MGs1 ¸ \(íF 0 00 j s 01 ! RL theory monograph: A monograph on RL theory based on notes from courses taught by Nan Jiang at UIUC and together with Sham Kakade at UW. The notes are being actively updated, and any feedback, typos etc. are welcome. Ph.D. Thesis. Computational Trade-offs in Statistical Learning, Ph.D. Thesis, Department of Computer Science, UC Berkeley, Recent preprints Jan 19, · Gradient descent is the preferred way to optimize neural networks and many other machine learning algorithms but is often used as a black box. This post explores how many of the most popular gradient-based optimization algorithms such as
This post explores how many of the most popular gradient-based optimization algorithms actually work. Note: If you are looking for a review paper, this blog post is also available as an article on arXiv, satyen kale phd thesis.
Update The discussion provides some interesting pointers to related work and other techniques. Gradient descent is one of the most popular algorithms to perform optimization and by far the most common way to optimize neural networks. At the same time, every state-of-the-art Deep Learning library contains implementations of various algorithms to optimize gradient descent e.
lasagne'ssatyen kale phd thesis, caffe'sand keras' documentation. These algorithms, however, are often used as black-box optimizers, as practical explanations of their strengths and weaknesses are hard to come by. This blog post aims at providing you with intuitions towards the behaviour of different algorithms for optimizing gradient descent that will help you put them to use.
We are first going to look at the different variants of gradient descent. We will then briefly summarize challenges during training. Subsequently, we will introduce the satyen kale phd thesis common optimization algorithms by showing their motivation to resolve these challenges and how this leads to the derivation of their update rules, satyen kale phd thesis. We will also take a short look at algorithms and architectures to optimize gradient descent in a parallel and distributed setting, satyen kale phd thesis.
Finally, we will consider additional strategies that are helpful for optimizing gradient descent. to the parameters. In other words, we follow the direction of the slope of the surface created by the objective function downhill until we reach a valley.
If you are unfamiliar with gradient descent, you can find a good introduction on optimizing neural networks here. There are three variants of gradient descent, which differ in how much data we use to compute the gradient of the objective function. Depending on the amount of data, we make a trade-off between the accuracy of the parameter update and the time it takes to perform an update.
Vanilla gradient descent, aka batch gradient descent, satyen kale phd thesis, computes the gradient of the cost function w. As we need to calculate the gradients for the whole dataset to perform just one update, batch gradient descent can be very slow and is intractable for datasets that don't fit in memory. Batch gradient descent also doesn't allow us to update our model onlinei. with new examples on-the-fly.
our parameter vector params. Note that state-of-the-art deep learning libraries provide automatic differentiation that efficiently computes the gradient w. some parameters. If you derive the gradients yourself, then satyen kale phd thesis checking is a good idea.
Satyen kale phd thesis here for some great tips on how to check gradients properly. We then update our parameters in the opposite direction of the gradients with the learning rate determining how big of an update we perform.
Batch gradient descent is guaranteed to converge to the global minimum for convex error surfaces and to a local minimum for non-convex surfaces. Batch gradient descent performs redundant computations for large datasets, as it recomputes gradients for similar examples before each parameter update. SGD does away with this redundancy by performing one update at a time.
It is therefore usually satyen kale phd thesis faster and can also be used to learn online. SGD performs frequent updates with a high variance that cause the objective function to fluctuate heavily as in Image 1.
While batch gradient descent converges to the minimum of the basin the parameters are placed in, SGD's fluctuation, on the one hand, enables it to jump to new and potentially better local minima.
On the other hand, this ultimately complicates convergence to the exact minimum, as SGD will keep overshooting. However, it has been shown that when we slowly decrease the learning rate, SGD shows the same convergence behaviour as batch gradient descent, almost certainly converging to a local or the global minimum for non-convex and convex optimization respectively, satyen kale phd thesis.
Its code fragment simply adds a loop over the training examples and evaluates the gradient w. each example. Note that we shuffle the training data at every epoch as explained in this section. This way, it a reduces the variance of the parameter updates, which can lead to more stable convergence; and b can make use of highly optimized matrix optimizations common to state-of-the-art deep learning libraries that make computing the gradient w.
a mini-batch very efficient. Common mini-batch sizes range between 50 andbut satyen kale phd thesis vary for different applications. Mini-batch gradient descent is typically the algorithm of choice when training a neural network and the term SGD usually is employed also when mini-batches are used. Vanilla mini-batch gradient descent, however, does not guarantee good convergence, but offers satyen kale phd thesis few challenges that need to be satyen kale phd thesis. Choosing a proper learning rate can be difficult.
A learning rate that is too small leads to painfully slow convergence, while a learning rate that is too large can hinder convergence and cause the loss function to fluctuate around the minimum or even to diverge. Learning rate schedules [1] try to adjust the learning rate during training by e. annealing, i, satyen kale phd thesis.
reducing the learning rate according to a pre-defined schedule or when the change in objective between epochs falls below a threshold. These schedules and thresholds, however, have to be defined in advance and are thus unable to adapt to a dataset's characteristics [2]. Additionally, the same learning rate applies to all parameter updates. If our data is sparse and our features have very different frequencies, we might not want to update all of them to the same extent, but perform a larger update for rarely occurring features.
Another key challenge of minimizing highly non-convex error functions common for neural networks is avoiding getting trapped in their numerous suboptimal local minima.
Dauphin et al. points where one dimension slopes up and another slopes down. These saddle points are usually surrounded by a plateau satyen kale phd thesis the same error, which makes it notoriously hard for SGD to escape, as the gradient is close to zero in all dimensions. In the following, we will outline some algorithms that are widely used by the deep learning community to deal with the aforementioned challenges.
We will not discuss algorithms that are infeasible to compute in practice for high-dimensional data sets, satyen kale phd thesis, e. second-order methods such as Newton's method. SGD has trouble navigating ravines, i. areas where the surface curves much more steeply in one dimension than in another [4]which are common around local optima.
In these scenarios, SGD oscillates across the slopes of the ravine while only making hesitant progress along the bottom towards the local optimum as in Image 2. Momentum [5] is a method that helps accelerate SGD in the relevant direction and dampens oscillations as can be seen in Image 3. Note: Some implementations exchange the signs in the equations.
Essentially, when using momentum, we push a ball down a hill. The ball accumulates momentum as it rolls downhill, becoming faster and faster on the way until it reaches its terminal velocity if there is air resistance, i. The same thing happens to our parameter updates: The momentum term increases for dimensions whose gradients point in the same directions and reduces updates for dimensions whose gradients change directions, satyen kale phd thesis.
As a result, we gain faster convergence and reduced oscillation. However, a ball that rolls down a hill, blindly following the slope, satyen kale phd thesis, is highly unsatisfactory. We'd like to have a smarter ball, a ball that has a notion satyen kale phd thesis where it is going so that it knows to slow down before the hill slopes up again. Nesterov accelerated gradient NAG [6] is a way to give our momentum term this kind of prescience.
We can now effectively look ahead by calculating the gradient not w. the approximate future position of our parameters:, satyen kale phd thesis. While Momentum first computes the current gradient small blue vector in Image 4 and then takes a big jump in the direction of the updated accumulated satyen kale phd thesis big satyen kale phd thesis vectorNAG first makes a big jump in the direction of the previous accumulated gradient brown vectormeasures the gradient and then makes a correction red vectorwhich results in the complete NAG update green vector.
This anticipatory update prevents us from going too fast and results in increased responsiveness, which has significantly increased the performance of RNNs on a number of tasks [7]. Refer to here for another explanation about the intuitions behind NAG, while Ilya Sutskever gives a more detailed overview in his PhD thesis [8]. Now that we are able to adapt our updates to the slope of our error function and speed up SGD in turn, we would also like to adapt our updates to each individual parameter to perform larger or smaller updates depending on their importance.
Adagrad [9] is an algorithm for gradient-based optimization that does satyen kale phd thesis this: It adapts the learning rate to the parameters, performing smaller updates i. low learning rates for parameters associated with frequently occurring features, and larger updates i. high learning rates for parameters associated with infrequent features. For this reason, it is well-suited for dealing with sparse data.
Dean et al. Moreover, Pennington et al. Interestingly, without the square root operation, the algorithm performs much worse. One of Adagrad's main benefits is that it eliminates the satyen kale phd thesis to manually tune the learning rate.
Most implementations use a default value of 0. Adagrad's main weakness is its accumulation of the squared gradients in the denominator: Since every added term is positive, the accumulated sum keeps growing during training.
This in turn causes the learning rate to shrink and eventually become infinitesimally small, at which point the algorithm is no longer able to acquire additional knowledge.
The following algorithms aim to resolve this flaw. Adadelta [13] is an extension of Adagrad that seeks to reduce its aggressive, monotonically decreasing learning rate. As the denominator is just the root mean squared RMS error criterion of the gradient, we can replace it with the criterion short-hand:.
The authors note that the units in this update as well as in SGD, Momentum, or Adagrad satyen kale phd thesis not match, i. the update should have the same hypothetical units as the parameter. To realize this, they first define another exponentially decaying average, this time not of squared gradients but of squared parameter updates:, satyen kale phd thesis.
With Adadelta, we do not even need to set a default learning rate, as satyen kale phd thesis has been eliminated from the update rule. RMSprop is an unpublished, adaptive learning rate method proposed by Geoff Hinton in Lecture 6e of his Coursera Class.
RMSprop and Adadelta have both been developed independently around the same time stemming from the need to resolve Adagrad's radically diminishing learning rates. RMSprop in fact is identical to the first update vector of Adadelta that we derived above:.
MSc and PhD Thesis Structure
, time: 26:52We would like to show you a description here but the site won’t allow blogger.com more MGs1 ¸ \(íF 0 00 j s 01 ! Jan 19, · Gradient descent is the preferred way to optimize neural networks and many other machine learning algorithms but is often used as a black box. This post explores how many of the most popular gradient-based optimization algorithms such as
No comments:
Post a Comment