LTG: Losses, TensorFlow & Gradient Descent

6 min readMay 5, 2021

Part One

Our journey into the world of machine learning continues. I’ve gotten up to some stuff since the last update, and I’ll try to keep it as concise as possible.

Iteration

In my previous article, I mentioned that I studied some methods that ML (machine learning) engineers use to minimize loss. As I went along, I found out that one way of reducing loss was to use an iterative approach. This means that a model will be fed with feature values for a linear regression problem, and the value of the bias and weight(s) will be assumed. Then the model’s prediction would be fed into a loss function — such as the squared loss function discussed last week — and the loss will be evaluated. From this point on, you make an educated guess of the bias and weights that will cause the loss value to approach zero — the sweet spot for which we can tell our model is reliable with its predictions.

The “compute loss” and “compute parameter updates” in the diagram combine to do this job and then feed their values back into the model.

Gradient Descent

In the iterative approach diagram, the box labelled “Compute parameter updates” is not clearly defined and is not easy to understand. It’s a black box of sorts. We will now replace that black box with something more transparent.

Assuming we had the time and computing resources to calculate the loss for all possible values of w1, for the kind of regression problems that I have encountered so far, the plot will always be convex. That implies it will be bowl-shaped, like in the diagram below.

Convex problems only have one minimum — a place where the slope is precisely 0. That minimum is where the loss function converges. Calculating the loss value for all possible values of w1 would be inefficient and a poor way of finding the convergence point. A better way of doing this would be to use a method called gradient descent.

The first stage in gradient descent is to pick a starting value for w1. The starting point doesn’t matter too much, so many algorithms set it to 0. The gradient descent algorithm calculates the gradient of the loss curve at the starting point. This value is equal to the derivative/slope of the curve at that point, and it tells you which direction to move in; to approach the minimum. NB: Gradient is a vector so it has both magnitude and direction.

Loss vs Weight curve with the bowl-shaped nature

When there are multiple weights, the gradient is a vector of partial derivatives concerning the weights. I will not be explaining partial derivatives in this paper to keep the length manageable.

The gradient always points in the direction of the steepest increase in the loss function. The gradient descent algorithm takes a step in the direction of the negative gradient in order to reduce loss as quickly as possible. The gradient descent algorithm adds some fraction of the gradient’s magnitude to the starting point to determine the next point along the loss function curve. The gradient descent then repeats this process, edging ever closer to the minimum.

Visualising the gradient descent algorithm (in a more fun way of course)

P. S. I’m really hoping this makes some sense. I’m at my wit’s end with this one

Let us imagine there is a tiny spider inside a huge bowl. Let’s assume this spider can only jump. It can’t walk along the walls of the bowl, neither can it use webs to move.

So the spider is dropped at a point on the side of the bowl — the starting point. Now, the spider needs to jump to its next point in the bowl by eating a smaller insect that flies by it. Based on how much energy it gets from the insect, it jumps to another point on the side of the bowl. This represents how the slope is calculated at each point, for which the value is used to guide the algorithm on its next point on the curve. So the spider in our example keeps eating these insects until it reaches its goal — to reach the bottom of the bowl’s interior(where the slope is zero, which means the function has converged).

The bowl is the loss vs. weight curve shown above, and the spider represents the points that the algorithm uses along the curve.

Learning rate

The gradient vector has both magnitude and direction. Gradient descent algorithms multiply the gradient by a scalar called the learning rate (or step size) to determine the next point. For instance, if the magnitude is 2.5 and the learning rate is 0.01, then the next point the gradient descent algorithm will pick will be 0.025 away from the previous point.

What is a hyperparameter?

Hyperparameters are the knobs and levers that programmers tweak in ML algorithms. The learning rate is one of such hyperparameters — as there’s no real way to know what the ideal rate would be — hence the tweaking. A learning rate that is too small will take too long to converge, and one that’s too large will bounce haphazardly in and around the curve. There is a Goldilocks learning rate for every regression problem, and that value is related to how flat the loss function is. If one knows that the gradient of the loss function is small, then a larger learning rate can be used, which compensates for the small gradient and results in a larger step size.

Stochastic Gradient Descent (SGD)

In gradient descent, a batch is the total number of examples that is used to calculate the gradient in a single iteration. Until now, we have assumed the batch has been the entire data set. In real-world scenarios, the volume of data is typically too large to use the entire data set as a batch. A very large data set with randomly sampled examples will probably contain redundant data. As the batch grows, redundancy becomes more likely. Some redundancy can help smooth out noisy gradients, but very large batches tend not to carry much more predictive value than large batches.

By choosing examples at random from one data set, we could estimate — albeit noisily — a significant average from a much smaller one.

Batch vs Whole

Stochastic Gradient Descent takes this idea to the extreme (it uses only one example — a batch size of 1 — per iteration). Given enough time, SGD works but is very noisy. The term “stochastic” indicates that the one example comprising each bach is chosen at random.

Mini-batch SGD is a compromise between full-batch iteration and SGD. A mini-batch is typically between 10 and 1,000 examples, chosen at random. This method reduces the amount of noise in SGD but is still more efficient than the full=batch.

NB: gradient descent and SGD can be used on feature sets that contain multiple features.

In the next part, I will talk about how I completed actual linear regression problems using some machine learning tools — with Python as the programming language.

Until next time,

Edudzi.