MLP Training

Demystifying large language models, Part 1: How LLMs work

Sep 09, 2024

August 9, 2024

The Universal Approximation Theorem asserts that for any function f there is a neural network that would approximate it, but it doesn't tell us how it should look or how to find it. That's where ‘machine learning’ comes in: we set up an algorithm that lets the computer find that neural network by itself, that is, ‘learn the solution’ from data.

The process is called training the neural network. The data is a list of inputs and outputs, of Xs and ys (where y=f(X)), for example, a list of photos and the associated labels (e.g. cat, dog). The architecture and size of the network are set in advance and the parameters are initiated randomly. At this initial stage the neural network already works, in the sense that if you give it a photo it would output a prediction —cat or dog— but its ‘predictions’ would be random and therefore useless.

During training the parameters of the network —the weights, biases— are adjusted to better categorize the data set's photos as those of a dog or a cat. That is, the network becomes less wrong. The metric for wrongness is called ‘loss’; it quantifies the deviation of the network's predictions from the correct answers, a deviation training seeks to minimize. Even if the network correctly classified an image of a cat as an image of a cat, there would be some loss involved if it only gave a probability of 80% for ‘cat’, instead of 100%.

The process is a kind of numerical approximation. There are mathematical problems that have analytical solutions, for example quadratic equations such as

\(2x^2 -3x + 8 = 0.\)

That it has an analytical solution means there's a formula one can use to compute it, namely the quadratic formula. On the other hand, polynomial equations of degree 5 and up, as the one below, generally cannot be solved analytically:1

\(2x^5 -3x + 8 = 0.\)

Still, the solutions to the equation can be found or at least arbitrarily approximated.2 How? By making guesses. You try different values of x until you hit the solution. For example, I tried now different values with a calculator and found that one of the solutions to the equation above is around -1.438. Mine was a napkin kind of search, and it almost found but one of several solutions, but there are algorithms to systematically find the solution for such equations.

Neural networks are formulas, and together with a data set and a loss function they constitute an equation that could be solved: just as there are values of x that would make the lefthand and righthand sides of the equation above equal, so there are parameter values which would make the loss minimal. Neural networks, too, are not analytically solvable. As you might imagine, guessing the correct value of millions of parameters is not as straightforward as guessing at the value of a single variable, wherefore a method has been developed for this case, called gradient descent.

To give a sense of how gradient descent works, I'll start with a metaphorical illustration.

The surface of the earth is not smooth — there are hills, mountains and valleys, such that different geographical points have different elevations. In other words, over the surface of the earth altitude is a function of longitude and latitude. Each longitude-latitude combination has an altitude that can be experienced by walking there or be found on an atlas.

Imagine you're standing somewhere in nature, or on the street of a slopy city. The wind blows gently in your face and you are full of energy. Your aim is to go up, to maximize your altitude, that is, to move yourself on foot to the highest point possible. The catch, it's dead dark, you have no phone and you don't know where you are. One way to achieve your aim is to step in the direction of the upward slope. You inspect your immediate surrounding with your foot, pick the most ascending direction and make a step. Then you inspect again, choose, step, and so forth until you've reach a peak, surrounded by descending slopes. Voilà: by incrementally changing your latitude and longitude you reached a maximum altitude. ‘A maximum’ and not ‘the maximum’ since, unless you were around it to begin with, you did not reach the top of Mount Everest, the highest point on earth, but some other ‘local maximum.’

This kind of stepping goes on during neural network training. Instead of altitude we have loss, which we want to minimize instead of maximize. The longitude and latitude we moved over in the analogy are the parameter values in the neural network case. The current network's parameters put us at a loss value, and to minimize it we make steps, increasing and decreasing not longitude and latitude but the weight and bias values of the neurons. It was a mere two dimensional movement when looking for the mountain, and a movement in a dimensionally much greater space when training the neural network: one dimension for the bias of each neuron and one dimension for the weight of each neuron-neuron connection. A petty perceptron of 50 neurons might have 450 parameters,3 making the movement 450-dimensional. The direction taken is along the gradient, the mathematical term for the multidimensional version of the slope, and specifically down (to minimize the loss), hence ‘descent.’ In gradient descent we descent down the gradient of the loss function.

Unlike the analogical case where we move from our starting position to a summit over a constant terrain, during training the steps are iterated over different batches of data samples (e.g. photo + cat/dog label), each of which yields a different relationship between the network parameters and loss, as if on every step the landscape around us morphed. The reason for this is rather technical than theoretical. As many networks are trained on huge datasets, it would be unfeasibly slow and complicated to compute the average loss on the entire data set on each step. Still, as there's a pattern across the samples, the topology changes congruently such that stepping iteratively over the different manifestations brings the network to be attuned to the commonalities of each category (dog/ cat), instead of to the incidental features of any one particular image or batch thereof.4

The process by which the 450-dimensional stepping direction is chosen5 is called backpropagation.

As you recall, the network renders a giant equation. And not just any equation, but one that has a serial —multi-phase— structure: the first layer makes a computation on the network's input, the next layer makes computation on the output of the previous layer and so forth until the last layer. On our hike we could feel with our foot to determine how a step would affect our altitude. In backpropagation we do likewise, only we do it in phases, starting from the the last layer going backwards towards the first — hence ‘back.’ The ‘propagation’ part comes from the fact that what is being passed backwards over the layers is information about how changes would affect the loss, i.e. the information is propagated.

First, note that in the case of the hike, making a step in the direction of the slope is equivalent to making steps in the cardinal directions proportional to the respective slope. That is, if the slope goes up towards north-east, by making one step east (on the longitudinal dimension) and one step north (on the latitudinal dimension) we would in total move in the north-east dimension. If the slope goes up in the north-north-east direction, then it is steeper in the latitudinal direction than in the longitudinal one; if our step in the former dimension was correspondingly bigger than the one we make in the latter, in end effect we would move in the overall steepest direction.

Likewise, we want to make steps in each of the 450 dimensions to the same effect. To figure out the effect of the prospective step on the loss, we begin in the last layer. It goes as follows. The output of the last layer, the loss, is determined by its parameters and by its input. Given the input (coming from the before-last layer), we look at how marginally changing each of the parameters would effect a decrease of the loss. This gives us the direction of the step —the change in value— we should make for each of the last layer's parameters. We then look at the previous layer, whose effect on the loss is meditated by the next layers, here only the last layer. Again, given the input to the layer, we look at how marginally changing each of the parameters would affect the input to the next layer, given its parameters, in such a way as to decrease the loss. And so with the previous layer, whose effect on the loss is meditated by two layers. The further away (more mediated) the parameters are from the computation of the loss at the output of the network, the longer the arithmetic computation we need to make to determine the change necessary for decreasing the loss, but it's merely a computation.

The steps taken are small. Imagine that after you have shot an arrow that missed the mark leftwards you aimed rightwards, moved the target left and changed the wind accordingly. If changed too much, now you'd miss in the other direction. In addition, the great number of non-linear computations between the input and the output makes it akin to a Telephone/ Chinese Whispers game, and the effect brought about by changing one parameter is different than expected after all the other parameters have been changed. And still, even with conservative steps, the process has many challenges. These have been solved or at least tackled by techniques that fall outside the scope of this here series.

To sum up, the neural network is initialized with random parameters; a loss function operating on the network's output given batches of dataset samples yields a measure of how off the network is; information on how each parameter influenced the yielded loss is carried through the network, from the output layer to the input layer (backpropagation), and is then used to modify them to decrease the loss (gradient descent); this is done iteratively with different dataset batches until the network yields good predictions.

Next ☞

Asserted by Abel–Ruffini's impossibility theorem, proved in 1824.

It's possible that one or all of the solutions are irrational numbers, i.e. numbers that can only be represented with an infinite number of digits past the decimal point.

If our perceptron has at its core 5 layers of 10 neurons each, that would be 5 x 10 = 50 biases; the layers are fully connected, so between each layer there are ten neurons each connected to ten other neurons, amounting to 10x10 = 100 weights, giving us 4 x 100 = 400 weights altogether between the layers (ignoring connections to and from the perceptron). That sums to 450 parameters.

The last point is related to the concepts of ‘underfitting’ and ‘overfitting’. The former refers to a state of a network that has not been fitted enough to the training data. A (randomly) initialized network, i.e. one that has yet to be trained at all, is the extreme example of an underfitted network. On the other end are networks that are ‘overfitted’ — they predict too well. You might wonder how could a network predict ‘too well.’ The issue is that while we want the perfect network, one that, say, classifies correctly any dog/ cat image, we only have a limited amount of images to train it on. It's possible that a network that was trained to the point of acing the data set would also ace any new image, i.e. one it has not been trained on, but in practice what happens is that the apparently very good networks are learning to rely on incidental features. For example, if the data set contained three images of a cat lying on a red carpet, the network might learn that a ‘red carpet’ means the image is ‘cat.’ Naturally, dogs might also lie on red carpets, it just so happened that such a scenario was not represented in the data set. When training a network one always sets aside part of the available data to be a ‘test set’, one that is not used to train the network but to assess it at the end. So long as the loss on the test set decreases with the training, all is fine. Once further iterations of gradient descent lead the loss on the test set to increase while loss on the training is decreasing, the network begins to overfit.

In contemporary real life cases, even one trillion dimensional direction.

‘LLMs’ are merely large language models

Table of Contents

Discussion about this post