Non-linearity

Demystifying large language models, Part 1: How LLMs work

Sep 04, 2024

August 9, 2024

This post is the most technical of the series, employing the most mathematical expressions. It should also be the last to employ any at all. It deals with a technical point, whats spirit is summed up in the next paragraph. The post could be skipped without great risk to the understanding of following posts. Still, it gives a glimpse into how computations look like within neural networks, rendering them more prosaic and less magical, or so I'd hope.
This post is about the Universal Approximation Theorem and the role non-linear activation functions have in it. The theorem postulates that for any imaginable function, that is, for any assignment scheme of outputs to inputs, there is a neural network that would approximate it. So far as the data concerned can be converted to numbers, it means that there is a neural network that would approximate any assignment to the data set's items. For example, given an album of photos, we could have a neural network that classifies (assigns a classification) the photos to dog or cat photos, or day time/ night time photos, or according to the photographer, or the number of animals in it, or what have you. Given a collection of incomplete texts, we could have a neural network that assigns the next word of any of these, i.e. a language model.
The rest of the post does not prove this nor tries to suggest why it might be the case. Rather, it discusses the role of non-linear activation functions in neural networks vis-à-vis the theorem, and why it would not be the case with linear activation functions. On the way it shows how the input is computed on as it passes through the network.

LLMs are giant equations

Multilayer perceptrons take an input and yield an output. More specifically, the input gets mathematically manipulated one layer at a time: it gets multiplied by a weight, aggregated, the activation function is applied, and so on until it finds its way out. A simple MLP with one ‘layer’ consisting of a single neuron could be represented algebraically as follows,

\(y = A(w\cdot x+b)\)

or simply:

\(y = A(wx+b)\)

where x is the input, w is the weight between the input and the one neuron, b is the neuron's ‘bias’ —nothing more than a value that is added to the neuron's aggregated input before the application of the activation function— and A() is the activation function.
If the MLP had two layers with one neuron each, we would get a simple recursion. We would apply the above formula twice, only the second time instead of using the network's input x, we would use the output of the first layer. If we expand it, it would look like so:

\(y = A(wA(wx+b)+b)\)

Where we substituted x with A(xw+b). Note that while w and b appear twice, these are different values; the first w and b are the weight and bias of the first neuron, the second of the second. I wanted to avoid making the equation messy with the subscripts that are commonly used for distinction. The A()s, however, are identical across the network's neurons.

Now, let's say we have one layer with two neurons followed by a second layer with a single neuron. As a reminder, each layer receives input from all the nodes of the previous layer, so that each of the two first layer's neurons would receive the network's input and forward their output to the next layer's single neuron. I'll use the letters a, b, c for the neurons' biases and q, u, v, w for the weights.
Let y and z be the output of the first and second neuron, respectively, of the first layer. Each receives the network's input x. Their output is:

\( \begin{multline} y = A(qx+a)\\ z = A(ux+b) \end{multline}\)

The next neuron, the single neuron of the second layer, receives the output from both as input, and it's own output is the network's output:

\(\begin{multline} output = A(vy+wz+c)\\ = A(vA(qx+a)+wA(ux+b)+c) \end{multline}\)

That's the algebraic representation of our simple 3 neuron MLP, which has 7 parameters: 1 bias for each of the 3 neurons, and 1 weight per connection.
I hope you may convince yourself that the output of an MLP of any size, be it 10, 1000 or million neurons big, could be expressed in such an albeit unseemly yet closed form equation. Plugging in the numbers, with the assistance of a calculator, you could derive the network's output yourself.

An LLM is more than an MLP, but it could, too, be expressed as a preposterous equation. You should wonder what an LLM, which takes as input and yields as output text, might have anything to do with an equation that takes in and outputs numbers. The short answer is that text is converted into numbers, which are manipulated, and are converted back to text. We would get to the full answer down the road. For now make a leap of faith with me that this is the case.

The Universal Approximation Theorem

Now, this is the cruncher: the Universal Approximation Theorem tells us that for any function f(x), we can make a perceptron which could arbitrarily approximate it. The function f(x) does not have to be mathematical: perhaps for each input God reaches its hand through the firmament and presents the answer. The function f(x) might be the perfect divine ChatGPT that always says the right thing. Still, there is a perceptron that, given the right amount of neurons with their weights and biases set correctly, would arbitrarily approximate the divine answers. We don't know how many neurons would be necessarily, neither do we know how to choose the parameters, but we know that such a perceptron ‘exists’ — in the mathematical sense. That is, it does not yet necessarily exist, but it would exist if we discover it.

You know what ‘approximate’ means: to get something that is not exactly identical to the reference, but which is ‘close enough.’ The ‘arbitrarily’ means that you decide how ‘close’ is ‘enough’: whether only a deviation by 0.1 or above is too much, or by 0.01, or by 0.0001. A deviation by 0 is not a deviation at all, and the theorem does not speak of such cases. But anything greater than zero, however slightly, falls under the theorem's auspice.
It should be here noted that while for practical purposes neural networks output a clear ‘decision,’ for example, either a cat OR a dog as a classification, strictly speaking the perceptron itself outputs a score for each of the alternatives, which are then sifted through a special layer (called softmax1) which converts them, proportionally, to probabilities, for example 20% dog and 80% cat. Thereafter is another layer that would pick up one of the alternatives, based on the probabilities, as the output. In neural networks that do classification that would be the option with the highest probability. Still, a network that calculates a 80% or 0.8 probability of ‘cat’ for an image of a cat, is off by 0.2; a network that calculated 0.97 for the same image is only off by 0.03.

The theorem applies also to multivariate functions, i.e. functions that take more than a single number as input. For example, a perceptron that take 64x64 black and white images of handwritten digits and classifies them would have an input of 64 x 64 = 4096 dimensions (each representing the darkness of a pixel, just like each of the three dimensions of a dresser represent its size in a direction: height, breath, length) and output a single value, an integer between 0 and 9. The input is in the form of a vector X, a list of 4096 values. A perceptron that classifies 720 x 1080 colour images into cats and dogs would take a vector of 3 x 720 x 1080 = 2.332800 million dimensions (red, blue and green values times the number of pixels) and output a single value, 0 for dogs and 1 for cats and 0.5 for God knows what this is (for example).

Reduction of linear activation function networks

There's a condition that I did not mention. For the theorem to be valid, the activation function of the perceptron's neurons must be non-linear. A linear function is a function of the form f(x) = ax + b. It's called linear because it describes a line. When you plot it, you get a straight line with a slope a, offset from the origin by b. A non-linear function is any function that cannot be written in this form.
The theorem is not really a single theorem but a family of theorems, each with its own specifications and particularities, some proven, some not. We will not get to the proof of any, but I'll demonstrate why a linear activation function would not do.
Let us take the mathematical expression of our three neuron MLP above (represented as g):

\(g(x) = A(vA(qx+a)+wA(ux+b)+c) \)

Let us take the simplest linear activation function, the identity function. That' s a function that yields it's output, i.e. A(x) = x. It's a line whose slope is 1 and offset is 0. Making the substitutions in the equation above, we get:

\(\begin{align} g(x) &= (v(qx+a)+w(ux+b)+c)\\ &= v(qx+a)+w(ux+b)+c\\ &= (vqx+va)+(wux+wb)+c\\ &= vqx+va+wux+wb+c\\ &= (vq+wu)x+(va+wb+c)\\ \end{align} \)

What did we get here? Our three neuron network could be replaced by a ‘network’ of a single neuron with a weight equal to vq+wu and a bias equal to va+wb+c! Once we plug in the values of these parameters, these are just numbers. Indeed, when the activation function is linear, this would be the case for a network of any size. You can have a network of billion neurons, but it would amount to the operation of a single neuron.

A polynomial plotted around the origin, *p(x) = 5 + 10x - 2x^2 + x^6.*

In the machine learning discourse it is often talked about ‘non-linear functions,’ but the theorems really preclude polynomial functions. That is, the activation function must not be a polynomial. A linear function is a kind of a polynomial.
A polynomial is a function whose output is a sum of scaled powers of the input. That is, if our input is x, each factor has the form a*x^p (a times x to the power of p), where a is any number and p is a non-negative integer. In the case of linear functions, the slope is the coefficient for x raised to the power of 1 (which simply yields back x), and the offset is the coefficient for x raised to the power of 0 (which yields 1).
I won't demonstrate it here, but if you have any doubt you might take out a pen and paper and try it out for yourself: if a network's activation function is any polynomial, say p(x) = 5 + 10x - 2x^2 -10x^3+ x^6, then however many neurons you have, the operation of the network as a whole would be equivalent to a single polynomial function.

The *Relu* function, yields back unchanged positive input and yields zero when the input is negative.

To give you a sense of the alternative, consider the ReLu, a (formerly) popular activation function, which can be expressed as ReLu = max(x,0). It is the identity function for input greater then zero, and it yields zero everywhere else. Though it behaves just like our A(x) = x above for positive input, that it returns zero for negative numbers makes all the difference. If you repeat the exercise of substituting A(x) with the ReLu function, you'd find that the network would no longer boil down to a simple equation.
This might be surprising, because ReLu looks pretty linear. However, it has to cases: one where the input is smaller than zero, one where it is greater than zero. When these are combined in a network, the cases get more complex. The network could still be written down as a single equation, but it would have a great number of cases: one where x, the input to the network, is less than -10, one where it's between -10 and -8, one where it's between -8 and -6.5 and so on. Therefore the network as a whole could not be substituted with a single neuron.

But mind that what makes the network not reducible is not the cases, but the non-linearity. Another potential activation function is tanh, the hyperbolic tangent, whose equation is

\(tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}\)

where e is Euler's number, a constant roughly equivalent to 2.71828. If you make the substitutions you won't get a simple mathematical expression. The network's equation would have a single case, but it could not be written as if it was a single neuron.

Polynomial approximation

As a final note, more of an aside, I'd touch a rarely discussed point. The statement of the Universal Approximation Theorems and the reducibility of ‘polynomial networks’ to a single polynomial function might lead one to believe that polynomial functions are less sophisticated and cannot achieve that which neural networks can.
Algebraically speaking, indeed, there is no sense in constructing a large neural network with a polynomial as the activation function. Nonetheless, we might say that polynomials are just as expressive as a neural networks: the Stone–Weierstrass theorem tells us that a polynomial can, just like the perceptron employing non-linearities, arbitrarily approximate any function.
In practice, there are a few challenges to employing polynomials to such tasks as pattern recognition or text completion.
Let's say we dispense with the network structure and just use a single multivariate polynomial — a polynomial that is a function of more than one variable. The following is purely speculative and more of a thought experiment. In our 64 x 64 pixel handwritten digits example we had an input of 4096 dimensions. Commonly in image recognition networks the pixels interact with their neighbours only ‘gradually’ over the layers, but let's say that our polynomial has a multilinear term that include the values of all the pixels, that is, it's a factor that consists of a coefficient times each of the pixel value's to the power of one. Or:

\(c \cdot x_1 \cdot x_2 \cdot \ldots \cdot x_{4096}\)

Let us say that these are white digit over a black background, and that pixel values run from 0 (black) to 255 (white). Let us say that on average the digit itself occupied only a fifth of the total image, on average — that is, the curve of the digit fills about 4096*0.5 = ~205 pixels. The average pixel value would therefore be 255 * 205 / 4096 + 0 * (4096-205)/ 4096 = 12.76. This means that this factor would have an expected value of

\(12.76^{4096} = 3.667 \cdot 10^{4529}\)

That's a large number, to put it mildly: it has 31971 digits, 4530 of which are to the left of the decimal point.
If instead of 4096 we multiplied only 400 random pixels (which would amount to a 20x20 square of pixels or about a ninth of the total image), we would still get a large number:

\(12.76^{400} = 2.189 \cdot 10^{442} \)

This is a problem, since such numbers have to be stored somewhere as the computer runs the computation. In machine learning, the ‘bigger’ format in which numbers are stored is the double-precision floating-point format AKA 64-bit float, which can represent positive numbers up to ~1.798 * 10^308. In other words, the above numbers could simply not be retained in memory unless we used an appropriate format, and then it becomes a hardware issue — and we've only touched the classification of small black and white handwritten digits!

Alternatively, instead of computing everything in one go, it can be computed in steps — as happens in a neural network. Fortunately, here I can stop speculating since there were several papers that investigated such networks. One of them, Polynomial Activation Functions, looking at the handwritten digits problem, reported superior performance vis-à-vis commonly used non-linear activation functions but that the ‘method is computationally expensive.’ As the training of LLMs is a tremendous computational challenge, it would not do to make it even more challenging.

That being said, polynomials might only approximate continuous functions, i.e. functions that do not ‘abruptly jump’ when the input is changed ‘smoothly.’ Neural networks can approximate discontinuous functions as well. Neural networks are therefore potentially more expressive, but I cannot say whether it comes into play. What might seem like a discrete (‘abrupt’) change in the network's output, might be continuous under the hood. For example, take an image of a handwritten 8 and a handwritten 9. Now imagine we create as many images as we need to represent a transition from one to the other: smoothly changing the 8 image, one image at a time, until we get a 9. There will be a point in between where the network switches from 8 to 9. That's an abrupt change in the output. However, if we peel away the softmax layer, we might discover a continuous change in the output. The probability assigned to 8 continuously goes down, the one assigned to 9 continuously goes up, and at some point the two probabilities cross each other.

Next ☞

The name ‘softmax’ comes from the layer's relation to the maximum function, and it being ‘softer’. The max function yield a single decisive answer. For a row of five buildings, it would just tell you which one is the tallest, for example the second, and output [0,1,0,0,0] — marking the second as ‘the tallest’ and the other four as ‘not the tallest.’ The softmax gives you information about all items by giving you proportionality, for example it might yield: [0.1, 0.6, 0.15, 0.1, 0.05]. Here you learn that the second is the tallest, the first and fourth are both the same height and are only surpassed by the second and third building, the fifth is the shortest. The sum of these numbers is 1 such that they may be interpreted as probabilities.

‘LLMs’ are merely large language models

Table of Contents

Discussion about this post