Meaning in Context
Now let us look at the word embedder part of transformers which turns words/ word-parts into the numbers that get input into the MLP. Before we dive into how the artificial neural network does what it does, let us regard one idea about how humans learn their language. In a paper by Heinz Werner1 (1890-1964) he put forward the simple idea that one way by which we learn the meaning of words as children is through the context in which they appear. If you ever listened or read a text in a foreign language and suddenly comprehended a word that repeated (or even which happened merely once), you remember having had this experience. Otherwise, since new English words constantly come into being, you might have figured out some without referring to the dictionary, again the same experience but likely a less exciting version thereof given the familiarity with English.
This kind of learning is possible thanks to the expectations created by the context. In the sentence ‘John ate my X,‘ we would expect X to be a piece of food and not, say, a piece of furniture, even though ‘he ate my chair’ is a possibly meaningful sentence. We certainly have heard the likes of ‘John ate my homework,‘ where John is —— you've guessed it. The reliance on expections makes us near blind to typos, allows us to incidental omissions of words, to raed scmalrebd seencnets, and leads us to misread unexpected words.
The most trivial example thereof is learning an unfamiliar word through a dictionary, which is a list of definitions indexed by words. The definitions are, too, a text. That is, any word's meaning is established through its relation to other words and through that relation only: if the dictionary is of a completely foreign language, and we follow each word by looking up the words in its definition, we would never break out of the words into the real world, the occasional illustration or audio clip notwithstanding.
For a bit of a ‘fun’ demonstration, try these:
This is my talo and my family's ancestral home.
I'll buy you an even bigger talo than this one.
Let's leave everything and go to my dad's talo.
The talo benefits from earthen wall that keep the cool summer.
Upon entering the talo we are welcomed by a large hall.
This is the prime minister's talo that's being graffitied.
I told you that my talo was in the debris field.
I will see your talo take prominence in all upcoming games.
Crystal strung beads and made jewelry for everybody in the talo.
Must be strange being in that talo alone, everybody gone.
Here's a more difficult one:
Despite being an aave, he often fears for his safety.
Together, perhaps you could lay this particular aave to rest.
That was even further proof that they were not seeing an aave.
Just as one should not forget, it would be best to be an aave.
The aave is fascinated by the soldier's mysterious sound device.
They describe him as some aave who appeared out of nowhere.
Either the aave followed you, or it was on campus.
One in which my mother's aave walked into the toilet...
This wrapping paper will rip proof, the aave is real.
Another:
Please heittää your trash into the garbage can after your lunch break.
After losing his favorite game, he flipped out and heittääed his controller.
He will heittää the ball muscularly, sending it sailing across the park.
She clenched her fists and heittääed an air punch in sheer outrage.
The bad-tempered chef heittääed the ingredients in frustration when his dish failed.
He heittääed a mudball at his friend during the rainy afternoon play.
Her friends cheer as she heittääs the beanbag right into the target.
And some more, this time with three ‘occluded’ words:
Available in blue, red and off-valkoinen colours, it makes a fun companion whether likasted onto the Christmas tree, an ovi handle or your keys.
Sidling towards the ovi, he hoped to likast out without being noticed.
The shower ovi likasts out of the rail.
She's sleeping as he cracks the ovi and likasts inside.
Locate a huge valkoinen ovi and enter inside.
We need to get to the valkoinen ovi.
Please paint the ovi valkoinen.
Even the road was completely valkoinen, when suddenly my front wheel likasted away.
I then likasted through what seemed like the same tunnel and I arrived in a tunnel-like space with a soft valkoinen and somewhat golden or shimmering light.2
Naturally, it gets more difficult the more unfamiliar words are present in any sentence — but not impossible, even when all the words are unfamiliar. Michal Ryszard Wojcik —and his followers— successfully learned a language without recourse to other languages in his Norwegian experiment. Though initially no sentence makes any sense to us, though we feel that words enter one ear and exit undisturbed through the other, we learn bit by bit an association. This is how children, coming into the world knowing nothing and having no other choice (if only we could roll run around naked and babble all of our life! To the great admiration of our audience!), learn their first language, too. And, indeed, that's how LLMs learn the language.
There's an important caveat to make here. While the only verbal context Wojcik had for any Norwegian word was other Norwegian words, he also had non-verbal context to inform him. Children, too, learn not just from verbal context but from demonstration, from relating the words to immediate phenomena. While I imagine that theoretically a human could ‘learn the meaning of words’ without reference to non-verbal context, I think a lack of motivation to do so would make it practically impossible.3 LLMs, however, need no motivation in order to learn, and as they neither sleep, eat nor fart and have patience as long as their hunger for electricity is satisfied, can go so far as learn the relationship between any word out there without any non-verbal context. Still, while this capacity of LLMs makes possible what would be difficult or impossible for a human, the lack of non-verbal context means that their ‘knowledge of language’ is not quite equivalent to that of a human, and in a sense not ‘knowledge’ at all. We'll return to this crucial point later.
Word Vectors
When we meet a new word for the first time, we get a sense of what it means and what it doesn't mean. We might make a guess with some uncertainty. The second encountered instance gives a different sense, and the third yet another. By combining Venn diagram like the three senses, we arrive at a definite meaning, further corroborated (or refuted!) by future encounters with the word: if it walks like a duck, squeaks like a duck and eats like a duck, it's probably a duck.4
For this to work we need memory, otherwise each instance is like our very first. Still, we don't need to remember the exact past sentences, merely to retain a notion of the sense of the new word. Depending on how frequent the encounters are, how difficult it is to fathom the meaning given the context and so on, we might need more (or indeed less) than three instances in order to arrive at the solution, but the principle is the same: each instance leaves us with a modified sense, however vague, of the word.
Suppose the new word has ever only had verbal-context, such that its retained vague notion depends purely on other words. Strictly speaking this is impossible, but let us pretend for a moment.
Let's imagine that the other words in the verbal-context are, too, new. Now the vague notion of the word depends purely on other words of which we have only a vague notion. That would be the case with a sentence in a foreign language. We wouldn't get very far reading such sentences. If we opened a book full of them, even if we could make ourselves ‘read it’ without boring ourselves to bits, by the end of the first page we would not even have the vaguest notion, had somebody asked us, what words we had at all encountered —beside a few that had caught our attention by their suggestive morphology— to say nothing of their context. Mere humans that we are.
However, Large Language Models, who do not forget over time, nor does their attention wavers, can do exactly that. How do we let machines retain a vague notion based on other vague notions and thus learn the relationship between one word to others? The short answer is that the LLM's word embedder encodes these relationships, of each word (or part thereof), into a vector. These ‘relationships’ are any commonalities and lack thereof between one word and another. Butter, (good) bananas, some cars are yellow; butter and bananas are soft and might be kept in the fridge, cars are not. These are relationships. LLMs do not experience ‘yellow’ (nor, indeed, butter, bananas or cars) but to the extent these commonalities are reflected in the texts on which they are trained, they can learn them.
LLM's word embedder converts each word (or part of word) into a vector. A vector is an array of numbers. Each position in the array corresponds to a dimension, so a 100 long array would correspond to a point in a 100 dimensional (vector) space. This could represent real space, as in a 2 dimensional vector that represents the (x,y) coordinates of grid paper or the latitude & longitude coordinates of the earth's surface, or a 3 dimensional array that represents the space of a room. A vector might represent more abstract entities, such as the 3 dimensional RGB coordinates of computer pixel colour where (0, 0, 0), (1, 1, 1), (0.5, 0.5, 0.5), (1, 0, 0) and (0, 1, 1) represent black, white, gray, red and cyan respectively.
If we have an image classifier for dog, cat and firetruck pictures, it might yield a (0.81, 0.17, 0.02) for a photo of a cat, (0.08, 0.85, 0.07) for a photo of a dog and (0.0, 0.01, 0.99) for a photo of a firetruck, based on each image's features. These are vectors of a three dimensional cat-dog-firetruck image vector space, though at the end, as far as the model is concerned, any picture is either of a cat OR a dog OR a firetruck, never some mix, and its output verdict is based on the highest value: a cat, a dog, a firetruck, respectively.
The word embedder takes in words instead of images, and likewise outputs a vector. If our language had, say, 10 million words, the embedder could have theoretically output a 10 million dimensional vector, with each dimension representing one word, as we had with the cat-dog-firetruck classifier —valued 1 if it's the corresponding word, 0 if it isn't— a scheme called ‘one hot encoding’. There are various practical reasons why this would not do. For the current discussion it suffices to say that because we want the vectors to represent relationships, we want a much smaller vector space which the words would cram themselves into, so that each word's vector would ‘overlap’ with those of others. Every word has something of other words in it.
To illustrate how this cramming allows to capture inter-word relationships, we consider the following example. Imagine there are 100 items arranged in a 10 by 10 grid. One way to represent these items as vectors is with one hot encoding. If there were 4 items, one would be (1,0,0,0), the second (0,1,0,0), the third (0,0,1,0) and the fourth would be (0,0,0,1). With such a representation, each item's vector is independent of all others, and we have no information about any relationships. It should be said that the order of dimensions must be consistent, but its initial choice is arbitrary. That is, the first item could have been encoded as (0,0,0,1), too. This means that the ‘33rd item’ and the ‘34th item’ might be as related to each other as you are to the person with the nearest ID or bank account number — more likely than not, not at all.
Alternatively, we could encode each item as a two dimensional vector, according to its row and column number. Now the vectors catch some relationships: we know that item (4,8) is on the same row as (4, 2) and the same column as (1,8); using a distance metric we could calculate that (2,3) is closer to (1,1) than (5,7) is, and so on.
Note that with a further reduction, down to a one dimensional vector, we again lose relative (relating to relationships) information. We could draw a line that passes through all 100 items, encoding as (1) the item first met by the line and as (100) the last item. With this encoding alone, we would not know whether (5) is closer to (6) or to (81) unless we could take a look at the grid and the curve, which makes the encoding unnecessary.5
We this in mind, let us return to words. Whereas the two dimensional encoding of the items gave us information about spatial relationships, what we want from our word embedding (the encoding of words into vectors) is to inform us about semantic relationships, i.e. relationships based on the meaning of words. The grid items encoding was based on row and column location — LLMs' word embedding is based on words' equivalence, that is, their degree of interchangeability. Given a text and a word in it, how plausible is it to replace it with a different word? The more plausible it is to exchange the two words in more different textual contexts, the more similar they are, i.e. have the same meaning. While in common speech ‘plausibility’ is a matter of reasoning, here I refer to the two words actual presence in identical/ similar contexts (within the texts trained on).
Let's take the sentence ‘The quick brown fox jumps over the lazy dog.’ One broad category (and thus relationship) of words is part of speech. We could exchange any of the adjectives with other adjectives (and even participles) and get a valid sentence, for example, ‘The thoughtful English Fox jumps over the moving dog.’ In this sense adjectives are similar to each other in a way that nouns aren't; replacing ‘quick’ with a noun, say ‘speed,’ gets us ‘The speed brown fox.’ You might notice that the context gives the word a new meaning: are speed brown foxes a subspecies of foxes? Regardless, again, the importance here is how likely the exchange is — not whether our imagination allows for such a substitution, but whether ‘speed brown foxes’ (who jump over lazy dogs) have in fact occurred in the corpora (the collections of texts) the model was trained on. So while ‘The fox jumps over the lazy house’ and ‘I applied a fresh coat of English paint on my wall’ are valid sentences, they are rare or non-existent (i.e. implausible). ‘Dog’ and ‘house’ are somewhat similar (‘I bought a new house’ and ‘I bought a new dog’ are plausible sentences), but they are somewhat different: seldom a creature jumps over a house, and houses are almost never lazy. A paint can be English but usually it would be of a colour (blue), material (acrylic, oil) or application (car, wall, wood).
Note that with the substitution we are not concerned with the meaning of the sentence. ‘The fox climbed over the fence’ and ‘the cat climbed over the fence’ do not mean the same thing, but both are ‘likely’ sentences. It is the contexts that both appear in that capture their similarity, and the contexts they are unique to which distinguish them. Both cats and foxes run, prowl, scratch their head, perk their ears, hunt birds, but only cats purr, hiss, use cat litter, are bewitched by catnip, make the internet, only foxes growl, dive head first hunting in the snow, were hunted by 18th century gentlemen, go in and out of woods.
Also note that while we dealt here with a single sentence, LLMs are trained to work on much longer chunks of text. The length thereof is called the model's ‘context window.’ A reasonable substitution of a word in a sentence might become unlikely if we take a broader context. The ‘rain’ in ‘He reached his home drenched through and through by rain’ could be replaced by ‘sweat,’ but the substitution would be unlikely if the preceding sentences were ‘When John parked the car at the corner of the street, the overcast skies rumbled ominously. The cold wind pierced through his light garments.’
Possible co-authored with another. I was unfortunately unable to find the paper that presented theoretically his ideas, but for experimental results, see for example Werner, Heinz; Kaplan, Edith (1950). Development of Word Meaning Through Verbal Context: An Experimental Study. The Journal of Psychology, 29(2), 251–257. doi:10.1080/00223980.1950.9916029;
On the first list, talo replaced house, also when used in a figurative sense; on the second, aave replaced ghost; on the third valkoinen, ovi and likast replaced white, door and slip respectively. The words are Finnish, except the likast that I used instead of ‘liukastua’ to make it a bit more ’English‘ phonologically.
It's also questionable whether a human could theoretically perceive language at all without a context. Unless they were deprived of all senses there would be a context, and if the were, how would they perceive language? Regardless, Helen Keller (1880-1968), who had been deaf and blind from the age of 19 months old, finished a university degree, wrote numerous books and gave lectures, demonstrating that even an impoverished non-verbal context —consisting of touch, proprioception, olfaction and taste only— could be sufficient to learn a language. Still, she was an extraordinary person.
This is a special case of how we treat any words. Whenever we encounter a ‘this’, ‘that’, ‘they’, a euphemistic articulation or other oblique references we also make a guess at what the words signify, and so is the case with every word out there except that unless the speaker aims to be cryptic, it's not ‘guessing’ but a matter of simple deciphering. Which table is meant by ‘the table’ differs between contexts, for example.
There's a technical reason why such an encoding would not work well with a neural network, but that's beyond the scope of this here text.