We want to get to a point where we have a probability distribution over a sequence of tokens. But what are these tokens? They are the words in a sentence that are quantized i.e. they are numbers. These numbers are not 64bit floats, they are more like 16 or 8 bit floats. In an array, these numbers map to a word and some of the context of the word. A classic example of a word (embedding) context is King - Man = Queen. This is a legit operation in this space. So we take a sentence and convert them into tokens and create vectors for each word which as a whole is called word embeddings. This seems like preliminary stuff, but it is not, this is so important and we get this wrong, the whole generative AI stuff falls on its face.
The engine of a generative AI car is the transformer. The main parts of the transformer is the encoder and decoder. Encoder is where you say “Yo quiero tacobell” and decoder translates to “I love Tacobell”. As an aside, if you tokenize as “I love Taco Bell” that will result a vastly different generation than if you tokenize as “I love Tacobell”. But to make this generative i.e. change its task model, they got rid of the encoder and used stacks of decoders. After all, given a prompt, we need to generate an essay where every word is picked from a probability distribution. A stack of decoders helps this more than a encoder/decoder architecture. This transformer architecture has two main components. First is the positional encoding and second is the multi-head attention.
If we start with positional encoding, we are looking at a matrix where each row is a word in the sentence of the prompt. This matrix is created by using 18th century math which showed us that a summation of sine and cosine function with varying frequencies carries all the information in the input signal. The input signal here is a prompt and to give different emphasis to the word position in the prompt, we use sine and cosines function to arrive at the rows in the matrix. So far we started with word embeddings and now we have a matrix with positional embeddings. But why? Well, the output of each time-step is fed to the decoder. E.g. I love Tacobell around the corner where the italics means they were generated in the previous step, will give emphasis to the words the corner by over the first few words. This is the part about attention. That brings us to the next big component called Multi-Head Attention (MHA).
To understand MHA, we need to understand the concept of query (Q), key (K) and Value (V). Query is the question asked by the transformer to find the correct next word in the sentence. In our case, that would be “around”. So query is asking is “around” the best fit given the input sequence i.e the Keys. The values are the input sequence’s embeddings that we calculated earlier. At the end of the day, we are using matrix multiplication, specifically dot product to guage the relative fit of a word given the input sentence. As we generate more words, we want the generated words to be in the input to find the next best match. A dot product is literally telling us how far is the next word (my query) from my current sentence (keys).
Note that till now we have not really used a neutral network. We would need that because so far all we have done is linear transforms (matrix multiplication). We need something nonlinear to activate the neurons or else what’s the point of calling this NN? That part is the feed forward neural network which gets this output (a matrix) that we calculated. But by doing all this processing and in parallel because we use multiple heads of attention, we are able to accelerate it. This was the original intent of transformers i.e. accelerate the translation using GPUs. If you are student of history, this is kind of like old mechanical engineering techniques to perform switching and routing. They worked but eventually were replaced by electronics. I think AI is lacking that paradigm shift. It hasn’t yet found its transistor.