## Saturday, June 24, 2023

### The Model behind LLM and Transformers

We want to get to a point where we have a probability distribution over a sequence of tokens. But what are these tokens? They are the words in a sentence that are quantized i.e. they are numbers. These numbers are not 64bit floats, they are more like 16 or 8 bit floats. In an array, these numbers map to a word and some of the context of the word. A classic example of a word (embedding) context is King - Man = Queen. This is a legit operation in this space. So we take a sentence and convert them into tokens and create vectors for each word which as a whole is called word embeddings. This seems like preliminary stuff, but it is not, this is so important and we get this wrong, the whole generative AI stuff falls on its face.

The engine of a generative AI car is the transformer. The main parts of the transformer is the encoder and decoder. Encoder is where you say “Yo quiero tacobell” and decoder translates to “I love Tacobell”. As an aside, if you tokenize as “I love Taco Bell” that will result a vastly different generation than if you tokenize as “I love Tacobell”. But to make this generative i.e. change its task model, they got rid of the encoder and used stacks of decoders. After all, given a prompt, we need to generate an essay where every word is picked from a probability distribution. A stack of decoders helps this more than a encoder/decoder architecture. This transformer architecture has two main components. First is the positional encoding and second is the multi-head attention.

If we start with positional encoding, we are looking at a matrix where each row is a word in the sentence of the prompt. This matrix is created by using 18th century math which showed us that a summation of sine and cosine function with varying frequencies carries all the information in the input signal. The input signal here is a prompt and to give different emphasis to the word position in the prompt, we use sine and cosines function to arrive at the rows in the matrix. So far we started with word embeddings and now we have a matrix with positional embeddings. But why? Well, the output of each time-step is fed to the decoder. E.g. I love Tacobell around the corner where the italics means they were generated in the previous step, will give emphasis to the words the corner by over the first few words. This is the part about attention. That brings us to the next big component called Multi-Head Attention (MHA).

To understand MHA, we need to understand the concept of query (Q), key (K) and Value (V). Query is the question asked by the transformer to find the correct next word in the sentence. In our case, that would be “around”. So query is asking is “around” the best fit given the input sequence i.e the Keys. The values are the input sequence’s embeddings that we calculated earlier. At the end of the day, we are using matrix multiplication, specifically dot product to guage the relative fit of a word given the input sentence. As we generate more words, we want the generated words to be in the input to find the next best match. A dot product is literally telling us how far is the next word (my query) from my current sentence (keys).

Note that till now we have not really used a neutral network. We would need that because so far all we have done is linear transforms (matrix multiplication). We need something nonlinear to activate the neurons or else what’s the point of calling this NN? That part is the feed forward neural network which gets this output (a matrix) that we calculated. But by doing all this processing and in parallel because we use multiple heads of attention, we are able to accelerate it. This was the original intent of transformers i.e. accelerate the translation using GPUs. If you are student of history, this is kind of like old mechanical engineering techniques to perform switching and routing. They worked but eventually were replaced by electronics. I think AI is lacking that paradigm shift. It hasn’t yet found its transistor.

## Saturday, June 10, 2023

### Perplexity, Entropy: How to measure LLMs?

How can we measure efficacy of a language model? Language model researchers use the term “Perplexity” to measure how a language model performs on tasks on standard datasets. In a language model the task means quizzing the model to complete a sentence or hold a Q&A or generate an essay. GPT-3 scored well, in fact very well, on perplexity on standard benchmarks like Penn Tree Bank. Overall, though, the results were mediocre.

Perplexity of a model measures the “surprise” factor in the generation or how many branches does the model deal with when predicting the next word. A perplexity of 20 would mean that given a few words the model has to pick between 20 choices for the next word. If that number was 2, then the model has an easier task but that is most likely because we overfit the model to a specific task. Without this context specific training, the GPT-3 folks claim that the model is a few shot learner, which means it takes a few attempts before the model hones in on the task’s context. There are variants of transformer models like BERT cased - which are trained for specific tasks like “complete last word” or “fill in the blank” which perform much better at those specific tasks than GPT-3.

As we are building LLMs to perform task like translation, generation, completion etc., should we overfit a model to a specific task or leave it as a generic model and provide in-context training via prompting? With the GPT-3, it seems prompting is the chosen route to get the model to hone-in. And what about the model size? Does it make sense to overfit a model with hundreds of billions of parameters to do a single task (like translate) or leave it as a few shot performer i.e. mediocre? These are the tradeoffs that one has to make when building a new LLM. Larger models are mediocre at best.

### Costs in Training LLMs

I went through the Llama-2 white paper that was released with the model by meta. I was hoping to learn some special technique they may be ... 