Sunday, May 28, 2023

Language Models

 LM are probability distribution over sequences of tokens - which are words in a vocabulary and so the sequence is a phrase. Each phrase fragment gets a probability. The probability is higher for fragments which are good phrases. Good meaning grammatically correct and semantically plausible. Good is very dependent on extra information that is not in the training set. For example, a sentence “I saw the golden gate bridge flying into san francisco” needs some contextual information like bridges are stationary and “I” refers to a human in the plane. This means the LM needs to determine semantic plausibility of a phrase fragment. This is what makes LMs deceptively simple and easy to get wrong.

Mathematically, the phrase fragment is called sequence of tokens. And the tokens are pulled from a set “V” for vocabulary of the language. Probability distribution of a four token sequence assigns different probabilities to the ordering of the four tokens in that sequence. Some orderings are semantically implausible and others are syntactically incorrect. So far, we haven’t done any generation, but that is also possible in the LM, where given a sequence of tokens, say four, we can pull five token sequences and judge on its “goodness”. The judgement is based on some parameter which could be level starting from poor to best. To continue on the sentence above, the probabilities of “I saw .. sf on saturday”, “I saw ..sf yesterday” are all good as they are good sentences, but we need a new parameter to pick one over the other. This parameter is the randomness of search. If we are strict, we would pick the highest next probability of each word in the sentence and its next prediction. If we are lenient, we would randomize the next pick. This conditional generation is then determined by the prefix sequence, in our case “I saw .. sf” and the completion will be “yesterday”. This is the big problem with the LM, a lenient generation will create absurd sentences and even total fabrications and strict generation will create templates. Add to that there is a heavy reliance on the prefix sequence or “prompt”. This parameter is called temperature and ranges from lenient to strict and controls the variability of the generation. So, we now have the mathematical vocabulary of generation. We have a prompt, a completion and temperature.

A slight change to the prompt will generate a entirely different sentence. A longer prompt also changes the generation and any foreign words in the prompt, perfectly normal is spoken english, can also influence the generation in unpredictable ways. This length of the number of tokens in the prompt is a big determinate of computational complexity of LMs. A generation that uses all the tokens, such as RNN (Recurrent Neural Network) will take a very long time to train. A transformer which sets attention to a few prior tokens and exploits parallelism of GPUs limits the accuracy of the generation. Parallelism of transformers enable large LMs. Large as in trillion parameters. What was observed in later 2022 was the larger the model got, the fixed length generation coupled with temperature generated sentences which never existed before. This is called emergent behavior - this is beginning of “learning”. Repeated prompts on the same topic teach the LM to hone in on the context. This makes the completions sound more like spoken language and less like an automaton. This observation of emergent behavior without having to change the model (like no new gradient descents) is what is causing most the hype around Generative AI. As the model hones into a context, it feels like a turing machine where the generation feels conversational.

The biggest challenge for Gen AI - as an industry - is not to create larger and larger models, but instead to slice and dice an existing LLM into a size that can be deployed at scale. We can see the sizes of large model in this paper, reproduced below. It is rumored GPT-4 is over 1 trillion parameters as well. Training costs around $28K per 1 billion parameters. It is however a one-time cost. The continual cost is on the sliced/diced version - perhaps running on the phone - which needs to cost no more a penny per 1000 generations.

The models also need to be trained on proprietary and current data for them to generate economic value beyond research.

Saturday, May 23, 2020

Virus and AI Reset

There are two kinds of AI: classifier and recognizer. Both are in affected by this pandemic. First the classifier - which is just statistical analysis to categorize scenes, events etc. and issue predictions need movement among the subjects i.e. a continuous data feed. With everybody at home across the globe that feed is quite stale and boring. Those annoying ads that were being served which sometimes got you to think that you were being observed are gone aren't they? Second, the recognizer is the AI-technique that has multi-billion dollar industry behind it. Its biggest customer is China and other governments. That industry is facing the same problem as the small biz during shutdown. It had trouble recognizing faces with dark skin and now it is completely stumped recognizing faces with masks. But most of all AI has no answer to these multiple models which are driving multi trillion dollar policy decisions. I was a fan of AI in grad school, saw it find some traction in expert systems and then disappear. I believe we might see that again. This time, I won't be mourning it.

Wednesday, January 01, 2020

Algos and Data Structures

From the title of this blog, one might wonder if I have something to say along the lines of Nicholas Wirth's famous books Algo and Data Structs. Nope it is about how these compute elements control the digital highways where we arguably spend more time than physical interstate highways.

Imagine on a physical highway, you see a sign which says "All Indians exit here" or if you drive on a street in the restaurant area and only Indian restaurants appear open. It would be illegal wouldn't it? But that is exactly what happens on the digital highways and streets. It does not matter how hard I try to search on popular media platforms, I keep being recommended stuff based on a single factor in my profile. Nationality as deduced by my name.

AI and ML are broad categories of algos that use data (mostly acquired using shady methods) to influence the recommendation engines used on the web. Calling them intelligent is mockery of intelligence. They are codified stereotypes, biases and directed suggestions. Those kinds of algos are good for recognition of something that does not change everyday like your voice and your face, but not for your likes, curiosities and spirit of discovery.

We need these platforms to use data structures that allow end users to load their profile with attached contracts that are enforced. There is no reason why these algos should be allowed to remain opaque. The way they are right now they are in violation of all data privacy laws that come active in 2020.

Happy New Year!

Saturday, July 20, 2019

Fed should embrace crypto... and extend

I have a mobile passport in my wallet. It is the digital version of my physical passport. I also have physical dollars in my wallet, but I don't have digital version of the same. One would wonder if the credit cards are digital versions of the physical dollars in my wallet. They are not. They are just extensions of my personal balance sheet. Their use creates a entry on the wrong side of my balance sheet - a hole from where I have to dig my way out. They are also not digital versions because if I lose my wallet I lose nothing. The physical dollars in my wallet are lost but not the credit represented by the cards.

What if I had digital version of the dollars that I have? Besides the benefits of limiting damaging effects of climate change by not killing more trees, it could actually have a paradigm shifting affect on world of finance. The interest on my deposits in a bank could vary depending upon the balance, my personal situation and any intent that I have declared on that digital currency. This simple sounding functionality could give the Fed (federal reserve) tools to control money at atomic level. Instead of medieval style messing around with funds rate, they could punish accumulation of the dollars where they don't want and reward flows which they want. I could declare intent on my digital dollar and have smart contract enforce that. (read estate disruption). IRS could tag dollars at source as tax revenue (read disruption of USTR). Employers can tag dollars as retirement dollars which even if they exist in my digital wallet will earn as if in a 401k account. The possibilities are endless. Finally, silicon valley can imagine again instead of the wide spread re-imagination that is rampant right now. Really, I am surprised books aren't written on this by science fiction writers.

If we had a digital dollar, what would it be based on? Today we have fiat currency (Fed In Agreement with Treasury  i.e. FIAT). The answer is crypto. The fed can own the crypto methods under USD. If other central banks agree, the currency crosses will be calculated every time they are used (read FX disruption). Blockchain is the bottom of this technology stack. The rest of the stack is not even funded yet. The core of all innovation is embrace and extend, I think the Fed should do exactly that. Embrace this technology and extend for the benefit of the world of finance.

Friday, May 10, 2019

Data is the new Narcotic

About six months ago, I tweeted saying Data is not the new oil, but it is the new narcotic.  A key property of any narcotic is that you need it periodically and with time more of it to achieve onset. Data has the same effect on any cloud based digital platform. Let me explain.

Let's say you are like me - frugual - and shop in big lots. A data at a point in time would be the receipt issued to you that has item's SKU, your loyalty number (for identification) and (most importantly) location/time of the purchase. If the store shared this data with a digital platform, that data would be the first attempt at a drug for that platform. The platform would need this data periodically i.e. your receipt at your next purchase. Why? Because the analytics is done on a time series and insights like time between purchases and second order insights like cough syrup in June means spread of some respiratory infection in that zipcode or if that fails then cross that data with doctor's office visits to arrive at a probability of upper respiratory chronic disease or just smoking. All of these common sense analytics can be done by a computer in seconds, but it needs data to start and it needs to be fed periodically to increase its accuracy. And to arrive at second order insights, it needs more of that data about you. In fact, if all your activities from body functions, to daily habits are digitized and fed to this platform, the smarter it will get on predicting your intentions. To a point that it will predict (serve you an advert) before you have felt the need for it. Like when you wonder how did it know that I need a cough syrup.

They are calling this type of learning where repeated encounters with a data point increases the weight of that happening again (probability) on an underlying learning network:  "AI". And the data is the narcotic that this system needs regularly and in increasing quantities.

Sunday, March 31, 2019

Who killed Blockchain?

We have all been to a notary to complete a transaction or an application for important events in our lives. The general procedure involves one signing/dating the document followed by a signature/stamp from the notary. The notary also makes a entry into his/her ledger. The purpose of the ledger to some time in the future validate the notarization. Now what if we had a technology that automagically places all the millions of entries across hundreds of thousands of notaries into a access controlled ledger on the internet? That would enable notarization by any citizen, entity not in the bad books of society. It would enable cross border notarization (read trade). Extend this use case further and you could apply this technology to validate your credentials, your deeds on your assets, your birth/death/marriage records. This technology would fundamentally change the way humans are organized today.Everybody agrees it would be great to have this technology flourish, but alas I read it obituary everyday.

To understand why, I would invite you to watch the movie who killed the electric car? There is something in the movie for both bulls and bears of blockchain. The bulls will say but wait.. EVs are not dead and so there is hope for blockchain. Bears will say, but those tactics (like filing patents and locking up competitive innovation) worked as it delayed the  EV for several decades. (My own first experience with EV was Sun's Java car in 1999). I think blockchain is going through one of those moments where the ones whose empires will fall ("trusted intermediaries") first joined the party and then played party pooper.

Monday, December 24, 2018

Elephant in the Datacenter

I hear conversations at eateries discussing virtual wires to move packets between software endpoints. I believe the problem of moving  packets or making a remote procedural call  (across administrative domains) is already solved. What is not solved is the policy around the movement and invocations. We need to solve this problem if we are to tackle the elephant in the room.  When I listen to end users of technology, their discuss their pressing problem and that is phishing, robocalls, invasion of privacy and at extreme fear of constant surveillance because of AI. The intrusion they are concerned about is not from rogue states but rogue software that runs within our trusted perimeter. This is not going to be solved at the lower levels of the compute stack. This needs a Layer 8 security solution whose enforcement takes place at all seven (or five) layers below.

I spent the last year trying to understand the appeal of blockchain as a data structure and now I am convinced that is how we should look at blockchain. It is a data structure like a class in Java or struct in C. We should use it when we want to store data with access permissions of the owner of the data. When that happens, we will be uploading our photos and storing our documents with a checkbox which says private and that would cause the platform's software to store the datum in a blockchain that is personal to the person who uploads. It is like a vault in the bank where not even the banker can access the contents without your private key.

This massive misallocation of capital to copy cat ideas only happened because of ridiculously low  cost of money. In a rising rate environment, hopefully we will see capital being allocated on issues to which society is waiting for resolution.

Happy Holidays

Language Models

 LM are p robability distribution over sequences of tokens - which are words in a vocabulary and so the sequence is a phrase. Each phrase fr...