tag:blogger.com,1999:blog-67433302024-03-19T01:48:18.972-07:00Network of ThingsBlog to discuss the digitization and network of everything that has a digital hearbeatUnknownnoreply@blogger.comBlogger102125tag:blogger.com,1999:blog-6743330.post-83017579977059029572023-07-22T08:47:00.009-07:002023-07-25T06:52:52.656-07:00Costs in Training LLMs<p> I went through the <a href="https://scontent-sjc3-1.xx.fbcdn.net/v/t39.2365-6/10000000_662098952474184_2584067087619170692_n.pdf?_nc_cat=105&ccb=1-7&_nc_sid=3c67a6&_nc_ohc=RYfzDCymkuYAX-mk8g9&_nc_ht=scontent-sjc3-1.xx&oh=00_AfDTiHm6K8Wed7zZbdoD_tAozBRzzac7qKyQ0o4FTzTwTw&oe=64C0613F">Llama-2 white paper</a> that was released with the model by meta. I was hoping to learn some special technique they may be using to train their models. Apparently, there isn't any. The learning process is straightforward. What is different is the huge costs associated with fine tuning after the model is trained. This fine tuning requires human interaction and feedback. To incorporate the feedback, the model has to be altered that requires more computation. Training, fine tuning the model costs more than $20M (~$4 per hour and 5M hours). This immediately limits the number of players who will actively develop LLMs. The cost of adding safety to these models (e.g. block prompts for ransom letters etc.) is almost as high as cost of training the model. </p><p>Another interesting tidbit from the paper was the assertion that RoCE based 200 Gbps interconnected cluster was adequate and more economical than Inifiband based cluster. RoCE uses commodity ethernet. If one can train 70B parameter model with trillions of tokens using commodity ethernet with RDMA, what is the compelling need to move to expensive NVLink linked superchips based systems? May be they are overfitting? (pun intended)</p><p>There is a significant cost to building these models that are shared with public (unknowingly) i.e. they are climate related. The carbon emission of these clusters is shown in the paper at 539 tonnes of CO2e. It took 3.3M hours of GPU (A100-80G). All of this to chat with a bot?</p><p>I found more benchmarks and metrics related to safety, climate and other social concerns in the paper than what one finds a technical paper. </p><p>It was easy to play with the model using oobaooga's <a href="https://github.com/oobabooga/text-generation-webui/blob/main/docs/LLaMA-v2-model.md">text gen UI</a>. I used the 13B parameter model from the family of Llama-2s released. It is a bit dated. You can see for yourself. </p><p><br /></p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEgAYYm7mfrWOrhArTlXE-GaVtk5QUxrCNdtCVCnWMMW6HLGa1URDCKIb3fuAdjXOvKmhjTzAlyfj3qeZwBx6nqEk5ldjYldCDVR8wAT7vvyaJPsLAN9uql3ltNxf_xV-rF5wFkmpfeizyPqRDvM3Ei988VX6BMJwmBe7kGfErMNkudZvxJPKDuA" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="429" data-original-width="863" height="159" src="https://blogger.googleusercontent.com/img/a/AVvXsEgAYYm7mfrWOrhArTlXE-GaVtk5QUxrCNdtCVCnWMMW6HLGa1URDCKIb3fuAdjXOvKmhjTzAlyfj3qeZwBx6nqEk5ldjYldCDVR8wAT7vvyaJPsLAN9uql3ltNxf_xV-rF5wFkmpfeizyPqRDvM3Ei988VX6BMJwmBe7kGfErMNkudZvxJPKDuA" width="320" /></a></div><br /><br /><p></p>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-6743330.post-47905396530715887492023-07-09T11:22:00.000-07:002023-07-09T11:22:04.621-07:00Learning a Model<p> Neural Networks have a bad reputation of being very confident when they are wrong. This is the result of a bad probability estimates being calculated (i.e. learned). They also suffer from <a href="https://arxiv.org/abs/1602.02697">adversarial attacks.</a> Training is the activity that takes the most time in arriving at a functional LLM. Besides collecting, curating and integrating data sets, we have to also navigate around pot holes by employing optimization techniques on the objective function. Objective function or goal seeking function is a function that takes data and model parameters as arguments and outputs a number. The goal is to find values for these parameters which either maximize or minimize this number. Maximum likelihood (MLE) is one of the most often used function that fits this task of finding the set of parameters that best fit the observed data.</p><p>LLMs have three model architectures (a) encoder only (BERT) (b) decoder only (GPT) (c) encoder-decoder (<a href="https://github.com/google-research/text-to-text-transfer-transformer/blob/master/released_checkpoints.md#t511">T5</a>). Looking at (b), which is a probability distribution over a word given the prompt, which is arrived by taking a smoothed exponential (softmax) of scores calculated using scaled dot products between each new prediction word with prompt, we use MLE to find the best distribution that fits the observed data.</p><p>Stochastic gradient descent (SGD) and ADAM (adaptive moment estimation) are two common methods used to optimize the objective function. The latter is memory intensive. There are many knobs like size of a floating point, calculating moments (more value per parameter), changing learning rates among others that can be used to learn a model. Sometimes the knob settings result in generic learning and other times overfitting. More often than not we just don’t converge i.e. no learning. ADAM is popular optimizer (I use it on the open source transformers from hugging), it keeps 3X more values per parameter than vanilla SGD. AdaFactor is an optimization on ADAM to reduce memory consumption but has been known to not always work. </p><p>A rule of thumb in ML is gather more data as a dumb model with lots of data beats a smart model with limited data. But training on large amount of data is costly. It used computational resources and as the whole process is iterative, we need fast processors to crunch through the data so the model can be adapted and iterated upon. More data does not guarantee convergence i.e learning. The whole exercise in learning a model looks and feels like art bordering on black magic than anything analytic or scientific. If modeling felt like a recipe than this is like cooking the recipe. The end result has a lot of variance. </p><p></p>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-6743330.post-87782054771194539662023-06-24T13:30:00.001-07:002023-06-24T13:30:00.146-07:00The Model behind LLM and Transformers<p> <span style="background-color: white; color: #172b4d; font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, Oxygen, Ubuntu, "Fira Sans", "Droid Sans", "Helvetica Neue", sans-serif; font-size: 16px; letter-spacing: -0.005em; white-space-collapse: preserve;">We want to get to a point where we have a probability distribution over a sequence of tokens. But what are these tokens? They are the words in a sentence that are quantized i.e. they are numbers. These numbers are not 64bit floats, they are more like 16 or 8 bit floats. In an array, these numbers map to a word and some of the context of the word. A classic example of a word (embedding) context is King - Man = Queen. This is a legit operation in this space. So we take a sentence and convert them into tokens and create vectors for each word which as a whole is called word embeddings. This seems like preliminary stuff, but it is not, this is so important and we get this wrong, the whole generative AI stuff falls on its face. </span></p><p data-renderer-start-pos="735" style="background-color: white; color: #172b4d; font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, Oxygen, Ubuntu, "Fira Sans", "Droid Sans", "Helvetica Neue", sans-serif; font-size: 16px; letter-spacing: -0.005em; line-height: 1.714; margin: 0.75rem 0px 0px; padding: 0px; white-space-collapse: preserve;">The engine of a generative AI car is the <a class="css-tgpl01" data-renderer-mark="true" data-testid="link-with-safety" href="https://arxiv.org/pdf/1706.03762.pdf" style="text-decoration-line: none;" title="https://arxiv.org/pdf/1706.03762.pdf">transformer</a>. The main parts of the transformer is the encoder and decoder. Encoder is where you say “Yo quiero tacobell” and decoder translates to “I love Tacobell”. As an aside, if you tokenize as “I love Taco Bell” that will result a vastly different generation than if you tokenize as “I love Tacobell”. But to make this generative i.e. change its task model, they got rid of the encoder and used stacks of decoders. After all, given a prompt, we need to generate an essay where every word is picked from a probability distribution. A stack of decoders helps this more than a encoder/decoder architecture. This transformer architecture has two main components. First is the positional encoding and second is the multi-head attention. </p><p data-renderer-start-pos="1516" style="background-color: white; color: #172b4d; font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, Oxygen, Ubuntu, "Fira Sans", "Droid Sans", "Helvetica Neue", sans-serif; font-size: 16px; letter-spacing: -0.005em; line-height: 1.714; margin: 0.75rem 0px 0px; padding: 0px; white-space-collapse: preserve;">If we start with positional encoding, we are looking at a matrix where each row is a word in the sentence of the prompt. This matrix is created by using 18th century math which showed us that a summation of sine and cosine function with varying frequencies carries all the information in the input signal. The input signal here is a prompt and to give different emphasis to the word position in the prompt, we use sine and cosines function to arrive at the rows in the matrix. So far we started with word embeddings and now we have a matrix with positional embeddings. But why? Well, the output of each time-step is fed to the decoder. E.g. I love Tacobell <em data-renderer-mark="true">around the corner </em>where the italics means they were generated in the previous step, will give emphasis to the words the corner by over the first few words. This is the part about attention. That brings us to the next big component called Multi-Head Attention (MHA). </p><p data-renderer-start-pos="2442" style="background-color: white; color: #172b4d; font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, Oxygen, Ubuntu, "Fira Sans", "Droid Sans", "Helvetica Neue", sans-serif; font-size: 16px; letter-spacing: -0.005em; line-height: 1.714; margin: 0.75rem 0px 0px; padding: 0px; white-space-collapse: preserve;">To understand MHA, we need to understand the concept of query (Q), key (K) and Value (V). Query is the question asked by the transformer to find the correct next word in the sentence. In our case, that would be “around”. So query is asking is “around” the best fit given the input sequence i.e the Keys. The values are the input sequence’s embeddings that we calculated earlier. At the end of the day, we are using matrix multiplication, specifically dot product to guage the relative fit of a word given the input sentence. As we generate more words, we want the generated words to be in the input to find the next best match. A dot product is literally telling us how far is the next word (my query) from my current sentence (keys). </p><p data-renderer-start-pos="3179" style="background-color: white; color: #172b4d; font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, Oxygen, Ubuntu, "Fira Sans", "Droid Sans", "Helvetica Neue", sans-serif; font-size: 16px; letter-spacing: -0.005em; line-height: 1.714; margin: 0.75rem 0px 0px; padding: 0px; white-space-collapse: preserve;">Note that till now we have not really used a neutral network. We would need that because so far all we have done is linear transforms (matrix multiplication). We need something nonlinear to activate the neurons or else what’s the point of calling this NN? That part is the feed forward neural network which gets this output (a matrix) that we calculated. But by doing all this processing and in parallel because we use multiple heads of attention, we are able to accelerate it. This was the original intent of transformers i.e. accelerate the translation using GPUs. If you are student of history, this is kind of like old mechanical engineering techniques to perform switching and routing. They worked but eventually were replaced by electronics. I think AI is lacking that paradigm shift. It hasn’t yet found its transistor. </p>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-6743330.post-71035655304958132392023-06-10T07:54:00.002-07:002023-06-10T07:54:21.690-07:00Perplexity, Entropy: How to measure LLMs? <p> <span style="background-color: white; color: #172b4d; font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, Oxygen, Ubuntu, "Fira Sans", "Droid Sans", "Helvetica Neue", sans-serif; font-size: 16px; letter-spacing: -0.005em; white-space: pre-wrap;">How can we measure efficacy of a language model? Language model researchers use the term “Perplexity” to measure how a language model performs on tasks on standard datasets. In a language model the task means quizzing the model to complete a sentence or hold a Q&A or generate an essay. </span><a class="css-tgpl01" data-renderer-mark="true" data-testid="link-with-safety" href="https://arxiv.org/pdf/2005.14165.pdf" style="font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, Oxygen, Ubuntu, "Fira Sans", "Droid Sans", "Helvetica Neue", sans-serif; font-size: 16px; letter-spacing: -0.005em; text-decoration-line: none; white-space: pre-wrap;" title="https://arxiv.org/pdf/2005.14165.pdf">GPT-3</a><span style="background-color: white; color: #172b4d; font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, Oxygen, Ubuntu, "Fira Sans", "Droid Sans", "Helvetica Neue", sans-serif; font-size: 16px; letter-spacing: -0.005em; white-space: pre-wrap;"> scored well, in fact very well, on perplexity on standard benchmarks like </span><a class="css-tgpl01" data-renderer-mark="true" data-testid="link-with-safety" href="https://catalog.ldc.upenn.edu/LDC99T42" style="font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, Oxygen, Ubuntu, "Fira Sans", "Droid Sans", "Helvetica Neue", sans-serif; font-size: 16px; letter-spacing: -0.005em; text-decoration-line: none; white-space: pre-wrap;" title="https://catalog.ldc.upenn.edu/LDC99T42">Penn Tree Bank</a><span style="background-color: white; color: #172b4d; font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, Oxygen, Ubuntu, "Fira Sans", "Droid Sans", "Helvetica Neue", sans-serif; font-size: 16px; letter-spacing: -0.005em; white-space: pre-wrap;">. Overall, though, the results were mediocre. </span></p><p data-renderer-start-pos="430" style="background-color: white; color: #172b4d; font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, Oxygen, Ubuntu, "Fira Sans", "Droid Sans", "Helvetica Neue", sans-serif; font-size: 16px; letter-spacing: -0.005em; line-height: 1.714; margin: 0.75rem 0px 0px; padding: 0px; white-space: pre-wrap;">Perplexity of a model measures the “surprise” factor in the generation or how many branches does the model deal with when predicting the next word. A perplexity of 20 would mean that given a few words the model has to pick between 20 choices for the next word. If that number was 2, then the model has an easier task but that is most likely because we overfit the model to a specific task. Without this context specific training, the GPT-3 folks claim that the model is a few shot learner, which means it takes a few attempts before the model hones in on the task’s context. There are variants of transformer models like BERT cased - which are trained for specific tasks like “complete last word” or “fill in the blank” which perform much better at those specific tasks than GPT-3.</p><p data-renderer-start-pos="1213" style="background-color: white; color: #172b4d; font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, Oxygen, Ubuntu, "Fira Sans", "Droid Sans", "Helvetica Neue", sans-serif; font-size: 16px; letter-spacing: -0.005em; line-height: 1.714; margin: 0.75rem 0px 0px; padding: 0px; white-space: pre-wrap;">As we are building LLMs to perform task like translation, generation, completion etc., should we overfit a model to a specific task or leave it as a generic model and provide in-context training via prompting? With the GPT-3, it seems prompting is the chosen route to get the model to hone-in. And what about the model size? Does it make sense to overfit a model with hundreds of billions of parameters to do a single task (like translate) or leave it as a few shot performer i.e. mediocre? These are the tradeoffs that one has to make when building a new LLM. Larger models are mediocre at best. </p>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-6743330.post-48610643648429006102023-05-28T09:49:00.005-07:002023-05-28T09:49:31.952-07:00Language Models<p> LM are p<span style="background-color: white; color: #172b4d; font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, Oxygen, Ubuntu, "Fira Sans", "Droid Sans", "Helvetica Neue", sans-serif; font-size: 16px; letter-spacing: -0.005em; white-space: pre-wrap;">robability distribution over sequences of tokens - which are words in a vocabulary and so the sequence is a phrase. Each phrase fragment gets a probability. The probability is higher for fragments which are good phrases. Good meaning grammatically correct and semantically plausible. Good is very dependent on extra information that is not in the training set. For example, a sentence “I saw the golden gate bridge flying into san francisco” needs some contextual information like bridges are stationary and “I” refers to a human in the plane. This means the LM needs to determine semantic plausibility of a phrase fragment. This is what makes LMs deceptively simple and easy to get wrong. </span></p><p><span style="background-color: white; color: #172b4d; font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, Oxygen, Ubuntu, "Fira Sans", "Droid Sans", "Helvetica Neue", sans-serif; font-size: 16px; letter-spacing: -0.005em; white-space: pre-wrap;">Mathematically, the phrase fragment is called sequence of tokens. And the tokens are pulled from a set “V” for vocabulary of the language. Probability distribution of a four token sequence assigns different probabilities to the ordering of the four tokens in that sequence. Some orderings are semantically implausible and others are syntactically incorrect. So far, we haven’t done any generation, but that is also possible in the LM, where given a sequence of tokens, say four, we can pull five token sequences and judge on its “goodness”. The judgement is based on some parameter which could be level starting from poor to best. To continue on the sentence above, the probabilities of “I saw .. sf on saturday”, “I saw ..sf yesterday” are all good as they are good sentences, but we need a new parameter to pick one over the other. This parameter is the randomness of search. If we are strict, we would pick the highest next probability of each word in the sentence and its next prediction. If we are lenient, we would randomize the next pick. This conditional generation is then determined by the prefix sequence, in our case “I saw .. sf” and the completion will be “yesterday”. This is the big problem with the LM, a lenient generation will create absurd sentences and even total fabrications and strict generation will create templates. Add to that there is a heavy reliance on the prefix sequence or “prompt”. This parameter is called temperature and ranges from lenient to strict and controls the variability of the generation. So, we now have the mathematical vocabulary of generation. We have a prompt, a completion and temperature. </span></p><p><span style="background-color: white; color: #172b4d; font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, Oxygen, Ubuntu, "Fira Sans", "Droid Sans", "Helvetica Neue", sans-serif; font-size: 16px; letter-spacing: -0.005em; white-space: pre-wrap;"> A slight change to the prompt will generate a entirely different sentence. A longer prompt also changes the generation and any foreign words in the prompt, perfectly normal is spoken english, can also influence the generation in unpredictable ways. This length of the number of tokens in the prompt is a big determinate of computational complexity of LMs. A generation that uses all the tokens, such as RNN (Recurrent Neural Network) will take a very long time to train. A transformer which sets attention to a few prior tokens and exploits parallelism of GPUs limits the accuracy of the generation. Parallelism of transformers enable large LMs. Large as in trillion parameters. What was observed in later 2022 was the larger the model got, the fixed length generation coupled with temperature generated sentences which never existed before. This is called emergent behavior - this is beginning of “learning”. Repeated prompts on the same topic teach the LM to hone in on the context. This makes the completions sound more like spoken language and less like an automaton. This observation of emergent behavior without having to change the model (like no new gradient descents) is what is causing most the hype around Generative AI. As the model hones into a context, it feels like a turing machine where the generation feels conversational. </span></p><p><span style="background-color: white; color: #172b4d; font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, Oxygen, Ubuntu, "Fira Sans", "Droid Sans", "Helvetica Neue", sans-serif; font-size: 16px; letter-spacing: -0.005em; white-space: pre-wrap;">The biggest challenge for Gen AI - as an industry - is not to create larger and larger models, but instead to slice and dice an existing LLM into a size that can be deployed at scale. We can see the sizes of large model in this </span><a class="css-tgpl01" data-renderer-mark="true" data-testid="link-with-safety" href="https://dl.acm.org/doi/pdf/10.1145/3442188.3445922" style="font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, Oxygen, Ubuntu, "Fira Sans", "Droid Sans", "Helvetica Neue", sans-serif; font-size: 16px; letter-spacing: -0.005em; text-decoration-line: none; white-space: pre-wrap;" title="https://dl.acm.org/doi/pdf/10.1145/3442188.3445922">paper</a><span style="background-color: white; color: #172b4d; font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, Oxygen, Ubuntu, "Fira Sans", "Droid Sans", "Helvetica Neue", sans-serif; font-size: 16px; letter-spacing: -0.005em; white-space: pre-wrap;">, reproduced below. </span><span style="background-color: white; color: #172b4d; font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, Oxygen, Ubuntu, "Fira Sans", "Droid Sans", "Helvetica Neue", sans-serif; font-size: 16px; letter-spacing: -0.005em; white-space: pre-wrap;">It is rumored GPT-4 is over 1 trillion parameters as well. Training costs around $28K per 1 billion parameters. It is however a one-time cost. The continual cost is on the sliced/diced version - perhaps running on the phone - which needs to cost no more a penny per 1000 generations. </span></p><p><span style="background-color: white; color: #172b4d; font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, Oxygen, Ubuntu, "Fira Sans", "Droid Sans", "Helvetica Neue", sans-serif; font-size: 16px; letter-spacing: -0.005em; white-space: pre-wrap;">The models also need to be trained on proprietary and current data for them to generate economic value beyond research. </span></p>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-6743330.post-43596293515134304042020-05-23T18:02:00.000-07:002020-05-23T18:02:00.190-07:00Virus and AI ResetThere are two kinds of AI: classifier and recognizer. Both are in affected by this pandemic. First the classifier - which is just statistical analysis to categorize scenes, events etc. and issue predictions need movement among the subjects i.e. a continuous data feed. With everybody at home across the globe that feed is quite stale and boring. Those annoying ads that were being served which sometimes got you to think that you were being observed are gone aren't they? Second, the recognizer is the AI-technique that has multi-billion dollar industry behind it. Its biggest customer is China and other governments. That industry is facing the same problem as the small biz during shutdown. It had trouble recognizing faces with dark skin and now it is completely stumped recognizing faces with masks. But most of all AI has no answer to these multiple models which are driving multi trillion dollar policy decisions. I was a fan of AI in grad school, saw it find some traction in expert systems and then disappear. I believe we might see that again. This time, I won't be mourning it.Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-6743330.post-34892392898493207722020-01-01T09:02:00.004-08:002020-01-01T09:02:46.261-08:00Algos and Data StructuresFrom the title of this blog, one might wonder if I have something to say along the lines of Nicholas Wirth's famous books Algo and Data Structs. Nope it is about how these compute elements control the digital highways where we arguably spend more time than physical interstate highways.<br />
<br />
Imagine on a physical highway, you see a sign which says "All Indians exit here" or if you drive on a street in the restaurant area and only Indian restaurants appear open. It would be illegal wouldn't it? But that is exactly what happens on the digital highways and streets. It does not matter how hard I try to search on popular media platforms, I keep being recommended stuff based on a single factor in my profile. Nationality as deduced by my name.<br />
<br />
AI and ML are broad categories of algos that use data (mostly acquired using shady methods) to influence the recommendation engines used on the web. Calling them intelligent is mockery of intelligence. They are codified stereotypes, biases and directed suggestions. Those kinds of algos are good for recognition of something that does not change everyday like your voice and your face, but not for your likes, curiosities and spirit of discovery.<br />
<br />
We need these platforms to use data structures that allow end users to load their profile with attached contracts that are enforced. There is no reason why these algos should be allowed to remain opaque. The way they are right now they are in violation of all data privacy laws that come active in 2020.<br />
<br />
Happy New Year!Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-6743330.post-66526935079089288052019-07-20T10:02:00.000-07:002019-07-20T10:02:00.186-07:00Fed should embrace crypto... and extendI have a mobile passport in my wallet. It is the digital version of my physical passport. I also have physical dollars in my wallet, but I don't have digital version of the same. One would wonder if the credit cards are digital versions of the physical dollars in my wallet. They are not. They are just extensions of my personal balance sheet. Their use creates a entry on the wrong side of my balance sheet - a hole from where I have to dig my way out. They are also not digital versions because if I lose my wallet I lose nothing. The physical dollars in my wallet are lost but not the credit represented by the cards.<br />
<br />
What if I had digital version of the dollars that I have? Besides the benefits of limiting damaging effects of climate change by not killing more trees, it could actually have a paradigm shifting affect on world of finance. The interest on my deposits in a bank could vary depending upon the balance, my personal situation and any intent that I have declared on that digital currency. This simple sounding functionality could give the Fed (federal reserve) tools to control money at atomic level. Instead of medieval style messing around with funds rate, they could punish accumulation of the dollars where they don't want and reward flows which they want. I could declare intent on my digital dollar and have smart contract enforce that. (read estate disruption). IRS could tag dollars at source as tax revenue (read disruption of USTR). Employers can tag dollars as retirement dollars which even if they exist in my digital wallet will earn as if in a 401k account. The possibilities are endless. Finally, silicon valley can imagine again instead of the wide spread re-imagination that is rampant right now. Really, I am surprised books aren't written on this by science fiction writers.<br />
<br />
If we had a digital dollar, what would it be based on? Today we have fiat currency (Fed In Agreement with Treasury i.e. FIAT). The answer is crypto. The fed can own the crypto methods under USD. If other central banks agree, the currency crosses will be calculated every time they are used (read FX disruption). Blockchain is the bottom of this technology stack. The rest of the stack is not even funded yet. The core of all innovation is embrace and extend, I think the Fed should do exactly that. Embrace this technology and extend for the benefit of the world of finance.Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-6743330.post-60191582173847267702019-05-10T11:30:00.000-07:002019-05-10T11:30:03.221-07:00Data is the new NarcoticAbout six months ago, I tweeted saying Data is not the new oil, but it is the new narcotic. A key property of any narcotic is that you need it periodically and with time more of it to achieve onset. Data has the same effect on any cloud based digital platform. Let me explain.<br />
<br />
Let's say you are like me - frugual - and shop in big lots. A data at a point in time would be the receipt issued to you that has item's SKU, your loyalty number (for identification) and (most importantly) location/time of the purchase. If the store shared this data with a digital platform, that data would be the first attempt at a drug for that platform. The platform would need this data periodically i.e. your receipt at your next purchase. Why? Because the analytics is done on a time series and insights like time between purchases and second order insights like cough syrup in June means spread of some respiratory infection in that zipcode or if that fails then cross that data with doctor's office visits to arrive at a probability of upper respiratory chronic disease or just smoking. All of these common sense analytics can be done by a computer in seconds, but it needs data to start and it needs to be fed periodically to increase its accuracy. And to arrive at second order insights, it needs more of that data about you. In fact, if all your activities from body functions, to daily habits are digitized and fed to this platform, the smarter it will get on predicting your intentions. To a point that it will predict (serve you an advert) before you have felt the need for it. Like when you wonder how did it know that I need a cough syrup.<br />
<br />
They are calling this type of learning where repeated encounters with a data point increases the weight of that happening again (probability) on an underlying learning network: "AI". And the data is the narcotic that this system needs regularly and in increasing quantities.Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-6743330.post-18222387399976659172019-03-31T09:47:00.000-07:002019-03-31T09:47:04.621-07:00Who killed Blockchain? We have all been to a notary to complete a transaction or an application for important events in our lives. The general procedure involves one signing/dating the document followed by a signature/stamp from the notary. The notary also makes a entry into his/her ledger. The purpose of the ledger to some time in the future validate the notarization. Now what if we had a technology that automagically places all the millions of entries across hundreds of thousands of notaries into a access controlled ledger on the internet? That would enable notarization by any citizen, entity not in the bad books of society. It would enable cross border notarization (read trade). Extend this use case further and you could apply this technology to validate your credentials, your deeds on your assets, your birth/death/marriage records. This technology would fundamentally change the way humans are organized today.Everybody agrees it would be great to have this technology flourish, but alas I read it obituary everyday.<br />
<br />
To understand why, I would invite you to watch the movie who killed the electric car? There is something in the movie for both bulls and bears of blockchain. The bulls will say but wait.. EVs are not dead and so there is hope for blockchain. Bears will say, but those tactics (like filing patents and locking up competitive innovation) worked as it delayed the EV for several decades. (My own first experience with EV was Sun's Java car in 1999). I think blockchain is going through one of those moments where the ones whose empires will fall ("trusted intermediaries") first joined the party and then played party pooper. Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-6743330.post-74813335535911990002018-12-24T12:27:00.000-08:002018-12-27T06:23:20.840-08:00Elephant in the DatacenterI hear conversations at eateries discussing virtual wires to move packets between software endpoints. I believe the problem of moving packets or making a remote procedural call (across administrative domains) is already solved. What is not solved is the policy around the movement and invocations. We need to solve this problem if we are to tackle the elephant in the room. When I listen to end users of technology, their discuss their pressing problem and that is phishing, robocalls, invasion of privacy and at extreme fear of constant surveillance because of AI. The intrusion they are concerned about is not from rogue states but rogue software that runs within our trusted perimeter. This is not going to be solved at the lower levels of the compute stack. This needs a Layer 8 security solution whose enforcement takes place at all seven (or five) layers below.<br />
<br />
I spent the last year trying to understand the appeal of blockchain as a data structure and now I am convinced that is how we should look at blockchain. It is a data structure like a class in Java or struct in C. We should use it when we want to store data with access permissions of the owner of the data. When that happens, we will be uploading our photos and storing our documents with a checkbox which says private and that would cause the platform's software to store the datum in a blockchain that is personal to the person who uploads. It is like a vault in the bank where not even the banker can access the contents without your private key.<br />
<br />
This massive misallocation of capital to copy cat ideas only happened because of ridiculously low cost of money. In a rising rate environment, hopefully we will see capital being allocated on issues to which society is waiting for resolution.<br />
<br />
Happy HolidaysUnknownnoreply@blogger.com0tag:blogger.com,1999:blog-6743330.post-79954047101449597432018-10-19T17:00:00.000-07:002018-10-23T16:05:45.909-07:00(Lack of) Trust is killing the WorldNot to take on the "Software is eating the world" meme, I believe the pendulum has swung too far in the SD* world.<br />
<br />
Automation was the killer app driving the softwarize everything and give it internal guidance (read ML, AI etc.), but at the end of the day it ended up being a great tool for the bad actors who used the automation to conduct fraud, invade privacy and outright steal. Think of an airport with no security where airlines automate all ticketing, boarding, everything. Not a very secure flight is it? The biggest threat to efficiency/automation/AI is fraud, malice and hmm . bad actors or humans.<br />
<br />
What we need now is to build a platform that promotes trust not just communication. I went to engineering school when having an email was a privilege given to certain university students. Today I can communicate with my family over four different chat programs. They all use different ones. Do we really need this? Given all these choices, I still could not locate the person who sold me stuff on eBay that never arrived. What's the point?<br />
<br />
If software wants to run stuff, the it should be accountable. Is should be trusted. Currently, the only technology that straddles tech and human boundary is Blockchain. These large platform companies should store all PII on Blockchain which should be under the user's control. My digital recognitions from University diplomas to Employee of the Day award should be on my personal chain. When we get to a point where I get alerted for when someone or something (read SW) is accessing my data using my public key on any platform or within any organization, only then will people trust these newly classified communication services companies. Until then, I think SD* will be put on hold. Enough of efficiency, I just want my privacy back!Unknownnoreply@blogger.comSunnyvale, CA, USA37.36883 -122.036349637.1669945 -122.35907309999999 37.570665500000004 -121.7136261tag:blogger.com,1999:blog-6743330.post-23558856949415922292018-03-17T11:33:00.000-07:002018-03-17T11:33:02.748-07:00Pets and CattleAny pet owner will understand the pain felt by the owner of the pet that died on United Airlines. What essentially happened is that flight attendant dealt with the pets as if it were cattle. Your onboard baggage is cattle and the airline is optimized for cattle. This is not a note on UAL or pets, but on the notion in cloud computing that enterprise applications are pets and should be converted to cattle so they can leverage the cloud computing infrastructure.<br />
<br />
The market for enterprise software is around $280B and market for infrastructure that includes servers, network and storage is roughly $60B, $40B and $20B. As cloud takes a large upfront cost hit on the infrastructure they want to host workloads that are more sticky that the ones that drive revenues today. Today's workloads are mostly transient. New workloads start on the cloud and migrate out of it as soon as they are viable (business modelwise). So, it is understandable that cloud computing giants are pushing the pets vs. cattle metaphor and exhorting enterprise IT to migrate to their pet applications to the cloud. And why not, the prize is huge. Almost $200B of revenues should it happen. But will it? The case of UAL and the pet reminds us that pets cannot be treated as cattle and when they are the consequences are disastrous.<br />
<br />
As any pet parent knows planning a vacation requires careful selection of pet friendly hotels, airlines and destinations. We don't have the same pet friendliness in a cloud yet. Given the economics of the cloud, it will be difficult.Unknownnoreply@blogger.comtag:blogger.com,1999:blog-6743330.post-56039117081557899042018-03-14T14:11:00.001-07:002018-03-14T14:24:20.203-07:00Too many DashboardsIt seems we have a dashboard for every metric and dashboards to pick dashboards. This reminds of the early 2000s when widgets were introduced and we placed a widget for every metric, data stream on the desktop. We have distributed our IT systems and that enables us to monitor everything, but dashboard is not the answer. In fact with so many dashboards I kind of miss the good old monolithic system :)<br />
<br />
What we need is an automaton which processes these metrics and takes automatic decisions. Dashboard seems too much like a business process. What would be great is if we get a alert saying "metric reached threshold and controller took some action".. kind of like what we get from our banks when a charge is made or fraud prevented. What would be neat is if we replace a whole bunch of dashboards with a few controllers that execute some policy that was recommended real-time by ML.Unknownnoreply@blogger.comtag:blogger.com,1999:blog-6743330.post-56546653356767746742018-02-27T14:16:00.000-08:002018-03-14T14:23:56.308-07:00One horse or 10K Chickens?Would you like to ride a carriage pulled by 10K chickens or single horse? The answer is not easy. It is technically cool to distribute a job over many compute units and when done successfully it replaces the workload on a more hospitable economic curve (cheaper, faster, better..). However, not all jobs can be distributed and a complicated system which does distribute it over cheap compute units ends up being so complicated that focus shifts from the job to managing this system.<br />
<br />
We just spent over 6 mos trying to distribute a webApp and came to the conclusion that it is not about the app but about the data. We need to distribute the data and have compute thread scheduled on the data. But wasn't that the whole point of OOP?<br />
<br />
Anyways, the search for distribution introduced me to blockchain. Here is a distributed network that kind of mimics human networks. It keeps people honest and has potential. stay tuned...Unknownnoreply@blogger.comtag:blogger.com,1999:blog-6743330.post-84380147025708978592017-01-03T09:46:00.000-08:002017-01-03T09:46:19.893-08:00Network Data Analytics for Recommendation ToolYeah so first Happy New Year<br />
<br /><br />
<br /><br />
Can analytics data collected from the network actually be used for driving IT infrastructure sales? This is what I was thinking about when scuba diving on the coral reefs over the break. As an aside I try to think of something pleasant when scuba diving as I tend to panic and hit the button to rise to the surface. <br />
<br /><br />
So back to the question. The answer IMHO is 'yes'. If we collect the right telemetry of an application distributed in the datacenter, we can answer a query like "What resources (IT) are used to create this latency profile". For example, if we know that a certain install of SharePoint has a distribution of response times that are satisfactory to the administrator then a simple query like the above should give us a list of inventory that made this possible. This inventory can include switches, servers, storage software etc. Evolve this tool further and this product can also give you TCO of a infrastructure by recording this data overtime and using it for TCO calculation. Telemetry needs to meet AI/ML for this to happen. <br />
<br /><br />
So the question is why isn't anyone doing this?Unknownnoreply@blogger.comtag:blogger.com,1999:blog-6743330.post-18566360034088164772016-12-19T17:07:00.000-08:002016-12-19T17:07:01.031-08:00Developer will own Security OperationsI have a sneaky feeling that in coming year, the developer will strike big and get operational control of security in a datacenter and enterprise as a whole. Earlier this year, I warned that this should not happen. Read <a href="http://www.networkofthings.com/2016_03_01_archive.html#1725743290795522123">This</a>.<br />
<br /><br />
But I feel now, it is too late. Here is why. We have moved from securing perimeter to interfaces and now are talking about process jails. Some folks call it micro segmentation moving to nano segmentation. From a ops person POV this means several orders of magnitude increase in number of endpoints that he has to identify and operationalize. i.e. he cannot do it. It will have to be done by software. And the developer owns software.<br />
<br /><br />
Yup, software is eating the world. It just ate the security ops. Unknownnoreply@blogger.comtag:blogger.com,1999:blog-6743330.post-21585758613862562872016-12-06T14:33:00.001-08:002016-12-06T14:33:23.752-08:00Spark - Yup it is still all bout APIsHaving spent over 6 months now reading and practicing with code snippets on the big data ecosystem, the epiphany came to me that all I was doing was learning a new API. In fact, I was learning three new APIs: RDD, DataFrame and DataSet. It wasn't so obvious when I started reading about Apache spark. See the beauty of APIs is that it speaks the language of a developer and as a developer at heart and training I can easily understand what is being said.<br />
<br />
All the stuff I had to read to get to this epiphany about Scale, R, NumPy and in-memory databases, keeping data in CPU registers and not in L1/L2 cache was just all confusion that kept me to from getting to the core. It took six months to weed through so much garbaget to get here.<br />
<br />
Ok, so these three APIs help you deal with data. And depending upon the data, you pick one of these. The more structured your data the more you think of datasets and dataframes over simple RDD. And that is all there is to it. <br />
<br />
<br />
<br />Unknownnoreply@blogger.comtag:blogger.com,1999:blog-6743330.post-50697434867512244332016-03-13T10:50:00.002-07:002016-03-13T10:50:30.584-07:00Machine Learning, Deep Learning and Streaming Data ProcessorsWhen AlphaGo beat the human last week using ML to process a small set of board positions which its computing power could process, it proved that Deep Learning ML (that which uses algorithms vs. simply data) has arrived. But can the same machine analyze a streaming set of unrelated mouse clicks to identify a "hack"? Could it have blocked the hacking of NY Fed and saved Bangladesh $100M of lost funds?<br />
<br />
As I refresh my understanding of ML - this time with Spark MLLib. I am thinking that this use case of streaming data analyzes with ML or Deep Learning is the NBT (Next Big Thing). To run this type of computational jobs, one requires a cloud because no small cluster will do and it needs a special fabric of the network because nothing enforces better than a policy on the network. Next generation compute systems and specially memory/microprocessor arch has found its killer app in streaming data processing just like GPUs found video games. This time the games are played by hackers and loss is measured in hundreds of millions. Unknownnoreply@blogger.comtag:blogger.com,1999:blog-6743330.post-17257432907955221232016-03-06T19:20:00.003-08:002016-03-06T19:20:57.389-08:00Microservices and Container - Not perfect together<br />
There is a continuous turf battle going on since 1990s between developer and deployer (a.k.a Admin). Currently the admin controls the security, scale and size of the application. The developer controls the content, architecture and interfaces to other systems. Two new emerging technologies empower (or shift power) these two constituencies. Microservices empower the developer while container empowers the deployer/admin. <br />
<br />
If either wants to take off, they need to find other partners not each other. If microservices want to take off it needs to get the container monkey off its back. It shifts the responsibility of security and scale on the developer and away from the deployer. Developer should ideally be limited to interface design and selection of abstractions that make logic easy to codify. If the security and scale fall in the hands of developer it is going against the grain of the evolution of application where binding is done as late as possible and certainly not at design time. A datacenter administrator will be very uneasy deploying applications where developer has coded in the security policy and scale limit. <br />
<br />
Container has a similar story. Container is not a disruptive technology like it is portrayed to be nor is its effect as tectonic as virtualization or Java/JVM. It needs to find a partner in a deployment technology like a vagrant or rails or PaaS. Currently its main value proposition is the time to bring up a execution environment. But to get that one has to give up security, naming, directory and identity. In other words, what it is doing is pushing those decisions to the developer who is the most ill equipped to handle those. Unknownnoreply@blogger.comtag:blogger.com,1999:blog-6743330.post-4282331961249631602016-02-23T10:57:00.002-08:002016-02-23T10:57:39.485-08:00JESOA<br />
In continuing my education on Microservices, I came across a blog which gave a formula for SOA. In essence the formula says remove ESB, SOAP, Persistence (Reliable Messaging) and centralized governance and add containers + PaaS. That is the microservices definition.<br />
<br />
There was a initiative a while back called JEOS for "Just enough Operating System". May be we should try that for Microservices equation as a function of SOA. It is just enough SOA.Unknownnoreply@blogger.comtag:blogger.com,1999:blog-6743330.post-70436702901899667132016-02-16T21:30:00.000-08:002016-02-17T09:43:06.225-08:00MicrosoervicesIn trying to understand the difference between Microservices and Web Services (Circa 2002), I cam across this definition.<br />
<br />
<em style="background-color: white; font-family: sans-serif; font-size: 15px; text-align: center;">"..., the microservice architectural style is an approach to developing a single application as a suite of small services, each running in its own process and communicating with lightweight mechanisms, often an HTTP resource API. These services are built around business capabilities and independently deployable by fully automated deployment machinery. There is a bare minimum of centralized management of these services, which may be written in different programming languages and use different data storage technologies." source: M Fowler</em><br />
<em style="background-color: white; font-family: sans-serif; font-size: 15px; text-align: center;"><br /></em>
<em style="background-color: white; font-family: sans-serif; font-size: 15px; text-align: center;"><br /></em>
<span style="background-color: white; font-family: sans-serif; font-size: 15px; text-align: center;">When web services started a similar definition was put forth. The main focus then was to get away from Java RMI into a light weight HTTP based communication and bring in language independence (then championed by Microsoft's CLR). As we implemented web services it became quite obvious that HTTP was quite heavy weight and language independence was not very economical. In fact what worked was language homogeneity with standardization on language that could shed weight for simple tasks and leverage a framework for heavy lifting. </span><br />
<span style="background-color: white; font-family: sans-serif; font-size: 15px; text-align: center;"><br /></span>
<span style="background-color: white; font-family: sans-serif; font-size: 15px; text-align: center;">One of the biggest innovations in Java was the built in packaging and (later) deployment mechanisms. In fact most of the appeal of Java was in its "platform" features and not language features. This whole CNA biz smells a lot like language once again taking control of deployment and not leaving it to operations. We are seeing a application ops profession asserting itself. </span>Unknownnoreply@blogger.comtag:blogger.com,1999:blog-6743330.post-11879749383919833512015-09-13T12:23:00.002-07:002015-09-13T12:23:16.597-07:00IoT is a Cloud NetworkCloud Native, SDN and SOA are all techniques not technologies. IoT is not a technique it is a use case that needs to use the above mentioned techniques to enable a mesh of connections that is manageable, secure and most of all just works. IoT could use any infrastructure including the carrier network but it is most likely end up using the cloud as a IaaS. IoT is a cloud networking problem that needs some connectivity middleware to sit on top of a virtual network. IoT is an application designed using SOA that will provision its network using SDN and will use ephemeral compute threads that have preexisting binding to the language runtimes. Unknownnoreply@blogger.comtag:blogger.com,1999:blog-6743330.post-78298457055897643412015-07-08T07:30:00.000-07:002015-07-08T11:44:21.468-07:00sdn and CMSOnly a few years ago, datacenter architects picked their overlay tunnel technology and created a list of stacks with which to build out their datacenter network. The cloud management system or even "the cloud" was an after thought. Today the tables have turned. We have datacenter architects debating the merits of various CMSes like OpenStack, vVMware (vSphere, vCAC, NSX) and a very distant third or fifth CloudStack. Within their CMS they are asking for support from one or more SDN stacks. The days of stanalone SDN stacks are gone. The battle today is between an open ecosystem like OpenStack vs. multiple closed ecosystems.<br />
<br />
So what are the SDN stacks being evaluated on by these cloud datacenters?<br />
<br />
First is ability to scale. And by scale, I don't mean just the overcoming the vlan exhaustion issue with annoating BGP or encapsulating L2 in L3 etc. Scale means the performance of the cloud network scales with the number of nodes in the datacenter. The nodes are server nodes. The cloud network does not scale with number of switches. Scale means your automation system can manage the configuration of a 50 node cloud as easily as 5K node cloud.<br />
<br />
Second is heterogeneity. This one is a quite a beast because it requires supporting all the major hypervisors, authentication systems, SIAMs, best of breed appliance (virtual and physical). From a cloud vendor's perspective this is where the R&D dollars are mostly spent i.e. in creation of a heterogenous ecosystem. Not proprietary ones like iCloud.<br />
<br />
Third is security. Not just network security, or long expensive compliance test but application data input validation, fraud prevention, almost waf like.<br />
<br />
<br />Unknownnoreply@blogger.comtag:blogger.com,1999:blog-6743330.post-16951058833630941412015-04-06T07:00:00.000-07:002015-04-06T12:10:58.917-07:00Killer App for Overlay Networking/SDNSDN has been searching for a killer app since its birth in the midst of protocol and encapsulation debates of 2011. It wasn't monitoring, flow management or physical network orchestration for a controller. It turns out it is container networking.<br />
<br />
Container is challenging the VM or a group of them is a unit of application. Its value proposition is removal of the virtualization tax and being open source it does not cost a whole lot to try out. The schedulers (k8, mesos etc.) seem to be maturing fast enough but the networking behind is still quite elementary. <br />
<br />
Using offloads that accelerate encap/decap of an overlay network on the JEOS (Just enough OS), containers with virtual interfaces can outperform hypervisor based VMs and integrate better with orchestration technology like gubernetes. Unknownnoreply@blogger.com