Network of Things

Friday, May 23, 2025

Local LLM using Ollama and open-webUI

I have a local server with Nvidia GPUs which I bought off ebay for $800. The GPU are RTX but there are 4 of them in the server. I run ollama on it and downloaded a few models that I use mainly to ask them how to configure other applications. I have multiple laptops where I run docker containers with open-webui and ollama-webui as shown below. Open-webui allows saving chats while ollam-webui does not.

The performance is about as good as any online LLM server. I routinely get around 20+ Tokens/sec i.e. it is not a long wait to get your queries answered and stop and add information. In other words, it is usable.

All the models are free and open source - so I pay nothing to use them. The only out-of-pocket expense was this server that I purchase. I could have made one with $99 2 socket intel motherboard (from Aliexpress), but I did not have time, so I bought it off some student off ebay.

The answers are accurate and not all that different from using online version. Give it a try! It is fun and really really cheap.

Appendix

I got the setup and asked it to write in a blog fashion all the steps necessary. I haven't validated them, but they look roughly what I did.

Title: Setting Up OLLAMA and Open-WebUI Across Server and Windows Laptop

In this blog post, we'll walk through the process of setting up OLLAMA, a powerful language model, on a server equipped with four NVIDIA GPUs, and connecting it to Open-WebUI, an intuitive user interface for large language models, on a Windows laptop using WSL (Windows Subsystem for Linux).

**Part 1: Server Setup**

First, let's set up OLLAMA on the server. Begin by updating the package lists and installing required dependencies:

```bash

sudo apt-get update && sudo apt-get upgrade -y

pip install ollama[all]

```

Now, we'll download and pull your favorite models from Hugging Face Model Hub. Here, we'll use DeepSea, Mistral, and QWEN:

```bash

ollama pull deepseek

ollama pull mistral

ollama pull qwen

```

To configure OLLAMA as a service, create a systemd unit file in `/etc/systemd/system`:

```bash

sudo nano ollama.service

```

Add the following content and save the file:

```ini

[Unit]

Description=OLLAMA Service

Requires=nvidia-smi.service

After=nvidia-smi.service

[Service]

User=<username>

ExecStart=/usr/local/bin/ollama start

Restart=always

EnvironmentFile=-/home/<username>/.ollama/config

WorkingDirectory=/home/<username>/.cache/ollama

[Install]

WantedBy=multi-user.target

```

Next, enable and start the service:

```bash

sudo systemctl daemon-reload

sudo systemctl enable ollama

sudo systemctl start ollama

```

Configure OLLAMA to use the GPU devices and expose it over a public IP using port forwarding. Update the `/home/<username>/.ollama/config` file accordingly:

```bash

# In [general] section

port = 8000

# In [gpus] section (add the necessary GPU IDs)

gpus = 0,1,2,3

```

**Part 2: Windows Laptop Setup**

Install WSL and Ubuntu if not already done. Open a Ubuntu terminal and update the package lists:

```bash

sudo apt-get update && sudo apt-get upgrade -y

pip install openwebui

```

Connect to the OLLAMA service on the server using the public IP and port:

```bash

openwebui connect <server_public_ip>:<port>

```

Once connected, Open-WebUI will launch, providing a user-friendly interface for interacting with your language models.

By following this guide, you've successfully set up OLLAMA on a server with multiple GPUs and connected it to Open-WebUI on a Windows laptop using WSL, enabling seamless access to powerful AI models from anywhere.

Saturday, February 01, 2025

DRS1 = DSV3 + GRPO + VR

DSR1 is actually a reasoning model which also does chat. But the surprise that it outperformed o2 and others is because nobody paid much attention to this company and its publications. After the DSR1 announcement, I found these two papers: DeepSeekMath (April, 24) and DeepSeekCoder (June, 24). Had I read this last year, I would have been waiting for the actual DSR1. In fact, the real contribution was in V3 model from December, 24. DSR1 is easier to reproduce from DSV3. Getting to DSV3 is what is difficult and amazing that used older and crippled infra. (2.8M GPU hours) or roughly 2.8M*$3.5/hr = $9.8M. Compare that to several billions spent in pre-training current proprietary and open source models.

To get to DSR1, you start with DSV3 and apply GRPO (defined in DeepSeekMath paper). It is not SFT with HF, it is RL with verified rewards (VR) - essentially no humans involved. The key contribution of GRPO actually came out in April in DSMath paper that is linked above.

To reduce cost, they use FP8 and DualPipe algorithm which helps them reduce GPU memory consumption. Recall successive GPUs from both Nvidia and AMD are simply adding more HBMe memory to the GPU module. Also recall the H800 was crippled by more than halving its bandwidth to the VRAM (memory on GPU module). The training parameters they optimized on are compute to communication ratio and near zero all-to-all communication in a GPU cluster. They avoided tensor level parallelism and focused on traditional pipeline bubble removal. They developed middleware that optimized on inter-node communication that understood the underlying transport as IB or NVLink.

In summary, all the cost efficiencies are standard HPC techniques and their key innovations were published months before they achieved DSR1. The only difference is they believed in their approach and those who actually read the papers in April, 2024 were high on we need more GPUs to pre-train and ignored their innovations.

The constrained infra in China pushed them in this direction, but not we should thank that constraint because we now know that even models exhibit emergent behavior when trained in constrained environment. You want sweet grapes, don't water the vine!

Friday, January 31, 2025

DeepSeek is On-Prem and "" real value is between ""

High Flyer is a hedge fund which was early adopter of GPU acceleration in finance. The expertise they built in that field helped them launch a subsidiary Deepseek which recently released R1 after two other LLMs. Everybody now knows what Deepseek is but not many know that it is actually not running in any cloud. It is all on-prem.

In China, you can't get H100, but you can get H800 which is BW limited and H20 which is crippled version but enough to train and infer 680B parameter model. Some suggest that Deepseek innovated due to constraints. It may be, but it could just as well be that they figured out first that one does not need to load the whole model into memory for inference. The latest figures are that DSR1 only load 30B parameters which you can load on a Laptop GPU with only 12GB of VRAM.

Since the new on DSR1 hit the wires, the consumer grade GPUs have been flying off the shelf. It is super easy to run DSR1 using ollama (literally just type %ollama run deepseek-r1). Most of the value, I derive from DSR1 is the bit inside. For example, I asked DSR1 how to compress and load a LLM into VRAM. The answer was obvious as shown below:

In summary, my approach would be:
1. Start by quantizing the model parameters to lower bit precision (e.g., 8-bit or 16-bit).
2. Apply pruning to remove unnecessary weights after quantization.
3. Use techniques like dynamic or static quantization for further compression during inference.
4. Implement checkpointing strategies to optimize memory usage during the inference process.
5. Possibly combine with other methods like knowledge distillation if there's excess capacity.

but the real insight was in between tags <think>. Some of the tips there are knowledge distillation (not the same as model distillation. Use of fixed point math (something we did like 30 years ago to draw pictures using postscript). I particularly liked the way it went from top-of-the-head response of quantization to fixed point to knowledge distillation and went on to (eventually reject) sparsity techniques.

Sunday, October 27, 2024

AI IDEs - Do you need it?

AI generated code IDEs like Replit, Cursor.sh and plugins into vscode are all the rage now a days. I blogged earlier on using continue.dev in vscode and using online model for code generation.

They are all priced between $12-$20/month and require a subscription (API Key) to a inference engine (OpenAI) on top of that. You can avoid all these fees, if you run your model locally. I tried that using ollama locally on my laptop and pick a 7B or smaller model (I like codestral). It is slower but completly usable.

That works well for hobbyist but what about teams in software factories? Writing code as a group is different workflow. GenAI powered tools will need to evolve to fit the team workflow. e.g. doing a diff between two code bases. Increasing the code context to include the whole team and not just my code base.

There is also a thing I noticed, it works well for 3rd and 4th generation languages, but not for assembly/C. It does not understand compiler optimizations yet. Charging $20/month for this early alpha type functionality is not worth it imho. You can get equivalent general purpose code generation for free without seeing this error

The impressive part of code IDEs is not that they generate code, it is that they can summarize your code for comments and audits. That may be the use case we want to focus on.

Saturday, September 14, 2024

AI AlterEgo

The killer application for AI is to enable expert profiles in enterprise and productivity applications. These are not bots that help you get through mundane tasks, these are profiles that application consults to provide assistance in using the application. This is akin to expert levels in gaming.

When using - say - an IDE. Today, the profile that the application stores on the user is mainly to collect credentials and secure access to outside storage and other artifacts. With GenAI, these profiles can be based on other users, experts or just AI itself. If you admire writing style of someone, then assuming the person is willing to sell/share their profile, one can use that profile (import it) and the application can now use the profile's style to generate your content.

IDE is the easiest to understand this concept, but you can imagine how intelligent profiles can be used in every sphere where applications perform majority of the mundane tasks. For example, the world's best trader can export his/her profile in stock trading and you can use if in your brokerage application to receive recommendations for trade which otherwise you would not have entered into as you would not have seen the opportunity.

All of this is essentially creating a alter ego of yourself. Now everybody can become a rock star!

Saturday, August 03, 2024

Where is the productivity in AI? Try this!

For some reason, mainstream is now asking for proof of productivity from AI. There are some skeptics. Let me show you how it increases my productivity as a developer.

As Easy as 123

Using continue code assistant, I was able to build with very little help an application that uses streamlit for UX, MySQL for DBMS and LangChain for chaining model and logic. The ease with which I can now "talk" to my tables in DBMS makes mysql workbench kind of obsolete. For run-of-the-mill DBMS reporting, we don't need to use any expensive human talent to get it done. This is a boost in productivity for anyone who has to back up an argument with data. This is probably why Snowflake acquired streamlit. The productivity gain is astronomical as I don't have to keep searching for the "right" syntax. I barely check on API reference as the code tells me which method and object I need.

Agents R US

While I used chains and linked them together to get the end result, I could have created distinct semi-autonomous agents which would get the information as it updates and report the state in real-time. This type of work takes weeks today in a organization and it can now be done in hours.

PaaS This!

You need a source of data, a connector to read/write to the data and a execution environment which allows for use of models from many sources (using their API key) and frameworks to keep all these components and their state in sync. This is not done in a IaaS setting, this needs a PaaS. No wonder, you can't get away from HuggingFace. No wonder they just announce Github models as competition to HF.

Models are 4Ever:

APIs come and go, but models are forever. I have used three different models in a single application and am paying no more 2.5 cents for 10K tokens. I am beginning to wonder if they actually make money providing me this service. Let's look at their investment, a typical model (like GPT), requires 144GPUs to load GPT model, it uses 750W per GPU used. Most cannot afford this, so they go for a model that fits into a single system, but single system needs to be configured for GPU passthrough to the VMs without any bloatware from K8s. We are looking at a hard requirement that a single node offer performance of 200TFLOPs as a minimum.

Larger models with 400B+ parameters are now called giants. We need Giants because they keep the context around for longer and capture deeper relationships between parameter. But these giants shouldn't share context across tenants. I believe currently they do.

Saturday, April 20, 2024

Llama 3 - More ways to run it, but still nothing new

Llama 3 is out and getting to it can be a challenge. The approval email's URL expires in 24 hours. It can take 8hrs to download. But after the download from Meta, it can be use locally in text-generation-webui. This time it has hosted versions on hugging chat and meta itself. It says it's training stopped in 2021 so it continues to think the PM of UK is Boris. But it believes it is more conversational.

When asked how many params it is trained on, it initially said 1.5B. Then I asked again and it changed its mind.

Using ollama to run llama-3, I get better answers

On text-generation-webui, the model does not load except when you pick transformers as the loader. And the chat is not fully functional.

After converting to GGUF,

LM Studio is the best one of these for now.