Friday, May 23, 2025

Local LLM using Ollama and open-webUI

 I have a local server with Nvidia GPUs which I bought off ebay for $800. The GPU are RTX but there are 4 of them in the server. I run ollama on it and downloaded a few models that I use mainly to ask them how to configure other applications. I have multiple laptops where I run docker containers with open-webui and ollama-webui as shown below. Open-webui allows saving chats while ollam-webui does not. 



The performance is about as good as any online LLM server. I routinely get around 20+ Tokens/sec i.e. it is not a long wait to get your queries answered and stop and add information. In other words, it is usable. 




All the models are free and open source - so I pay nothing to use them. The only out-of-pocket expense was this server that I purchase. I could have made one with $99 2 socket intel motherboard (from Aliexpress), but I did not have time, so I bought it off some student off ebay. 

The answers are accurate and not all that different from using online version. Give it a try! It is fun and really really cheap. 


Appendix

I got the setup and asked it to write in a blog fashion all the steps necessary. I haven't validated them, but they look roughly what I did. 


 Title: Setting Up OLLAMA and Open-WebUI Across Server and Windows Laptop


In this blog post, we'll walk through the process of setting up OLLAMA, a powerful language model, on a server equipped with four NVIDIA GPUs, and connecting it to Open-WebUI, an intuitive user interface for large language models, on a Windows laptop using WSL (Windows Subsystem for Linux).


**Part 1: Server Setup**

First, let's set up OLLAMA on the server. Begin by updating the package lists and installing required dependencies:

```bash

sudo apt-get update && sudo apt-get upgrade -y

pip install ollama[all]

```

Now, we'll download and pull your favorite models from Hugging Face Model Hub. Here, we'll use DeepSea, Mistral, and QWEN:


```bash

ollama pull deepseek

ollama pull mistral

ollama pull qwen

```

To configure OLLAMA as a service, create a systemd unit file in `/etc/systemd/system`:

```bash

sudo nano ollama.service

```

Add the following content and save the file:

```ini

[Unit]

Description=OLLAMA Service

Requires=nvidia-smi.service

After=nvidia-smi.service


[Service]

User=<username>

ExecStart=/usr/local/bin/ollama start

Restart=always

EnvironmentFile=-/home/<username>/.ollama/config

WorkingDirectory=/home/<username>/.cache/ollama


[Install]

WantedBy=multi-user.target

```


Next, enable and start the service:


```bash

sudo systemctl daemon-reload

sudo systemctl enable ollama

sudo systemctl start ollama

```

Configure OLLAMA to use the GPU devices and expose it over a public IP using port forwarding. Update the `/home/<username>/.ollama/config` file accordingly:

```bash

# In [general] section

port = 8000


# In [gpus] section (add the necessary GPU IDs)

gpus = 0,1,2,3

```


**Part 2: Windows Laptop Setup**


Install WSL and Ubuntu if not already done. Open a Ubuntu terminal and update the package lists:


```bash

sudo apt-get update && sudo apt-get upgrade -y

pip install openwebui

```

Connect to the OLLAMA service on the server using the public IP and port:


```bash

openwebui connect <server_public_ip>:<port>

```


Once connected, Open-WebUI will launch, providing a user-friendly interface for interacting with your language models.


By following this guide, you've successfully set up OLLAMA on a server with multiple GPUs and connected it to Open-WebUI on a Windows laptop using WSL, enabling seamless access to powerful AI models from anywhere.


Prompt is the new Config, Workflow is the new Product

 Prompt are dynamic configurations that can adapt to incoming requests. Today configurations are static files which may miss a lot of corner...