Why?

I wanted to set up a local LLM-based chatbot for software development purposes in order to try out the capabilities of the (smaller) models and being able to develop without being tied to external services.

What you need:

  • Hardware / Configuration
    • local (server) system, e.g. Windows 11, Linux, …
    • NVIDIA graphics card with a GPU and some VRAM to run the (large) language models smoothly
    • proper hardware setup of your local network (see potential issues in the troubleshooting section)
  • Software
    • Ollama runtime on the server
    • latest drivers for the graphics card
    • CUDA toolkit to utilize the GPU for parallizable calculations
    • NVIDIA command line process monitor nvidia-smi

Network setup

  • Any network firewalls must be properly configured to allow traffic between the VLANs/subnets where the clients and the host reside.

  • The firewall of the host must be configured to allow incoming traffic to the port where Ollama is listening. It might be advantageous to restrict the source IP addresses to a subnet where your development machine is located.

netsh advfirewall firewall add rule name="Ollama incoming requests" dir=in action=allow protocol=TCP localport=11434 profile=any
  • Windows must be configured to route the traffic internally:
netsh interface portproxy add v4tov4 listenaddress=0.0.0.0 listenport=11434 connectaddress=127.0.0.1 connectport=11434
  • In the system environment variables, the Ollama host should be configured: OLLAMA_HOST=0.0.0.0

Ollama Installation & IDE (zed, kilocode, VSCode, …)

On Windows, the Ollama application is started anyways. Since I prefer to be in control of what is actually happening, I removed the Ollama link in the Windows startup folder.

From the command line (e.g. PowerShell), you can run Ollama with ollama serve. Then, Ollama is being started and you can checkout the service API on http://localhost:11434. It shoud report that Ollama is running..

Before you can run a model, you need to download (pull) them from the web to your local storage, I chose the following three models:

  • ollama pull mistral
  • ollama pull qwen-2.5-coder:7b
  • ollama pull gemma3:4b

You can have a look at the available models by running ollama list, and ollama ps gives you the active model.

Then you can interactively start the chat with the model by running it ollama run mistral. You can check out if the hardware is utilized properly by monitoring the usage of the GPU: nvidia-smi -l 2.

Afterwards, you can set up the (local) Ollama service as model provider in your IDE. But no one said this would be easy :-). Hence, I collected a list of issues that I encountered in the section below.

Happy LLMing.

Troubleshooting

  • Potential error messages during the start of Ollama or during interaction with the Ollama API.
Error: listen tcp 127.0.0.1:11434:
Error: Head "http://127.0.0.1:11434/": EOF
  • Restart the host network service
net stop hns
net start hns
  • Find existing Ollama processes and stop them:
Get-Process | Out-String -Stream | Select-String 'llama' | Stop-Process
  • After updating Ollama, sometimes the port is blocked or used by other processes, you need to find them.
netstat -ano | findstr :11434
netstat -ano | findstr :11434 | findstr 'LISTEN'

From the output, you can identify the PID, e.g. 4684. Then you can stop the process:

tasklist /svc /FI "PID eq 4684"
taskkill /F /PID 4684