AI is everywhere. In fact, it has become a bit of an unwanted party guest – everyone is talking about it, but not everyone wants to invite it in. And I get it – it’s here to stay but its presence raises privacy concerns.
Do you really know if your AI powered snow blower isn't silently judging you for not clearing your driveway 3 days in a row? - Mistral (7B)
This article explains how to run a Large Language Model (LLM) locally on your Raspberry Pi (or any server, really), ditching online services and keeping your personal data (and your snow blower) out of the spotlight. We are going to take a look at the performance of five different models, running on a Raspberry Pi 4 and a Raspberry Pi 5.
Table of Contents
Overview
Hardware
Raspberry Pis are small, low-cost single-board computers designed for education and DIY projects. They run on an ARM architecture, support various peripherals, and operate using an open-source Debian-based OS, making them versatile tools for programming, software development, and IoT applications. For this article I have tested the top-end spec for both generations.
Raspberry Pi 4 | Raspberry Pi 5 |
Quad-core Cortex-A72, 1.5 GHz | Quad-core Cortex-A76, 2.4 GHz |
8GB LPDDR4 | 16GB LPDDR4X |
128GB MicroSD | 128GB MicroSD |
Software
This is the software used to run a Large Language Model locally.
Docker |
Open WebUI + Ollama |
Large Language Models
LLMs differentiate themselfs in multiple ways. They can vary in parameter count, training data and purpose. Choosing the right model for your usecase is important for the best results. In this article I will compare the performance of five different models, each with their strenghts and weaknesses.
Model | Description |
tinyllama (1B) | A very light weight LLM for use in low-power & edge devices |
llama3.2 (3.2B) | An LLM by Meta, condensed down to 3.2 billion parameters |
gemma3 (4.3B) | An LLM by Google, built on Gemini |
mistral (7.2B) | A 7 billion parameter LLM by Mistral AI |
phi4 (14B) | An LLM by Microsoft with 14 billion parameters |
You can find these and more Large Language Models here: https://ollama.com/
Installation
- Install Docker
- Automated installation script: https://get.docker.com/
- Manual installation: https://docs.docker.com/engine/install/
- Start the Open WebUI container
- GitHub repository: https://github.com/open-webui/open-webui
docker run -d -p 3000:8080 --add-host=host.docker.internal:host-gateway -v open-webui:/app/backend/data --name open-webui --restart always ghcr.io/open-webui/open-webui:main
After the download has completed you should be able to access Open WebUI on port 3000.
http://YOUR_IP:3000
On first launch, Open WebUI will ask you to create an account. As per Open WebUI’s website, the account is only created locally and your data will not be transmitted to their servers. The reason for the account creation process is the protection of the administrative settings.
Q: Why am I asked to sign up? Where are my data being sent to?
A: We require you to sign up to become the admin user for enhanced security. This ensures that if the Open WebUI is ever exposed to external access, your data remains secure. It's important to note that everything is kept local. We do not collect your data. When you sign up, all information stays within your server and never leaves your device. Your privacy and security are our top priorities, ensuring that your data remains under your control at all times.
Source: https://docs.openwebui.com/faq - 11.07.2025
After logging in you are presented with a prompt screen. However if you were to prompt the tool now, you’d be greeted by an error message saying that you do not have a model selected.

To download a model, visit https://ollama.com/ and browse for a model that fits your needsThen choose “Select a model” in Open WebUI and pull it. Once the download has completed, you are ready to start prompting.

Quick performance enhancements
Since the Raspberry Pi is by no means a fast computer, here are some settings you can change to improve performance. These settings will also improve performance on any other system.
- Use a smaller model
- A 3B model will run a lot quicker than a 12B model
- Disable convenience options
- Go to Admin Panel > Settings > Interface and disable the following options:
- Follow Up Generation
- Tag Generation
- Retrieval Query Generation
- Web Search Query Generation
- Autocomplete Generation
- (Title Generation) – If you disable this, all your chats will be called “New Chat”
- Go to Admin Panel > Settings > Interface and disable the following options:
Testing different models
The tests will be done with the following prompt:
You are a professional salesperson. You are tasked to convince the owner of a start-up on implementing AI into their business. You have a maximum of 100 words. You are not allowed to lie. You are not allowed to make up facts. You need to stay neutral. If you are citing facts you need to supply sources. Sources do not count towards your 100 word limit.
The numbers – what do they mean?
- model (5B) – The amount of parameters in a model. Larger models contain more parameters but require more computational power
- token – A token refers to an individual piece of text (words & characters) that is processed during text analysis and generation
- prompt_token – A token that was generated out of a prompt. Each prompt is tokenized prior to analysis
- response_token – A token that was generated as part of a response. Each response is made of multiple tokens
Performance
The table below shows the tokens per second for the prompt and the response. A higher number equals better performance.
Device Large Language Model | Raspberry Pi 4 (8GB) prompt_token/s – response_token/s | Raspberry Pi 5 (16GB) prompt_token/s – response_token/s |
tinyllama (1.1B) | 13.6 / 5.2 | 27.13 / 17.57 |
llama3.2 (3B) | 4.17 / 1.78 | 9.36 / 5.41 |
gemma3 (4B) | 3.61 / 1.46 | 8.14 / 4.64 |
mistral (7.2B) | 1.85 / 0.9 | 3.89 / 2.39 |
phi4 (14B) | won’t run, not enough memory available | 1.88 / 1.25 |
Output Usability
Finding the right model for your usecase is important since not all models are created equally. Smaller models may run faster but can also generate less accurate responses. Here are two examples:
Model | Prompt | Response | Issue |
gemma3 (4B) | Does any Raspberry Pi have a built in microphone? | Yes, some Raspberry Pi models have a built-in microphone. Here’s the breakdown: Raspberry Pi 4 Model B: This is the most common model with a built-in microphone. It utilizes a microphone array for better voice recognition and noise cancellation. Raspberry Pi 400: This keyboard-computer version of the Raspberry Pi 4 also includes a built-in microphone. Raspberry Pi Zero 2 W: This compact model does have a built-in microphone, although its quality isn’t as high as the Pi 4’s. | None of the available Raspberry Pi models have a built-in microphone |
tinyllama (1.1B) | Please scramble a random word from the dictionary. I want to guess the word. | Sure, here’s a random word from the dictionary: “Astronaut“! Let me help you guess it! […] | I’ll take a wild guess and say the word was “Astronaut“ |
On my Raspberry Pi 5 I’m usually running Mistral (7B) as it seems to be the best performance to accuracy compromise for this tiny device. On the Raspberry Pi 4 I would not run anything larger than gemma3 (4B).
Privacy
The majority of users opt for cloud-based AI services due to their convenience and efficiency. However, it is crucial to recognize that this choice may have significant privacy implications, not only for the user but also for others. The data entered into a cloud-based AI system, which may include personal, professional, or financial information, carries a potential risk of being collected, analyzed, and even utilized to enhance the AI model itself by the service provider.
Running a large language model locally or on a privately-owned server offers the advantage of maintaining exclusive control over the data utilized in conversations, thereby eliminating said privacy concerns. However, it is essential to acknowledge that this approach may be accompanied by reduced performance or less precise responses when compared to cloud-based solutions.
Privacy Proof
Trusting that a program respects your privacy and does not communicate with the outside world is nice, but proof is better. I rebooted my server and ran tcpdump while using Open WebUI to see what kind of communication I can observe. I used the same prompt as in the performance test above, but continued the conversation for a few more messages.
tcpdump command
tcpdump -i eth0 -s 65535 -w openwebui.pcap
Wireshark Filter
((((ip.src == 192.168.0.7) && !(ip.dst == 192.168.0.0/24)) || ((ip.dst == 192.168.0.7) && !(ip.src == 192.168.0.0/24))) && !(_ws.col.protocol == "NTP") && !(_ws.col.protocol == "MDNS") && !(_ws.col.protocol == "IGMPv3"))
Explanation:
Source or Destination: From / To my LLM server (192.168.0.7), from / to the internet (not 192.168.0.0/24)
Protocol: Not NTP, MDNS & IGMPv3

IP | Domain | Traffic | Description |
13.226.155.93 | huggingface.co | 300 bytes, 4 packets | Get’s called during container startup, not entirely captured as the container was already starting |
140.82.121.5 | api.github.com | 12 kB, 19 packets | Probably Update checks |
This packet capture shows that there is minimal communication going on while using Open WebUI. It’s save to say that your prompts do not leave your device.
Conclusion
Running a Large Language Model locally is not as hard as it may seem. While the performance cannot compete with services like OpenAI or Perplexity, the advantage of enhanced privacy makes it a worthwhile effort.