Running AI on your Raspberry Pi – A privacy respecting & cost efficient LLM server

AI is everywhere. In fact, it has become a bit of an unwanted party guest – everyone is talking about it, but not everyone wants to invite it in. And I get it – it’s here to stay but its presence raises privacy concerns.

Do you really know if your AI powered snow blower isn't silently judging you for not clearing your driveway 3 days in a row? - Mistral (7B) 

https://ollama.com/library/mistral

This article explains how to run a Large Language Model (LLM) locally on your Raspberry Pi (or any server, really), ditching online services and keeping your personal data (and your snow blower) out of the spotlight. We are going to take a look at the performance of five different models, running on a Raspberry Pi 4 and a Raspberry Pi 5.

Table of Contents

Overview

Hardware

Raspberry Pis are small, low-cost single-board computers designed for education and DIY projects. They run on an ARM architecture, support various peripherals, and operate using an open-source Debian-based OS, making them versatile tools for programming, software development, and IoT applications. For this article I have tested the top-end spec for both generations.

Raspberry Pi 4	Raspberry Pi 5
Quad-core Cortex-A72, 1.5 GHz	Quad-core Cortex-A76, 2.4 GHz
8GB LPDDR4	16GB LPDDR4X
128GB MicroSD	128GB MicroSD

Software

This is the software used to run a Large Language Model locally.

Docker

Open WebUI + Ollama

Large Language Models

LLMs differentiate themselfs in multiple ways. They can vary in parameter count, training data and purpose. Choosing the right model for your usecase is important for the best results. In this article I will compare the performance of five different models, each with their strenghts and weaknesses.

Model	Description
tinyllama (1B) https://ollama.com/library/tinyllama	A very light weight LLM for use in low-power & edge devices
llama3.2 (3.2B) https://ollama.com/library/llama3.2	An LLM by Meta, condensed down to 3.2 billion parameters
gemma3 (4.3B) https://ollama.com/library/gemma3	An LLM by Google, built on Gemini
mistral (7.2B) https://ollama.com/library/mistral	A 7 billion parameter LLM by Mistral AI
phi4 (14B) https://ollama.com/library/phi4	An LLM by Microsoft with 14 billion parameters

Installation

Install Docker
- Automated installation script: https://get.docker.com/
- Manual installation: https://docs.docker.com/engine/install/

Start the Open WebUI container
- GitHub repository: https://github.com/open-webui/open-webui

docker run -d -p 3000:8080 --add-host=host.docker.internal:host-gateway -v open-webui:/app/backend/data --name open-webui --restart always ghcr.io/open-webui/open-webui:main

After the download has completed you should be able to access Open WebUI on port 3000.

http://YOUR_IP:3000

On first launch, Open WebUI will ask you to create an account. As per Open WebUI’s website, the account is only created locally and your data will not be transmitted to their servers. The reason for the account creation process is the protection of the administrative settings.

Q: Why am I asked to sign up? Where are my data being sent to?

A: We require you to sign up to become the admin user for enhanced security. This ensures that if the Open WebUI is ever exposed to external access, your data remains secure. It's important to note that everything is kept local. We do not collect your data. When you sign up, all information stays within your server and never leaves your device. Your privacy and security are our top priorities, ensuring that your data remains under your control at all times.

Source: https://docs.openwebui.com/faq - 11.07.2025

After logging in you are presented with a prompt screen. However if you were to prompt the tool now, you’d be greeted by an error message saying that you do not have a model selected.

To download a model, visit https://ollama.com/ and browse for a model that fits your needsThen choose “Select a model” in Open WebUI and pull it. Once the download has completed, you are ready to start prompting.

Quick performance enhancements

Since the Raspberry Pi is by no means a fast computer, here are some settings you can change to improve performance. These settings will also improve performance on any other system.

Use a smaller model
- A 3B model will run a lot quicker than a 12B model
Disable convenience options
- Go to Admin Panel > Settings > Interface and disable the following options:
  - Follow Up Generation
  - Tag Generation
  - Retrieval Query Generation
  - Web Search Query Generation
  - Autocomplete Generation
  - (Title Generation) – If you disable this, all your chats will be called “New Chat”

Testing different models

The tests will be done with the following prompt:

You are a professional salesperson. You are tasked to convince the owner of a start-up on implementing AI into their business. You have a maximum of 100 words. You are not allowed to lie. You are not allowed to make up facts. You need to stay neutral. If you are citing facts you need to supply sources. Sources do not count towards your 100 word limit.

The numbers – what do they mean?

model (5B) – The amount of parameters in a model. Larger models contain more parameters but require more computational power
token – A token refers to an individual piece of text (words & characters) that is processed during text analysis and generation
- prompt_token – A token that was generated out of a prompt. Each prompt is tokenized prior to analysis
- response_token – A token that was generated as part of a response. Each response is made of multiple tokens

Performance

The table below shows the tokens per second for the prompt and the response. A higher number equals better performance.

Device Large Language Model	Raspberry Pi 4 (8GB) prompt_token/s – response_token/s	Raspberry Pi 5 (16GB) prompt_token/s – response_token/s
tinyllama (1.1B)	13.6 / 5.2	27.13 / 17.57
llama3.2 (3B)	4.17 / 1.78	9.36 / 5.41
gemma3 (4B)	3.61 / 1.46	8.14 / 4.64
mistral (7.2B)	1.85 / 0.9	3.89 / 2.39
phi4 (14B)	won’t run, not enough memory available	1.88 / 1.25

All performance enhancements except for disabling Title Generation were applied

Output Usability

Finding the right model for your usecase is important since not all models are created equally. Smaller models may run faster but can also generate less accurate responses. Here are two examples:

Model	Prompt	Response	Issue
gemma3 (4B)	Does any Raspberry Pi have a built in microphone?	Yes, some Raspberry Pi models have a built-in microphone. Here’s the breakdown: Raspberry Pi 4 Model B: This is the most common model with a built-in microphone. It utilizes a microphone array for better voice recognition and noise cancellation. Raspberry Pi 400: This keyboard-computer version of the Raspberry Pi 4 also includes a built-in microphone. Raspberry Pi Zero 2 W: This compact model does have a built-in microphone, although its quality isn’t as high as the Pi 4’s.	None of the available Raspberry Pi models have a built-in microphone
tinyllama (1.1B)	Please scramble a random word from the dictionary. I want to guess the word.	Sure, here’s a random word from the dictionary: “Astronaut“! Let me help you guess it! […]	I’ll take a wild guess and say the word was “Astronaut“

On my Raspberry Pi 5 I’m usually running Mistral (7B) as it seems to be the best performance to accuracy compromise for this tiny device. On the Raspberry Pi 4 I would not run anything larger than gemma3 (4B).

Privacy

The majority of users opt for cloud-based AI services due to their convenience and efficiency. However, it is crucial to recognize that this choice may have significant privacy implications, not only for the user but also for others. The data entered into a cloud-based AI system, which may include personal, professional, or financial information, carries a potential risk of being collected, analyzed, and even utilized to enhance the AI model itself by the service provider.

Running a large language model locally or on a privately-owned server offers the advantage of maintaining exclusive control over the data utilized in conversations, thereby eliminating said privacy concerns. However, it is essential to acknowledge that this approach may be accompanied by reduced performance or less precise responses when compared to cloud-based solutions.

Privacy Proof

Trusting that a program respects your privacy and does not communicate with the outside world is nice, but proof is better. I rebooted my server and ran tcpdump while using Open WebUI to see what kind of communication I can observe. I used the same prompt as in the performance test above, but continued the conversation for a few more messages.

tcpdump command

tcpdump -i eth0 -s 65535 -w openwebui.pcap

Wireshark Filter

((((ip.src == 192.168.0.7) && !(ip.dst == 192.168.0.0/24)) || ((ip.dst == 192.168.0.7) && !(ip.src == 192.168.0.0/24))) && !(_ws.col.protocol == "NTP") && !(_ws.col.protocol == "MDNS") && !(_ws.col.protocol == "IGMPv3"))

Explanation:
Source or Destination: From / To my LLM server (192.168.0.7), from / to the internet (not 192.168.0.0/24)
Protocol: Not NTP, MDNS & IGMPv3

IP	Domain	Traffic	Description
13.226.155.93	huggingface.co	300 bytes, 4 packets	Get’s called during container startup, not entirely captured as the container was already starting
140.82.121.5	api.github.com	12 kB, 19 packets	Probably Update checks

This packet capture shows that there is minimal communication going on while using Open WebUI. It’s save to say that your prompts do not leave your device.

Conclusion

Running a Large Language Model locally is not as hard as it may seem. While the performance cannot compete with services like OpenAI or Perplexity, the advantage of enhanced privacy makes it a worthwhile effort.