Running the official llama.cpp benchmark tool (llama-bench)
Here’s a step-by-step breakdown of what each command does:
| Step | Command | Action | Why it is done |
| 1 | cd ~ | Navigates to your user’s home folder (/home/tzah). | Ensures the project downloads into a clean, safe workspace instead of a restricted system folder. |
| 2 | git clone [https://github.com/ggerganov/llama.cpp](https://github.com/ggerganov/llama.cpp) | Downloads the entire up-to-date source code repository from GitHub. | Creates a local copy of the llama.cpp files on your device in a new folder named llama.cpp. |
| 3 | cd llama.cpp | Changes your current terminal directory into the newly created project folder. | Moves you inside the codebase so you can run configuration and build tools on its files. |
| 4 | cmake -B build | Generates a custom build configuration and creates a folder named build. | Inspects your system hardware (like your Spacemit processor features) to generate a tailored compilation “recipe.” |
| 5 | cmake –build build –config Release | Compiles the raw C++ code into a finished, ready-to-run executable binary. | The –config Release flag instructs the compiler to heavily optimize the code for raw speed and AI performance. |
- Once those finishes running, we will have a working compiled binary (usually an executable named main or llama-cli inside the build/bin/ folder). Then we can use it to load quantized (compressed) AI models and text-generate completely offline.
- After completing step 5, we will have a finely tuned local AI engine ready to run language models directly on our K3 Pico-ITX hardware.
Running our test
We used the following Bash script to run our test. In plain language: what does it actually do?
It:
- Runs a full LLaMA performance benchmark on our Spacemit K3 hardware.
- Measures tokens/sec, latency, prompt processing speed, generation speed, and thread scaling
- Saves the results into a Markdown file you can open anywhere
- Names the file after your device so you can compare it later with your Mac mini results
This script is specifically designed to let us use the same model, parameters, and benchmark tool to compare it with other devices such as:
- Mac mini (M1/M2/M4)
- Any Linux or ARM device
| Step | What it does | Why it matters |
| 1 | MODEL=”models/llama-7b-q4_0.gguf” | Sets the model file llamabench will test. |
| 2 | THREADS=$(nproc) | Automatically uses all CPU cores on your device. |
| 3 | OUTFILE=”llama_bench_results_$(hostname).md” | Creates a results file named after your machine (e.g., llama_bench_results_spacemit-k3.md). |
| 4 | Writes a Markdown header with hostname, CPU model, and thread count | Makes the results readable and comparable across devices. |
| 5 | Create a directory to store our model | mkdir -p ~/models |
| 6 | Downloads LLaMA3 8B Q4_K_M (GGUF) | wget -O ~/models/llama3-8b-q4_k_m.gguf https://huggingface.co/QuantFactory/Meta-Llama-3-8B-GGUF/resolve/main/Meta-Llama-3-8B.Q4_K_M.gguf |
| 7 | Runs our script with the following parameters: • 512 prompt tokens • 128 generation tokens • all CPU threads • 2048 batch size • Markdown output | This is the actual LLaMA performance benchmark. |
| 8 | Appends the benchmark output to the Markdown file | Saves everything in one clean report. |
| 9 | Prints “Benchmark complete…” | Confirms the script finished. |
Our test script
Runing the most popular benchmark model on Mac mini: LLaMA-3 8B (Q4_K_M)
#!/bin/bash
MODEL="$HOME/models/llama3-8b-q4_k_m.gguf"
LLAMABENCH="$HOME/llama.cpp/build/bin/llama-bench"
THREADS=$(nproc)
OUTFILE="llama_bench_results_$(hostname).md"
echo "# LLaMA Benchmark Results for $(hostname)" > $OUTFILE
echo "## CPU: $(lscpu | grep 'Model name')" >> $OUTFILE
echo "## Threads: $THREADS" >> $OUTFILE
echo "## Model: LLaMA-3 8B Q4_K_M" >> $OUTFILE
echo "" >> $OUTFILE
$LLAMABENCH \
-m $MODEL \
-p 512 \
-n 128 \
-t $THREADS \
-b 2048 \
-o md >> $OUTFILE
echo "" >> $OUTFILE
echo "Benchmark complete. Results saved to $OUTFILE"
Why did we pick this model?
We chose this model because it’s the most popular among Mac mini users and has been tested more than any other. It shows up in nearly every:
- GitHub llama.cpp benchmark thread
- Reddit r/LocalLLaMA performance post
- Apple Silicon comparison
- M1 vs M2 vs M4 benchmark
- CPU vs Metal backend test
Why it’s our favorite:
- Fits comfortably in 16 GB RAM
- Strong real world performance
- Good balance of speed + quality
- Works perfectly with llama.cpp
- Ideal for comparing different CPUs (like our K3)
LLaMA-3 8B Q4_K_M — Spacemit K3 vs Mac mini (16 GB RAM) + Estimated Prices (USD)
| Device | Threads | Prompt Speed (pp512) | Generation Speed (tg128) | Estimated Price (USD) | Notes |
| K3 Pico-ITX (our result) | 8 | 9.04 t/s | 3.05 t/s | ~$300 | Low cost RISC V SBC |
| Mac mini M1 (16 GB) | 8 | ~55–65 t/s | ~35–45 t/s | $650–$800 (used) | Best budget Apple Silicon |
| Mac mini M2 (16 GB) | 8 | ~70–85 t/s | ~45–55 t/s | $900–$1,100 | Faster memory + CPU |
| Mac mini M4 (16 GB) | 8 | ~110–140 t/s | ~70–90 t/s | $1,200–$1,400 | Latest generation |
Final Conclusions:
What do the results mean?
✔ Our device can run an 8B model, but it’s a bit slow—about 3 tokens per second.
✔ In terms of value for the money, the SpacemiT K3 Pico-ITX is a winner.
✔ The Mac mini is 10×–20× faster.
✔ For interactive use, a 2B–4B model will likely feel much smoother.
What kinds of AI language models can operate on this device?
After CPU power, the next big bottleneck for running LMMs is the device’s RAM capacity. As the saying goes, the more, the better. So, can you run 30B models on this mini-PC as SpacemiT claims? The answer is yes, but only if the model is compressed into INT4 format. In plain English, INT4 builds are a type of compression designed for AI models because can’t cram a 60GB or even 30GB model into a device with just 16GB of RAM! It’s like trying to fit a full-size refrigerator into a small car—it’s just not happening.
How big the model is in each format
| Format | Size of a 30B model | Fits in 16GB RAM? |
| FP16 | ~60 GB | ❌ No |
| INT8 | ~30 GB | ❌ No |
| INT4 | ~7.5 GB | ✔️ Yes |
What does this mean from a user’s standpoint?
That you can run big models like:
- LLaMA-2 30B (INT4)
- LLaMA-3 30B (INT4)
- Qwen 32B (INT4)
- Baichuan 30B (INT4)
- Mixtral 8x7B (INT4)
Unified LLM Compatibility Table (16GB RAM Pico ITX)
Basically, the smaller models (7B, 13B) run even more smoothly. Check out the expanded list below for AI models compatibility that work on a device with 16GB of RAM, based on LLM sizes, disk space, RAM, and INT levels, which Includes Alibaba Qwen, Google Gemma, and Gemini Nano.
| Model / Family | Runs on 16GB? | Disk Size (Q4_K_M) | RAM Use | INT / Quantization |
| Qwen 0.5B | ✅ | ~0.5GB | ~1GB | INT4 / INT8 |
| Qwen 1.8B | ✅ | ~1GB | ~2GB | INT4 / INT8 |
| Qwen 4B | ✅ | ~2GB | ~3GB | INT4 |
| Qwen 7B | ✅ | ~3.5–4GB | ~5–6GB | INT4 |
| Qwen 9B | ✅ | ~4–5GB | ~6–7GB | INT4 |
| Qwen 14B | ✅ | ~7–8GB | ~9–10GB | INT4 |
| Qwen 22B | ⚠️ | ~10–11GB | ~12–13GB | INT4 |
| Qwen 27B | ⚠️ | ~13–14GB | ~15–16GB | INT4 |
| Qwen 32B | ❌ | ~15–16GB | ~17–18GB | Exceeds RAM |
| Qwen 72B | ❌ | 30GB+ | 40GB+ | Exceeds RAM |
| Gemma 4 2B | ✅ | ~1–1.5GB | ~2–3GB | INT4 (edgeoptimized) |
| Gemma 4 4B | ✅ | ~2–3GB | ~3–4GB | INT4 (edgeoptimized) |
| Gemma 3 4B | ✅ | ~2.3GB | ~2.3–2.6GB | INT4 (QAT) |
| Gemma 3 12B | ⚠️ | ~6.9GB | ~7–8GB | INT4 (QAT) |
| Gemma 3 27B | ⚠️ borderline | ~15.5GB | ~15–17GB | INT4 (QAT) — fits but tight |
| Gemma 4 26B | ❌ | 16GB+ | 18GB+ | Too large for 16GB |
| Gemma 4 31B | ❌ | 18GB+ | 20GB+ | Exceeds RAM |
| Gemini Nano 1 | ✅ | ~1GB | ~1–2GB | INT4 / mobile optimized |
| Gemini Nano 2 | ✅ | ~2GB | ~2–3GB | INT4 / mobile optimized |
| Llama class 7B | ✅ | ~4GB | ~6GB | INT4 |
| Llama class 12B | ✅ | ~6–7GB | ~8–9GB | INT4 |
| Llama class 22B–24B | ⚠️ | ~11–12GB | ~13–14GB | INT4 |
| Llama class 27B | ⚠️ | ~13–14GB | ~15–16GB | INT4 |
| Llama class 30B | ⚠️ borderline | ~14–15GB | ~16GB+ | INT4 |
| 35B+ Models | ❌ | 16GB+ | 18GB+ | Exceeds RAM |



