Local AI Hardware Performance Benchmarking

Edited on November 5th: update the Key findings section for better readability.

Overview

Running powerful generative AI models locally is becoming increasingly practical for developers and professionals. However, as model complexity grows, the performance of the underlying hardware is a critical factor for productivity. While many manufacturers publish specifications, it can be difficult to find objective performance data based on real-world AI workloads.

To address this, we ran comprehensive benchmarks comparing Olares One, a hardware solution that ships with the open‑source Olares OS, against high‑end systems from Apple, AMD, and NVIDIA. Our goal is to provide clear, quantitative data on how this range of local AI hardware handles common AI tasks.

The complete data and analysis for each test can be found in the dedicated results sections. Click here to jump to the results.

The hardware we tested

The following systems were configured for the performance benchmarks. The specifications represent relevant high-end configurations available at the time of testing (Oct.22, 2025).

Processor ConfigurationSystem Memory (RAM)StorageOperating SystemPrice
Olares OneIntel Core U9 275HX,
NVIDIA 5090 Mobile, 24GB VRAM
96 GB2 TBOlares$2,999
Mac Studio (M3 Ultra)Apple M3 Ultra with 32-core CPU, 80-core GPU, 32-core Neural Engine96 GB unified1 TBMacOS$5,499
Mac Studio (M4 Max)Apple M3 Ultra with 16-core CPU, 40-core GPU, 16-core Neural Engine64 GB unified512 GBMacOS$2,699
Macbook Pro (M4 Pro)Apple M4 Pro chip with 14‑core CPU, 20‑core GPU, 16‑core Neural Engine24 GB unified1 TBMacOS$2,399
Mac Mini (M4)Apple M4 chip with 10-core CPU, 10-core GPU, 16-core Neural Engine16 GB unified512 GBMacOS$799
Beelink GTR9 ProAMD Ryzen™ AI Max+ 395,
Radeon 8060S
128 GB unified1 TBWindows$2,099
NVIDIA DGX SparkNVIDIA Blackwell Architecture,20 core Arm, 10 Cortex-X925+ 10 Cortex-A725 Arm128 GB unified4 TBNVIDIA DGX™ OS$3,999

The AI tasks and metrics we measured

Our benchmarks were designed to measure performance in ways that directly reflect real-world user experiences, so we focused on two primary workload categories.

Large language model (LLM) inference

For LLMs, responsiveness is key. We measured this by focusing on the Token Generation Rate, reported in tokens per second (tok/s). This metric assesses the sustained speed at which the model produces text after the initial prompt is processed. Higher rates are better.

To evaluate performance under realistic conditions, we also measured this rate with 1, 2, 4, and 8 concurrent requests.

Generative media creation

For image and video generation, the user experience is about waiting time. We measured this in two distinct scenarios that represent common creative workflows.

  • Time to first generation (cold start): The time required to create the first image. This test includes the one-time step of loading the AI model from storage into memory. Lower times are better.
  • Time for subsequent generation (warm cache): The time needed to create subsequent images or videos once the model is already in memory. This metric reflects the hardware’s raw processing speed during iterative work. Lower times are better.

Key concepts for local LLMs

To effectively run large language models on your own hardware, it’s important to understand three concepts: how a model uses memory, its underlying architecture, and the software that runs it.

Model size and VRAM requirements

A model’s “size” is determined by its number of parameters (e.g., a 70B model has 70 billion parameters). This directly affects the amount of Video RAM (VRAM) it requires.

The solution to high VRAM usage is quantization, a process that reduces the precision of the model’s parameters to make them smaller. Here is a simple guide to estimate VRAM needs:

  • Native precision (FP16/BF16): Requires ~2 bytes per parameter.
    • Example: A 30B model needs ~60 GB of VRAM.
  • 8-bit quantized (FP8): Requires 1 byte per parameter.
    • Example: A 30B model needs ~30 GB of VRAM.
  • 4-bit quantized (Q4): The most common format for local use, requiring ~0.5 bytes per parameter.
    • Example: A 30B model needs ~15 GB of VRAM.

On top of this, you must account for the KV Cache, which stores the context of your conversation. This can add another 20-30% to the total VRAM needed, especially for long conversations.

Model architecture

The internal structure of a model affects its speed and efficiency.

  • Dense models: Every parameter in the network is used to process every token. This approach is powerful and coherent but is slower and more resource-intensive.
  • Mixture-of-Experts (MoE) models: The model is composed of smaller sub-networks (“experts”). For any given token, only a fraction of these experts are activated. This makes inference much faster and more efficient, allowing very large models to run with the footprint of a much smaller one.

Inference frameworks

An inference framework (or “backend”) is the engine that runs the model. The choice of framework significantly impacts performance and ease of use. For our tests, we used the following:

  • Ollama: Known for its simplicity and ease of use, making it excellent for getting started quickly.
  • vLLM: A high-performance backend optimized for maximum throughput, widely used in server environments.
  • Llama.cpp: A versatile engine initially focused on running models on CPUs.

LLM selection

To ensure our benchmark reflects real-world performance and user preference, we selected our models based on the LMSys Chatbot Arena Leaderboard.

The Chatbot Arena is a crowdsourced, open platform where models compete in anonymous, head-to-head battles, and thousands of human users vote on which response is better. This methodology provides a dynamic and robust measure of a model’s practical capabilities, making it a reliable benchmark for evaluating LLMs.

Our strategy was to select open-source models that offer the best balance of top-tier performance and resource efficiency, as the absolute highest-ranking models are often closed-source or too large to run on local consumer hardware.

RankModelScoreNotes & Rationale
1claude-sonnet-4-5-20250929-thinking-32k1453Closed-source, not available for local use.
4GLM-4.61426A 357B MoE model, too large for our test hardware.
11qwen3-235b-a22b-instruct-25071419A 235B MoE model, also too large for local deployment.
11deepseek-r1-05281417A 671B parameter model, far too large for local deployment.
11kimi-k2-0905-preview1416A massive 1T parameter model, not feasible for local hardware.
21qwen3-next-80b-a3b-instruct1403The first model under 100B parameters, but still requires very high-end hardware.
27longcat-flash-chat1399A 560B parameter model, too large for consumer hardware.
33deepseek-r11394An older version of the DeepSeek model. Newer, higher-ranked models were prioritized for selection.
39qwen3-30b-a3b-instruct-25071384Selected. While it can be run in its full BF16 version for maximum quality, it is also highly effective when quantized, allowing it to run efficiently on high-end consumer hardware.
57gemma-3-27b-it1362Selected. The highest-ranked open-source dense model, providing a strong baseline for quality and multimodal performance.
64gpt-oss-120b1347Selected. A unique, pre-quantized FP4 model packing 120B parameters into a ~60GB footprint.
66gemma-3-12b-it1340Selected. An excellent smaller model for balanced performance and efficiency.
91gpt-oss-20b1320Selected. A lightweight and efficient model for general tasks.

The Arena ranks models using the Elo rating system, a method from the chess world that is perfect for one-on-one comparisons. A model’s score shows its relative strength. The difference in scores between two models predicts how often one would be preferred over the other in a blind test.

For example, at the time of our selection, the top-ranked model had a score of 1453, while our first selected model, Qwen3-30B-A3B-Instruct, had a score of 1384. This 69-point gap means that if both models were compared head-to-head, users would prefer the top model’s answers about 60% of the time.

Based on this analysis, we curated a list of five models representing different use cases.

  • Qwen3-30B-A3B-Instruct: Delivers top-tier text quality among locally runnable models.
  • GPT-OSS-120B: This highly compressed FP4 model offers the reasoning capacity of a 120B parameter model while fitting within a 60GB VRAM budget, making it a unique choice for high-end local hardware.
  • Gemma3-27B: Its key strength is understanding both images and text. As the best open-source dense model, it provides a strong baseline for quality and coherence.
  • GPT-OSS-20B & Gemma3-12B: These lightweight models are fast and responsive, perfect for handling everyday, less demanding tasks with excellent efficiency.

Test methodology

To ensure our results are transparent and reproducible, we ran our benchmarks using a standardized set of models, software, and configurations. Here’s a breakdown of our setup.

Model sources and specifications

For full reproducibility, the specific model versions and sources used in our tests are detailed below.

ModelFrameworkOllama Model MD5Hugging Face Source (for vLLM / Llama.cpp)
Qwen3-30B-A3B-Instruct-2507vLLMN/Acpatonn/Qwen3-30B-A3B-Instruct-2507-AWQ-4bit
Qwen3-30B-A3B-Instruct-2507Ollama19e422b02313N/A
gpt-oss:120bOllamaf7f8e2f8f4e0N/A
gpt-oss:120bLlama.cppN/Aggml-org/gpt-oss-120b-GGUF
gpt-oss:20bvLLMN/Aopenai/gpt-oss-20b
gpt-oss:20bOllamaaa4295ac10c3N/A
gemma3:27bvLLMN/Aleon-se/gemma-3-27b-it-qat-W4A16-G128
gemma3:27bOllamaa418f5838eafN/A
gemma3:12bvLLMN/Aabhishekchohan/gemma-3-12b-it-quantized-W4A16
gemma3:12bOllamaf4031aab637dN/A

Software and configuration

To standardize our evaluation, we utilized the open-source framework EvalScope, developed by the ModelScope community.

The following frameworks were used to serve and run the models, which were then benchmarked by EvalScope.

LLM

Ollama setup (v0.12.5)

The service was launched with a default setting of OLLAMA_NUM_PARALLEL=8. However, we made the following system-specific adjustments to optimize performance based on hardware constraints:

  • Mac Mini (M4): OLLAMA_NUM_PARALLEL = 1
  • Beelink GTR9 Pro (AI Max+ 395): OLLAMA_NUM_PARALLEL = 4
  • Olares One (for Gemma3-27B): OLLAMA_NUM_PARALLEL = 4

vLLM setup (v0.11.0)

This framework was our go-to for all LLM benchmarks on the Olares One, with one exception: the massive GPT-OSS-120B model.

Example startup arguments:

          args:
            - '--model'
            - cpatonn/Qwen3-30B-A3B-Instruct-2507-AWQ-4bit
            - '--max-model-len'
            - '8000'
            - '--tensor-parallel-size'
            - '1'

Llama.cpp setup

For the particularly demanding GPT-OSS-120B model on Olares One, we specifically used the full-cuda-b6755 version of Llama.cpp. This combination was necessary to successfully run the model within the system’s VRAM constraints.

Startup arguments:

          image: ghcr.io/ggml-org/llama.cpp:full-cuda-b6755
          args:
            - '--server'
            - '-m'
            - /models/gpt-oss-120b-mxfp4-00001-of-00003.gguf
            - '--host'
            - 0.0.0.0
            - '--port'
            - '8080'
            - '-c'
            - '8000'
            - '--flash-attn'
            - auto
            - '--jinja'
            - '--reasoning-format'
            - none
            - '--n-gpu-layers'
            - '999'
            - '--n-cpu-moe'
            - '24'
            - '--temp'
            - '1.0'
            - '--top-p'
            - '1.0'
            - '--top-k'
            - '0'
            - '--threads'
            - '20'
            - '--api-key'
            - olares

Image generation

The core workload parameters were kept identical across all platforms:

  • Model: Flux.1 dev
  • Resolution: 1024 x 1024
  • Steps: 20

Platform-specific notes

  • Olares One: Utilized the Nunchaku FP4 version of the Flux.1 dev model, which is specifically optimized for the NVIDIA 50-series GPU.
  • Apple Silicon: On these devices, FP4/INT4 quantization is not natively supported, and the model’s FP8 acceleration path is CUDA-specific. Therefore, we benchmarked the standard FP16 model variant to report fair, supported performance.
  • NVIDIA DGX Spark: The standard FP16 model variant was also benchmarked on this platform.

Our goal was to measure each system at its peak potential. While the quantization methods differ across platforms, the core workload parameters including prompts were kept identical.

Video generation

The video generation benchmark consisted of two separate tests using different models.

Test 1

  • Model: Wan 2.2 14B
  • Resolution: 672×480
  • Frames: 121
  • FPS: 24
  • Steps: 20

Note
This specific video generation model is not supported on Apple Silicon devices.

Test 2

  • Model: LTX-Video 2B 0.9.5
  • Resolution: 768×512
  • Frames: 97
  • FPS: 24
  • Steps: 30

Key findings

LLM

Hover over the chart to see the exact data points. Select a tab to change the model.

Qwen3-30B-A3B-Instruct-2507

As a high-scoring model in our selection, Qwen3-30B is a strong candidate for general-purpose tasks. Its size allows it to run effectively on most of the tested hardware, providing a solid baseline for performance comparison.

  • Olares One: Delivered the highest throughput across all tested configurations. Using the optimized vLLM framework, its performance ranged from 157.17 tok/s at concurrency 1 down to 81.26 tok/s at concurrency 8. With the more general-purpose Ollama, its performance ranged from 106.67 tok/s down to 31.94 tok/s.
  • Mac Studio (M3 Ultra): The highest-performing Apple device in this test. Its generation rate started at 84.09 tok/s and scaled down to 24.93 tok/s under maximum load.
  • Mac Studio (M4 Max): Performed very competitively, with a throughput slightly lower than the M3 Ultra, ranging from 81.18 tok/s down to 19.88 tok/s.
  • NVIDIA DGX Spark: Secured a strong position, with throughput ranging from 76.36 tok/s down to 17.13 tok/s, placing it just behind the Mac Studios.
  • Beelink GTR9 Pro (AI Max+ 395): Had the lowest throughput among the tested systems for this model, with its generation rate ranging from 61.21 tok/s down to 11.68 tok/s.
  • MacBook Pro M4 Pro and Mac Mini M4 were not tested with this model.

GPT-OSS-120B

This 120-billion parameter model is a demanding test of system memory capacity and bandwidth. Performance here is heavily influenced by a device’s ability to hold the entire model in VRAM or high-speed unified memory to avoid offloading.

  • Mac Studio (M3 Ultra): Delivered the best performance in this category, with its large memory pool enabling throughput that ranged from 69.39 tok/s down to 19.05 tok/s.
  • NVIDIA DGX Spark: Performed respectably, with throughput from 41.97 tok/s down to 12.29 tok/s.
  • Olares One: Tested with llama.cpp, its performance was significantly bottlenecked. The model’s size exceeded the 24 GB of VRAM, forcing layer offloading to slower system RAM and resulting in a much lower throughput, from 36.16 tok/s to 4.44 tok/s.
  • Beelink GTR9 Pro (AI Max+ 395): Throughput started at 33.97 tok/s but dropped sharply at high concurrency, ending at just 2.54 tok/s.
  • Other Apple devices were not tested with this model.

GPT-OSS-20B

This smaller, efficient model is designed to run well on a wider range of hardware, making it a more accessible option for on-device applications.

  • Olares One: Remained the top performer with both vLLM (139.00 tok/s to 72.20 tok/s) and Ollama (123.76 tok/s to 43.76 tok/s).
  • Mac Studio (M3 Ultra): Was the fastest Apple device, ranking just behind Olares One and maintaining a consistent lead over the M4 Max.
  • NVIDIA DGX Spark: Trailed the M3 Ultra but stayed ahead of the MacBook Pro and Beelink, particularly at higher concurrencies, from 58.83 tok/s to 19.98 tok/s.
  • MacBook Pro (M4 Pro): Completed all runs but was noticeably slower.
  • Mac Mini (M4) was not tested with this model.

The results confirm its versatility, with strong performance across most of the tested systems.

Gemma3-27B

As a dense model, Gemma3-27B is more demanding, resulting in lower generation rates across all systems compared to the Qwen3 model.

  • Olares One: Maintained its lead. With vLLM, it achieved a generation rate from 38.02 tok/s to 28.81 tok/s. On Ollama, it also outperformed all other devices, ranging from 29.75 tok/s down to 7.51 tok/s.
  • Mac Studio (M3 Ultra): Was the second-highest performer at low concurrency, starting at 27.51 tok/s, but performance dropped significantly under load to 4.64 tok/s.
  • Mac Studio (M4 Max): Followed a similar trend, starting with a strong 21.60 tok/s but decreasing to 3.12 tok/s at the concurrency 8.
  • NVIDIA DGX Spark: This system demonstrated excellent performance scaling. Although its initial generation rate at concurrency 1 (11.46 tok/s) was lower than the Mac Studios, it sustained its performance much more effectively. At concurrency 8, its throughput of 6.27 tok/s was significantly higher than both the M3 Ultra’s (4.64 tok/s) and the M4 Max’s (3.12 tok/s).
  • Beelink GTR9 Pro (AI Max+ 395): Had the lowest throughput, with its generation rate ranging from 10.28 tok/s down to 1.26 tok/s.
  • MacBook Pro M4 Pro and Mac Mini M4 were not run for this model.

Gemma3-12B

This lightweight model was the only one in our test suite that could run successfully on the Mac Mini (M4) with its 16 GB of unified memory, making it an important benchmark for entry-level hardware.

  • Olares One: Remained top with vLLM, from 71.94 tok/s to 61.17 tok/s, and also led under Ollama.
  • Mac Studio (M3 Ultra): Was the next fastest system, with a throughput ranging from 50.67 tok/s down to 9.75 tok/s.
  • Mac Studio (M4 Max): Delivered solid performance, with a generation rate from 42.53 tok/s down to 6.46 tok/s.
  • NVIDIA DGX Spark: Once again showcased superior performance scaling under load. Its initial throughput at concurrency 1 (26.76 tok/s) was slightly lower than the MacBook Pro M4 Pro’s.
  • MacBook Pro (M4 Pro): Started with a high generation rate of 27.28 tok/s, but this dropped sharply under load to 3.21 tok/s.
  • Beelink GTR9 Pro (AI Max+ 395): Its performance ranged from 21.72 tok/s down to 3.44 tok/s.
  • Mac Mini (M4): Completed the test with lower throughput than the others.

Image generation

Hover over the chart to see the exact data points.

We evaluated image generation performance using the Flux.1 dev model.

  • Olares One: Was significantly faster than all other devices, generating the first image in 15.51s and subsequent images in just 8.32s. This represents a 5.7x speedup over the M3 Ultra for the first image and an 8.8x speedup for subsequent ones.
  • NVIDIA DGX Spark: Was the clear runner-up, with a first-generation time of 72.27s and subsequent generations at 42.27s.
  • Mac Studio (M3 Ultra & M4 Max): These devices were considerably slower than the DGX Spark. The M3 Ultra (88.08s) was faster than the M4 Max (135.83s) for the first image.

The performance gap in this task is stark, with the dedicated, high-VRAM GPU in Olares One providing a substantial advantage.

Video generation

Hover over the chart to see the exact data points. Select a tab to change the model.

Wan 2.2 14B

This is a computationally demanding model that is also not supported on Apple Silicon devices. Therefore, it was only tested on the Olares One and NVIDIA DGX Spark.

  • Olares One: It completed the first generation in 142.03s and subsequent generations in 97.79s.
  • NVIDIA DGX Spark: This system was slower, recording a first-generation time of 208.34s and subsequent times of 157.49s.

LTX-Video 2B 0.9.5

This smaller model was compatible with a wider range of hardware, allowing for a broader comparison.

  • Olares One: It again led the field, recording a first-generation time of 45.38s and subsequent times of 32.21s.
  • NVIDIA DGX Spark vs. Mac Studio (M3 Ultra): These two devices had similar first-generation times (97.58s for DGX Spark, 98.56s for M3 Ultra). However, the DGX Spark showed a significant speedup on subsequent generations (35.98s), while the M3 Ultra’s time remained high (88.84s). 
  • Mac Studio (M4 Max): Remained behind the M3 Ultra on both first and subsequent passes.
  • MacBook Pro (M4 Pro): Completed the test with the longest generation times among the systems tested for this model.

Conclusions

Local AI performance

A device’s practical AI performance is determined by both its raw processing speed and its memory capacity, which dictates the size and number of models it can run. These two factors were the primary differentiators in our testing.

Olares One is one of two platforms in our testing to complete all image and video benchmarks (along with DGX Spark), and it was the faster of the two. This high performance is enabled by its hardware: 24 GB of VRAM, matching the desktop NVIDIA 4090, and a 50-series GPU with native support for FP4 models to enhance efficiency. Powered by the vLLM inference framework, it also delivers strong concurrent performance, making it an excellent fit for handling API requests and powering agent workflows.

Mac Studio (M3 Ultra & M4 Max), with its unified memory architecture, supports loading larger MoE models. However, performance declines rapidly as concurrency increases, and prefill speed is a notable disadvantage that is not shown here. These make Mac Studio a better fit for single-user chat rather than serving external services or powering agents. While Mac Studio can run image and video generation workloads, its retail price limits its appeal for that purpose. Overall, based on our tests, the M4 Max offers a better price-to-performance ratio than the M3 Ultra.

Beelink GTR9 Pro (AI Max+ 395) is marketed as an AI PC, and this device demonstrates capability in specific MoE-style LLM inference tasks. On dense models, decoding speed and concurrency degrade sharply, which makes it a poor fit for serving external API services. A critical limitation is its lack of CUDA. Some reviews of other systems using the same AI Max+ 395 chip report higher single‑concurrency results, including up to roughly 30% on GPT‑OSS‑120B. These differences likely stem from OEM power tuning, operating systems and drivers, inference frameworks, launch parameters, model variants, and evaluation methods. But these potential improvements do not affect the fundamental conclusion about its architectural limitations for broader AI workloads.

NVIDIA DGX Spark’s overall LLM performance is similar to that of the AI Max+ 395, though it shows notably weaker results on GPT-OSS-120B that may be constrained by memory bandwidth. The key advantage is excellent prefill speed, reportedly ahead of Apple Silicon and AI Max+ 395. As a CUDA-enabled device, it also successfully completed all image and video benchmarks.

Accessibility

The value of a local AI device also depends on its ease of use and the ecosystem supporting it.

Olares One focuses on a streamlined user experience. Its Olares Market provides one-click installation for popular open-source applications including 3D asset generation and text-to-speech. It also generates a secure, unique URL, allowing users to access their applications and data from anywhere. For developers, Olares Studio enables local development and debugging.

NVIDIA DGX Spark is built for developers and researchers. It comes pre-configured with a customized version of Ubuntu 24.04, the complete CUDA toolkit, drivers, and Docker, significantly reducing setup time.

Portability

The physical form factor of these devices dictates their suitability for mobile use versus stationary desktop deployment.

Macbook Pro (M4 Pro) is the only true laptop in this comparison, best suited for offline, on-the-go tasks using mid-sized models such as GPT-OSS-20B. From benchmark data, it is not well-suited for local media asset generation.

Olares One has a desktop form factor and is not physically portable. However, this lack of physical portability is offset by its remote access capabilities (as mentioned under Accessibility), which allow users to connect to the device from a separate PC, tablet, or smartphone.

Mac Mini (M4) is a lightweight device that some users carry for on-the-go work with an external monitor. However, it is not designed for demanding, GPU-accelerated workloads or larger models and will struggle to deliver consistent performance for advanced local AI use.

Compatibility

Hardware and software compatibility determines which AI workloads a device can run.

The following table summarizes which models were successfully run on each hardware configuration during our testing. A checkmark (✓) indicates a successful run, while a cross (✗) indicates the hardware could not run the model, often due to memory limitations or lack of required software support (e.g., CUDA).

Olares OneMac Studio(M3 Ultra)Mac Studio(M4 Max)Macbook Pro(M4 Pro)Mac Mini (M4)Beelink GTR9 Pro(AI Max+ 395)NVIDIA DGX Spark
LLMQwen3-30B-A3B-Instruct-2507✖️
GPT-OSS-120B✖️✖️✖️
GPT-OSS-20B✖️
Gemma3-27B✖️✖️
Gemma3-12B
Image GenerationFlux✖️✖️✖️
Video GenerationWan 2.2✖️✖️✖️✖️✖️
LTX-Video✖️✖️

Final notes

This work would not be possible without the vibrant open-source communities, whose tireless efforts allow us to bring powerful models to local devices. We are deeply grateful for their contributions.

We hope this analysis will be helpful as you evaluate the options and choose the ideal hardware for your local AI needs. Our benchmarking is not perfect. The local AI landscape is evolving at an incredible pace, and our goal was to provide a practical, real-world snapshot of performance on currently available hardware. We also did not overclock any systems for this benchmark, as it is not a default setting for most users. We look forward to covering more devices and tests in future benchmarks as this exciting field continues to grow.

If you have any questions about self-hosting these models on your own device with Olares, check out the following resources and don’t hesitate to contact us:

Discover more from Olares Blog

Subscribe now to keep reading and get access to the full archive.

Continue reading