📑 Table of contents

antirez launches ds4: the local inference engine that makes DeepSeek V4 Flash usable on a Mac

Self-Hosting 🟢 Beginner ⏱️ 15 min read 📅 2026-05-18

antirez launches ds4: the local inference engine that makes DeepSeek V4 Flash usable on a Mac

🔎 The creator of Redis is back, and he has picked his side

Salvatore Sanfilippo — aka antirez — is no stranger to the developer ecosystem. He created Redis, one of the most widely used caching systems in the world. When this guy releases a new open-source project, people listen. Especially when that project explodes to 8000+ stars on GitHub in a matter of days.

His new baby is called ds4. It's a local inference engine written in pure C, exclusively optimized for Apple Silicon via Metal, and designed to run a single model: DeepSeek V4 Flash. Not a generic GGUF runner, not a wrapper around llama.cpp. A native, single-model engine, tailored for a specific architecture.

The message is clear: self-hosting frontier models on consumer hardware is no longer science fiction. It's compiled C running on your MacBook.


The essentials

  • ds4 is a pure C, Metal-optimized inference engine, created by antirez to run DeepSeek V4 Flash (284B parameter MoE) on Apple Silicon.
  • It achieves 26 t/s on an M3 Max with a 1M token context and thinking enabled, thanks to a proprietary asymmetric Q2 quantization.
  • The KV cache is stored on SSD, allowing it to handle a massive context without blowing up RAM.
  • A ds4-server component exposes OpenAI and Anthropic-compatible endpoints, allowing you to replace cloud APIs locally.
  • It is not a general-purpose tool: it's a dedicated engine, and that is precisely what makes it so performant.

Tool Main usage Price Ideal for
ds4 Local DeepSeek V4 Flash inference on Mac Free (open source) Apple Silicon users looking for the best perf
ds4 GGUF Specific ds4 quantizations Free Downloading model weights
Ollama Multi-model local LLM inference Free Generalist approach, simpler but less optimized
NVIDIA NIM DeepSeek V4 Flash cloud API inference Variable (check on build.nvidia.com) Users without a Mac, need to scale

What ds4 really is — and what it isn't

ds4 is an inference engine, not a generic library. It doesn't load just any GGUF. It doesn't plug into twenty different architectures.

It's a binary compiled in C that knows how to do exactly one thing: run DeepSeek V4 Flash on Apple M1/M2/M3/M4 chips with the maximum possible performance. The source code is available on the antirez/ds4 GitHub repo.

This single-model approach is a radical architectural choice. Most local inference projects (Ollama, llama.cpp, LM Studio) aim for generality: supporting the largest number of models possible. ds4 does the opposite. It sacrifices flexibility to gain in pure performance.

The result? An engine that squeezes every CPU cycle and every byte of memory bandwidth at the service of a single model.

Why a dedicated engine has a structural advantage

When you target a single model, you can optimize at every layer. You know the exact topology of the network, the dimensions of the matrices, the attention patterns. You don't have conditional branches to handle different architectures.

antirez exploited this knowledge to write an ultra-specific Metal pipeline. No generic kernel that adapts on the fly. Kernels written for the exact dimensions of DeepSeek V4 Flash.

It's the same philosophy that made Redis successful against Memcached: sacrificing generality for performance on a specific use case.


The numbers: 26 t/s on M3 Max with 1M tokens of context

The benchmarks published by Pasquale Pillitteri in his article technique sur ds4 are unequivocal. On an M3 Max, ds4 reaches 26 tokens per second in generation with thinking enabled.

This is impressive for a 284-billion parameter model using a MoE (Mixture of Experts) architecture. Let's remember that DeepSeek V4 Flash only activates a fraction of its parameters at each forward pass — that's the principle behind MoE — but the complete model still has to reside somewhere.

Configuration Generation speed Max context Thinking
M3 Max (ds4, Q2) ~26 t/s 1M tokens Enabled
M3 Max (Ollama, standard GGUF) Significantly lower Variable Variable
Cloud NVIDIA NIM Depends on the tier 1M tokens Enabled

A context of one million tokens changes the game. You can feed the model with dozens of source files, complete logs, entire documentation — and it keeps everything in memory during generation.

The SSD as extended RAM: on-disk KV cache

The most striking innovation of ds4 is its KV cache manager. The KV cache stores the attention keys and values computed for each token passed into the context. For 1M tokens, this won't fit into a Mac's RAM, even a well-equipped one.

ds4's solution: use the SSD as active cache. The most recent KV cache data stays in RAM, while the oldest is swapped to SSD with optimized sequential access.

Modern NVMe SSDs have sequential throughput of several GB/s. By accessing the KV cache in a predictive and sequential manner, the penalty compared to pure RAM remains acceptable. It's a smart trade-off: a bit slower than an all-in-RAM approach, but it allows scaling up to 1M tokens instead of 32K or 128K.


Asymmetric Q2 quantization: ds4's other secret

A 284B model in FP16 weighs about 568 GB. Even in INT4, we're talking about ~142 GB. No Mac can load that entirely into RAM.

This is where ds4's specific asymmetric Q2 quantization comes into play. The quantization recipe is detailed on Hugging Face, and it differs significantly from standard GGUF approaches.

Asymmetric Q2 vs standard symmetric Q2

In symmetric quantization, we divide the weights by a single factor and round. Simple, but we lose precision where the weight distribution is asymmetric.

ds4's asymmetric Q2 uses an offset in addition to the scale factor. This allows it to better capture the actual distribution of the model's weights, especially when it is centered around a non-zero value.

The result: output quality that is clearly superior to what a standard Q2 would produce, for an identical memory footprint. antirez literally created a custom quantization format for this model.

Why it's not a "normal" GGUF

The files distributed on antirez/deepseek-v4-gguf are in GGUF format, but be careful: they are not compatible with llama.cpp or Ollama. The format contains specific metadata and quantization tables that only ds4 knows how to read.

This is a crucial point. If you download these files thinking you'll use them with your existing Ollama setup, it won't work. It's a deliberate choice: the format is optimized for ds4's internal pipeline, not for interoperability.


ds4-server: replacing cloud APIs locally

ds4 is not limited to a CLI mode. The project includes ds4-server, a component that exposes HTTP endpoints compatible with the OpenAI and Anthropic APIs.

In practice, you launch ds4-server locally, and you can point any OpenAI-compatible client (Cursor, Continue, Claude Code, etc.) to http://localhost:PORT instead of the OpenAI or Anthropic API.

It's a game-changer for developers. You keep your existing workflow, your tools, your configurations — you just change the base URL. And instead of paying per token to OpenAI or Anthropic, you use your own hardware.

This approach is part of a broader trend: development with open source AI agents that run locally. ds4-server provides the underlying inference infrastructure, and the agents plug into it.

Current limitations of ds4-server

Let's stick to the facts: ds4-server is a young component. It handles basic chat and completion endpoints, but probably doesn't yet support all the features of cloud APIs (complex streaming, advanced function calling, embeddings, etc.).

The project is evolving rapidly — 8000 stars in a few days generates a lot of contributions — but if you need reliable function calling or structured output, check the current state of the repo before migrating your production pipeline.


ds4 vs Ollama vs llama.cpp : which one to choose?

The CometAPI guide on running DeepSeek V4 locally compares the different approaches available. Here is a practice-oriented summary.

Ollama : the comfortable generalist

Ollama is the tool I recommend most often for installing a local LLM. It is simple, well-documented, and handles hundreds of models. Its API is natively OpenAI-compatible.

But for DeepSeek V4 Flash specifically, Ollama uses generic GGUF kernels. It does not benefit from ds4's fine-grained Metal optimization, nor from Q2 asymmetric quantization, nor from SSD KV cache.

Ollama remains the best choice if you want to run several different models on your Mac. It's a Swiss army knife.

llama.cpp : the underlying engine

llama.cpp is the founding project of LLM inference on consumer CPU/GPU. Ollama, LM Studio and many others use it as a backend. It supports Metal, but in a generic way.

For DeepSeek V4 Flash, llama.cpp can load standard GGUF quantizations. The quality will be lower than ds4 (no asymmetric Q2) and the context will be limited by the available RAM (no SSD KV cache).

llama.cpp is perfect for experimentation and the development of new formats. But in production for this specific model, ds4 surpasses it.

ds4 : the specialist that wins on its own turf

If your need is unique — to run DeepSeek V4 Flash as fast as possible on a Mac with the largest possible context — ds4 has no competitor. It's a sniper rifle facing two Swiss army knives.

Criterion ds4 Ollama llama.cpp
Raw performance on V4 Flash Best Good Good
Max context 1M tokens Limited by RAM Limited by RAM
Number of supported models 1 Hundreds Hundreds
Ease of installation Medium Very simple Medium
OpenAI compat API Yes (ds4-server) Yes Not native
Proprietary quantization Q2 asymmetric Standard GGUF Standard GGUF

DeepSeek V4 Flash: why this model deserves a dedicated engine

DeepSeek V4 Flash is a 284B total parameter MoE model, with 28B parameters activated per token. Available on NVIDIA NIM, it is optimized for coding and agents.

In the June 2025 open-source LLM ranking, the DeepSeek V4 Flash variants (Max and High) placed 5th and 7th respectively with scores of 76 and 71. DeepSeek V4 Pro (Max) dominates the ranking with 88 points. The V4 family is clearly the current open-weight benchmark.

But Flash is the ideal candidate for local. With "only" 28B parameters activated per forward pass, it requires significantly less compute than Pro (745B total, 38B activated) while remaining extremely competent in code and reasoning.

This is precisely the perfect profile for a specialized engine: a model powerful enough to be useful daily, and lightweight enough at execution (thanks to MoE) to run on consumer hardware with the right optimizations.

The RunLocalAI guide on DeepSeek V4 confirms this position: V4 Pro is the open-weight leader in coding and mathematics for spring 2026, and V4 Flash is its "fast" version that retains much of the capabilities.


What this means for the self-hosting of frontier models

Eighteen months ago, running a 284B parameter model locally was the realm of laboratory experimentation. You needed a server with multiple NVIDIA GPUs, complex configuration, and the performance was mediocre.

Today, a C binary a few megabytes in size, downloaded from GitHub, runs this same model on a MacBook Pro at 26 t/s with a million tokens of context.

The change is structural, not incremental. The combination of several factors makes this possible:

The MoE architecture drastically reduces compute per token. DeepSeek V4 Flash activates 10% of its parameters at each pass.

Aggressive quantization (asymmetric Q2 here) reduces the memory footprint by an order of magnitude compared to FP16, with minimal quality loss thanks to the asymmetry.

SSD as memory eliminates the RAM bottleneck for long context. Modern SSDs are fast enough for sequential streaming of KV cache.

Apple Silicon provides unified memory bandwidth (100-400 GB/s depending on the chip) that is ideal for LLM inference, far superior to what a discrete NVIDIA GPU offers in the same budget.

When you add up these four factors, the result is that consumer hardware is catching up to datacenter hardware for specific use cases. Not for training, not for large-scale serving — but for an individual developer's use, yes.


Technical innovations in detail

The Metal pipeline

Apple Metal is Apple's graphics and compute API, the equivalent of CUDA but for M chips. Metal has long been criticized for its SDK being less mature than CUDA. But for a dedicated engine written in C, Metal primitives are more than sufficient.

antirez wrote specific Metal kernels for each operation of DeepSeek V4 Flash's forward pass: matmul, multi-head attention, activation, RMSNorm. Each kernel is sized for the model's exact shapes.

No complex JIT compilation, no runtime algo search. Just compiled code that does exactly what it needs to do.

Memory management

ds4 splits the model into two parts: the weights (which fit in RAM after Q2 quantization) and the KV cache (which overflows to SSD). This separation is not trivial — you have to manage transfers asynchronously so as not to block generation.

The engine preloads KV cache blocks from the SSD before they are needed by the attention mechanism. It's classic prefetching, applied to a new problem.

Thinking mode

DeepSeek V4 Flash supports "thinking" — an internal reasoning phase before generating the response. ds4 enables this mode by default, meaning the 26 t/s include thinking tokens (which are not displayed to the user).

This is important for comparing with other benchmarks that sometimes only count visible output tokens. Thinking consumes compute but significantly improves response quality on complex tasks.


❌ Common mistakes

Mistake 1: Confusing ds4 GGUFs with standard GGUFs

ds4 GGUF files contain proprietary quantization tables. Loading them in Ollama or llama.cpp will produce errors or inconsistent outputs. Download them only from antirez's Hugging Face repo and use them exclusively with ds4.

Mistake 2: Expecting similar performance on Intel Macs

ds4 is optimized for Apple Silicon via Metal. It will not work (or will work in an extremely degraded manner) on an Intel Mac. If you don't have an M1 chip or later, skip this and look into llama.cpp or Ollama.

Mistake 3: Underestimating SSD storage requirements

KV cache on SSD means your drive will be heavily used during generation with a long context. An internal NVMe SSD is essential. An external USB hard drive or a slow SATA SSD will make the experience unbearable.

Mistake 4: Comparing t/s without specifying the thinking mode

26 t/s with thinking enabled is not 26 t/s of visible tokens. If you disable thinking, the apparent output speed will be different. Always specify your configuration when reporting benchmarks.


❓ Frequently Asked Questions

Can ds4 run on a base M1 Mac with 8 GB of RAM?

Theoretically yes thanks to the SSD KV cache, but the experience will be very slow. The base M1 has a memory bandwidth of 68 GB/s and 8 GB of RAM shared between the system and the model. An M1 Pro/Max with 32 GB+ is the recommended minimum for comfortable use.

Can I use ds4 with my IDE (VS Code, Cursor)?

Yes, via ds4-server which exposes OpenAI-compatible endpoints. Configure your AI extension to point to the local URL of ds4-server. This is one of the most practical use cases of the project.

Will ds4 replace Ollama?

No. ds4 is a specialized tool for a specific model. Ollama remains the best choice for managing multiple models, experimenting, and having a unified interface. Both tools coexist: Ollama for versatile daily use, ds4 when you need maximum performance on DeepSeek V4 Flash.

Doesn't Q2 quantization degrade quality too much?

In standard symmetric quantization, Q2 is indeed very aggressive. But ds4's asymmetric Q2 significantly compensates for this loss. Preliminary user feedback suggests quality comparable to standard Q4 for this specific model. The format is tailored to the weight distribution of V4 Flash.

Is there a ds4 equivalent for NVIDIA PCs?

Not to date. The project is explicitly targeted at Apple Silicon and Metal. On NVIDIA, the ecosystem already exists (TensorRT-LLM, vLLM, llama.cpp with CUDA), and the need for a dedicated engine is lower because the memory bandwidth of NVIDIA GPUs is generally higher than that of M chips.


✅ Conclusion

ds4 proves that an inference engine tailor-made for a specific model on specific hardware can beat generalist solutions. antirez applies the philosophy of Redis to LLM inference: radical specialization, raw performance, clean C code. Self-hosting frontier models on Mac is no longer an experiment — it's an operational workflow. If you have a Mac with an M chip and you want the best LLM for coding locally, ds4 deserves your attention.