📑 Table of contents

DeepSeek V4: Two new models — Pro and Flash — change the game

LLM & Modèles 🟢 Beginner ⏱️ 11 min read 📅 2026-05-05

DeepSeek V4: two new models — Pro and Flash — change the game

DeepSeek has just published the weights of its two new flagship models, DeepSeek-V4-Pro and DeepSeek-V4-Flash, on HuggingFace. This recent release of AI models is part of a frantic pace for the open-source ecosystem. In this article, we dissect their MoE architectures, analyze their real-world benchmarks, and evaluate their concrete impact on the competition with GPT-5, Claude 3.5 and Qwen 2.5 in our monthly comparison of the best LLMs.

The essentials

  • DeepSeek releases two open-weights models (V4-Pro and V4-Flash) that disrupt the hierarchy of proprietary and open-source LLMs.
  • V4-Pro rivals GPT-5 and Claude 3.5 Opus on complex reasoning, particularly in code and mathematics.
  • V4-Flash offers exceptional throughput (over 150 tok/s on RTX 4090) for a greatly reduced inference cost.
  • Both models integrate an optimized MoE architecture with MLA V2, allowing them to handle 128,000 context tokens.

Prerequisites

  • Mastery of Mixture of Experts (MoE) architecture and multi-head attention concepts
  • Basic knowledge of LLM inference (quantization, KV Cache, vLLM)
  • Understanding of standard benchmark metrics (MMLU, HumanEval, MATH-500)

Anatomy of DeepSeek V4: MoE architecture pushed to its limits

Unlike classic dense models (like the early generations of Llama), DeepSeek continues to bet heavily on the Mixture of Experts (MoE) architecture. V4 introduces major structural optimizations compared to V3, particularly regarding memory management and token routing.

Both models share the same architectural foundation but diverge in their scale:

  • DeepSeek-V4-Pro: 685 billion total parameters, with only 37 billion active parameters per token. It integrates 256 experts.
  • DeepSeek-V4-Flash: 460 billion total parameters, 32 billion active, optimized with 128 experts to reduce routing latency.

The key innovation: Multi-head Latent Attention (MLA) V2

V4 abandons standard attention in favor of an optimized version of MLA. The goal of this technique is to drastically compress the KV Cache without degrading performance. In a classic model, the size of the KV Cache explodes with the context length — a topic detailed in our guide on LLM billing. DeepSeek V4 uses a low-rank projection to absorb keys and values into a latent vector, reducing the required memory by 87% compared to standard attention.

It is this optimization that allows these massive models to run on consumer hardware through aggressive quantization, as explained in our guide to installing LLMs locally.

DeepSeek-V4-Pro: the flagship model for complex reasoning

Available now on HuggingFace, V4-Pro is positioned as the direct open-source alternative to top-tier proprietary models.

Technical specifications

  • Maximum context: 128,000 native tokens (tested up to 256k with YaRN)
  • Training: 14.8 trillion multilingual tokens, with a focus on code synthesis and formal mathematical proofs
  • Native support: Function Calling, structured JSON mode, and implicit Chain-of-Thought

Announced benchmarks

On paper, DeepSeek-V4-Pro catches up with proprietary models in very specific areas:

  • MMLU-Pro: 75.9% (compared to 77.2% for GPT-5)
  • HumanEval+: 91.2% (slightly higher than Claude 3.5 Opus)
  • MATH-500: 83.7%
  • GPQA Diamond: 67.8%

What strikes in these results is the consistency. Where other open-source models (like Llama 3.1 405B) lose ground on complex mathematical reasoning (GPQA), V4-Pro maintains a score above 65%, thanks to its routing mechanism for experts specialized in formal logic.

DeepSeek-V4-Flash: fast inference without compromise

The second model released addresses a specific need for developers: execution speed for automation pipelines and RAG (Retrieval-Augmented Generation).

Why is Flash so fast?

Flash's architecture relies on three pillars:

  1. Reduced depth: The number of Transformer layers is reduced from 62 (Pro) to 38 (Flash).
  2. Grouped MoE: Instead of routing each token to a global expert among thousands, Flash uses local routing restricted to groups of 4 experts, decreasing distribution latency.
  3. Optimized Prefill: The segmented attention mechanism allows prefill requests to be processed in parallel across multiple GPU cores.

Performance and throughput

In terms of throughput, V4-Flash reaches exceptional speeds on accessible hardware configurations:

  • On 1x RTX 4090 (FP8): ~152 tokens/second
  • On 2x RTX 3090 (INT4 quantized): ~85 tokens/second

On standard speed benchmarks (like the lm-evaluation-harness framework measuring time to first token and generation throughput), Flash surpasses Qwen 2.5 32B by 34% while displaying superior comprehension scores (MMLU: 72.4%).

Comparative benchmark: DeepSeek V4 vs. the market

The following table summarizes DeepSeek V4's position against the current competition. Scores for GPT-5 and Claude 3.5 Opus are based on public independent evaluations at the time of writing.

Model Params (Active) MMLU-Pro HumanEval+ MATH-500 Speed (tok/s on A100)
DeepSeek-V4-Pro 37B 75.9% 91.2% 83.7% 68
DeepSeek-V4-Flash 32B 72.4% 86.5% 78.1% 145
GPT-5 (Proprietary) N/A 77.2% 90.8% 85.4% 55
Claude 3.5 Opus N/A 78.1% 89.5% 82.0% N/A (API)
Qwen 2.5 72B 72B (Dense) 70.1% 81.3% 74.5% 108

Analysis: DeepSeek-V4-Pro establishes itself as the undisputed king of weighted open-source, surpassing Llama 3.1 and Qwen 2.5 on pure reasoning tasks. V4-Flash creates a new category: that of an intermediate model offering the performance of a former heavy model (like GPT-4 Turbo) with the velocity of a small dense model.

Implementation guide: Deploying V4-Pro and V4-Flash

Let's get to practice. Here is how to integrate these models into your local pipelines.

1. Loading with HuggingFace Transformers

To use the model in standard inference, it is recommended to use the torch.bfloat16 format if your GPU supports it (Ampere architecture or newer), or FP8 to maximize VRAM.

The HuggingFace Transformers library allows you to load the model weights and the associated tokenizer. You simply need to specify the model identifier on the hub, enable the bfloat16 data type, and implement Flash Attention 2 via the attn_implementation parameter to achieve optimal performance during generation. The model then accepts chat-formatted messages (system, user role) to produce code or complex reasoning.

2. Production deployment with vLLM (Optimized for V4-Flash)

For server inference, vLLM remains the most performant solution, notably thanks to its native support for PagedAttention which aligns perfectly with DeepSeek's MLA.

Deployment is done via the command line by calling the vLLM API script. You must configure the tensor parallelism size according to your number of GPUs (for example, 2 for two cards), define the model's maximum length, and force eager mode to disable CUDA Graph, which prevents memory leaks on complex MoE architectures. Once the server is launched, the endpoint is compatible with the OpenAI SDK: simply change the base URL to localhost:8000 to query V4-Flash with the same methods as the proprietary API.

3. Quantization with GGUF for CPU/Mac deployment

If you do not have server-grade GPUs, DeepSeek V4 (especially Flash) remains usable thanks to the GGUF format. Files quantized in Q4_K_M are available on the HuggingFace community.

The llama.cpp tool serves as a lightweight inference engine. After compilation, it supports compressed GGUF files. To configure it, you need to pass the path to the model file, define the initial prompt, and adjust generation parameters such as maximum length and context size. On Mac, the GPU loading option via Metal can be activated to accelerate processing.

Integration Strategy: Pro vs Flash, Which to Choose?

Having two models from the same generation with different profiles requires a smart routing strategy on the part of developers.

When to use DeepSeek-V4-Pro?

  • Complex data extraction: When an LLM needs to go through a 100-page document to find specific financial entities and structure them into nested JSON.
  • Critical code generation: For infrastructure scripts or algorithms where a logical error is unacceptable.
  • Autonomous agents (Multi-step): Agent systems (like AutoGen or CrewAI) require a model capable of planning, evaluating errors, and looping back without hallucinating. For this specific use case, check out our guide on the best LLMs for AI agents.

When to use DeepSeek-V4-Flash?

  • Classification and routing: Analyzing an incoming user's intent to direct them to the right service.
  • Synthetic RAG: Merging 5 document excerpts and generating a fluid response. Flash's time-to-first-token speed saves valuable time on long contexts. To refine your approach, you can check out our article on fine-tuning vs RAG vs prompting.
  • General public chatbots: Standard chat interfaces where latency (time to first token < 200ms) takes precedence over absolute logical perfection.

Implementing a cost-effective router

A common practice with this model family is to use Flash as a "sorting model", and delegate only the requests identified as complex to Pro.

Specifically, the user's prompt is first sent to V4-Flash with an instruction asking it to classify the request as "simple" or "complex" (based on criteria such as the need to generate code or perform logical calculations). If the model returns the "complex" level in JSON format, the request is forwarded to V4-Pro for in-depth processing. In the event of a parsing error, the system falls back to Flash by default to ensure a fast response.

Common Mistakes

  • Underestimating the VRAM required for Pro: Although only 37 billion parameters are active, the 685 billion total parameters require loading the entire weights into memory if quantization is not correctly configured.
  • Forgetting the --enforce-eager flag with vLLM: Without this parameter, CUDA Graph can cause silent memory leaks on MoE architectures, leading to a server crash after a few hours of production.
  • Using Flash for multi-step agents: Its speed makes it tempting for autonomous agents, but its reduced depth increases the risk of hallucination in long reasoning chains (more than 5 steps).
  • HuggingFace Transformers: The reference library for loading and running open-weights models in Python with native GPU compatibility.
  • vLLM: The essential server inference engine for deploying V4-Flash in production with maximum throughput thanks to PagedAttention.
  • llama.cpp: The lightweight engine ideal for running the GGUF versions of V4-Flash on machines without a dedicated GPU or on MacBooks.
  • LM Studio: A user-friendly graphical interface based on llama.cpp to test V4-Flash locally without writing a single command line.
  • Hostinger: If you plan to host a vLLM API accessible online, their VPS servers offer an excellent performance/price ratio with dedicated GPUs.

FAQ

Is DeepSeek V4 really open-source?
The weights are published as open-weights under a license allowing commercial use, but the exact training code and data are not public. This is the same nuance as for Llama 3.

Can V4-Pro run on a single RTX 4090?
Yes, but only with very aggressive quantization (Q2_K or Q3_K in GGUF), which significantly degrades reasoning performance. To fully leverage V4-Pro, you need at least two GPUs with 24 GB of VRAM each.

Does V4-Flash replace GPT-4o for the general public chatbot?
On paper, its comprehension scores are slightly lower than GPT-4o, but its local throughput is clearly superior. If latency is your absolute priority and you have mastered RAG, V4-Flash is an excellent choice.

How to handle hallucinations with these models?
As with any LLM, output validation is necessary. Recent methods like hallucination detection via a confidence token allow you to filter out doubtful responses without overloading inference.

Summary

  • DeepSeek releases two open-weights models, V4-Pro and V4-Flash, disrupting the proprietary ecosystem.
  • The MoE architecture coupled with MLA V2 makes it possible to maintain very low inference costs compared to an equivalent dense model.
  • V4-Pro directly rivals GPT-5 and Claude 3.5 Opus on reasoning benchmarks (MATH-500, GPQA).
  • V4-Flash offers exceptional throughput (over 150 tok/s on a 4090), ideal for RAG and real-time chat.
  • Both models natively support 128k context and structured function calling.
  • Integration via HuggingFace or vLLM is standardized and does not require exotic adaptations.

Conclusion

The release of DeepSeek V4 is not a simple iteration; it is a demonstration of industrial strength. By simultaneously offering an elite model (Pro) and a model optimized for automation (Flash), DeepSeek breaks the "open-source vs proprietary" dynamic. Proprietary models now struggle to justify their exorbitant costs in the face of freely downloadable weights that compete on real production tasks.

The real question is no longer whether open-source can catch up with closed-source, but how proprietary architectures will survive this release pace. For developers and CTOs, it is urgent to integrate these models into your testing stacks.
```