DeepSeek's DeepEP: the open source lib that optimizes GPU communication for large-scale MoE models
🔎 Why DeepEP is a game-changer for MoE training
Training Mixture-of-Experts (MoE) models with several hundred billion parameters hits a bottleneck that few people mention: inter-GPU communication. The computations are fast, but data transfers between experts distributed across thousands of cards kill performance.
DeepSeek has just made public the exact infrastructure that allows its models to train so efficiently. DeepEP means 9700+ stars on GitHub, an open source license, and optimizations that make MoE communication up to 10x faster than standard primitives.
This is the kind of release that moves the starting line for any lab or company that wanted to tackle large-scale MoE training but didn't have the resources of Google or Meta.
The key points
- DeepEP is an open source expert-parallel (EP) communication library, optimized for large-scale MoE models.
- It accelerates all-to-all dispatch/combine operations up to 10x compared to standard primitives, with FP8 support and low-latency kernels.
- Tested on DeepSeek-V3 16B MoE with PyTorch TorchTitan on B200, it enables 41% faster pre-training in MXFP8 without convergence degradation.
- Supports CUDA and ROCm/AMD via the MORI backend, making it portable beyond the NVIDIA ecosystem.
- It is the infrastructure underlying DeepSeek V3.1 and DeepSeek V4, now accessible to the entire community.
Recommended tools
| Tool | Main usage | Price (June 2025, check website) | Ideal for |
|---|---|---|---|
| DeepEP | EP communication for MoE | Open source (MIT) | Teams training large-scale MoEs |
| PyTorch TorchTitan | Distributed pre-training | Open source | DeepEP + MXFP8 integration on B200 |
| Megatron-LM | Distributed LLM training | Open source (NVIDIA) | Full DP/TP/EP pipeline |
| FairScale | Scalability components | Open source (Meta) | Prototyping parallelism strategies |
What is Expert Parallelism and why is it the bottleneck
Expert Parallelism (EP) involves distributing the experts of an MoE model across multiple GPUs. Rather than fitting all experts onto a single card, they are spread out. Each GPU processes the tokens that belong to its experts and sends the others to the relevant GPUs.
The central problem: at every forward and backward pass, a massive all-to-all exchange is required between all GPUs in the EP group. A token produced by GPU 0 might need to land on GPU 47, and vice versa. These all-to-all operations become the limiting factor well before the matrix computation itself.
In a model like DeepSeek V4 Pro with its overall score of 88 and its massive MoE architecture, the amount of data exchanged at each step is colossal. If the communication is not hyper-optimized, the GPUs spend more time waiting than computing.
This is exactly the problem that DeepEP solves.
DeepEP Architecture: What Makes the Difference
DeepEP does not reinvent the concept of expert parallelism. It optimizes every layer of the communication stack for the specific case of MoE models.
Low-latency dispatch and combine kernels
The dispatch operations (sending tokens to the right experts) and combine (retrieving and assembling the results) are the two key steps of the EP cycle. DeepEP implements custom CUDA kernels for these two operations.
Unlike generic nccl.alltoall calls that treat data as opaque blocks, DeepEP kernels understand the structure of MoE data. They can fuse operations, reduce intermediate memory copies, and minimize latency per micro-batch.
FP8 and MXFP8 support
Support for FP8 precision is a major asset. In FP8 format, the volume of data transferred between GPUs is halved compared to BF16, which directly reduces the required bandwidth.
Experimentation conducted by the PyTorch team with TorchTitan on B200 GPUs showed that the combination of MXFP8 + DeepEP on a 16B MoE DeepSeek-V3 model yields pre-training that is 41% faster, with convergence equivalent to BF16. No measurable degradation, no need for complex fine-tuning.
Group-limited gating algorithm
DeepSeek-V3 introduced a specific routing algorithm: group-limited gating. Instead of sending each token to the top N experts among all available experts, the choice is limited to a subgroup. This reduces the number of possible destinations per token.
DeepEP is specifically optimized for this communication pattern. The kernels take advantage of the fact that destinations are pre-constrained to better organize buffers and reduce the fragmentation of sends.
Portability: CUDA and ROCm
This is an often underestimated point. DeepEP is not limited to NVIDIA CUDA. MORI backend support allows the same communication logic to run on AMD GPUs with ROCm.
For the open-source community, this is strategic. It means that DeepSeek-level MoE models can be trained without being locked into the NVIDIA ecosystem, with all the hardware flexibility that implies.
Benchmarks: the numbers that speak
The data published by the PyTorch blog (June 2025) and the official website deepep.org provide a clear picture of the performance gain.
DeepEP vs standard All-to-All primitives
| Scenario | Standard primitive | DeepEP | Gain |
|---|---|---|---|
| All-to-all dispatch (FP16) | Baseline | ~10x faster | 10x |
| All-to-all combine (FP16) | Baseline | ~10x faster | 10x |
| Full EP cycle (MXFP8, B200) | Baseline BF16 | 41% faster | 1.41x |
The 10x gain on raw primitives comes from the Perplexity AI source, which benchmarked DeepEP under controlled MoE communication conditions. The 41% gain on full pre-training is more conservative because it includes the entire pipeline (computation, optimizer, checkpointing), not just communication.
Convergence: FP8 vs BF16
The PyTorch study is clear: on the DeepSeek-V3 16B MoE model, the loss curve in MXFP8 with DeepEP perfectly overlaps with that of BF16. No divergence, no premature plateau. The reduced precision does not impact the quality of the final model.
This is an important result because FP8 has long been considered too risky for pre-training. DeepEP demonstrates that in the specific context of EP communication, FP8 is not only safe but beneficial.
DeepEP in the ecosystem: comparison with Megatron and FairScale
The distributed training framework landscape is dominated by two names: Megatron-LM (NVIDIA) and FairScale (Meta). DeepEP does not seek to replace them but to fill a specific gap.
Megatron-LM
Megatron implements expert parallelism but in a relatively generic way. It uses standard NCCL primitives for all-to-all and does not offer custom kernels for MoE dispatch/combine.
DeepEP can in fact be integrated into a Megatron pipeline by replacing the EP communication layer. This is actually what the ecosystem is starting to do: keep Megatron's scheduling but plug DeepEP underneath for token transfer.
FairScale
FairScale offers modular components (FullyShardedDataParallel, sequence parallelism, etc.) but lacks advanced MoE specialization. Its design is geared towards research and rapid prototyping, not towards maximum performance at 1000+ GPUs.
DeepEP is the opposite: it is production code, tested at the scale of DeepSeek V4 Pro (Max), aiming for the last percent of performance.
DeepEP's position
DeepEP positions itself as a specialized communication library, not as a complete framework. It does one thing and does it extremely well: transferring tokens between MoE experts as fast as possible. It is meant to be integrated into a larger framework rather than used alone.
Practical use cases: who should use DeepEP
Training an MoE model from scratch
If your team plans to train an MoE model with more than 30B total parameters on a cluster of 8+ GPUs, DeepEP is not optional. Without dedicated EP optimization, communication will consume 40 to 60% of the total time. With DeepEP, this fraction drops drastically.
This is particularly relevant for teams working on architectures similar to Qwen3.6-27B or Qwen3.5-122B-A10B, which also use MoE architectures with a small number of active parameters.
Fine-tuning a large existing MoE
Distributed fine-tuning of models like DeepSeek V4 Pro or Kimi K2.6 also requires expert parallelism if the model does not fit on a single node. DeepEP accelerates exchanges during fine-tuning forward/backward passes, not just during pre-training.
Research on expert routing
Researchers experimenting with new gating algorithms (variable top-k, noise-based routing, expert choice) need a reliable and fast communication infrastructure. DeepEP provides this foundation without researchers having to rewrite low-level kernels.
Local runs for prototyping
For initial small-scale prototyping, you can use local LLMs with Ollama or LM Studio. But as soon as you move on to distributed training or fine-tuning, DeepEP becomes relevant. The local LLM installation guide remains the first step to understand MoE models before scaling up.
Technical integration: how to use DeepEP in practice
The most documented integration is with PyTorch and TorchTitan. The PyTorch blog details the steps to configure a DeepSeek-V3 16B MoE run with MXFP8 and DeepEP on B200s.
Prerequisites
- Multi-GPU cluster with InfiniBand or NVLink interconnect (network latency becomes the limiting factor with such optimized kernels).
- PyTorch compiled with CUDA or ROCm support.
- DeepEP cloned from the official GitHub repo and compiled.
Typical configuration
DeepEP integrates at the MoE layer level. Instead of calling torch.distributed.all_to_all for dispatch and combine, you call DeepEP functions which handle tensor formatting, optional FP8 quantization, and sending via the optimized kernels.
DeepSeek-V3's group-limited gating must be enabled in the model config for DeepEP to take full advantage of its optimizations. Without this group constraint, the gains are lower because the communication patterns are less predictable.
Monitoring
DeepEP exposes communication metrics (transferred volume, time per operation, bandwidth utilization) that allow you to verify that EP is not the bottleneck. If the metrics show that communication time exceeds 20% of the compute time, it means EP scaling has reached its limits for the given configuration.
Impact on the democratization of MoE training
Until recently, only a few players (Google with GShard/Megatron, Meta with the early internal MoE models) mastered very large-scale MoE training. Communication kernels were industrial secrets.
DeepSeek changes the game by publishing the exact same library it uses in production. The message is clear: MoE performance is not a mysterious art reserved for a few, it is engineering that anyone can reproduce.
For the best open source LLMs, this means that the next generation of MoE models will not only come from DeepSeek or Alibaba. University labs, startups, and open source communities can now access the same level of infrastructure optimization.
The release of DeepEP under the MIT license (like DeepSeek V3.1) is part of this strategy of democratization through infrastructure.
Limits and considerations
DeepEP is not a magic solution. There are prerequisites and limits to understand.
Break-even point
On a cluster of 2 to 4 GPUs, the overhead of installing and configuring DeepEP is probably not worth it. Gains become significant starting from 8 GPUs in EP, and spectacular at 64+ GPUs.
Hardware dependency
The optimized kernels are designed for modern GPUs (A100, H100, B200). On older hardware or consumer GPUs, the gains will be reduced because the kernels leverage specific hardware features (Tensor Cores, NVLink, FP8 networks).
Use case coverage
DeepEP is optimized for the specific communication pattern of DeepSeek-V3/V4 (group-limited gating, top-k routing). For very different MoE architectures (expert choice, routing based on the entire sequence), some of the optimizations might not apply directly.
Maturity
Despite the 9700+ stars, it is a relatively young project. The documentation is technical and assumes a good understanding of distributed parallelism. There is no step-by-step tutorial for beginners.
❌ Common mistakes
Mistake 1: Confusing DeepEP with a complete training framework
DeepEP does not handle the optimizer, data loading, checkpointing, or tensor parallelism. It is an EP communication component to be integrated into an existing framework (TorchTitan, Megatron, or your own pipeline). Using it alone is useless.
Mistake 2: Enabling DeepEP on a network-undersized cluster
If your GPUs are connected via standard Ethernet without InfiniBand or NVLink, DeepEP's optimized kernels will not be able to reach their full potential. The network will become the bottleneck before the communication itself. Check your interconnect before investing time in integration.
Mistake 3: Using FP8 without checking convergence
The PyTorch study shows that MXFP8 works without degradation on DeepSeek-V3 16B MoE. This does not mean it is guaranteed for any MoE model. Always test convergence in FP8 vs BF16 on your specific use case before committing to a long pre-training run.
Mistake 4: Ignoring group-limited gating
DeepEP is optimized for the group-limited gating algorithm. If you use standard top-k routing without group constraints, you lose some of the optimizations. Make sure your model architecture is compatible with DeepEP's assumptions.
❓ Frequently Asked Questions
Does DeepEP replace NCCL?
No. DeepEP relies on NCCL for the underlying network transport. What it replaces are the generic all_to_all calls with custom kernels that format and optimize data before the NCCL send.
Does DeepEP work with non-MoE models?
No, that is not its intended use. DeepEP is specifically designed for the communication patterns of Mixture-of-Experts models. For pure tensor parallelism or data parallelism, use standard tools.
What is the minimum number of GPUs to benefit from DeepEP?
In practice, 8 GPUs in an EP configuration. Below that, the integration overhead exceeds the gains. All significant benchmarks are at 64+ GPUs.
Does DeepEP support inference or only training?
The dispatch/combine kernels are used for both. However, the optimizations are primarily designed for training (where communication is repetitive and the volume is massive). For inference, other optimizations (continuous batching, KV cache sharing) are often more relevant.
Can DeepEP be used with AMD GPUs?
Yes, via the MORI backend which supports ROCm. This is one of DeepEP's strengths compared to NVIDIA-only solutions.
✅ Conclusion
DeepEP is the infrastructure component the open-source community was waiting for to seriously compete in large-scale MoE training. By releasing the same library that powers DeepSeek V4, DeepSeek isn't just sharing code: they are reshaping the open-source AI landscape. If you are considering a serious MoE project, DeepEP is now the mandatory starting point for your tech stack.