NVIDIA Nemotron 3 Ultra 550B: The most powerful open-source model in the US arrives at Computex

LLM & Modèles 🟢 Beginner ⏱️ 14 min read 📅 2026-06-04

NVIDIA Nemotron 3 Ultra 550B: the most powerful open-source model in the US arrives at Computex

🔎 Computex 2026 marks a turning point: the US re-enters the open-weights war

Jensen Huang took the stage on June 1, 2026, at Computex Taipei with a clear message. America is no longer ceding the open-source ground to Chinese models. Nemotron 3 Ultra, with 550 billion parameters, is NVIDIA's direct response to DeepSeek V4 Pro and MiniMax M3.

The geopolitical context is anything but trivial. Since late 2025, Chinese open-weights models have dominated the Artificial Analysis rankings. The US had nothing comparable in terms of combined power and accessibility. Nemotron 3 Ultra changes the game — and not just a little.

It is also a signal sent to Meta, whose first closed model from the Superintelligence Lab caused an earthquake in the open-source community. NVIDIA takes the opposite approach: open-weights, no compromises.

The essentials

550B parameters, 55B active in MoE architecture with 90% sparsity — an unprecedented power/cost ratio on the American side.
AA Intelligence Index score of 48, far ahead of Gemma 4 (39) and the previous Nemotron 3 Super (36). It is the most intelligent open-weights model in the US.
Pre-trained on 25T+ tokens with a context window of 1 million tokens and a throughput of 300+ tok/s.
Optimized for AI agents and integrated into the NVIDIA stack (NIM microservices, RTX Spark).
Weights in open-source, in a context where the US had fallen behind strategically compared to Chinese models.

Recommended Tools

Tool	Main Usage	Price (June 2026, check website)	Ideal for
NVIDIA NIM	Deployment and inference of Nemotron 3 Ultra	Free (self-hosted) / Variable cloud pricing	Enterprise developers
OpenRouter	API access to Nemotron 3 Ultra vs competitors	Pay-per-use	Comparisons and quick tests
RTX Spark	Local execution on NVIDIA GPUs	Included with NVIDIA drivers	Users with RTX GPU

MoE Architecture: How 55B active parameters rival models 10x heavier

Nemotron 3 Ultra uses a Mixture of Experts (MoE) architecture with a key feature: 90% sparsity. Specifically, out of the 550 billion total parameters, only 55 billion are activated at each inference.

This is the same principle behind the success of DeepSeek V4 Pro, but NVIDIA pushes the logic further with what they call the LatentMoE architecture. Instead of mechanically activating fixed experts, routing is done in a latent space, allowing for a more dynamic and precise allocation of computational resources.

The result is measurable: 300+ tokens per second in generation, a throughput that enables real-time interactions even with 1 million context tokens. For comparison, most models of this size peak below 100 tok/s in standard configuration.

This efficiency comes with a hardware cost, but it remains contained. A dense 550B model would require several A100s just to load into memory. Nemotron 3 Ultra, thanks to sparsity, runs on significantly more accessible configurations — which is precisely the strategic point.

Benchmarks : Nemotron 3 Ultra vs DeepSeek V4 Pro vs MiniMax M3

The real test is the confrontation with the models currently dominating the open-weights space. Artificial Analysis is publishing a direct comparison between Nemotron 3 Ultra and MiniMax M3, and the numbers speak for themselves.

Model	Parameters	Active	AA Intelligence Index	Context	Throughput
Nemotron 3 Ultra	550B	55B	48	1M tokens	300+ tok/s
MiniMax M3	456B	45B	~43	1M tokens	~200 tok/s
DeepSeek V4 Pro (Max)	—	—	88 (general)	128K tokens	~150 tok/s
Gemma 4	Variable	Variable	39	Variable	Variable
Nemotron 3 Super	Variable	Variable	36	Variable	Variable

Two things stand out. First, Nemotron 3 Ultra clearly dominates in the purely American open-weights category, with a 12-point lead over Gemma 4. Second, when facing Chinese models, the gap narrows but generally persists — DeepSeek V4 Pro (Max) reaches an 88 on the general ranking according to our benchmark data.

The important nuance: the "general" and "agentic" rankings do not measure the same thing. Nemotron 3 Ultra is specifically optimized for agentic tasks, whereas DeepSeek V4 Pro excels in pure general reasoning. The direct comparison is therefore more nuanced than the raw scores suggest.

Source: Artificial Analysis — Nemotron 3 Ultra announced and OpenRouter — Comparaison Nemotron 3 Ultra vs MiniMax M3

The American response to Chinese hegemony in open-weights

For over a year, the narrative has been relentless: China dominates open-weights, the US is locking itself behind proprietary models. DeepSeek, MiniMax, Kimi K2.6 — every month brought its batch of high-performance open-weights models.

Nemotron 3 Ultra is the first American response that measures up. Not a proprietary model in disguise, not "open-weights but with a restrictive license" — the weights are available, which is what matters to the community.

The timing is no coincidence. At a time when OpenSeeker-v2 breaks the monopoly of industrial search agents and ByteDance's DeerFlow pushes the open-source agent towards the long term, NVIDIA had to show that the American ecosystem can produce competitive open-weights models.

The geopolitical dimension goes beyond a simple techno-benchmark. Chinese open-weights models have become a tool of soft power. Every developer who adopts DeepSeek or MiniMax buys into an ecosystem controlled by Beijing. Nemotron 3 Ultra offers a credible alternative, integrated into the NVIDIA stack that millions of developers already use.

Designed for AI Agents: why it's strategic

NVIDIA isn't just releasing a large model for prestige. Nemotron 3 Ultra is explicitly optimized for multi-agent systems, as detailed by DataCamp in its analysis of the architecture.

What does that mean in practice? The model is trained to maintain consistency across long chains of actions, manage multiple subtasks simultaneously, and produce structured outputs (JSON, tool calls) with superior reliability. This is exactly what AI agent frameworks require.

This is where the connection with choosing the best LLMs for AI agents becomes critical. An agentic model must be fast (for iterative reasoning loops), reliable (no hallucinations on tool calls), and capable of handling long context (to maintain the state of a complex conversation).

Nemotron 3 Ultra checks all three boxes brilliantly: 300+ tok/s for speed, agent-specific training for reliability, and 1M tokens of context for memory. For developers building agents IA open source avec Ollama en local, this model is a game-changer — provided they have the hardware.

Integration with the NVIDIA stack: NIM, RTX Spark, and the ecosystem

An open-weights model is good. An open-weights model that natively integrates into an existing deployment ecosystem is better. This is exactly what NVIDIA did with Nemotron 3 Ultra.

NIM microservices allow you to deploy the model in production with a few command lines. No complex configuration, no hit-or-miss compatibility — NVIDIA controls the entire chain, from the model to the runtime. This is a massive competitive advantage over DeepSeek or MiniMax, which do not have this level of vertical integration.

RTX Spark is the other piece of the puzzle. NVIDIA is pushing the local execution of heavy models on consumer GPUs. Nemotron 3 Ultra, with its 55B active parameters, is theoretically executable on a multi-GPU RTX 5090 setup — a scenario that will directly interest those looking to installer un LLM en local.

For those who want to compare with the meilleurs LLM à run en local, it will be necessary to test under real-world conditions. The 300+ tok/s throughput is measured on server infrastructure, not on a desktop PC. But the direction is clear: NVIDIA wants Nemotron 3 Ultra to become the reference model for local developers.

How to access Nemotron 3 Ultra

Three access methods are available at launch.

Via NIM (recommended for production): Downloading the weights from the official NVIDIA Research page and deploying via NIM microservices. This is the most optimized method, but it requires compatible GPU infrastructure.

Via OpenRouter (for testing): Accessible via a pay-per-use API, which allows you to test the model without investing in hardware. The OpenRouter comparison page even allows for direct A/B testing against MiniMax M3.

Via RTX Spark (for local): Integrated into recent NVIDIA drivers, this option is aimed at advanced users with multi-GPU configurations. The complete developer guide from WowHow details the exact hardware requirements.

To be honest: Nemotron 3 Ultra is not a model you will run on a laptop. Even with 90% sparsity, 55B active parameters require at least 110-120 GB of VRAM in fp16, or 60-70 GB in 4-bit quantization. This is multi-GPU or server territory.

Nemotron 3 Ultra vs. the best LLMs on the market

To position Nemotron 3 Ultra in the global landscape, it must be compared to the proprietary and open-weights models that currently dominate.

On the agentic side, the top is occupied by closed models: GPT-5.5 (score 98.2), Gemini 3 Pro Deep Think (95.4), Claude Opus 4.7 Adaptive (94.3). Nemotron 3 Ultra, with its AA score of 48, does not directly compete with these monsters. But that is not its market.

Its true battleground is open-weights, and there, it takes the lead of the American pack. The meilleurs LLM gratuits like ChatGPT Free or Gemini offer easy access but without control. Nemotron 3 Ultra offers total control of the weights, which fundamentally changes the value proposition.

For developers who consult the comparatif mensuel des meilleurs LLM, Nemotron 3 Ultra will likely enter the top of the open-weights models in the next ranking. Its relevance will mainly depend on community adoption — open-source weights are only worth as much as the ecosystem built around them.

The Nemotron 3 Family: Nano, Super, Ultra

Nemotron 3 Ultra is not an isolated model. NVIDIA has structured a complete family of three models, each targeting a specific segment, as detailed on the NVIDIA Research page.

Nemotron 3 Nano: Lightweight model for execution on constrained devices (edge, mobile). Designed for simple classification and extraction tasks.

Nemotron 3 Super: The mid-range model, with an AA score of 36. Suited for standard reasoning tasks and deployment on single-GPU servers.

Nemotron 3 Ultra: The flagship, 550B, optimized for agents and complex tasks. This is the model driving NVIDIA's strategy.

This three-tier segmentation mirrors what Google does with Gemma or Meta with Llama, but with a major difference: each level is optimized for agentic use cases, not just for chat or text completion.

25T+ pre-training tokens: what it really means

25 trillion tokens. This figure, reported by MemeBurn, deserves some attention.

For comparison, Llama 3.1 was pre-trained on ~15T tokens. Nemotron 3 Ultra therefore goes 66% further in data volume. But quantity isn't everything — the quality of the dataset and the curriculum strategy (the order in which data is presented) are just as crucial.

NVIDIA did not detail the exact composition of the dataset, but Pasquale Pillitteri notes a significant proportion of synthetic data generated by previous NVIDIA models, along with code and structured reasoning data. This is consistent with the model's agentic orientation.

The massive pre-training also explains why the model achieves high performance despite a MoE architecture which, by nature, sees less data per expert than an equivalent dense model. The overcompensation in volume makes up for the under-exposure per expert.

90% Sparsity: The Key Technical Innovation

The 90% sparsity figure is repeated across all sources, but its technical importance is often underestimated. Kilo AI explains it clearly: this means that 9 out of 10 parameters are inactive at each forward pass.

The benefit is twofold. In terms of memory, only the activated experts need to be loaded, which drastically reduces VRAM requirements. In terms of compute, matrix multiplications only apply to 10% of the weights, which explains the 300+ tok/s throughput.

The challenge lies in the routing. A poor MoE router sends tokens to the wrong experts, and performance collapses. This is where NVIDIA's LatentMoE architecture makes a difference: instead of discrete routing (expert A or B), routing takes place in a continuous space, allowing for more nuanced combinations of experts.

This is an evolution over first-generation MoE architectures (like Mixtral's) and even over that of DeepSeek V3, which uses more conventional routing. NVIDIA has clearly learned from Chinese models to take things a step further.

Limitations and points of caution

Everything is not perfect. Nemotron 3 Ultra has limitations that must be understood before adopting it.

The hardware barrier remains high. Despite sparsity, 55B active parameters require serious infrastructure. This is not a model that the majority of individual developers will be able to run locally without significant investment.

The evaluation ecosystem is still young. Unlike Llama or Gemma, which benefit from thousands of fine-tunes and community evaluations, Nemotron 3 Ultra has just been released. NVIDIA's benchmarks are promising, but independent validation will take weeks.

The open-weights license is not open-source in the strict sense. "Open-weights" means that you can download and use the weights, but the commercial terms and usage restrictions may vary. You will need to read the license carefully before deploying in production.

Dependence on the NVIDIA stack is an advantage (smooth integration) but also a trap. If your infrastructure is not 100% NVIDIA, the experience will be degraded. AMD users or those using non-NVIDIA clouds will have to make compromises.

❌ Common mistakes

Mistake 1: Confusing open-weights and open-source

Nemotron 3 Ultra is open-weights, not open-source. The weights are downloadable, but the training code, datasets, and methodology are not public. This is an important distinction for open-source purists.

Mistake 2: Directly comparing the AA 48 score with general scores (80+)

Nemotron 3 Ultra's AA Intelligence Index (48) specifically measures capabilities within the open-weights category. Directly comparing it with DeepSeek V4 Pro's general score (88) makes no sense — they are not on the same scales nor the same evaluations.

Mistake 3: Underestimating hardware requirements

"55B active parameters, it should run on a 4090" — no. Even in 4-bit, 55B parameters require ~60 GB of VRAM. Plan for at least two high-end GPUs or a dedicated cloud instance.

Mistake 4: Ignoring the agentic orientation

Nemotron 3 Ultra is not optimized for casual chat or creative generation. If that's what you're looking for, the meilleurs LLM en français or generalist models will be more suitable. This model shines in structured reasoning chains and tool calls.

❓ Frequently Asked Questions

Is Nemotron 3 Ultra really open-source?

No, it is open-weights. You can download and use the model weights, but NVIDIA does not publish the training code or datasets. This is sufficient for deployment and fine-tuning, but not for reproducing the training.

Can Nemotron 3 Ultra run on a PC?

Theoretically yes, with a minimum of two NVIDIA RTX 5090 GPUs (24 GB each) in a multi-GPU configuration and aggressive 4-bit quantization. In practice, this is a rare setup and the throughput will be well below the announced 300+ tok/s.

Is Nemotron 3 Ultra better than DeepSeek V4 Pro?

Among purely American open-weights, yes. In absolute performance, DeepSeek V4 Pro (Max) still leads the overall rankings. But Nemotron 3 Ultra is specifically optimized for agents, a domain where the direct comparison is more nuanced.

When will the weights be available?

The weights are available starting June 1, 2026, via the NVIDIA Research page and NIM microservices. Deployment via OpenRouter is also active at launch.

Does Nemotron 3 Ultra handle French?

NVIDIA has not released any French-specific benchmarks. Being pre-trained on 25T+ multilingual tokens, it should perform decently, but for specifically Francophone tasks, checking out the best LLMs in French remains relevant.

✅ Conclusion

Nemotron 3 Ultra is the model the United States needed to release: a powerful open-weights model, optimized for agents, and integrated into the NVIDIA ecosystem that millions of developers already use. It doesn't beat DeepSeek V4 Pro in raw performance, but it closes a strategic gap that had dangerously widened. For developers building multi-agent systems who have the infrastructure, this is the new American reference model — and it needed to happen.

#intelligence-artificielle #deepseek-v4-pro #jensen-huang #modele-open-source #nvidia-nemotron-3-ultra-550b #computex-2026

📚 Related articles

LLM & Modèles 🟢 Débutant 12 min

July 17: Gemini 3.5 Pro and Shanghai's WAIC collide — the day AI officially goes bipolar

On July 17, 2026, the Gemini 3.5 Pro launch and Shanghai WAIC illustrate two opposing visions. Discover this key day for AI.

2026-07-14 17:03

LLM & Modèles 🟢 Débutant 14 min

GPT-Live : OpenAI launches full-duplex voice — AI agents can finally listen and speak at the same time

OpenAI launches GPT-Live with full-duplex voice. Discover how AI agents can finally listen and speak at the same time.

2026-07-13 15:04

LLM & Modèles 🟢 Débutant 11 min

Meta Muse Spark 1.1 : Meta launches its first paid model and enters the agentic coding battle

Discover Meta Muse Spark 1.1, Meta's first paid model. The giant enters the agentic coding battle and changes strategy.

2026-07-11 15:02

📑 Table of contents