📑 Table of contents

Ollama 0.30 switches to llama.cpp: the architectural revolution changing local AI

Self-Hosting 🟢 Beginner ⏱️ 12 min read 📅 2026-06-06

Ollama 0.30 switches to llama.cpp: the architectural revolution changing local AI

🔎 The legacy GGML backend is dead — and it's the best news for local AI

Ollama had dominated the local AI landscape for two years with a simple promise: one ollama run and you're off. Behind this simplicity, the engine used GGML, an inference backend created by Georgi Gerganov before he pivoted to llama.cpp. The problem? GGML had become a ghost. GGUF models (the successor format) were constantly evolving, and support in GGML was lagging behind. The result: models that wouldn't load, ignored architectures, and sluggish performance on certain machines.

In June 2026, Ollama 0.30 changes everything. The team ripped out the GGML backend to replace it natively with llama.cpp. This isn't just another update. It's a core engine overhaul that affects every user running an LLM on their machine. GGUF compatibility explodes, performance improves across a much wider range of hardware, and new models like the NVIDIA Nemotron 3 Ultra arrive in the ecosystem. The catch: tokenization bugs and community tension over open-source attribution that isn't going away anytime soon.


The Essentials

  • Ollama 0.30 drops the legacy GGML backend and migrates natively to llama.cpp for all GGUF inference.
  • Flash attention is enabled by default on Qwen and Gemma models, with measurable speed gains.
  • The MLX engine remains retained on Apple Silicon — Ollama isn't going all-llama.cpp, it chooses the best engine per platform.
  • NVIDIA Nemotron 3 Ultra (55B active, 550B total) arrives in cloud mode via Ollama, optimized for agent workflows.
  • Multi-byte tokenization bugs have been reported during the migration, particularly affecting certain Unicode characters.
  • Community tension over attribution between Ollama and the llama.cpp project resurfaces.

Outil Main usage Price (June 2026, check on ollama.com) Ideal for
Ollama 0.30 Local LLM inference Free (open source) All local AI users
Ollama vs LM Studio Local tools comparison Free Choosing your inference tool
Meilleurs modèles Ollama Model selection Variable Finding the right GGUF model
Agents IA Ollama Local agent workflows Free Automating with local LLMs
Hostinger Hosting for AI APIs Starting from 2,99 €/mo Deploying web interfaces

GGML → llama.cpp : what actually changes under the hood

Ollama hadn't changed its inference engine since its creation. GGML did the job, but with limitations that became unbearable in 2026. The GGUF format continued to evolve thanks to the work around llama.cpp, and GGML didn't keep up. Each new version of GGUF added features that the legacy backend didn't support.

The migration to llama.cpp solves this in one fell swoop. According to the official Ollama blog, this change improves GGUF compatibility across a much wider range of hardware. In practice, models that refused to load on certain GPU or CPU architectures now load without any issues. The v0.30.0 backend no longer uses the GGML format at all to load models. Everything goes through llama.cpp natively.

What's fascinating is that Ollama doesn't sacrifice its platform optimizations. On Apple Silicon, the MLX engine remains active. Ollama 0.30 intelligently selects the most suitable backend: MLX on Mac, llama.cpp everywhere else. It's pure pragmatism, not technical dogmatism.

For those who want to go further in choosing their setup, our local LLM installation guide details the recommended configurations based on your machine.


Flash-attention and performance gains: the numbers

Migrating to llama.cpp isn't just about compatibility. It unlocks optimizations that were previously out of reach. The most notable: flash-attention enabled by default on Qwen and Gemma models.

Flash-attention drastically reduces memory usage during token generation. Instead of storing the full attention matrix in VRAM, it computes it in blocks. The result? Faster generation and the ability to run larger models on GPUs with less VRAM.

According to the analysis by InsiderLLM, the gains are particularly visible on models from the Qwen3.5 family, such as the Qwen3.5-122B-A10B or the Qwen3.6-27B. On a GPU with 16 GB of VRAM, the generation time per token can drop by 15 to 25% depending on the context. It's not a doubling in speed, but it's the difference between a smooth experience and a painful wait.

The table below summarizes the improvements measured by the community since the release:

Model Before 0.30 (GGML) After 0.30 (llama.cpp) Improvement
Qwen3.6-27B Flash-attention disabled Flash-attention by default +15-20% tokens/s
Gemma 3 (via GGUF) Partial compatibility Full support Increased stability
Qwen3.5-122B-A10B Limited VRAM Better memory management Longer context possible
Multi-byte models Reliable tokenization Bugs reported Occasional regression

To make the most of these models, check out our selection of the best LLMs to run locally.


NVIDIA Nemotron 3 Ultra: the agent model arriving in Ollama

The migration to llama.cpp opens the door to model architectures that were previously incompatible. And NVIDIA seized the opportunity. At Computex 2026, the company introduced the Nemotron 3 Ultra, a 550-billion parameter model of which only 55 billion are active at inference thanks to MoE (Mixture of Experts).

Nemotron 3 Ultra is not a classic generalist model. It is explicitly designed for agent workflows. According to the Ollama registry sheet, it is optimized for high-throughput reasoning with hundreds of tool calls per session. This is exactly the usage profile that interests developers building open source AI agents with Ollama.

The key point: Nemotron 3 Ultra is available via Ollama in cloud mode. You don't run it locally on your laptop — nobody has 550B parameters in VRAM. But the integration into the Ollama ecosystem means you can call it with the same API as your local models. Same interface, same workflow, but with the power of a cloud model when you need it. It's a hybrid approach that makes sense for real-world use cases.

If you are hesitating between local and cloud models for your agents, our Claude, GPT, Gemini, Llama comparison can help you decide.


The Laguna architecture: an integrated llama.cpp patch

Ollama 0.30 didn't just migrate to llama.cpp as-is. The team contributed to the upstream project by adding support for a new architecture: Laguna, developed by poolside. The release notes on GitHub confirm that a patch was submitted and integrated into llama.cpp to support this architecture.

Laguna is a model designed specifically for code generation. Its integration shows that the relationship between Ollama and llama.cpp is not one-way: Ollama doesn't just consume llama.cpp, it contributes to the support of new architectures. This is a positive signal for the health of the open-source ecosystem around local inference.

For developers, this means that future models based on Laguna will be able to be launched with a simple ollama run as soon as they are released in GGUF. No manual compilation, no complex configuration. This is Ollama's promise becoming a reality thanks to this migration.


❌ Common mistakes

Mistake 1: Updating without clearing the model cache

After migrating to llama.cpp, some older GGUF models might use a corrupted cache. Symptoms: the model starts but produces incoherent text or tokenization errors.
Solution: Clear the cache with ollama rm then re-download the model. The GGUF files themselves haven't changed, but the inference cache must be regenerated.

Mistake 2: Confusing GGML and GGUF

GGML is the old backend (dead since 0.30). GGUF is the model file format. These are two different things. You can still use GGUF files — it's even the main format now. What disappeared is the GGML engine that read them.
Solution: If a tool asks you to choose between GGML and GGUF, always choose GGUF in 2026.

Mistake 3: Ignoring multi-byte tokenization bugs

The migration introduced regressions in the tokenization of multi-byte characters (accents, Asian characters, emojis). If you are working with non-ASCII text, systematically test before putting into production.
Solution: Follow the issues on Ollama's GitHub. A fix is currently in development based on community feedback.

Mistake 4: Thinking everything goes through llama.cpp

On Apple Silicon, MLX remains the default backend. Forcing the use of llama.cpp on Mac can yield worse performance than native MLX.
Solution: Do not touch the backend settings on Mac. Ollama automatically chooses the best engine.


Reported bugs: what's broken in this version

No major migration happens without breaking something. Ollama 0.30 is no exception, and the InsiderLLM analysis clearly lists the known issues.

The most impactful bug concerns multi-byte tokenization. Some users report that non-ASCII Unicode characters (like French accented letters, CJK characters, or certain emojis) are incorrectly split into tokens. The model receives corrupted tokens as input, which produces incoherent outputs. This is particularly problematic for French-speaking users working with natural text.

A second issue affects backward compatibility. GGUF models quantized with very old methods (pre-2024) may no longer load correctly. This is an acceptable trade-off: these models are obsolete anyway, and modern quantizations (IQ4_XS, Q4_K_M, etc.) work perfectly.

Finally, a few reports mention sporadic crashes on multi-GPU AMD configurations. ROCm support via llama.cpp is constantly improving, but it remains the weak link compared to CUDA on NVIDIA.

If these bugs concern you and you are looking for a more stable alternative, our Ollama vs LM Studio comparison can help you evaluate your options.


Community tension: Ollama vs llama.cpp

This architectural change reignites an old debate within the local open-source AI community. When llama.cpp was created by Georgi Gerganov, it was a fork of the GGML project. Ollama, on the other hand, built its success on GGML and then went a long time without migrating. Some llama.cpp contributors feel that Ollama benefited from community work without contributing enough in return.

The 0.30 migration changes the game. Ollama now integrates llama.cpp as its main dependency and contributes patches (such as for the Laguna architecture). But the timing raises questions: why wait so long when llama.cpp had already been the de facto standard for over a year?

The answer is likely technical. MLX on Mac worked well with the old architecture, and migrating a backend used by millions of users is a colossal task. But the fact that this migration coincides with the arrival of Nemotron 3 Ultra in the Ollama ecosystem also suggests commercial motivations. NVIDIA needs next-generation models to run everywhere, and Ollama had an interest in being compatible.

Regardless, the end user wins out. A unified engine, fewer compatibility bugs, and more available models. Open-source politics are important, but pragmatism wins out when it comes to running models on real hardware. For those who want to understand the broader issues surrounding inference engines, our article on OpenClaw, the AI agent that changes everything explores these same tensions between standardization and innovation.


The impact on local agent workflows

Where this migration truly makes sense is on the agent side. Agent workflows generate dozens, sometimes hundreds of LLM calls per session. Each call involves a system prompt, a context, and a generation. Inference speed and engine reliability become critical.

With llama.cpp and flash-attention by default, models like the Qwen3.6-27B or the Qwen3.5-122B-A10B become viable as local agent engines. The latency per call drops enough that an agent loop (reflection → action → observation) remains fluid. Our article on an AI agent that works while I sleep shows exactly this kind of workflow.

The arrival of Nemotron 3 Ultra in cloud mode via Ollama opens up another possibility: a local agent that delegates complex tasks to a cloud model, while keeping routine tasks local. Same API, same interface, but intelligent routing between local and cloud based on the complexity of the task. This is the hybrid architecture many developers have been waiting for.

To build these agents, the guide on open source AI agents with Ollama details the design patterns and available tools.


❓ Frequently Asked Questions

Do I need to reinstall all my models after the update?

No. GGUF files are compatible. Only the inference cache needs to be regenerated. Delete and re-download only the models that are causing issues, not all of them systematically.

Is Ollama 0.30 faster on all models?

Not universally. The gains are significant on Qwen and Gemma thanks to flash-attention. On other model families, the difference is minimal or non-existent depending on your hardware.

Can I still use GGML?

No. The GGML backend has been completely removed. If you have models in the (old) GGML format, you must convert them to GGUF using a tool like llama.cpp-convert.

Does Nemotron 3 Ultra run locally?

No. Its 550 billion parameters make it impossible to run on consumer hardware. It is accessible via Ollama in cloud mode using the same API as your local models.

Does the migration affect Mac users?

Yes and no. On Apple Silicon, the MLX backend remains the default. However, the overall management of GGUF models has changed, so some behaviors may differ. Multi-byte tokenization bugs also affect Macs.


✅ Conclusion

Ollama 0.30 is the most significant update since the project's inception: an engine change that resolves compatibility issues accumulated over two years, unlocks flash-attention on popular models, and opens the door to new architectures like Laguna and Nemotron 3 Ultra. The tokenization bugs are real but temporary. If you run LLMs locally, update Ollama and choose your models from our selection — the gain in stability and performance is worth it.