1-bit LLMs: when models fit on a smartphone
🔎 An 8B model in 1 GB of RAM is now a reality
Until recently, running a decent LLM on a smartphone was pure fantasy. An 8-billion parameter model in classic format consumes 4 to 7 GB of memory, reserving it for recent Macs or well-equipped PCs. In April 2026, two announcements changed the game: Microsoft published the results of BitNet b1.58 trained on 4 trillion tokens, and the startup PrismML (spun out of Caltech) released Bonsai, a commercially viable 1-bit LLM that fits in 1 GB and runs on iPhone.
Why now? Because post-training quantization (GPTQ, AWQ, GGUF) has hit its physical limits. Compressing a model after the fact is like compressing a JPEG photo: past a certain point, the image degrades. The real breakthrough is training directly in native 1-bit — each weight in the network constrained to only three values: -1, 0, +1.
The key takeaways
- BitNet b1.58 (Microsoft) is the first open-source 1-bit LLM natively trained at scale: 2 billion parameters on 4 trillion tokens, with performance comparable to full-precision models of the same size.
- Bonsai (PrismML) pushes the concept further: an 8B model that takes up ~1 GB on disk and generates 131 tokens/sec on an M4 Pro chip, under the Apache 2.0 license.
- Post-training quantization (GPTQ, AWQ, GGUF) remains relevant for existing models, but cannot compete with native 1-bit training in terms of quality-to-size ratio.
- A 100B 1-bit model can run on a single CPU at 5-7 tokens/sec thanks to the bitnet.cpp framework, with an energy consumption reduction of 71.9% to 82.2%.
Recommended tools
| Tool | Main use case | Price (June 2025, check website) | Ideal for |
|---|---|---|---|
| bitnet.cpp | CPU inference for BitNet b1.58 models | Free (open-source) | Running 1-bit LLMs on CPU only |
| BitNet b1.58 2B4T | Open-source 1-bit model | Free (Apache 2.0) | Testing and research on 1-bit AI |
| Bonsai 8B | Lightweight commercial 1-bit LLM | Free (Apache 2.0) | Running on smartphones and constrained devices |
What 1-bit quantization really is
1-bit quantization means that each weight in a neural network is stored on a single bit, taking only three possible values: -1, 0, or +1. This is called ternary representation (hence the "1.58" at Microsoft, since log₂(3) ≈ 1.58 bits of information per weight).
In a classic LLM like GPT-5.4, each weight is a 16-bit floating-point number (FP16). An 8B model therefore theoretically consumes 16 GB in weights alone. 4-bit quantization via GGUF or AWQ brings this down to ~5 GB. Native 1-bit shatters this barrier: an 8B model in 1-bit fits in ~1 GB.
The crucial distinction is between post-training quantization and native 1-bit training. Post-quantization takes an FP16 model and compresses its weights a posteriori. Native training constrains the weights during learning, which the foundational BitNet paper published in JMLR demonstrates as fundamentally superior: the weights learn directly to be efficient in this ternary space, instead of being brutally truncated after the fact.
BitNet b1.58: the paper that triggered it all
Microsoft isn't just compressing: they are rewriting the rules of the game. The BitNet b1.58 2B4T technical report published on arXiv in April 2025 documents the first open-source 1-bit LLM trained at this scale — 2 billion parameters on 4 trillion tokens.
The numbers are impressive and verified. According to the detailed analysis on DEV Community in March 2026, bitnet.cpp makes it possible to run a 100-billion parameter BitNet model on a single CPU at 5-7 tokens/sec. This is comparable to human reading speed. The measured speedups on x86 range from 2.37x to 6.17x compared to standard FP16 inference, with an energy consumption reduction of 71.9% to 82.2%.
At a more modest scale, the Medium article from February 2026 shows that at 3B parameters, BitNet b1.58 rivals LLaMA in FP16 in perplexity and zero-shot accuracy, while consuming 3.55× less VRAM and running 2.71× faster. For those who want to test it, a French tutorial from OneDollarVPS details the step-by-step installation.
The BitNet b1.58 Reloaded study (arXiv, February 2026) confirms these results on smaller architectures, and InfoQ points out that Microsoft demonstrates performance comparable to FP16 models in real-world conditions. It's a paradigm shift: we no longer sacrifice quality for size, we change the very nature of computation.
These advances are part of a broader movement of model optimization. The Qwen3.6 family from Alibaba also illustrates this trend of making LLMs more accessible, even without going as far as native 1-bit.
Bonsai: the first commercially viable 1-bit LLM
If BitNet is pure research, Bonsai is its product incarnation. The startup PrismML, spun out of Caltech, announced in April 2026 what Forbes describes as the first commercially viable 1-bit LLM.
The key figure: an 8B model that weighs ~1 GB on disk. To put this in context, a standard 8B model in 4-bit GGUF takes up 4 to 7 GB. That's a 4x to 7x reduction. And Bonsai generates 131 tokens/sec on an Apple M4 Pro chip — a throughput that makes usage comfortable in real-time conversation.
The most striking part: Bonsai works on iPhone. The practical guide from Roborhythms confirms that the 8B model runs with only 1 GB of allocated RAM. The Apache 2.0 license allows for broad adoption, including commercial. Créati.ai reports that PrismML positions Bonsai as a break from cloud dependency: 1-bit models make on-device AI viable without unreasonable compromises.
For users interested in local execution without going as far as 1-bit, our guide to installing local LLMs remains a reference for classic approaches with Ollama and LM Studio. And to compare what already exists locally, our comparison of the best local LLMs reviews the available options.
Post-training quantization: GPTQ, AWQ, GGUF — the classic state of the art
Before native 1-bit, post-training quantization was the only lever to reduce model size. It remains essential because it applies to all existing models — including current leaders like GPT-5.5, Claude Opus 4.7 or Gemini 3.1 Pro which are not available in 1-bit.
In 4-bit, a model goes from about 15 GB to ~5 GB, a 3x reduction. This makes execution possible on consumer GPUs or even on CPU. Toolhalla specifies that GPTQ is optimized for pure GPU, GGUF is hybrid CPU/GPU, and that FP8/FP4 are emerging as alternatives to integer quantization.
GPTQ: GPU-oriented compression
GPTQ quantizes weights layer by layer by minimizing the reconstruction error. It excels on dedicated GPUs but is not designed for CPU. The practical results from Johal.in show a 4x memory saving and a 3x speedup, with a perplexity loss of less than 2%.
AWQ: protecting the 1% that matter
AWQ (Activation-aware Weight Quantization), described in the original article on arXiv, starts from a clever observation: not all weights are created equal. Protecting the 1% most important weights (the "salient" weights) drastically reduces quantization error. The approach is hardware-friendly and produces more robust models than GPTQ at the same compression level.
GGUF: the king format of local
GGUF (formerly GGML) is the reference format for local execution. It supports hybrid CPU/GPU computation, making it extremely flexible. It's the format used by Ollama, LM Studio and the majority of local tools. To use free models without sacrificing quality, GGUF is often the preferred download format.
The inherent limits of post-quantization
The discussion on Hugging Face is unequivocal: integer-only inference is not yet the standard. Most solutions still use weight-only quantization, mixed weight/activation, or low-precision float (FP8/FP4). The fundamental reason is that compressing after the fact cannot recreate lost information. This is exactly why native 1-bit is a qualitative leap.
Real-world performance: what we gain and what we lose
Let's talk concrete figures. The following table summarizes verified data from available sources:
| Metric | FP16 Model (reference) | GGUF/AWQ 4-bit (post-quant) | BitNet b1.58 (native 1-bit) | Bonsai 8B (native 1-bit) |
|---|---|---|---|---|
| 8B model size | ~16 GB | ~5 GB | ~1 GB | ~1 GB |
| Memory reduction | Reference | ~3x | ~16x | ~16x |
| Speed (vs FP16) | Reference | ~3x faster | 2.71x faster | 131 tok/s (M4 Pro) |
| Quality loss (perplexity) | Reference | <2% | Comparable to same-size FP16 | Not publicly documented |
| Energy consumed | Reference | Moderate reduction | -71.9% to -82.2% | Not publicly documented |
Where native 1-bit clearly wins
Memory and energy are two undeniable victories. A 100B BitNet model running on CPU alone at 5-7 tokens/sec is a revolution for edge computing deployment, in countries where GPUs are prohibitively expensive, or on cheap servers. For these use cases, the comparison of the best free LLMs makes perfect sense when 1-bit models join free offerings.
Where we need to be honest about limitations
BitNet b1.58 at 2B parameters rivals an LLaMA 3B in FP16. That's impressive, but it's not a 70B model. It doesn't have the reasoning of a GPT-5.5 (agentic score of 98.2) or a Claude Opus 4.7 (94.3). Native 1-bit at a very large scale (70B+) has not yet been demonstrated with competitive performance against current frontier models.
For AI vision and image analysis tasks, 1-bit has not been validated either. Multimodal architectures add a complexity that extreme quantization does not yet handle well.
Concrete impact: a 70B model in 2-4 GB, when is it coming?
Let's do the math. A 70B model in FP16 = ~140 GB. In GGUF 4-bit = ~35-40 GB. In native 1-bit = theoretically ~8-9 GB (each weight on 1.58 bits instead of 16, meaning a ~10x reduction). With a bit of overhead for activations and context, we effectively arrive in the 2-4 GB zone for a very aggressively optimized model.
But there is a major caveat: no one has yet published a 70B model natively trained in 1-bit. BitNet b1.58 has been demonstrated at 2B-3B. Bonsai goes up to 8B. Scalability to 70B+ remains a theoretical hypothesis supported by trends, not a measured reality.
What is already real, however: an 8B model in 1 GB running on a smartphone. That's sufficient for many tasks — summarization, classification, information extraction, light conversational assistance. For more demanding tasks like code or in-depth research, cloud models remain indispensable.
For hosting such models locally, a server at Hostinger with enough RAM can do the trick for modestly sized 1-bit models.
The future of 1-bit inference
Several trends are taking shape for the coming months. First, the likely arrival of larger 1-bit models. If PrismML reached 8B and Microsoft proved the concept at 2B-3B, the race for a 30-70B 1-bit model is on. Next, the integration of 1-bit into multimodal architectures — currently, all known 1-bit models are text-only.
Microsoft's bitnet.cpp framework will also evolve. Currently optimized for x86 CPUs, ARM support (smartphones, Raspberry Pi, low-cost servers) is a logical and necessary step. The discussion on Hugging Face suggests that integer-only inference could become the standard within 2-3 years, replacing current mixed formats.
For AI agents, 1-bit is particularly promising: an agent running locally, permanently, without calling on the cloud, with minimal energy consumption. This is the scenario where 1-bit truly changes the game compared to simply compressing existing models.
❌ Common mistakes
Mistake 1: Confusing post-training 1-bit quantization with native 1-bit training
This is the most frequent error. Quantizing an FP16 model to 1-bit after the fact produces a degraded model, often unusable. BitNet and Bonsai are trained from the start in 1-bit — the weights learn directly in this ternary space. The difference in quality is abyssal. Do not confuse the two.
Mistake 2: Believing that an 8B 1-bit model replaces a GPT-5.5
An 8B model in 1-bit is excellent for its size category. It does not compete with the frontier models that dominate our monthly comparison of the best LLMs. GPT-5.5 (91 overall, 98.2 in agentic) and Claude Opus 4.7 (90 overall, 94.3 in agentic) remain in another category of capabilities. 1-bit compresses storage, not intelligence.
Mistake 3: Ignoring activation overhead
The model size on disk is not the only metric. Activations (the intermediate values computed during inference) also consume memory. An 8B model in 1-bit may weigh 1 GB on disk but require 2-3 GB in RAM during execution. It's still remarkable, but don't count on running an 8B 1-bit on a device with only 1 GB of total RAM.
Mistake 4: Using GPTQ for CPU deployment
GPTQ is optimized for GPU. If you only have CPU, use GGUF or bitnet.cpp. GPTQ on CPU will be slower than the original FP16 model in some cases. The choice of quantization format depends on your target hardware, not just compression quality.
❓ Frequently asked questions
Can a 1-bit LLM really run on a smartphone?
Yes. PrismML's Bonsai 8B demonstrates exactly this: ~1 GB on disk, 131 tokens/sec on M4 Pro, and it works on iPhone. The smaller BitNet b1.58 also runs on mobile devices via bitnet.cpp.
What quality loss compared to a classic model?
For models of the same size (e.g., 3B vs 3B), BitNet b1.58 is comparable in perplexity and zero-shot accuracy to FP16 according to JMLR. But a 3B 1-bit model does not replace a 70B FP16 model — these are different categories.
GPTQ, AWQ, or GGUF: which one to choose?
GGUF if you want CPU/GPU flexibility (Ollama, LM Studio). AWQ if you have a dedicated GPU and want to preserve quality. GPTQ for maximum pure GPU inference. None of these three are native 1-bit.
Is BitNet b1.58 usable in production today?
For research and PoCs, yes. For a critical mainstream application, it's premature. The 2B model is too small for complex tasks, and the tool ecosystem is still young. Bonsai 8B is closer to practical use.
Will 1-bit models replace AI cloud computing?
No, but they will drastically reduce the cases where the cloud is necessary. For simple and medium tasks, local 1-bit will be enough. For complex reasoning, the best LLMs for research will remain in the cloud. It's a complementarity, not a replacement.
✅ Conclusion
Native 1-bit is not just another compression technique — it's a paradigm shift that brings AI from datacenters back into our pockets. BitNet b1.58 proved the concept, Bonsai made it usable. The rest is a matter of months, not years. To follow these developments and compare models as they evolve, check out our monthly comparison of the best LLMs.