OpenAI Jalapeño: the custom inference chip with Broadcom that promises -50% on costs — the end of Nvidia dependency for serving

Deep Tech 🟢 Beginner ⏱️ 16 min read 📅 2026-06-26

OpenAI Jalapeño: the custom inference chip with Broadcom promising -50% on costs — the end of Nvidia dependency for serving

🔎 Why OpenAI is building its own chip now

OpenAI spends about $14 billion a year on serving on third-party GPUs, essentially Nvidia. This is the company's heaviest operational bill, far beyond salaries or R&D. Every ChatGPT query, every API call, every generated token passes through silicon that does not belong to them.

On June 24, 2026, OpenAI and Broadcom unveiled Jalapeño, OpenAI's first custom chip designed exclusively for LLM inference. The stated goal: halving the cost per token. This is not a distant research project. A prototype will be deployed in late 2026, production starts in 2027, and full-scale deployment is scheduled for the first half of 2028.

This timing is no coincidence. The inference war has become the true economic battle of AI. The marginal cost of a token determines whether a model can be profitable at scale. Jalapeño is OpenAI's answer to this equation.

The essentials

Jalapeño is an ASIC (application-specific integrated circuit) designed by OpenAI and Broadcom for LLM inference, promising a 50% reduction in cost per token compared to equivalent Nvidia GPUs.
The design cycle was only 9 months, a record for a chip of this complexity, with TSMC handling manufacturing, Broadcom handling silicon design, and Celestica handling rack integration.
Broadcom confirms its position as the king of AI ASICs, already behind Google's TPUs, Meta's MTIAs, and soon the chips from ByteDance and Apple.
The threat to Nvidia is bounded but real: training remains on GPUs, but inference represents the daily and growing bill.
OpenAI has signed a commitment for 10 GW of computing power with Microsoft by 2029, a significant portion of which will run on Jalapeño.

Recommended tools

Tool	Main use	Price (June 2026, check website)	Ideal for
Hostinger	Web hosting for AI projects	From 2.99 €/month	Developers deploying LLM apps
ChatGPT (OpenAI)	Inference via API	Varies by model	Production integration
Claude (Anthropic)	Alternative inference	Varies by model	Use cases requiring long context

Jalapeño: an ASIC designed solely for inference

An ASIC is not a GPU. It is a circuit tailored for a specific task, without the flexibility of a programmable GPU, but with far superior energy efficiency and throughput for that task. Jalapeño is designed to do just one thing: run transformer forward passes, as fast and as cheaply as possible.

According to the official OpenAI announcement, the chip specifically optimizes the dense matrix operations and memory transfers that dominate LLM serving. No ray tracing, no physics simulation, no backward pass for training. Just inference.

This specialization explains the announced -50%. An Nvidia H100 or B200 GPU consumes a significant fraction of its silicon and memory on capabilities that serving does not need. Jalapeño eliminates this waste.

CNBC reports that the chip specifically targets models in the GPT-5.x range and beyond, with a particular focus on FP8 and INT4-based quantization formats.

Why not an LPU like Groq?

Groq has paved the way with its LPU (Language Processing Unit) architecture, but with a different approach: massive onboard SRAM at the expense of capacity per chip. Jalapeño opts for a different balance, likely with attached HBM, allowing it to serve larger models without fragmenting batching.

The fundamental difference: Groq sells compute to others. Jalapeño is an internal tool, optimized for OpenAI's specific models. This difference in target changes everything in architecture.

$14 billion/year: the bill that makes customization inevitable

To understand Jalapeño, you have to look at the numbers. OpenAI handles hundreds of millions of daily requests through ChatGPT and its API. Each request consumes FLOPs, memory, network bandwidth, and above all, electricity.

Bloomberg points out that OpenAI's serving bill has reached $14 billion annually, a figure that includes hardware depreciation, energy, cooling, and network infrastructure. That is more than the GDP of several countries.

Even a 30% reduction (conservative compared to the announced 50%) represents $4 to $7 billion in annual savings at full scale. This easily justifies the R&D investment, estimated at between $1 and $2 billion for the complete program.

The economic model is simple: an ASIC is expensive to design but costs almost nothing to produce in volume. The more tokens you serve, the faster you reach the break-even point. For OpenAI, that threshold was passed months ago.

The parallel with HBM memory

This dynamic is reminiscent of what is happening on the memory side. The rise of HBM4 shows that every component in the inference chain is the subject of furious optimization. Jalapeño fits into this trend: OpenAI wants to control the entire stack, from silicon to memory.

The record 9-month cycle: how it's possible

Designing a custom chip normally takes 2 to 3 years. Nvidia, Apple, Google plan their architectures on multi-year cycles. Jalapeño was designed in 9 months. How?

Three factors explain this acceleration. First, Broadcom brings a pre-validated ASIC platform. The foundry doesn't start from scratch: it reuses IP blocks, memory controllers, PCIe/CXL interfaces already tested on other projects (TPU, MTIA). Next, OpenAI doesn't need a general-purpose chip. The scope is narrow: transformer inference. Fewer features, less verification, fewer possible bugs. Finally, the financial urgency focuses the teams.

According to TechTimes, OpenAI and Broadcom teams worked in a co-located mode, with weekly design iterations rather than quarterly ones. It's a project mode that looks more like software development than traditional microelectronics.

TSMC, Broadcom, Celestica: the division of labor

The value chain is clear:

TSMC: manufacturing on an advanced node (probably N4 or N3, exact details undisclosed)
Broadcom: silicon design, IP blocks, verification
Celestica: rack integration, cabling, cooling, system testing

Celestica is an interesting choice. The Canadian company is already a key partner of Google for TPUs and Meta for MTIAs. Broadcom systematically entrusts it with integration. It has become the industry's hidden bottleneck: having a good chip is not enough, you have to package it in a rack that can handle the power, thermal, and reliability requirements of a datacenter.

Broadcom: the hidden king of AI ASICs

The story of Jalapeño is also the story of Broadcom's rise in AI. While everyone is looking at Nvidia, Broadcom has silently built a monopoly on ASIC design for tech giants.

The list is impressive: Google's TPU, Meta's MTIA, ByteDance's AI chips (in development), Apple's neural chips (in development), and now OpenAI's Jalapeño. Broadcom doesn't manufacture, it doesn't sell cloud. It designs the silicon that others use to free themselves from Nvidia.

This positioning is lucrative. Every ASIC contract generates design revenues (hundreds of millions) plus royalties per chip produced. It's a recurring and defensive model: once a client has invested in a Broadcom design, the switching cost is prohibitive.

Why not Marvell or others?

Marvell is the most cited alternative, with AWS (Trainium/Inferentia) and Microsoft (Maia) contracts. But Broadcom has an advantage: a broader IP library, particularly in high-speed network interconnects, which are the true bottleneck of large-scale inference systems.

Impact on Nvidia: bounded threat but disrupted strategy

We need to be precise about the threat. Jalapeño does not replace Nvidia GPUs for training. GPT-5.5, Claude Opus 4.7, and Gemini 3 Pro Deep Think models continue to train on H100/B200 clusters. Training requires higher precision (BF16, FP32), checkpointing capabilities, and above all a software ecosystem (CUDA) that is irreplaceable in the short term.

But inference is the daily bill. And that's where the volume is. A model trains once, it serves billions of times. The split in spending between training and inference has shifted: inference now represents 70 to 80% of the total cost of ownership for large-scale LLM deployment.

Nvidia knows this. That is why the company is aggressively pushing its inference-dedicated chips (L40S, inference-optimized B200, future chips from the "N" line). But a custom ASIC will always be more efficient than a GPU, even a "stripped-down" one, because it can eliminate the final compromises of generality.

What this changes for the GPU market

In the short term, nothing. OpenAI will continue to buy Nvidia GPUs massively for training and as inference backup. In the medium term (2028+), a portion of serving migrates to Jalapeño. In the long term, if others follow (Meta with MTIA, Google with TPU, Amazon with Inferentia), the inference GPU market contracts.

Nvidia remains dominant in training. But the company's future growth depended heavily on capturing inference value. Jalapeño and its equivalents are eating away at this prospect.

The surrounding ecosystem: memory, cooling, networking

A chip alone is useless. Jalapeño fits into an ecosystem that is evolving in parallel.

HBM memory is the first critical component. Each Jalapeño chip will need HBM3E or HBM4 to feed data to its compute cores. This is where players like Micron, SK Hynix, and Samsung are capturing a growing share of the value. The transition to HBM4, which is denser and more energy-efficient, is a direct enabler of Jalapeño's efficiency.

Cooling is the second challenge. High-density inference racks regularly exceed 100 kW per rack. Liquid cooling is no longer an option but a necessity. Celestica is likely integrating cold plate or immersion cooling solutions into the Jalapeño racks.

Networking is the third. The interconnection between Jalapeño chips within the same node, and between nodes in a datacenter, determines the effective batching. If the network cannot keep up, the chip waits and efficiency collapses.

10 GW with Microsoft: The Energy Context

The commitment of 10 GW of computing power with Microsoft by 2029 sets the scale of the deployment. To put this in context: a modern AI datacenter consumes between 500 MW and 1 GW. 10 GW equates to 10 to 20 massive datacenters.

This power will not come exclusively from Jalapeño. A portion will remain on Nvidia GPUs (training, critical inference). But the ASIC proportion will grow significantly. Microsoft has every interest in seeing Jalapeño succeed: as an investor and OpenAI's cloud partner, every dollar saved on serving improves the overall profitability.

This energy commitment also raises the question of electricity supply. 10 GW is the output of several nuclear reactors or dozens of wind/solar farms. The PPAs (Power Purchase Agreements) that Microsoft and OpenAI are signing with energy producers are a prerequisite for this deployment.

Energy Sovereignty as a Competitive Advantage

Whoever controls access to energy controls the AI deployment. This is why Microsoft has signed nuclear agreements (Three Mile Island, Helion) and why OpenAI is investing in energy projects. Jalapeño is useless without the watts to power it.

What Jalapeño changes for developers and users

For the developer calling the OpenAI API, Jalapeño should be transparent. Same endpoint, same response format, same or better latency. That's the goal: hardware abstraction. OpenAI handles the routing between GPUs and ASICs on the backend.

But the indirect effects are real. If the cost per token drops by 50%, OpenAI has two options: lower its prices to squeeze the competition (Anthropic, Google, xAI), or maintain prices and improve its margins. A combination of both is likely.

For ChatGPT end users, the impact will be increased reliability during peak load and potentially more generous request limits. For integrating companies, a reduction in API costs that can transform the profitability of products built on top of GPT.

The models that benefit the most

The heaviest models are the first candidates for migration to Jalapeño. GPT-5.5, with its 98.2 points on the agentic benchmark, is the flagship model whose serving is the most expensive. Migrating its inference to ASIC is the number one priority.

Lighter models like Claude Sonnet 4.6 or GPT-5.3 Codex have different cost profiles and could remain on GPUs longer, where the flexibility of multi-model batching is an advantage.

The competitive landscape: Groq, Cerebras, and the inference chip market

OpenAI is not the first to want to break Nvidia's stranglehold on inference. Groq, which raised $650 million and pivoted to neocloud, offers a similar approach with its LPU chips. Cerebras is pushing its wafer-scale engines. SambaNova, d-Matrix, and others are targeting the same niche.

The difference with Jalapeño: OpenAI does not sell chips. OpenAI does not sell inference cloud. OpenAI consumes its own chips internally. This is a major structural advantage. Groq has to convince external customers to adopt its stack, to rewrite their pipelines, to accept a less mature ecosystem. OpenAI has none of these problems: it controls the model, the API, and now the hardware.

Cerebras, Groq, and the others remain relevant for enterprises that want to deploy open-source models (DeepSeek V4 Pro, GLM-5) on dedicated hardware. But for the serving of OpenAI's proprietary models, Jalapeño is a vertically integrated solution that no third party can match.

Neoclouds facing verticalization

Neoclouds (CoreWeave, Lambda, Together) built their model on GPU arbitrage: buying H100s/B200s, renting them out with a margin. If the major model providers (OpenAI, Google, Meta) massively migrate to internal ASICs, the neocloud market finds itself reduced to the serving of open-source models and training for enterprise customers. This is a smaller and more competitive market.

The risks of the Jalapeño project

A chip project of this scale has inherent risks that the enthusiastic announcement does not mention.

The first risk is manufacturing yield. An advanced TSMC node has defect rates that must be managed. If the yield is low, the effective cost per chip skyrockets and the promised savings evaporate. Broadcom has the experience to manage this, but every new design is a gamble.

The second risk is software. Nvidia's CUDA is not just a driver, it is a complete ecosystem (cuDNN, TensorRT, Triton). OpenAI must build the equivalent for Jalapeño: compiler, runtime, scheduler, monitoring. This is a considerable amount of work, even when starting from existing frameworks (PyTorch, Triton).

The third risk is obsolescence. LLM models evolve quickly. o1-preview and reasoning models have different inference patterns than classic autoregressive models (token-by-token generation with chain-of-thought, variable context usage). If Jalapeño is too optimized for the GPT-5.x architecture and GPT-6 changes paradigms, the chip loses its advantage.

The risk of Broadcom dependency

By leaving Nvidia, OpenAI places itself under Broadcom's dependency. If Broadcom increases its design prices, if the foundry prioritizes other clients (Google, Apple), if a contractual dispute erupts, OpenAI has no Plan B in the short term. This is a risk that the company consciously accepts, but that must be noted.

❌ Common mistakes

Mistake 1: Confusing inference and training

Thinking that Jalapeño replaces Nvidia GPUs for everything. The chip only does inference (forward pass). Training (backward pass, optimization, checkpointing) remains on GPUs. These are two distinct markets with different technical requirements.

Mistake 2: Taking the -50% for granted

Figures announced by companies are always optimal scenarios. The -50% is likely measured on a specific workload, with ideal batching, on a particular model. In real-world conditions, with varied workloads and non-optimal utilization, the reduction will be more modest. 30-40% remains a major result, but do not take 50% as a guarantee.

Mistake 3: Ignoring the total cost of ownership

Focusing on the cost per token without considering the development cost ($1-2 billion), the yield risk, the software cost (compilers, runtime), and the Broadcom lock-in. The actual TCO will only be known after 2-3 years of production.

Mistake 4: Believing that Jalapeño is available now

The announcement is June 2026. The prototype arrives in late 2026. Production in 2027. Full scale H1 2028. Between the announcement and a significant impact on costs, there are 18 to 24 months. Nvidia GPUs will remain the foundation of OpenAI's serving until at least 2028.

❓ Frequently Asked Questions

Will Jalapeño replace all GPUs at OpenAI?

No. Training remains on Nvidia GPUs. Only inference is progressively migrating to Jalapeño, and even there, part of the serving will keep GPUs for flexibility and fallback. The transition will be gradual over 2027-2029.

What is the connection between Jalapeño and the on-premise Codex with Dell?

Jalapeño and the on-premise Codex initiative are parallel projects. On-premise Codex addresses the enterprise demand for data sovereignty. Jalapeño addresses serving cost optimization. Ultimately, an on-premise product could integrate Jalapeño chips, but that is not the initial plan.

Is Broadcom becoming a direct competitor to Nvidia?

Partially. Broadcom competes with Nvidia in the ASIC inference segment (alongside Google TPUs, Meta MTIA, and Jalapeño). But Broadcom does not sell GPUs and is not targeting training. The two companies operate fundamentally different economic models: Nvidia sells high-margin generic products, Broadcom sells custom design on demand.

Will reasoning models like o1-preview work on Jalapeño?

That is an open question. Reasoning models have atypical inference patterns (long chains of thought, variable use of context tokens). The study on o1-preview shows that these models generate significantly more internal tokens than a standard model. Jalapeño will need to support these patterns to remain relevant as reasoning models become dominant.

What impact on OpenAI API pricing?

Hard to predict. OpenAI could maintain its prices and improve its margins, or lower its prices to accelerate adoption. The most likely combination: gradual price decreases for models whose serving has migrated to Jalapeño, maintained or increased prices for models still on GPU.

✅ Conclusion

Jalapeño won't kill Nvidia, but it marks the beginning of the end of the de facto monopoly on LLM inference. When OpenAI, Google, Meta, Amazon, and soon Apple and ByteDance all have their custom chips, the inference GPU market finds itself caught between a fragmented open-source ecosystem and internal ASICs that aren't for sale. The real question is no longer whether inference will migrate to ASICs, but how quickly the promised -50% will become the new market standard.

#openai-jalapeno #puce-custom-inference #broadcom #fin-dependance-nvidia #reduction-couts-serving

📚 Related articles

Deep Tech 🟢 Débutant 12 min

Micron Q3 2026: Revenue Quadrupled, 81% Gross Margin — How HBM4 Memory Became the New AI Tax

Micron Q3 2026: Discover how quadrupled revenue to $41.5B and HBM4 memory became the new unavoidable AI tax.

2026-06-25 16:03

Deep Tech 🟢 Débutant 17 min

Groq raises $650 million and pivots to neocloud: the survival of the former AI chip darling after Nvidia scooped up its soul for $20 billion

Groq raises $650M and pivots to AI inference neocloud, marking its comeback against Nvidia after a $20B deal.

2026-06-24 15:02

Deep Tech 🟢 Débutant 14 min

SpaceX × Reflection AI : a $6.3 billion deal — Colossus 2 becomes the AWS of AI, Musk locks down frontier infrastructure

Discover the $6.3B SpaceX & Reflection AI deal. Colossus 2, with Nvidia GB300 chips, becomes Musk's AWS of AI.

2026-06-23 15:06

📑 Table of contents