AI Inference (1/3) - the AI age utility

A deep dive into AI inference: how it works, the hardware powering it, and how it’s becoming the backbone of modern intelligence.

Mar 02, 2025

👋 Welcome to episode 3 of the Lay of the Land. Each month, I (why?) pick a technology trend and undertake outside-in-market research for the opportunities they present.

This is article 1/3 of Topic 2: AI inference. We’ll discuss how inference could be the new electricity, the wiring behind the technology, the different silicon plays, and their inherent tradeoffs.

Hope you find this useful enough to follow along!

Also, if you missed Topic 1 (strong recommend, if you are new to language models)

Small Language Models (1/2) - Sometimes, size doesn't matter

Arjun Rakesh K O

Feb 2

Read full story

Small language models (2/2) - The lay of the land

Arjun Rakesh K O

Feb 16

Read full story

In 2024 alone, we consumed over 30,000 terawatt-hours (12 zeroes) of electricity. Yet most of us give it zero thought. Flip a switch in LA, and a power plant in Arizona might ramp up to feed your lightbulb. It’s seamless, almost magical: the global power grid quietly orchestrates millions of transformers, miles of wire, and countless megawatts so you can watch a late-night movie without ever seeing a spark.

Tesla's "Power Banquet" Speech - Tesla Science Center at Wardenclyffe — Niagara Falls Power Co

I believe that AI inference is the next invisible utility, only for intelligence instead of electricity. Models churning away in distant data centers act like virtual “power stations,” supplying billions of predictions per second to fuel translations, create code, and more. Sure, some tasks run on smaller, battery-operated devices, like the kind of on-device inference you carry in your pocket. But, just like real-life electricity, most of the heavy lifting comes from these colossal GPU and AI accelerator farms, humming 24/7 to keep your apps smart.

Today, we’ll discuss this new utility and the infrastructure that delivers it.

Follow me on LinkedIn

You thought you could just train and chill?

Embeddings, attention, matrix multiplications, backpropagation, and repeat - you remember how models are trained (If not). Very long and compute-heavy process. But it is all worth it now that you have an easy-to-use, plug-and-play tool that will churn out prediction after prediction for perpetuity, right? right?

Turns out that using an LLM is pretty similar to the first pass involved in training the model. You tokenize your query, feed it forward through your neural network layers, use something called Softmax to convert output possibilities into probabilities, and pick the most likely output as your prediction. This process - Inference - is also compute-heavy and, as you can imagine, pretty demanding to execute at scale. The scale required to service:

ChatGPT’s 180M Monthly Active Users
1.5M developers using GitHub Copilot or 360k+ vibe coders on Cursor
Millions of BPO killer AI support agents

Today, 20% of all AI data center capacity is used to power inference and 80% for training. A Big 3 planning lead told Alvarez and Marsal that this distribution would switch in the years to come. I think so, too. You pre-train once and then fine-tune intermittently to keep the model up to date, but model inference will be an operating cost that scales alongside model usage. Thank God that data center capacity is growing fast.

The demand for inference is only going to increase with test-time compute-based reasoning models and the agentic applications that we will build with them. For the o1 model, there's an estimate that it may use around 50 times more tokens for reasoning than what is visible in the output. This is based on an example I saw on Reddit where a 110-token output corresponded to roughly 5,500 total tokens used, suggesting about 5,000 hidden reasoning tokens.

More about the market for Inference in article 2. Now, it’s time to get technical.

Inference under the hood

To discuss the complicated process flow that we call inference, let’s oversimplify and trace the steps of a sample query.

Arjun: “What should I get for lunch?”

The internet internets and takes this request to a data center server, where a load balancer routes it to an available AI cluster. A CPU tokenizes this request while the GPU fetches the model (weights) stored in an attached memory store and moves to a faster memory store called the HBM (High Bandwidth Memory) via a pipeline called the PCIe. The GPU now uses the model weights to calculate attention for the tokenized request and runs matrix multiplications using its tensor cores to first output tokens and then convert them to words. The CPU then re-enters the stage to do some post-processing (AI safety), meter usage for monetization purposes, and then recommends a steaming hot Biriyani to Arjun (which he would have gotten anyway) - All in milliseconds.

NVIDIA H100 Hopper PCIe 80GB Graphics ... — The NVIDIA H100 Hopper

IRL, queries are mostly batched together to maximize GPU utilization. Sometimes, the model is too big for one GPU and is split (a few layers here and a few layers there) among multiple GPUs (model parallellism) connected by specialized networking hardware like the NVIDIA NVLink. And sometimes the HBM retains attention weights in memory so that you don’t need to calculate them again (KV caching).

Sometimes, we don’t use GPUs. We instead use AI accelerators like the fast and efficient Groq LPU (Language Processing Unit), the big and mighty Cerebras Wafer, the super-specialized Google TPUs (Tensor Processing Unit), or even the FPGAs (Field-Programmable Gate Array) from Intel—each bringing tradeoffs in cost, efficiency, and speed.

The infer-structure stack

The central pillar for AI inference is undoubtedly the accelerator System-on-Chips (GPUs + Memory + CPU + networks),

Semiconductors

CPUs - the proverbial brain of the computer. Processes general-purpose tasks sequentially at super high speed.
GPUs - parallel processing units originally designed to render graphics for Need for Speed, turned out to be the perfect solution for computing the many simultaneous small tasks that power AI training and inference. These are supported by middleware like NVIDIA’s CUDA, which helps developers access and manipulate the many compute cores present in the hardware.
Other accelerators - the Google TPU, Groq LPU, and Cerebras WSE-3 are examples of tailor-made accelerators that employ novel architectures. Groq, for instance, brings down costs and improves speed of inference - albeit with caveats, using a deterministic architecture that does not rely on high bandwidth memory. (more on this in the next article)

These semiconductors are designed by experts at companies like NVIDIA and are manufactured at semiconductor fabs like TSMC using technologies like lithography. These fabs are national treasures that cost $50B+ in capex and are extremely complicated to operate.

TSMC's Arizona Plant to Start Making Advanced Chips - IEEE Spectrum — TSMC Arizona WIP

Memory hardware - is another important semiconductor used in AI acceleration. HBMs are supplied by very limited players, including Samsung, and play the crucial role of loading model weights onto the accelerator in super-fast speeds - 3TB/s+. These memory components comprise up to 30% of the COGS of an H100 and are often a major bottleneck for supply.

These SoCs are powered (pun intended) by a strong supporting cast.

Data centers

AI accelerators require a lot of energy to run and generate a lot of heat in the process. They also need networking equipment and storage devices. Bringing in economies of scale, the efficient way to house these machines would be to put them all in a massive warehouse in a place with decent real estate prices and cheap access to electricity and cool it up with water, pumps, and cooling towers. Finally, you need both physical and digital security to keep your precious silicon and weights safe.

Today, most AI computing, similar to high-performance computing, is serviced by hyperscalers - large data centers that can scale resources up and down based on demand, built by data center companies like Equinix and operated by Azure, AWS, GCP, Meta, Oracle, etc. An average data center has a go-live lead time of two years and capital expenses exceeding a billion dollars a pop. Once built, these mammoths consume hundreds of millions in OpEx a year, spending on electricity, water, labor, equipment maintenance, and land leases.

While training can be done in remote locations, inference needs to be closer to civilization in order to lower latency, bumping up real estate costs.

How to Build an AI Data Center - by Brian Potter — **Source**

Cloud serving and workload orchestration

The hardware is in place. We are ready for business.

Hyperscalers and startups in the space create / lease dedicated AI clusters, load them up with models, and lay out API infrastructure to serve inference. This also includes additional services like metering, payments, authentication, security, storage, and reporting.

These serverless AI providers most often use 1) price, 2) ecosystem integration, and sometimes 3) additional bells and whistles like tools for fine-tuning to gain a competitive advantage. Price advantages, being the biggest lever, are made possible by minimizing costs. Oversimplifying and listing some methods:

Maximizing GPU utilization with larger batch sizes
Balancing load between different models and clusters
Inference optimization with techniques like quantization, KV caching
VC money, eheheh

Enterprises stand to benefit from picking Azure Foundry or Amazon Bedrock as their serverless AI providers when these platforms are closely bundled together with ecosystem tools. They also offer adjacencies like fine-tuning / RAG systems as add-ons to make the serverless offerings attractive.

Different shapes and sizes

Keeping the above principles and inputs intact, tweaks in architecture, memory usage, and hardware can be used to specialize inference for specific applications. Here are a few of the popular ones:

Low-Latency, Single-Query Inference (Real-Time / Interactive AI) - chatbots, copilots, and real-time applications that require ultra-low latency. These methods use medium complexity models and low batch size GPU cycles.
High-Throughput, Batched Inference (Enterprise AI / Bulk Processing) - Content moderation, document summarization. Capable models, medium latency, and large batch sizes.
Extreme-Scale, Asynchronous Inference (Offline AI / Background Processing) - large-scale document processing (OCR), AI Video Generation, etc. Very large batch jobs with massive models for applications that can live with high latency.
Reasoning-Intensive, Test-Time Compute (Next-Gen AI Agents) - Advanced AI agents (AutoGPT), scientific research models, code generation with deep reasoning. Multi-step reasoning with unpredictable latency and multiple inference loops.

Special mentions:

Edge / On-Device Inference (Low-Power AI)
Federated / Decentralized Inference (Privacy-Preserving AI) - Healthcare AI, On-Prem LLM Deployment (Banking, Finance, Legal AI). Update model weights without sharing data.
Specialized Hardware-Accelerated Inference (AI on Exotic Chips) - AI-powered quantum computing, optical AI accelerators, and neuromorphic chips are WIP inference tech that could fundamentally change the technology.

Okay, Substack is giving me that “Near email length” alert. Guess that’s enough content for today.

Next up - Market landscape, risks, and more

I hope that was a useful intro to AI inference as a category. You know the drill. In the next ones, we will chart out the Competitive Landscape for the AI inference value chain, explore some Extrinsic Risks, and finally make some predictions for the space.

Thanks so much for reading! See you soon with a new post!

If you liked the content and would like to support my efforts, here’s how you can help:

Subscribe! It’ll help me get new updates to you easily
Please spread the word - it’s always super encouraging when more folks engage with your project. Any kind of S/O would be much appreciated.
Share
Drop your constructive thoughts in the comments below - point out mistakes, suggest alternate angles, request future topics, or post other helpful resources