The broad-brush strokes on how to build a great AI training cluster are pretty settled: Get as many GPUs together as you can, densely pack them with fast networking, and pump in as much data as possible.

But as the AI prepares to move into its inferencing era, what a data center built for that challenge looks like is one of great debate.

Co-Packaged Silicon Photonic Networking Switches from Nvidia
– Nvidia

Some generative AI model approaches have gone for distilling the model down to its smallest form so that it can fit on a device such as a phone, removing the need to interact with a data center and only pinging back more complex requests, as well as data for future training.

Others fit on single GPUs or partial racks, with companies hoping that the rising tide of AI will help provide the Edge with its much-needed killer app.

But for Nvidia’s SVP of networking, Kevin Deierling, the generative AI industry’s embrace of reasoning models points to a different approach to inference – one that looks a lot like the tightly packed training clusters of today.

“I'm not sure the [data center industry] fully groks the impact of reasoning models,” he tells us. “There are certainly AI workloads like robots or [autonomous] cars that absolutely need the fastest inferencing and therefore need to be in close proximity to wherever that's happening. For those workloads, absolutely it will be right at the Edge or on device, and it can be relatively small compute and then a lot of them all over the place.”

But, for the large reasoning models, it will require significant scale.

Nvidia, which obviously benefits from an increased need for ever greater scale, believes that the generative AI market has evolved into three phases of workload.

First is the pre-training phase, “where we train foundation models,” Deierling says. From there, we move to “post-training, where models go to think,” he continues. “So teaching your models how to think: You start off with 20 petabytes of data, and then you move to hundreds of petabytes of data, or even a trillion in terms of the model parameters. We can even post-train multimedia models with visualization, and so all of that is expanding the scale.”

But the most interesting phase comes next. Known as test-time scaling, it essentially dedicates substantially more computational resources during the inference phase to improve performance. The model can simulate multiple solution paths or responses and select the best one, as it reasons out an answer.

“What’s interesting there is we’re moving from one-shot inference, where you ask a question and it immediately gives you an answer out of a pre-trained foundation model to these reasoning models,” Deierling explains.

“The reasoning model figures it out, it thinks through the tokens. And the point there is the amount of compute for inferencing is massive. Even the ‘smaller’ models that we're talking about - 671 billion parameters for DeepSeek R1, it’s an enormous model, it doesn't fit on a single GPU. It takes a dozen GPUs.”

When this is then used for agentic reasoning, where each agent has its own database of information and autonomously completes complex tasks (as used in OpenAI’s Deep Research), the GPU requirement grows ever further.

“We see the scaling of inferencing as something that people hadn't anticipated,” Deierling says. “Even a year ago at Nvidia, I was sort of fighting this battle that people said, 'oh, it's going to be one shot, single node: Do the inferencing, get an answer.' And we're seeing that's absolutely not the case.”

This, he claims, is already the direction in which Nvidia’s largest AI customers are moving. “They build training clusters, which are very, very large scale but that is on the expense side of their income statement. Inferencing is on the revenue side, that's where they make money.

“So there was this idea that you're going to have a training cluster and then an inferencing cluster, but we don't see that in the field. What we see is they build a giant cluster that was built to train their foundation models, and then they take all or some of that and start using it for inferencing.”

At first, he says, people thought “that's going to be an overkill, we don't need networking for inferencing, we can just run that on individual boxes,” but, he says, “it turns out, inferencing needs a ton of networking for all kinds of reasons.”

Now, he believes: “We're going to see training clusters used for inferencing.”

That, of course, means that ever larger inference data centers will face the same problem as training ones - that of power.

“If we have 100,000 GPUs, or a million, scaling for inferencing, we have to really pay attention to the power budget,” Deierling says. “So that's where the CPO comes in, it's actually providing a huge power benefit.”

Announced earlier this year, Nvidia's co-packaged optics (CPO) switches with integrated silicon photonics are pitched at radically reducing the power demands of networking gear - conveniently opening up more capacity for more GPUs.

“What we find is some of our customers have plenty of money,” Deierling says, describing a problem other companies would love to have. “They can afford to buy more stuff, but the constraint is actually that power budget.”

While the industry is on the hunt for more locations with power, and is ploughing money into energy infrastructure, the CPO aims at doing more with what’s available. “To the extent we can save 30 or 50 percent of the interconnect power by moving from traditional optics to CPO, that’s the driving force,” Deierling says. “We're talking about 10s of megawatts of power savings on a giant data center.”

A second-order benefit is that it should increase reliability, Nvidia believes. Deierling’s fellow networking SVP Gilad Shainer explains: “On every GPU that you put into a data center, you need to install six transceivers. So the number of components in the data center is growing very fast and, obviously, the more components that you have, the more things that can fail over time.”

Those components that fail will need to be replaced, but that too poses a risk. “For example, when you replace a transceiver, one of the faults that you have as a human is that your fingers have a dimension,” Shainer says. “When you place a transceiver in a very dense computing infrastructure, you're going to touch other transceivers, and when you touch other things, you can actually create issues on other elements.”

With the CPO, the external transceiver is moved into the package itself, reducing the number of lasers by four times. “So there are hundreds of thousands of transceivers that are now not needed.”

He adds: “You reduce the human touch, you reduce the number of components dramatically, increase resiliency, and can increase the complete capacity because they reduce the power consumption.”

For now, Nvidia's silicon photonics ambitions are squarely at the switch level for rack-to-rack communications. "We will do copper as long as we can on [within the rack] with NVLink and scale up," Deierling says. "At some point in the future, you can imagine that everything becomes optical."

Across optical data center interconnects, data centers are already communicating with each other for multicluster training runs.

"What we see is that, in the largest data centers in the world, there's actually a data center and another data center and another data center," he says. "Then the interesting discussion becomes – do I need 100 meters? Do I need 500 meters? Do I need a kilometer interconnect between data centers?"

What that limit is, Deierling wouldn't say, although the speed of light is the ultimate limiter for both training and inference scales.

For inference, latency is often given as a reason for keeping GPUs close to the user. "But if there's 200 milliseconds of latency, we don't even notice it," Deierling says, downplaying the concerns. "Humans don't notice it."

Where it matters is for the agentic workloads, where AIs talk to AIs. "We need sub-millisecond sorts of latencies, network connectivity between the devices," he says, which can happen within the same facility.

"That last hop from a centralized data center can matter a little bit,” he says. You can't compound that.

“I can't have an agentic workflow of inferencing that's going back and forth from California to New York a dozen times because that latency adds up. So I think the agentic inferencing workloads will happen in a centralized data center, but we can handle one hop to a user and live with that."