Let There Be Light to Build a Better AI Supercomputer

18th April, 2024
Let There Be Light to Build a Better AI Supercomputer

Leaders in AI, including OpenAI, believe that fresh advances in machine intelligence will necessitate new types of computer hardware. One idea is to use light to connect GPUs.

Most artificial intelligence experts seem to agree that taking the next big leap in the field will depend at least partly on building supercomputers on a once unimaginable scale. At an event hosted by the venture capital firm Sequoia last month, the CEO of a startup called Lightmatter pitched a technology that might well enable this hyperscale computing rethink by letting chips talk directly to one another using light.

Data today generally moves around inside computers—and in the case of training AI algorithms, between chips inside a data center—via electrical signals. Sometimes parts of those interconnections are converted to fiber-optic links for great bandwidth, but converting signals back and forth between optical and electrical creates a communications bottleneck.

Instead, Lightmatter wants to directly connect hundreds of thousands or even millions of GPUs—those silicon chips that are crucial to AI training—using optical links. Reducing the conversion bottleneck should allow data to move between chips at much higher speeds than is possible today, potentially enabling distributed AI supercomputers of extraordinary scale.

Lightmatter’s technology, which it calls Passage, takes the form of optical—or photonic—interconnects built in silicon that allow its hardware to interface directly with the transistors on a silicon chip like a GPU. The company claims this makes it possible to shuttle data between chips with 100 times the usual bandwidth.

For context, GPT-4—OpenAI’s most powerful AI algorithm and the brains behind ChatGPT—is rumored to have run on more than 20,000 GPUs. Harris says Passage, which will be ready by 2026, should allow for more than a million GPUs to run in parallel on the same AI training run.

One audience member at the Sequoia event was Sam Altman, CEO of OpenAI, who has at times appeared obsessed with the question of how to build bigger, faster data centers to further advance AI. In February, The Wall Street Journal reported that Altman has sought up to $7 trillion in funding to develop vast quantities of chips for AI, while a more recent report by The Information suggests that OpenAI and Microsoft are drawing up plans for a $100 billion data center, codenamed Stargate, with millions of chips. Since electrical interconnects are so power-hungry, connecting chips together on such a scale would require an extraordinary amount of energy—and would depend on there being new ways of connecting chips, like the kind Lightmatter is proposing.

A deal between Lightmatter and GlobalFoundries, a manufacturer of chips for companies like AMD and General Motors, was previously disclosed. In reference to the biggest cloud providers like Microsoft, Amazon, and Google, Harris states that his company is “working with the largest semiconductor companies in the world as well as the hyperscalers.”

Rewiring massive AI projects by Lightmatter or another company could remove a major obstacle to the creation of more intelligent algorithms. The development of ChatGPT was largely facilitated by the usage of additional compute, and many AI experts believe that further hardware scaling up will be essential to future advancements in the field—and to hopes of ever reaching the vaguely-specified goal of artificial general intelligence, or AGI, meaning programs that can match or exceed biological intelligence in every way.

According to Lightmatter CEO Nick Harris, algorithms many generations above current state-of-the-art may be possible by connecting a million processors together using light. He asserts with confidence that “Passage is going to enable AGI algorithms.”

The large data centers that are needed to train giant AI algorithms typically consist of racks filled with tens of thousands of computers running specialized silicon chips and a spaghetti of mostly electrical connections between them. Maintaining training runs for AI across so many systems—all connected by wires and switches—is a huge engineering undertaking. Converting between electronic and optical signals also places fundamental limits on chips’ abilities to run computations as one.

The goal of Lightmatter’s method is to make sense of the complex communications inside AI data centers. To communicate between two GPUs, “you normally have a bunch of GPUs, and then a layer of switches, and a layer of switches, and a layer of switches,” according to Harris. According to Harris, any GPU in a data center linked by Passage would have a fast connection to every other chip.

Lightmatter’s work on Passage is an example of how AI’s recent flourishing has inspired companies large and small to try to reinvent key hardware behind advances like OpenAI’s ChatGPT. Nvidia, the leading supplier of GPUs for AI projects, held its annual conference last month, where CEO Jensen Huang unveiled the company’s latest chip for training AI: a GPU called Blackwell. Nvidia will sell the GPU in a “superchip” consisting of two Blackwell GPUs and a conventional CPU processor, all connected using the company’s new high-speed communications technology called NVLink-C2C.

Nvidia decided to defy the trend of the semiconductor industry, which is known for finding methods to squeeze more processing power out of chips without having them get bigger. Although the Blackwell GPUs in the company’s superchip are twice as powerful as its predecessors, they require a lot more power because they are constructed by joining two processors. This trade-off implies that improvements to other crucial parts for AI supercomputers, like those suggested by Lightmatter, may become more crucial in addition to Nvidia’s efforts to connect its CPUs via fast connections.