Nextsilicon says the Maverick-2 delivers 4x the performance per watt vs. Blackwell GPU

Nextsilicon says the Maverick-2 delivers 4x the performance per watt vs. Blackwell GPU

Nuxtelicon has shared some internal benchmark results for its latest AI and HPC chip, the Maverick-2, and the claims are nothing if not bold. The company claims that its new chip, which is shipping now, can virtually regenerate itself to change workloads, thus providing 10x the computational performance of the latest NVIDIA GPUs, but using only 60% of the power.

Trillions of dollars are being spent to build massive AI factories loaded with the most powerful GPUs. The latest NVIDIA Blackwell GPU, the standard-bearer for high-performance computing, will offer 20 PETAFLOPs of FP4 performance. However, Blackwell will also use 1,400 watts of power, which is changing how data centers run and cool.

The folks at Nextsilicon envision a different path. Although GPUs and CPUs have helped power major scientific and societal advances in the HPC and AI fields, they face a future of diminishing returns. Instead of continuing down the same path, which is to spend huge amounts of money forever building AI factories filled with ever more powerful GPUs (as well as more advanced power and cooling systems), the founders of Nextelicon decided to try a different path.

NexSilicon founder and CEO Ald Raz notes that, while the 80-year-old von Neumann architecture has given us a universally programmable foundation for computing, it comes with a lot of overhead. He says that 98% of the silicon is dedicated to controlling overhead tasks, such as branch prediction, out-of-order logic, and instruction handling, while only 2% is doing the actual computation at the heart of the chip application.

Chips built with traditional von Neumann architectures waste up to 98% of their capacity on overhead tasks, Nextelicon says.

Raz and his team envisioned a new architecture, dubbed Intelligent Compute Architecture (ICA), whereby the chip can essentially configure itself to adapt to changing workloads, thereby minimizing the amount of overhead and maximizing the computing horsepower available to perform the math behind the demands of AI and HPC applications. This was the basis of NextSilicon’s patent, entitled “Runtime Optimization for Reconfigurable Hardware”, and the guidelines for its non-one-Neumann dataflow architecture used in the Maverick-2 processor.

Raz explained the core insight behind Nexilicon’s chip design philosophy during a media presentation earlier this week.

“We’ve seen that for compute-intensive applications, a small chunk of code runs most of the time,” Raz said. “Instead of running every piece of code, every instruction the same way, we thought, let’s focus on what matters. We developed a smart software algorithm that constantly monitors your request. It pinpoints the path of code that runs the most. Freeing you to focus on innovation, not overhead.”

Nextsilicon’s data flow architecture is built on top of a graph structure. Instead of processing instructions one by one, a la von Neumann, a dataflow processor consists of a grid of computational units, dubbed Alois, that are interconnected in a graph structure. Each ALU handles a specific type of function, such as multiplication or logical operations. When input data arrives, the calculation is automatically triggered, and the result flows to the next unit in the graph.

Eld Raz, CEO of Nextsilicon (Image courtesy of Nextsilicon)

This novel approach brings a huge advantage over serial data processing, as the chip no longer needs to handle data retrieval, decoding, or scheduling, which are overhead tasks that eat up compute cycles.

“For parallel workloads, which define virtually all modern AI and HPC applications, Dataflow’s ability to exploit natural parallelism in computational graphs means hundreds of operations can be executed simultaneously, limited only by data dependencies rather than artificial instruction serialization,” Raj wrote in a white paper today as part of Maverick-2.

According to Nuxtelican, this approach gives the Maverick-2 the flexibility and performance often found in ASICs, but without the long and expensive development timeline. In terms of programmability, the Maverick-2 is CUDA compatible, allowing the chip to act as a drop-in replacement for NVIDIA GPUs, the company says. The company says Python, C++, and Fortran code will also run without rewriting.

“We get 10x faster performance at half the power consumption. And because we have a drop-in alternative, we can penetrate the market much faster,” Raz said during the presentation. “Our technology seamlessly runs CPU and GPU supercomputing tasks, HPC workloads, and advanced AI machine learning models easily out of the box.”

Maverick-2 feeds and speeds (image courtesy of Nextsilicon)

The Maverick-2 comes in both single-die and double-die configurations. The single die Maverick-2 includes 32 RISC-V cores built on TSMC’s 5nm nanometer process and runs at 1.5GHz. The card supports PCIE GEN5X16, features 96GB of HBM3E memory, and provides 3.2TB per second of memory bandwidth. It sports 128MB of L1 cache, features a 100 Gigabit Ethernet NIC, operates within a thermal design power (TDP) of 400W, and is air-cooled. The Dual Die Maverick-2 effectively doubles all of these capabilities, but it plugs into the OAM (OCP Accelerator Module) bus, features dual 100 GbENICs, can be air or liquid cooled, and operates with a TDP of 750 watts.

Nuxtelicon shared some internal benchmark data for the Maverick 2. In terms of giga updates per second (GUPS), the Maverick-2 managed to deliver 32.6 GUPs at 460 watts, which is said to be 22x faster than a CPU and 6x faster than a GPU. The Maverick-2 delivered 600 Gflops at 750 watts in the HPCG (High Performance Conjugate Gradients) category, which it says is comparable to leading GPUs, at half power.

“What we discussed in detail today is more than a chip,” said Eyal Nagar, Nexillan’s vice president of R&D. “It’s a foundation, a new way of thinking about computing. It opens up a whole new world of possibilities and optimizations for engineers and scientists.”

Maverick-2 ICA (Image courtesy of Nextsilicon)

Nuxtelicon’s focus on optimizing resources makes sense, said Steve Conway, HPC and AI industry analyst at Intersect 360 Research. “Traditional CPU and GPU architectures are often constrained by high-order pipelines and limited scalability,” Conway said. “There is a clear need to reduce energy waste and unnecessary computations within HPC and AI infrastructures. Nextsilicon addresses these critical issues with Maverick 2, a novel architecture aimed at the unique demands of HPC and AI.”

Maverick is already deployed at Sandia National Lab as part of the Vanguard-2 supercomputer and will be used with the upcoming Spectra supercomputer. “Our partnership with NexSilicon, which began four years ago, has been an excellent example of how NNSA laboratories can work with industry partners to mature novel emerging technologies,” said James H. LaRose III, Sandia National Laboratory, Vanguard Program Lead. “Nextsilicon’s continued focus on HPC makes them a prime candidate for the Vanguard program.”

Platforms change every 10 to 15 years, and each transition unlocks new types of problems, said Alan Tari, co-founder and VP of architecture at Nexillan. We saw it as the world moved from mainframes to PCs, from client servers to the cloud and from CPUs to GPUs.

“Each transition normalizes previously impossible requests,” Tari said. “But here’s the thing: Each transition has happened because the previous platform hit the fundamental limits of solving these problems. The next transition will allow scientists to reach beyond what’s possible today.”

“Imagine biologists researching cancer using molecular mimics in thousands of human cells. Consider astronomers probing dark matter in galaxies with extraordinary granularity to uncover fundamental properties of the universe, accurately predicting natural disasters like floods, hurricanes and wildfires,” he said. “Our vision has never been more critical to free us from the confines of accounting.”


This article was first published Hpcwire.


About the author: Alex Woody

Alex Woody has written about it as a technology journalist for over a decade. He brings extensive experience from the IBM middleware marketplace, including topics such as servers, ERP applications, programming, databases, security, high availability, storage, business intelligence, cloud and mobile capabilities. He lives in the San Diego area.

Share this article

Leave a Reply

Your email address will not be published. Required fields are marked *