The AI arms race has made “GPU” and “gigawatt” household words, and for good reason: What’s happening with the scale of compute is unprecedented. But what about the underlying storage layer? How are organizations going to store all the data for AI and feed it to hungry GPUs? It turns out that there is a revolution in storage for HPC and AI as well.
Welcome to our special Hpcwire Series about the future of storage for HPC and AI. In this first story, we’re going to present the current state of storage for AI and HPC and highlight some of the broader challenges facing organizations. In future pieces, we’ll dig into various aspects of the HPC and AI storage industry, and give our best data-driven bets where it all comes to a head.
For starters, some things have changed with AI and HPC storage, but some things haven’t. On the hardware front, while solid state disks (SSDs) based on NVME flash media have become dominant, there are still roles for spinning disk and even tape in the storage mix. Support for RDMA, whether over InfiniBand or Ethernet, and NVIDIA’s GPUDIRECT technologies are helping keep GPUs open.
Gigawatt-scale data centers require plenty of storage (Source: Shutterstock)
From a software perspective, there are a wide variety of file systems and object stores in use. Parallel file systems that have powered traditional HPC workloads, such as Flash, PanFS, and IBM Storage Scale (formerly Spectrum Scale and GPF), are experiencing a renaissance thanks to AI build-outs. Training large AI models is similar in some ways to traditional HPC workloads such as modeling and simulation. Both require moving lots of data in relatively large block sizes at high speed to the GPU and its associated memory, which traditional parallel file systems are good at.
At the same time, some organizations are setting up their AI storage on network-attached storage (NAS) storage systems that use NFS or Parallel NFS (PNF). Only a handful of software vendors in the NFS and PNFS world are finding success. Many storage vendors, whether using traditional parallel file systems or PNFS—or software-only plays or appliance vendors—integrate S3-capable object storage into the mix, primarily to serve AI inference workloads. Ethernet and Infiniband are the dominant networking protocols in AI and HPC, with RDMA used to accelerate data transfer in both.
What has changed is the scale of storage and the ways it is used. A petabyte of storage used to be considered “big data,” but thanks to today’s ultra-dense flash, organizations can store an exabyte of data on a single rack. Gigawatt-sized data centers built by the likes of Meta, OpenAI, and Google, and will contain thousands of Arc servers with thousands of racks to go with compute clusters containing hundreds of thousands of GPUs. Some of these will include the latest proprietary networking technology from NVIDIA, such as its NVLINK
Ascending AI workloads bring slightly different requirements than HPC, including more data ingestion, labeling, preparation, and sorting before the actual work (model training) even begins. Once the model is trained, workloads bring a different set of performance and capacity requirements. File sizes range from large to small, and input to a chatbot or agentic AI interaction may call upon different types of data pieces from different systems. Data orchestration becomes an issue, as do features such as security, privacy, and data residency requirements.
Emerging tech, such as NVIDIA NVSwitch, which harnesses multiple GPUs together using NVIDIA’s NVLINK technology to create a single GPU hypercluster, will push storage limitations.
Although commercial organizations share infrastructure for scientific computing and AI computing and storage, the workloads have different needs, said Addison Steele, CEO of analyst firm Intersect 360 Research. “And there’s a wide gap between what consumers are asking for and what vendors are providing,” he said.
It used to be two storage tiers, disk and tape. “Now you get five, six, seven levels in most of these environments,” Steele added. “And now performance is not so much about how much bandwidth I have with that one tier. It’s about how I optimize it, what data on what data.”
All companies pursuing the HPC and AI storage market need to provide the basic infrastructure to support the core capabilities, said Mark Nosukoff, a storage industry analyst at Hyperion Research. “But that’s just the baseline and the table stakes,” he explained. Hpcwire. “You need equipment on top of that to be able to really manage and understand what’s going on with the data that’s being transferred and stored and get it to the right place at the right time.”
AI training clusters will often offer special flash drives called “burst buffers” to help smooth out rough patches during training. At the time of singularity, many storage vendors have integrated key-value caches into their storage platforms that allow them to maintain state over the lifetime of an AI conversation, or even store the conversation’s components for later use.
Integrating data and metadata is an emerging problem in HPC and AI storage
Metadata management has become a big deal with AI storage, especially when data is spread across multiple systems, including on-premise and in the cloud. Cataloging, managing, and governing even this metadata in a single Exabyte-scale storage cluster is a challenge, and each vendor seems to implement this feature differently.
“AI wants access to all data in all places, and that’s not how storage was typically built. So to me that’s what’s happening with organizations,” says Molly Presley, Hammerspace’s SVP of global marketing. “Users don’t know how to put all these pieces together. There’s a lot of new application technology that they’ve never worked with. And how do they decide which piece of the whole stack to use?”
Surveys indicate that many (if not most) HPC organizations are already using their clusters to run AI workloads, whether in direct support of traditional modeling and simulation workloads or in other use cases, such as data analysis, literature review, hypothesis generation, or assisting in scientific experiments. Although there are similarities between the two types of workloads, there are significant differences.
“It’s like a big HPC zoo. You can pick an HPC instance that does anything,” says DDN senior vice president James Comer, who started in the HPC business 30 years ago as a PhD-holding researcher.
“Whatever you’re interested in, whether it’s fluid dynamics or crash encapsulation or cosmology or quantum mechanics modeling or whatever, you’re going to find an application that will do some weird thing for storage in a different way, whereas AI is actually, in that sense, more intelligent,” Koomer says. “The training… loads these models, loads the datasets, the checkpoints. It’s a lot.”
The future of AI and HPC storage is bright
The challenges with fitting storage to AI are different. “We have customers who spend literally $1 billion,” Kommer continues. “Thirty percent is spent on data center, cooling, infrastructure power, 60% to 50% is on GPU, 10% on networking, and basically 5% on storage. But if you spend 5% of your budget on the wrong storage, you can really kill the productivity of that whole pie. You can get 25% less productivity because you’re waiting for that data to move.”
Weca CTO Shimon Ben David says storage for AI is changing rapidly, and yesterday’s concepts don’t apply to tomorrow’s problem. “If in the past you only talked about storage, you sold storage for backup, shared storage for block devices. That’s not something that’s going to be very sustainable because consumers are honestly expecting a lot more.”
Nobody wants to buy storage today. Instead, everyone wants to buy the result, Ben David continued. “So you can’t just say, here’s my storage environment. What you’re able to show is, I have an environment that makes your mitigation five times, 10 times faster.” “Or I have an environment that fully satisfies your GPU. Or there’s an environment that already has vector databases, RDBMS databases that you can just use.”
According to NetApp Vice President Jeff Baxter, Gartner recently published a report claiming that 60 percent of AI projects will be abandoned by 2026 due to a lack of AI-ready data. “And we’re seeing more and more [customers] Running into the problem of models is great, data science is true, he said, but there isn’t easily accessible and AI-driven data to drive these experiments.
All told, that’s a lot of time to be in the high-end storage business, according to Eric Salo, vice president of marketing for VDura, the original developer of Panas File System, PNFF. “It’s just the best arms race I’ve ever seen in my entire career,” Salo says. “A few years ago, it was unusual for me to see an RFQ for a terabyte per second of bandwidth. Now I’m seeing four, five, eight, nine terabytes a second for these systems. They’re just getting bigger.”
Stay tuned for our next article in this series.
This article first appeared on our sister publication, Hpcwire.
Related







