Arista delivers holistic AI solutions in collaboration with NVIDIA

Arista Networks has announced a technology demonstration of AI Data Centers in order to align compute and network domains as a single managed AI entity, in collaboration with NVIDIA. In order to build optimal generative AI networks with lower job completion times, customers can configure, manage, and monitor AI clusters uniformly across key building blocks including networks, NICs, and servers. This demonstrates the first step in achieving a multi-vendor, interoperable ecosystem that enables control and coordination between AI networking and AI compute infrastructure.

Need for Uniform Controls

As the size of AI clusters and large language models (LLMs) grows, the complexity and sheer volume of disparate parts of the puzzle grow apace. GPUs, NICs, switches, optics, and cables must all work together to form a holistic network. Customers need uniform controls between their AI servers hosting NICs and GPUs, and the AI network switches at different tiers. All these elements are reliant upon each other for proper AI job completion but operate independently. This could lead to misconfiguration or misalignment between aspects of the overall ecosystem, such as between NICs and the switch network, which can dramatically impact job completion time, since network issues can be very difficult to diagnose. Large AI clusters also require coordinated congestion management to avoid packet drops and under-utilization of GPUs, as well as coordinated management and monitoring to optimize compute and network resources in tandem.

Introducing the Arista AI Agent

At the heart of this solution is an Arista EOS-based agent enabling the network and the host to communicate with each other and coordinate configurations to optimize AI clusters. Using a remote AI agent, EOS running on Arista switches can be extended to directly-attached NICs and servers to allow a single point of control and visibility across an AI Data Center as a holistic solution. This remote AI agent, hosted directly on an NVIDIA BlueField-3 SuperNIC or running on the server and collecting telemetry from the SuperNIC, allows EOS, on the network switch, to configure, monitor, and debug network problems on the server, ensuring end-to-end network configuration and QoS consistency. AI clusters can now be managed and optimized as a single homogenous solution.

“Arista aims to improve efficiency of communication between the discovered network and GPU topology to improve job completion times through coordinated orchestration, configuration, validation, and monitoring of NVIDIA accelerated compute, NVIDIA SuperNICs, and Arista network infrastructure,” said John McCool, Chief Platform Officer for Arista Networks.

End-to-End AI Communication and Optimization

This new technology demonstration highlights how an Arista EOS-based remote AI agent allows the combined, interdependent AI cluster to be managed as a single solution. EOS running in the network can now be extended to servers or SuperNICs via remote AI agents to enable instantaneous tracking and reporting of performance degradation or failures between hosts and networks, so they can be rapidly isolated and the impact minimized. Since EOS-based network switches are constantly aware of accurate network topology, extending EOS down to SuperNICs and servers with the remote AI agent further enables coordinated optimization of end-to-end QoS between all elements in the AI Data Center to reduce job completion time.

“Best-of-breed Arista networking platforms with NVIDIA’s compute platforms and SuperNICs enable coordinated AI Data Centers. The new ability to extend Arista’s EOS operating system with remote AI agents on hosts promises to solve a critical customer problem of AI clusters at scale, by delivering a single point of control and visibility to manage AI availability and performance as a holistic solution,” said Zeus Kerravala, Principal Analyst at ZK Research.

Partner Resources

Popular Right Now

Edgecore Insight Podcast

Ep-1: Navigating the Waters of Sustainability

Others have also read ...

energy sources

AMD set energy efficiency goals

AMD has announced a goal to deliver a 30x increase in energy efficiency for AMD EPYC CPUs and AMD Instinct GPU accelerators used in artificial

Click to View