Meta, the company behind popular social media platforms, is not just focused on developing cutting-edge AI models and tools but also on building an infrastructure that can support the future of AI. From hardware to publicly released models and new generative AI tools, Meta is investing in all levels of AI development.

In a recent conference, Networking at Scale 2023, Meta’s engineers and researchers highlighted their efforts to design and operate a network infrastructure that can handle Meta’s massive AI workloads. This includes ranking and recommendation algorithms and the increasingly complex GenAI models. They discussed topics such as network design, routing and load balancing solutions, performance tuning, and workload simulation.

One of the key challenges Meta is facing is the scale and complexity of GenAI models. These large language models require a robust and flexible network infrastructure to support their training and inference processes. To address this, Meta has transitioned from CPU-based to GPU-based training and deployed distributed network-interconnected systems.

The network fabric Meta currently uses for training models is RoCE-based, with a CLOS topology. This setup connects leaf switches to GPU hosts and spine switches to provide scale-out connectivity to GPUs in the cluster. It is designed to maximize raw performance and consistency, crucial for AI-related workloads.

In addition to tackling scalability and performance, Meta is also focused on optimizing traffic engineering for AI training networks. Through centralized traffic engineering, Meta dynamically distributes traffic across available paths in a load-balanced manner, ensuring job performance consistency.

To enable observability in network communication, Meta has introduced tools like ROCET, PARAM benchmarks, and the Chakra ecosystem. These systems capture top-down observability, allowing for the analysis of job performance and the identification of network-related issues.

Furthermore, Meta has developed Arcadia, an end-to-end AI system performance simulator. Arcadia enables the simulation of compute, memory, and network performance of AI training clusters, aiding in the design and optimization of various system levels. It provides valuable insights into the performance of future AI models and helps Meta’s engineers make data-driven decisions about infrastructural enhancements.

By investing in robust network infrastructure, traffic engineering solutions, network observability tools, and performance simulators, Meta is paving the way for the future of advanced AI systems. These efforts will not only benefit Meta’s own AI development but also contribute to the advancement of the field as a whole.

