New Rules of the Game: Groq’s Deterministic LPU™ Inference Engine with Software-Scheduled Accelerator & Networking

Igor Arsovski (Groq)
Gates B12

Jan

Thu, Jan 18 2024, 4:30pm

About the talk: In this talk, we will demonstrate Groq’s approach to synchronous, software-scheduled AI acceleration and networking and showcase how we use it to unlock state-of-the-art performance and latency on Large Language Models (LLMs), including Llama-2 70B, scaled to over 500 GroqChip™ Language Processors™.

Traditional HPC systems and data centers use dynamic time- and space-sharing, where platforms dynamically coordinate the use of compute, memory, and network resources among threads or workloads. This is a natural solution for arbitrary compute workloads, whose unpredictability makes such mediation a prerequisite. Unfortunately, this results in compounding inefficiency and complexity at all layers of the stack: processor architecture, memory, networking, and more. Modern AI workloads, however, have a predictable structure allowing for efficient static scheduling of compute and network resources.

Groq has changed the rules of this game by making components deterministic from the ground up. We have developed large-scale synchronous compute platforms that empower software to make more orchestration decisions statically.

Unlike traditional networks where packets can collide and congestion can develop, all traffic in the Groq network is completely pre-planned by Groq™ Compiler with zero network collisions. This maximizes not only the utilization of the links, but the number of minimal paths that can be taken between chips.

Deterministic compute and static orchestration introduces new software and hardware challenges and co-optimization opportunities, which we will discuss in this talk. Overcoming these challenges unlocks opportunity for greater compute and power efficiency on AI workloads. Groq’s software-scheduled networks offer key advantages including: (1) a global network load balancing via compiler-driven network traffic scheduling; (2) high network bandwidth efficiency via low control overhead; and (3) low latency chip-to-chip communication via direct topology. We will showcase these advantages by demonstrating state-of-the-art performance on LLM models, including Llama-2 70B, scaled to over 500 Language Processors.

About the speaker: Igor Arsovski is Chief Architect & Fellow at Groq.

Prior to Groq, Igor was at Google responsible for SoC technology strategy for Google Cloud and Infrastructure, leading and managing the technology and custom PD effort for the TPU.

Prior to Google, Igor was the CTO of the GF/Marvell ASIC business unit responsible for setting ASIC strategy for >900-person team while leading technical solutions for Data Center and Automotive ASICs.

Igor is an IBM Master Inventor with >100 patents, >30 IEEE papers and presentations at premier system (SC, AI HW Summit, Linley, ML and AI Dev Com) and circuit (ISSCC, VLSI, CICC, ECTC) conferences, and currently serves on the VLSI Technology Program Committee.

SystemX Alliance (EE310)