Prof. Zhihao Jia, Carnegie Mellon University, USA


Title: Building Systems for Fast, Affordable, and Efficient Generative Large Language Models

Abstract: The high computational and memory requirements of generative large language models (LLMs) make it challenging to train and serve them quickly and cheaply. In this talk, I will present two systems for enabling fast, affordable, and efficient LLM computation. First, SpecInfer is an LLM serving system that accelerates autoregressive LLM inference with speculative inference and token tree verification. A key insight behind SpecInfer is to combine multiple collectively boost-tuned small speculative models to jointly predict an LLM’s outputs and verify their correctness against the LLM using a tree-based parallel decoding mechanism. Compared to existing LLM serving systems, SpecInfer reduces the number of LLM decoding steps by 4.4x and the end-to-end LLM inference latency by 2.4x, while preserving LLMs' generative quality. 

Second, Parcae is a system that enables low-cost LLM training and fine-tuning on preemptible instances by optimizing liveput, a novel metric that measures the expected training throughput of an LLM job under potential preemption scenarios. Parcae considers both the throughput of a job and its robustness under preemptions, achieving near-optimal performance for training LLMs under frequent preemptions. Parcae includes an availability predictor to forecast future preemptions and a liveput optimizer to discover optimal strategies to parallelize DNN training under predicted preemptions. Parcae reduces LLM training and fine-tuning cost by up to 10x compared to existing solutions.

  • Share this: