Skip to content. | Skip to navigation

Personal tools

Theme for TIFR Centre For Applicable Mathematics, Bangalore

Navigation

You are here: Home / Events / Efficient Large Language Model Inference with HiRE and Tandem Transformers

Efficient Large Language Model Inference with HiRE and Tandem Transformers

Praneeth Netrapalli (Google India Pvt. Ltd., Bengaluru)
Speaker
Praneeth Netrapalli (Google India Pvt. Ltd., Bengaluru)
When Aug 20, 2024
from 04:00 PM to 05:00 PM
Where LH-006, Ground Floor
Add event to calendar vCal
iCal
COLLOQUIUM TALK

Title: Efficient Large Language Model Inference with HiRE and Tandem Transformers

Abstract: We will first give an overview of the Transformer architecture and Large Language Models (LLMs), and explain the key bottlenecks in LLM inference. After giving a birds eye view of the various approaches that have been proposed in the literature for speeding up LLM inference, we explain in detail two of the approaches that we have developed recently.
The first approach focuses on exploiting the inherent sparsity in different layers of LLMs. More concretely, despite significant sparsity within these layers, efficient exploitation is hindered by a lack of accelerator support for unstructured sparsity and the computational cost of identifying important elements. We introduce HiRE, a novel technique that utilizes dimensionality reduction and quantization to predict the significant elements with high recall, followed by focused computation and an efficient approximate top-k operator.  Applied to softmax and a group-sparse FFN layer, HiRE significantly reduces computational cost while preserving accuracy, leading to improved end-to-end inference latency.
Second, we tackle the inherent sequential generation bottleneck of LLMs with tandem transformers. This architecture combines a small autoregressive model with a large block-mode model, where the small model leverages the large model's representations for improved accuracy. This results in enhanced prediction, faster inference, and the option of a verification step to ensure quality. Our approach demonstrates superior performance compared to standalone models and addresses the limitations of existing parallel decoding techniques.
Based on joint works with Yashas Samaga B L, Varun Yerram, Aishwarya P S, Pranav Nair, Srinadh Bhojanapalli, Chong You, Toby Boyd, Sanjiv Kumar and Prateek Jain.

Speaker Bio: Praneeth Netrapalli is a research scientist at Google Research India, Bengaluru. He is also an adjunct professor at CMInDS, IIT Bombay and TIFR, Mumbai and a faculty associate of ICTS, Bengaluru. Prior to this, he was a researcher at Microsoft Research. He obtained MS and PhD in ECE from UT Austin, and B-Tech in EE from IIT Bombay. He is a co-recipient of IEEE Signal Processing Society Best Paper Award 2019, Indian National Science Academy (INSA) Medal for Young Scientists 2021 and was an associate of Indian Academy of Sciences (IASc) 2019-2022. His current research interests are to make training and inference of large language models more efficient.
Filed under: