SparQ Attention – Bandwidth-Efficient LLM Inference

6 March 2024, 1.45 PM - 6 March 2024, 3.00 PM

Luke Hudlass (Graphcore)

1.07, Queen’s Building

Hosted by the Interactive AI Centre for Doctoral Training

Abstract: Large language models are bottlenecked by memory bandwidth during inference, due to transferring the KV cache between memory and processor frequently when generating each token. We analyse properties of the tensors generated by LLMs, from which we derive a strategy to sparsely access the KV cache, reducing the total data transfer by 8x while retaining the entire history of the sequence, helping to maintain statistical performance across a wide variety of tasks

Contact iai-cdt@bristol.ac.uk if you would like lunch, served between 13.30-14.00. Talks will begin at 14.00.

Contact information

Enquiries to Interactive AI CDT Admin Mailbox <iai-cdt@bristol.ac.uk>