Hosted by the Interactive AI Centre for Doctoral Training
Speaker: Luke Hudlass, Graphcore
Abstract: Large language models are bottlenecked by memory bandwidth during inference, due to transferring the KV cache between memory and processor frequently when generating each token. We analyse properties of the tensors generated by LLMs, from which we derive a strategy to sparsely access the KV cache, reducing the total data transfer by 8x while retaining the entire history of the sequence, helping to maintain statistical performance across a wide variety of tasks
Contact iai-cdt@bristol.ac.uk if you would like lunch, served between 13.30-14.00. Talks will begin at 14.00.