Lucy Farnik

Working Project Title:
Towards interpretable and controllable deep language modeling
Academic Background
BEng Computer Science with Innovation, University of Bristol (2020 - 2023)

General Profile:

I’m a researcher and software engineer trying to push the future of AI in a slightly more positive direction. I started coding at age 7, which allowed me to become a senior software developer at a tech startup at age 18. I worked in that role for 4 years while completing high school and a Bachelor’s degree at the University of Bristol. I started working on AI research at the end of January 2023. Since then I’ve submitted an inverse RL paper to NeurIPS as the second-name author along with people from Oxford and UC Berkeley, worked on fine-tuning language models using fMRI data to encourage more human-like feature spaces, done multiple research placements in GPT interpretability, and a few other smaller projects. My research interests broadly include anything that can make AI safer and more robust to distribution shifts, as well as research aimed to help policymakers create more sensible regulation around frontier models. More concretely, I’m currently interested in mechanistic interpretability and model evaluations, but I expect this to change over time.

Research Project Summary:

Large language models (LLMs) such as ChatGPT have rapidly become ubiquitous in the technology sector and are being applied in a wide variety of increasingly important domains, including customer service, automated source code generation, and augmenting the efficiency of the UK civil service. However, these models are "black boxes" - no one, not even the researchers who created them, understand how these AIs "think". This makes it effectively impossible to make any guarantees about their behaviour, including their safety, resilience to malicious inputs, and lack of dangerous capabilities. At best, organizations such as the UK government's AI Safety Institute can perform empirical evaluations of these properties, but the results of these evaluations may not generalize beyond the distribution of inputs on which the model was tested.
The PhD project will develop novel methods for overseeing, steering, and controlling these increasingly powerful and influential systems which will enable us to design real-time automated oversight solutions based on the model's internal states. It will begin by examining the application of sparse autoencoders (SAEs) to the analysis of the latent spaces in the model. SAEs have become broadly popular in the field of mechanistic interpretability due to their ability to locate certain "concepts" within the model, but it remains unclear whether they have sufficient range and flexibility to discover the full range of causally relevant concepts.

For instance, if we detect that a model's internal state suggests that it is helping a user engage in illegal activity, we can flag the interaction for manual review. The hypothesis is that there are types of concepts which SAEs are not well-suited for, and therefore any conclusions about the safety properties of a model based on SAE-powered analysis may be misleading. Concrete case studies where SAEs fail to accurately represent important concepts will be found and then developed into novel techniques for discovering the relevant representations in those case studies, and then attempt to generalize from this to find a methodology for analyzing model internals which outperforms the current state of the art.

In later stages of this project, this knowledge will be leveraged to investigate about internal representations work on using LLMs to oversee and supervise the work of other LLMs. This can be loosely compared to how employees organize into company hierarchies in order to prevent insider fraud, but with the additional safety guarantees afforded by the interpretability techniques. This would enable much more reliable and robust AI systems, since failures of one LLM can be caught by another LLM, enabling deployment of these systems with much more confidence and trust. A novel multi-agent oversight methods inspired by social science fields (e.g. insider threat robustness, management science, and adversarial risk analysis) in which LLMs could effectively monitor each other's work will be developed. This work will be predominantly empirical, with the core objective being the evaluation of safety properties of various oversight methods, and the way these properties may change with model scale.

The deliverables of the PhD are: publish a concrete case study of an area where the current SAE-based interpretability methodology fails; develop novel methods for locating a wider range of concepts within internal representations of LLMs; develop novel techniques for game-theoretic oversight of LLM agents inspired by social science in order to improve their robustness and safety.