Kevin Flanagan

Working PhD Project Title:

Grounding Words Using Vision

Academic Background

MRes Astrophysics, University College Dublin, 2018-2020; BSc Physics with Astronomy and Space Science, University College Dublin, 2014-2018

General Background:

I completed my undergraduate degree in Physics with Astronomy and Space Science in University College Dublin in 2018, where I then also did an MRes which focused on using convolutional neural networks to identify distinctive ring-shaped signals in a high energy telescope. This project involved handling a large amount of citizen science data and determining how best to use it in order to train models to detect those signals. From this experience I decided that I wanted to pursue AI research further outside of astrophysics, which has led me here to Bristol. I enjoy the cross-disciplinary nature of many applications of AI and I have a particular interest in computer vision

Research Project Summary:

My PhD project is focused around creating models which allow words in sentences to be grounded, or localised, within corresponding videos. Relating language to vision is important as interactive models can be made to better understand language input from a user by linking it with the surrounding visual environment. A model which learns to ground words from sentences within corresponding videos would be able to identify where each word in the sentence is present in the video. This can be done both spatially and temporally. Temporal grounding involves determining which frames of the video contain each word, while spatial grounding involves determining where within each of those frames each word is. Ideally a model would be able to ground both spatially and temporally, giving accurate locations in the video for each word. It would also ideally be able to ground words which the model has not seen before by using the context from other words in the sentence and information from the video itself to gain an understanding of the word.

This work is done by building on previous work in grounding words in still images, and temporally grounding words in video. Not much work has yet been done on combining these two aspects for full spatio-temporal grounding in video. The project will initially focus on understanding the feasibility of spatio-temporal grounding and the methods through which this can be achieved.