Daniel Whettam

Supervisors:

Website:

General Profile:

After completing a BSc in Computer Science, I had developed an interest in AI and wanted to take it further. My AI journey began with an internship at The Hartree Centre, where I researched speech recognition. After my internship I enrolled in the Data Science MSc at The University of Edinburgh, which then led to starting my PhD with the Interactive AI CDT at Bristol. My research interests lie within the intersections of Deep Learning, Signal Processing and Computer Vision.
I love learning things and picking up new skills. To that end, my interests and hobbies outside of academia range from powerlifting, to Brazillian Jiu Jitsu, to juggling.

Research Project Summary:

Recent work has demonstrated the importance of incorporating audio into video-based egocentric action recognition [1]. Building upon this, and our own work demonstrating that audio is an effective modality independent of video for action recognition, we are interested in self-supervised approaches for audio-visual fusion. In particular, we are investigating how fusing the audio and visual modalities can improve self-supervised representation learning and it’s down-stream applications to egocentric action recognition.
 
Our previous work investigated the utility of audio as a stand-alone modality for action recognition. Whilst audio is expectedly less informative than video, we demonstrated that is very feasible to create an effective audio-alone model. Combining this result with other work demonstrating that audio is an informative modality for multi-modal action recognition, this creates a strong justification for further work investigating the utility of audio-visual action recognition, as well as exploring audio-visual fusion in a wider context, such as representation learning.
 
The self-supervised approach would consider audio-visual fusion in the context of representation learning. In representation learning a contrastive loss is used to learn a representation where data points of the same class are similar to each other, and those of different classes are dissimilar. Such a representation is learnt in a self-supervised manner where the data labels are not known. This is done by creating two examples of the same data-point, where one is a transformed version of the other (positives). A model may then learn a representation such that these examples are similar, and pairings that are not of the same data-point (negatives) are dissimilar. This self-supervised formulation can be extended to the audio-visual setting where a positive example is an audio-visual pairing, such that a model would learn similar representations for both the audio and visual components of a video, and dissimilar representations for audio-visual pairings that are from different video clips. Through the introduction of multiple modalities, we can enhance the representation that is learnt through this process, with the aim of learning representations that are informative for down-stream tasks. In particular, we are interested in the utility of audio-visual representation learning to enhance the performance of egocentric action recognition.
 
Evangelos Kazakos, Arsha Nagrani, Andrew Zisserman, and Dima Damen. Epic-fusion: Audio-visual temporal binding for egocentric action recognition. In IEEE/CVF International Conference on Computer Vision (ICCV), 2019.
Edit this page