Fahd Abdelazim

Working Project Title
Understanding Object States in Vision-Language Models
Academic Background
Master's Robotics, University of Bristol (2017-19), BSc Mechanical Engineering, University of Southampton (2014-17)

General Profile:

My first experience with machine learning was during my Robotics masters where I took AI modules and later chose a thesis with a machine learning component. My thesis was about using bio-inspired sensor data and machine learning to detect defects in composite materials. During my thesis, I experienced firsthand the impact machine learning can have on real-world applications and decided to pursue a career in machine learning.

I worked as a machine learning engineer in the telecom industry for three years where I was building AI-powered applications to support business applications. During that period, I realized that there is still no clear way to build AI systems that can assist humans in different tasks, and decided I wanted to carry out research on how humans can fully capitalize on AI capabilities.

My research interests are mainly in computer vision and Explainable AI and how to build more transparent AI systems that can provide real value to all its users.

Research Project Summary:

This research will introduce improvements to Vision-language models that allow for better linking of specific ideas or attributes (like colour or state) to physical items (like an apple), in order to help models recognize and understand the properties of objects in images and will address the following:
(1)Assess the ability of various Vision-Language models to recognize and understand the physical states of objects.
(2) Determine how well these models can distinguish between different object states.
(3) Identify the limitations and challenges faced by existing Vision-Language models in accurately recognizing object states. This includes understanding the shortcomings in their architecture, training data, and the effectiveness of their learning objectives.
(4) Address specific areas where improvements can be made to Vision-Language models, such as refining the architecture for better object states understanding and developing better training objectives.
Investigate whether Vision-Language models effectively recognize and encode the physical states of objects?
What are the limitations of current Vision-language models in recognizing object states?
What architectural changes can improve the object states understanding in Vision-language models?
What training objectives are necessary for Vision-language models to improve object states understanding?

To enhance object states understanding in Vision-Language models, several strategies will be used to focus on different aspects of the models. The effects of focusing on individual objects within images instead of the whole image will be tested. This can help models associate concepts with specific objects. The next stage will focus on explicit training objectives, such as tasks that differentiate between similar objects based on their attributes, which can help reinforce object states understanding.

By implementing these strategies, it is anticipated there will be enhancements to Vision-Language models' ability to understand object states, leading to improved understanding and reasoning capabilities.

The research will lead to improved object state recognition that can enhance robotic systems' ability to interact with their environment. For instance, robots could better understand whether an object is whole, sliced, or in a different state, allowing for more effective manipulation and task execution, such as cooking or assembly tasks. In Augmented Reality and Virtual Reality applications, accurate recognition of object states can lead to more immersive and interactive experiences such as virtual assistants could provide context-aware information based on the current state of physical objects in the user's environment. In fields such as video editing and content creation, understanding object states can facilitate automated editing processes, eg in video editing software could automatically identify and suggest edits based on the states of objects within the footage.

Supervisors:

Dr Michael Wray, School of Computer Science
Professor Dima Damen, School of Computer Science

Website:

Research page