Image Question Answering and Video Question Answering are two tasks involving the realization of models able to analyze the visual content of an image or a video, and produce a meaningful answer to visual content-related questions. These tasks both involve spatial, frame-level reasoning. Moreover, Video Question Answering also requires temporal, video-level reasoning which further raises the difficulty of the task. Solving these tasks would represent the ability to train models able to jointly analyze and reason on visual contents and textual contents at a human-level: the obtained models would be able to learn to isolate and pinpoint objects of interest in video (or image), and to identify and reason about their interactions in both the spatial and temporal domains. Image and Video Question Answering thus represent a challenging, but fundamental task in both Computer Vision and Natural Language Processing communities.
During my Ph.D. I will work on the Video Question Answering task focusing on videos recorded from an Egocentric perspective. In addition to the temporal and spatial reasoning aspects, such task also requires the analysis of several egocentric cues. Finally, Egocentric Video Question Answering will be useful in several fields, such as a visual support to help a worker develop new skills and improve the existing ones.