Log in

Login to your account

Username *
Password *
Remember Me
From Brain Science to Intelligent Machines

Multi-modal aggregation explains why a picture is worth a thousand words

Date: Tuesday 2/2/2016

Venue: MS105 (Boardroom)

Time: 2.45 pm

Speaker: Prof. Khurshid Ahmad

Affiliation:  School of Computer Science and Statistics, Trinity College, The University of Dublin

                 Project Co-ordinator, EU FP7 Slandail Project** (2014-2017)

 

Multi-modal aggregation explains why a picture is worth a thousand words

By 
Prof. Khurshid Ahmad

School of Computer Science and Statistics, Trinity College, The University of Dublin

Abstract

The central issue about the utility of images, video and still, in decision making is the retrieval of images from an arbitrary collection of images in a timely manner with as great a precision as is possible.  This question is as relevant to photo (and now) an image album as it is to the operation of disaster managers.  The research question related to the issue is this: Can we make a corpus of images, as accessible to retrieval, browsing and summarization as a text corpus is now? This is a revolutionary notion that would change all multi-modal environments such as the World Wide Web.

The barrier to this notion is that, whereas the codes designed for the expression and retrieval of information in texts are well understood in computational terms, analogous codes for images, if present at all, are not. From what we know currently about human and animal behaviour is the usage of one modality (say language or sound, spatial perception) in understanding information received in another modality (say vision) – the collateral use of texts, sounds, spatial perception in understanding the visual signals appears vary valuable.  Visual enumeration or subitization, aspects of synaesthesia, appreciation of (abstract) art, and observations of humans involved in critical decision making, for example, show this multi-modality at work. The codes that have to be designed for expression and retrieval of information in images may be abstracted using data which is collateral to an image.  

I will describe how we can begin to automatically index images and videos using image features, shape, colour, and texture distribution plus features of a text that is collateral to an image.  The collateral information has many sources – for example, manual annotation, linguistic description of an image found embedded in a text, sound track associated with a video.  The image and text vectors can be used for training a system for associating image features that a set of singe and compound words that have a high probability of association with image features of images.  Such a system, in principle, can not only perform picture naming but also can do word illustration.  A neural network based architecture will be presented which has been implemented with a modicum of success in automatically naming pictures and in illustrating words.