Multi-modal aggregation explains why a picture is worth a thousand words
The barrier to this notion is that, whereas the codes designed for the expression and retrieval of information in texts are well understood in computational terms, analogous codes for images, if present at all, are not. From what we know currently about human and animal behaviour is the usage of one modality (say language or sound, spatial perception) in understanding information received in another modality (say vision) – the collateral use of texts, sounds, spatial perception in understanding the visual signals appears vary valuable. Visual enumeration or subitization, aspects of synaesthesia, appreciation of (abstract) art, and observations of humans involved in critical decision making, for example, show this multi-modality at work. The codes that have to be designed for expression and retrieval of information in images may be abstracted using data which is collateral to an image.
I will describe how we can begin to automatically index images and videos using image features, shape, colour, and texture distribution plus features of a text that is collateral to an image. The collateral information has many sources – for example, manual annotation, linguistic description of an image found embedded in a text, sound track associated with a video. The image and text vectors can be used for training a system for associating image features that a set of singe and compound words that have a high probability of association with image features of images. Such a system, in principle, can not only perform picture naming but also can do word illustration. A neural network based architecture will be presented which has been implemented with a modicum of success in automatically naming pictures and in illustrating words.