Aman's AI Journal • Multimodal Machine Learning • Multimodal Challenges

Introduction

Multimodality is a new AI paradigm where various modalities (text, speech, videos, images) are combined with multiple intelligence processing algorithms to achieve higher performance.
Multimodal applications currently include various discriminative tasks such as information retrieval, mapping and fusion.

There are several ways to achieve multi-modality in industry.
Let’s start by first mentioning different uses of multi-modal representations:
- We want to make sure similarity in that space implies similarity in corresponding concepts
- We want them to be useful for various discriminative tasks such as: retrieval, mapping, and fusion etc.
- Fill in the missing modality, given the other modalities that are present
The five core challenges in multimodal ML are – representation, translation, alignment, fusion, and co-learning. Let’s start looking through each of these individually.

The greatest challenges of multimodal data is to summarize the information from multiple modalities (or views) in a way that complementary information is used as a conglomerate while filtering out the redundant parts of the modalities.
Due to the heterogeneity of the data, some challenges naturally spring up including different kinds of noise, alignment of modalities (or views) and, techniques to handle missing data. (source)
We will continue to study multimodal representations using two broad approaches: Joint and Coordinated representations.

Your modalities have to coordinate between very weak (where their spaces are not overlapped) or very strong (which ends up being joint representation).
Structured coordinated embeddings:

Coordinated Representations involve projecting all the modalities to their space, but those spaces are coordinated using a constraint.
This kind of an approach is more useful for modalities which are fundamentally very different and might not work well in a joint space.
Due to the variety of modalities in nature, Coordinated Representations have a huge advantage over Joint Representations which gives us reason to believe that the coordination using constraints is the way to go in the field of multimodal representation.

Joint Representations involve projecting all the modalities to a common space while preserving information from the given modalities.
Data from all modalities is required at training and inference time which can potentially make dealing with missing data hard.
In our study, we propose a recurrent model which can fuse different views of a modality at each time-step and finally use the joint representation to complete the task at hand (like classification, regression, etc.). (source)

For the tasks in which all of the modalities are present at the inference time, the joint representation will be more suited.
On the other hand, if one of the modalities is missing, coordinated representation is well suited.

Multimodal machine translation involves drawing information from more than one modality, based on the assumption that the additional modalities will contain useful alternative views of the input data.
The most prominent tasks in this area are spoken language translation, image-guided translation, and video-guided translation, which exploit audio and visual modalities, respectively.
These tasks are distinguished from their monolingual counterparts of speech recognition, image captioning, and video captioning by the requirement of models to generate outputs in a different language. (source)
Below we can see an image captioning example gone hilariously wrong:

As we can see, the model above is unable to understand the visual scene along with the grammatical scentence in sync. This is essential for a strong multimodal model.
Multimodal translation models come in two flavors: example-based and generative:

Example -based models will store a dictionary of translations, as we see above, and maps that from one modality to another.
- During inference, the model will fetch the closest match from the dictionary or create the translation by inferring from what the dictionary provides.
- These models thus need to store a lot more information and are very slow to run.
Generative models produce translations without referring to the training data at the time of inference.
- Generative models have 3 categories which are grammer-based, transformer model, and continuous generation.

Multimodal alignment is finding relationships and correspondences between two or more modalities.

To align different modalities, a model has to measure similarities between them and has to deal with long-range dependencies.
Other difficulties involved in multimodal alignment include lack of annotated datasets, designing good similarity metrics between modalities, and the existence of multiple correct alignments.
There are wo types of multimodal alignment:
- Explicit – whose goal is to find correspondences between modalities and to align data from different modalities of the same event.
  - Eg: Aligning speech signal to a transcript
- Implicit – alignment helps when solving a different task (for example “Attention” models). Its a precursor to several downstream tasks like classification.
  - Eg: Machine Translation

Multimodal fusion is probably one of the more important topics and challenges.
Fusion is the practice of joining information from two or more modalities to solve a classification or regression problem.
Using multiple modalities provides more robust predictions and allows us to capture complementary information.
Multimodal fusion models could still be used even if one of the modalities is missing.

Co-learning is the challenge of transferring learnings or knowledge from one modality to another.
For building a model in a modality for which resources are limited – lack of annotated data, noisy input, and unreliable labels, transferring the knowledge from a resource-rich modality are quite useful.