Introduction
- Multimodality is a new AI paradigm where various modalities (text, speech, videos, images) are combined with multiple intelligence processing algorithms to achieve higher performance.
- Multimodal applications currently include various discriminative tasks such as information retrieval, mapping and fusion.
Multimodal representations
data:image/s3,"s3://crabby-images/8775d/8775da30ef1de5f8f816c8d2d9f35a40b5fedbf2" alt=""
- There are several ways to achieve multi-modality in industry.
- Let’s start by first mentioning different uses of multi-modal representations:
- We want to make sure similarity in that space implies similarity in corresponding concepts
- We want them to be useful for various discriminative tasks such as: retrieval, mapping, and fusion etc.
- Fill in the missing modality, given the other modalities that are present
data:image/s3,"s3://crabby-images/5f179/5f179423f320ff947db9809381e0e18a5e86d5e3" alt=""
- The five core challenges in multimodal ML are – representation, translation, alignment, fusion, and co-learning. Let’s start looking through each of these individually.
data:image/s3,"s3://crabby-images/2105b/2105b781ba1abcaf304b6277b1475a39796607c7" alt=""
- The greatest challenges of multimodal data is to summarize the information from multiple modalities (or views) in a way that complementary information is used as a conglomerate while filtering out the redundant parts of the modalities.
- Due to the heterogeneity of the data, some challenges naturally spring up including different kinds of noise, alignment of modalities (or views) and, techniques to handle missing data. (source)
- We will continue to study multimodal representations using two broad approaches: Joint and Coordinated representations.
data:image/s3,"s3://crabby-images/5d688/5d688e3a67c62888519bffdc61ffe9226eb1e13e" alt=""
Coordinated representation
- Your modalities have to coordinate between very weak (where their spaces are not overlapped) or very strong (which ends up being joint representation).
- Structured coordinated embeddings:
data:image/s3,"s3://crabby-images/c0be8/c0be825eeb3c98d44bfedc2646d3dce0e27c5831" alt=""
- Coordinated Representations involve projecting all the modalities to their space, but those spaces are coordinated using a constraint.
- This kind of an approach is more useful for modalities which are fundamentally very different and might not work well in a joint space.
- Due to the variety of modalities in nature, Coordinated Representations have a huge advantage over Joint Representations which gives us reason to believe that the coordination using constraints is the way to go in the field of multimodal representation.
Joint representation
- Joint Representations involve projecting all the modalities to a common space while preserving information from the given modalities.
- Data from all modalities is required at training and inference time which can potentially make dealing with missing data hard.
- In our study, we propose a recurrent model which can fuse different views of a modality at each time-step and finally use the joint representation to complete the task at hand (like classification, regression, etc.). (source)
Coordinated vs. Joint representation
- For the tasks in which all of the modalities are present at the inference time, the joint representation will be more suited.
- On the other hand, if one of the modalities is missing, coordinated representation is well suited.
Translation
- Multimodal machine translation involves drawing information from more than one modality, based on the assumption that the additional modalities will contain useful alternative views of the input data.
- The most prominent tasks in this area are spoken language translation, image-guided translation, and video-guided translation, which exploit audio and visual modalities, respectively.
- These tasks are distinguished from their monolingual counterparts of speech recognition, image captioning, and video captioning by the requirement of models to generate outputs in a different language. (source)
- Below we can see an image captioning example gone hilariously wrong:
data:image/s3,"s3://crabby-images/47453/474535dbcfd4053232660b6d6d114be5cddff292" alt=""
- As we can see, the model above is unable to understand the visual scene along with the grammatical scentence in sync. This is essential for a strong multimodal model.
- Multimodal translation models come in two flavors: example-based and generative:
data:image/s3,"s3://crabby-images/6eba5/6eba585896420e45522ce580128418f584299b98" alt=""
- Example -based models will store a dictionary of translations, as we see above, and maps that from one modality to another.
- During inference, the model will fetch the closest match from the dictionary or create the translation by inferring from what the dictionary provides.
- These models thus need to store a lot more information and are very slow to run.
- Generative models produce translations without referring to the training data at the time of inference.
- Generative models have 3 categories which are grammer-based, transformer model, and continuous generation.
Alignment
- Multimodal alignment is finding relationships and correspondences between two or more modalities.
data:image/s3,"s3://crabby-images/64a7c/64a7c5ddf9555953c5e0239807084e2085412551" alt=""
- To align different modalities, a model has to measure similarities between them and has to deal with long-range dependencies.
- Other difficulties involved in multimodal alignment include lack of annotated datasets, designing good similarity metrics between modalities, and the existence of multiple correct alignments.
- There are wo types of multimodal alignment:
- Explicit – whose goal is to find correspondences between modalities and to align data from different modalities of the same event.
- Eg: Aligning speech signal to a transcript
- Implicit – alignment helps when solving a different task (for example “Attention” models). Its a precursor to several downstream tasks like classification.
Fusion
- Multimodal fusion is probably one of the more important topics and challenges.
- Fusion is the practice of joining information from two or more modalities to solve a classification or regression problem.
- Using multiple modalities provides more robust predictions and allows us to capture complementary information.
- Multimodal fusion models could still be used even if one of the modalities is missing.
data:image/s3,"s3://crabby-images/67b65/67b652dde5ce40ad8d7fe5bf3b6912326a9f9fa2" alt=""
Co-learning
- Co-learning is the challenge of transferring learnings or knowledge from one modality to another.
- For building a model in a modality for which resources are limited – lack of annotated data, noisy input, and unreliable labels, transferring the knowledge from a resource-rich modality are quite useful.
data:image/s3,"s3://crabby-images/8c17a/8c17a1340e79e4a2b1ee680d4d6fa4b3cdfd3099" alt=""
References