• Multimodality is a new AI paradigm where various modalities (text, speech, videos, images) are combined with multiple intelligence processing algorithms to achieve higher performance.
  • Multimodal applications currently include various discriminative tasks such as information retrieval, mapping and fusion.

Multimodal representations

  • There are several ways to achieve multi-modality in industry.
  • Let’s start by first mentioning different uses of multi-modal representations:
    • We want to make sure similarity in that space implies similarity in corresponding concepts
    • We want them to be useful for various discriminative tasks such as: retrieval, mapping, and fusion etc.
    • Fill in the missing modality, given the other modalities that are present

  • The five core challenges in multimodal ML are – representation, translation, alignment, fusion, and co-learning. Let’s start looking through each of these individually.

  • The greatest challenges of multimodal data is to summarize the information from multiple modalities (or views) in a way that complementary information is used as a conglomerate while filtering out the redundant parts of the modalities.
  • Due to the heterogeneity of the data, some challenges naturally spring up including different kinds of noise, alignment of modalities (or views) and, techniques to handle missing data. (source)
  • We will continue to study multimodal representations using two broad approaches: Joint and Coordinated representations.

Coordinated representation

  • Your modalities have to coordinate between very weak (where their spaces are not overlapped) or very strong (which ends up being joint representation).
  • Structured coordinated embeddings:

  • Coordinated Representations involve projecting all the modalities to their space, but those spaces are coordinated using a constraint.
  • This kind of an approach is more useful for modalities which are fundamentally very different and might not work well in a joint space.
  • Due to the variety of modalities in nature, Coordinated Representations have a huge advantage over Joint Representations which gives us reason to believe that the coordination using constraints is the way to go in the field of multimodal representation.

Joint representation

  • Joint Representations involve projecting all the modalities to a common space while preserving information from the given modalities.
  • Data from all modalities is required at training and inference time which can potentially make dealing with missing data hard.
  • In our study, we propose a recurrent model which can fuse different views of a modality at each time-step and finally use the joint representation to complete the task at hand (like classification, regression, etc.). (source)

Coordinated vs. Joint representation

  • For the tasks in which all of the modalities are present at the inference time, the joint representation will be more suited.
  • On the other hand, if one of the modalities is missing, coordinated representation is well suited.


  • Multimodal machine translation involves drawing information from more than one modality, based on the assumption that the additional modalities will contain useful alternative views of the input data.
  • The most prominent tasks in this area are spoken language translation, image-guided translation, and video-guided translation, which exploit audio and visual modalities, respectively.
  • These tasks are distinguished from their monolingual counterparts of speech recognition, image captioning, and video captioning by the requirement of models to generate outputs in a different language. (source)
  • Below we can see an image captioning example gone hilariously wrong:

  • As we can see, the model above is unable to understand the visual scene along with the grammatical scentence in sync. This is essential for a strong multimodal model.
  • Multimodal translation models come in two flavors: example-based and generative:

  • Example -based models will store a dictionary of translations, as we see above, and maps that from one modality to another.
    • During inference, the model will fetch the closest match from the dictionary or create the translation by inferring from what the dictionary provides.
    • These models thus need to store a lot more information and are very slow to run.
  • Generative models produce translations without referring to the training data at the time of inference.
    • Generative models have 3 categories which are grammer-based, transformer model, and continuous generation.


  • Multimodal alignment is finding relationships and correspondences between two or more modalities.

  • To align different modalities, a model has to measure similarities between them and has to deal with long-range dependencies.
  • Other difficulties involved in multimodal alignment include lack of annotated datasets, designing good similarity metrics between modalities, and the existence of multiple correct alignments.
  • There are wo types of multimodal alignment:
    • Explicit – whose goal is to find correspondences between modalities and to align data from different modalities of the same event.
      • Eg: Aligning speech signal to a transcript
    • Implicit – alignment helps when solving a different task (for example “Attention” models). Its a precursor to several downstream tasks like classification.
      • Eg: Machine Translation


  • Multimodal fusion is probably one of the more important topics and challenges.
  • Fusion is the practice of joining information from two or more modalities to solve a classification or regression problem.
  • Using multiple modalities provides more robust predictions and allows us to capture complementary information.
  • Multimodal fusion models could still be used even if one of the modalities is missing.


  • Co-learning is the challenge of transferring learnings or knowledge from one modality to another.
  • For building a model in a modality for which resources are limited – lack of annotated data, noisy input, and unreliable labels, transferring the knowledge from a resource-rich modality are quite useful.