• In similarity searching, there is often a query record that is compared against a stored database of records (documents or images etc). The main aim is to retrieve a set of database records that are similar to the query record.
  • Similarity search works with semantic representations of data and finds the similar items fast. This means when we represent images or pieces of text as vector embeddings, their semantic similarity is represented by how close their vectors are in the vector space. Hence, what we want to look at is the distance between vectors of the objects.
  • The following article has been written by Arpita Vats.

Applications

  • Image retrieval, document retrieval, recommender systems, image recognition, similar movies/items etc.

FAISS

  • Facebook AI Similarity Search (FAISS) is a library developed by Facebook AI that enables efficient similarity search.
  • Given a set of vectors, we can index them using FAISS then using another vector (the query vector), we search for the most similar vectors within the index. FAISS not only allows us to build an index and search but it also speeds up search times to ludicrous performance levels. FAISS provides several similar search methods that span a broad spectrum of usage trade-offs.
  • FAISS is optimized for memory usage and speed.
  • FAISS offers a state-of-the-art GPU implementation for the most relevant indexing methods.
  • To know more about the different and efficient indexing done by FAISS, please refer Efficient Indexing of Billion-Scale datasets of deep descriptors by Babenko and Lempitsky (2016).
  • The figure below shows the internal mechanism of FAISS.

  • Once the vectors are extracted by learning machinery (from images, videos, text documents, and elsewhere), they’re ready to feed into the similarity search library.
  • Similarity search can be made orders of magnitude faster if we’re willing to trade some accuracy; that is, deviate a bit from the reference result. For example, it may not matter much if the first and second results of an image similarity search are swapped, since they’re probably both correct results for a given query. Accelerating the search involves some pre-processing of the data set, an operation that we call indexing.
  • This brings us to the three metrics of interest:
    • Speed: How long does it take to find the 10 (or some other number) most similar vectors to the query? Hopefully less time than the brute-force algorithm needs; otherwise, what’s the point of indexing?
    • Memory usage: How much RAM does the method require? More or less than the original vectors? FAISS supports searching only from RAM, as disk databases are orders of magnitude slower. Yes, even with SSDs.
    • Accuracy: How well does the returned list of results match the brute-force search results? Accuracy can be evaluated by counting the number of queries for which the true nearest neighbor is returned first in the result list (a measure called 1-recall@1), or by measuring the average fraction of 10 nearest neighbors that are returned in the 10 first results (the “10-intersection” measure).

Pros

  • FAISS is a library for efficient similarity search and clustering of dense vectors. It contains algorithms that search in sets of vectors of any size, up to ones that possibly do not fit in RAM. It also contains supporting code for evaluation and parameter tuning.
  • FAISS optimizes through various indexing techniques.

Cons

  • Similarity search can be made orders of magnitude faster if we’re willing to trade some accuracy; that is, deviate a bit from the reference result.

ScaaNN

  • ScaNN (Scalable Nearest Neighbors), by Google, is a method for efficient vector similarity search at scale.
  • Modern ML models can transform inputs such as text and images into embeddings, high dimensional vectors trained such that more similar inputs cluster closer together.
  • For a given query, we can therefore compute its embedding, and find the literary works whose embeddings are closest to the query’s. computational challenge remains: for a given query embedding, how does one quickly find the nearest dataset embeddings? The set of embeddings is often too large for exhaustive search and its high dimensionality makes pruning difficult.
  • Embedding-based search is a technique that is effective at answering queries that rely on semantic understanding rather than simple indexable properties.
  • In this technique, machine learning models are trained to map the queries and database items to a common vector embedding space, such that the distance between embeddings carries semantic meaning, i.e., similar items are closer together.
  • To answer a query with this approach, the system must first map the query to the embedding space. It then must find, among all database embeddings, the ones closest to the query; this is the nearest neighbor search problem.
  • One of the most common ways to define the query-database embedding similarity is by their inner product; this type of nearest neighbor search is known as maximum inner-product search (MIPS).
  • Referring to the figure below source: Announcing ScaNN: Efficient Vector Similarity Search, the two-tower neural network model, illustrated below, is a specific type of embedding-based search where queries and database items are mapped to the embedding space by two respective neural networks. In this example the model responds to natural-language queries for a hypothetical literary database.

Anisotropic Vector Quantization in ScaNN

  • Anisotropic vector quantization allows ScaNN to better estimate inner products that are likely to be in the top-k MIPS results and therefore achieve higher accuracy.
  • On the glove-100-angular benchmark from ann-benchmarks.com, ScaNN outperformed eleven other carefully tuned vector similarity search libraries, handling roughly twice as many queries per second for a given accuracy as the next-fastest library (NGT-onng).

  • ScaNN is open-source software and you can try it yourself at GitHub. The library can be directly installed via Pip and has interfaces for both TensorFlow and Numpy inputs.

Pros

  • Handles roughly twice as many queries per second for a given accuracy as the next-fastest library.
  • Anisotropic vector quantization allows ScaNN to better estimate inner products that are likely to be in the top-k MIPS results and therefore achieve higher accuracy.

Cons

  • Database size can easily be in the millions or even billions, MIPS is often the computational bottleneck to inference speed, and exhaustive search is impractical. This necessitates the use of approximate MIPS algorithms that exchange some accuracy for a significant speedup over brute-force search.

ANNOY

  • Annoy from Spotify uses a bunch of trees to enable Spotify’s music recommendations.
  • In order to construct the index we create a forest (i.e., a bunch of trees). Each tree is constructed by picking two points at random and splitting the space into two by their hyperplane. We keep recursively splitting into subspaces until at most K items are left in each node.
    • It has the ability to use static files as indexes. In particular, this means you can share index across processes.
    • Annoy also decouples creating indexes from loading them, so you can pass around indexes as files and map them into memory quickly.
    • It tries to minimize memory footprint so the indexes are quite small.
  • To understand the data structures and algorithms that Annoy uses to do approximate nearest neighbor queries, please refer Erik Bernhardsson’s Nearest neighbors and vector models – part 2 – algorithms and data structures.
  • There are just two main parameters neede to tune Annoy: the number of trees n_trees and the number of nodes to inspect during searching search_k.
    • n_trees is provided during build time and affects the build time and the index size. A larger value will give more accurate results, but larger indexes.
    • search_k is provided in runtime and affects the search performance. A larger value will give more accurate results, but will take longer time to return.
    • If search_k is not provided, it will default to n * n_trees where n is the number of approximate nearest neighbors. Otherwise, search_k and n_trees are roughly independent, i.e. the value of n_trees will not affect search time if search_k is held constant and vice versa.
    • It’s recommended to set n_trees as large as possible given the amount of memory you can afford, and it’s recommended to set search_k as large as possible given the time constraints you have for the queries.
  • From Erik Bernhardsson’s Nearest neighbors and vector models – part 2 – algorithms and data structures, the diagram below shows the binary tree (left) when the subspaces have been recursively split such that there are at most \(K\) items left in each node (right).

  • In order to search the constructed index, the forest is traversed in order to obtain a set of candidate points from which the closest to the query point is returned.

Pros

  • Decouple index creation from loading them, so you can pass around indexes as files and map them into memory quickly.
  • We can tune the parameters to change the accuracy/speed tradeoff.
  • It has the ability to use static files as indexes, this means you can share indexes across processes.

Cons

  • The exact nearest neighbor might be across the boundary to one of the neighboring cells.
  • No support for GPU processing.
  • No support for batch processing, so in order to increase throughput “further hacking is required”.
  • Cant incrementally add points to it (annoy2 tries to fix this).

References

Citation

If you found our work useful, please cite it as:

@article{VatsChadha2020DistilledSimilaritySearch,
  title   = {Similarity Search},
  author  = {Vats, Arpita and Chadha, Aman},
  journal = {Distilled AI},
  year    = {2022},
  note    = {\url{https://aman.ai}}
}