Content-Based Image Retrieval using Autoencoders

Published on in Machine Learning

Content-Based Image Retrieval using Autoencoders

We all have used Google Search by Image and have been amazed by how well it performs. It is a form of content based image retrieval (CBIR) system that enables computers to find similar images to a query image among a large image dataset.

Broadly, there are two types of image retrieval systems -

  1. Search by text: Here, a considerable manual effort is made to annotate images by text descriptors which is then used to query for images.
  2. Search by content: Here, images are indexed by their visual content, such as color, texture, shapes.


One strategy would be to use a Convolutional Neural Network (CNN) trained for classification purposes. For every query image, we get the corresponding label and then query our database based on this label. But a major drawback here is that the process of manually labeling the dataset is a very costly process.

Bringing in Unsupervised Learning

A denoising autoencoder is a feed-forward neural network that learns to remove noise from images. We provide it a set of images, add random noise to the image and train the autoencoder with it. By providing the autoencoder with a noisy image and expecting it to return the original image, we train it to learn interesting features about the image. This can be used to extract similarly looking images.

An autoencoder works by taking an input image and encoding it into a smaller representation. When decoding, it takes this smaller representation and tries to construct back the original image. Since the compression here is lossy, the reconstruction here is not perfect. But the performance metric of the autoencoder is based upon how well it is able to reconstruct the original image.

Autoencoder Basic Flow

We can use this ability of autoencoders to take a noisy input image, represent it in a smaller form while preserving only the most important features while discarding the rest (like noise). When reconstructing, it uses the most important features from the compressed representation and tries to get back the original denoised image. This can be used to train the autoencoder to extract important features, which will then be used to get similar looking images.

Hey, I'm not done yet. Check back later to see the complete story.