CLIP : Connecting Text and Images

less than 1 minute read

Published:

This paper is trained on a wide variety of images with a wide variety of natural language supervision that’s abundantly available on the internet. By design, the network can be instructed in natural language to perform a great variety of classification benchmarks, without directly optimizing for the benchmark’s performance.

  • Paper Link : https://openai.com/blog/clip/
  • Model : Pre-training + zero-shot prediction.

Summary

The method uses an abundantly available source of supervision: the text paired with images found across the internet. This data is used to create the following proxy training task for CLIP: given an image, predict which out of a set of 32,768 randomly sampled text snippets, was actually paired with it in the dataset.

Important : Zero-shot classifier.