CLIP : Connecting Text and Images
Published:
This paper is trained on a wide variety of images with a wide variety of natural language supervision that’s abundantly available on the internet. By design, the network can be instructed in natural language to perform a great variety of classification benchmarks, without directly optimizing for the benchmark’s performance.
- Paper Link : https://openai.com/blog/clip/
- Model : Pre-training + zero-shot prediction.
Summary
The method uses an abundantly available source of supervision: the text paired with images found across the internet. This data is used to create the following proxy training task for CLIP: given an image, predict which out of a set of 32,768 randomly sampled text snippets, was actually paired with it in the dataset.
Important : Zero-shot classifier.