Computer Vision
2020
INTERMEDIATEFeatured

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, et al. · 2020

Vision Transformer (ViT). Applies a pure Transformer to image patches and matches CNNs on ImageNet at scale — the paper that unified vision and language architectures.

What you'll get

  • Outline: a plain-English breakdown of the paper's core idea, prerequisites, and the concepts you'll need to implement it.
  • Exercises: five to ten hands-on tasks, each with a concept card, a prompt, a starter code stub, and a collapsible reference solution.
  • Runnable notebook: a single .ipynb you can download and open in Jupyter or VS Code to work through every exercise.
  • Extensions: suggested follow-up experiments so you don't stop at a faithful reimplementation.