[논문 리뷰] ViT: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

ICLR 2021
Citations: 53,519
https://arxiv.org/abs/2010.11929

Transformer 복습

[논문 리뷰] Transformer: Attention is All You Need

[논문]https://arxiv.org/abs/1706.03762NeurIPS 2017Citations: 170,606Positional Encoding 이해하기2025.03.11 - [Paper Review/Transformer] - [개념 설명] Positional Encoding - Transformer 이해하기 (1) [개념 설명] Positional Encoding - Transf

kongshin00.tistory.com

Viusal Transformer (ViT) overview

CNN구조였던 image task를 Transformer 구조로 대체시킨 시작점
Image classification에서 SOTA 달성
대용량 dataset을 pre-train → small image dataset에서 tranfer learning
- 훨씬 적은 computation cost, good performance
- 대용량 dataset을 이용하여 pre-train해야하는 제약 존재
  - ImageNet와 같은 mid-sized dataset에서는 ResNet보다 낮은 성능
ViT - Transformer의 Encoder만 사용하여 마지막 Classification head를 통해 class 예측 model

Paper

ViT Method

NLP 분야의 Transformer와 최대한 비슷하게 model design
token 대신 image patch가 input으로 들어감

Vision Trasformer (ViT)

Standard Transformer - input: 1D sequence of token embeddings
2D images 다루기 위해 reshape 진행
- $x\in\mathbb{R}^{H\times W\times C}$ → $x_p\in\mathbb{R}^{N\times (P^2\cdot C)}$
- Image → sequence of flattened 2D patches
- $(P,P)$: each image patch의 resolution
- $N=HW/P^2$: patch 개수
Transformer- 모든 layers에 constant latent vector size $D$를 사용했음
- patches 또한 D dim.으로 mapping 필요 ⇒ trainable linear projection
- ⇒ Output: patch embeddings $z_0\in \mathbb{R}^{(N+1)\times D}$

[Class token]

BERT의 [class] token처럼, 추가적인 learnable token을 첫 번째 input으로 사용
- $z_0^0=x_{class}\in \mathbb{R}^D$
- class token의 transformer encoder output $z_L^0$: image representation vector of $y$
  - $y=LN(z_L^0)$
  - Image classification에 활용됨
Classification head - input: $y$ → output: 최종 image class
- Pre-training 단계 - MLP with one hidden layer
- Fine-tuning 단계 - single linear layer

[Position embeddings]

standard learnable 1D position embeddings ($E_{pos}$)
- Sinusoidal embedding 사용 X
- sequence에 position 정보 추가
- 2D-aware position embeddings 크게 효과 없었음
⇒ 최종 embedding sequence는 encoder의 input임

[Transformer encoder]

MSA: Multiheaded self-attention
LN: Layer Normalization
Encoder의 각 layer는 MSA block, MLP block으로 이루어짐
각 block에 LN과 residual connections 존재

MSA block
- Self-attention (SA)
  - input sequence: $z\in \mathbb{R}^{(N\times D)}$
  - $SA(z)\in \mathbb{R}^{N\times D_h}$
- Multihead self-attention (MSA)
  - k개의 head
  - $D_h=D/k$로 설정
  - $MSA(z)\in \mathbb{R}^{N\times D}$

MLP block
- two layers with a GELU activate ft.

ViT 수식
- Classification head$(y)$ = 최종 class

Inductive bias

Image-specific inductive bias: ViT << CNN
CNN
- locality 특성과 2D neighborhood structure 학습
- Translation equivariance - object의 위치 변해도 동일 feature 학습
ViT
- Self-attention을 통해 global feature 학습 but 2D neighborhood 관계 반영하는 structure 없음
- image를 patches로 분할 후 inductive bias 추가하여 공간 정보 학습이 필요
- ⇒ Sinusoidal Embedding(fixed embedding) 대신 learnable Embedding 사용

Hybrid Architecture

CNN의 inductive bias 일부 도입 ⇒ CNN + ViT
Input - image pathces 대신 CNN의 feature map 사용
feature map → flatten → Linear projection

Fine-Tuning and Higher Resolution

ViT - large datasets의 pre-training → small datasets의 fine-tuning
- pre-training 완료 후, prediction head 대신 새로운 class 개수 K로 조정한 head 사용
  - $D\times K$ feedforward layer
In Fine-tuning, pre-trained image보다 resolution을 키우면 성능 더 좋아짐
- Fixing the train-test resolution discrepancy. In NeurIPS. 2019
- Big transfer (BiT): General visual representation learning. In ECCV, 2020.
- resolution을 키워도 patch size는 동일 ⇒ sequence length 증가
- Vision transformer - abitrary sequence length 다룰 수 있음
  - transformer 관련 weight들은 N에 영향 X
- Pre-trained position embedding $\in \mathbb{R}^{(N+1)\times D}$ → 위치 정보의 의미 사라짐
  - 2D interpolation 진행

ViT vs CNN

Inducitve bias

CNN - Inducitve bias(locality, Translation equivariance 등)을 이용해 작은 dataset에서도 학습 가능
ViT - 대규모 dataset에 대한 pre-training 수행하면 CNN보다 훨씬 뛰어남
- 작은 dataset에서는 local pattern 효과적으로 학습 어려움

Self-Attention

CNN - convolution filter를 이용하여 작은 영역에서 점진적으로 feature 추출
ViT - Self-Attention을 통해 처음부터 global한 정보 활용
⇒ 멀리 떨어진 patches간의 관계 학습 가능

저작자표시 비영리 변경금지 (새창열림)

'Paper Review > Diffusion Transformer' 카테고리의 다른 글

[논문 리뷰] Direct3D: Scalable Image-to-3D Generation via 3D Latent Diffusion Transformer (0)	2025.04.03
[논문 리뷰] DiT: Scalable Diffusion Models with Transformers (1)	2025.03.17
[논문 리뷰] Transformer: Attention is All You Need (0)	2025.03.11
[개념 설명] Sinusoidal encoding, Normalization - Transformer 이해하기 (2) (0)	2025.03.11
[개념 설명] Positional Encoding - Transformer 이해하기 (1) (0)	2025.03.11

kongshin's Lab

[논문 리뷰] ViT: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Viusal Transformer (ViT) overview

Paper