[논문 리뷰] NCSN: Generative modeling by estimating gradients of the data distribution

[논문]

Generative Modeling by Estimating Gradients of the Data Distribution
NeurIPS 2019
Citations: 4,139
https://arxiv.org/abs/1907.05600

[references]

https://yang-song.net/blog/2021/score/
- 해당 blog를 먼저 공부하고 NCSN, SDE 논문을 보는 것 추천!!!

[Blog][https://yang-song.net/blog/2021/score/]blog 작성자 - SDE의 저명한 저자인 Yang SongNCSN, SDEs 등 Score-based model을 공부하기 전 이 blog를 통해 개념을 익히는 것 추천!!이 blog를 리뷰하는 post가 될 예정 [Code][h

kongshin00.tistory.com

Introduction

Generative modeling techniques ⇒ 기준: prob. dist.를 어떻게 represent하는지
- Likelihood-based models
  - Maximum likelihood를 통해 dist의 prob.density를 직접 learning
  - Autogressive models, Normalizing flow models, energy-based models(EBMs), VAEs
  - [Cons]
    - likelihood computation을 위해 model architecture에 강력한 제약들 필요
      - tractable normalizing constant를 보장하기 위해
    - approx. maximum likelihood training를 위해 surrogate objectives에 의존
- Implicit generative models
  - Sampling process에서 prob. dist.가 implicit하게 represent됨
  - GANs
  - [Cons]
    - adversarial training 필요 ⇒ unstable, mode collapse 유발

⇒ 이러한 limitations 피하기 위해 다른 방법 제시

(Stein) Score function - gradient of the log prob. density function
- $\nabla_xlogp(x)$
Score-based models
- tractable normalizing constant 필요 X
- score matching을 이용하여 directly learning
- modeling & estimating scores는 inverse problem 해결에 활용
  - e.g. inpating, colorization, compressive sensing, image reconstruction

The score function, score-based models, and score matching

Goal of generative modeling: model을 data dist.에 fit
- prob.dist를 represent하는 방법 필요

[Likelihood-based models]

Likelihood - pdf, pmf를 directly modeling
$p_\theta$ - pdf & $f_\theta$ - unnormalized probabilistic model or energy-based model

$Z_\theta$ > 0 , dependent on $\theta$: Normalizing constant
$p_\theta$는 maximizing the log-likelihood of the data를 통해 train
- but $p_\theta$를 계산하기 위해서 intractable quantity인 $Z_\theta$를 evaluate해야함
- training 가능하게 만들기 위해 2가지 방법 사용
1. restrict model architectures
2. e.g., causal convolutions in autoregressive models, invertible networks in normalizing flow models
3. approximate the normalizing constant ⇒ computation 비용 expensive
4. e.g., variational inference in VAEs, or MCMC sampling used in contrastive divergence

[Score-based models] - $s_\theta(x)$

density function 대신 score function 사용
- intracable normalizing constants 어려움 피할 수 있음
- score function: $\nabla_xlogp(x)$

$s_\theta(x)\approx\nabla_xlogp(x)$로 학습
- $s_\theta: R^D$→ $R^D$ & Neural networks
- normalizing constant없이 parameterized 가능
- $Z_\theta$는 x와 독립 ⇒ gradient 0이 됨
- $s_\theta$는 $Z_\theta$와 independent !!!
- 따라서 normalizing constant tractable을 위해 special architecutres 사용할 필요 없음

Fisher divergence를 minimizing하는 방향으로 $s_\theta(x)$학습
- Fisher divergence: 두 분포의 score function 차이를 통해 분포 간의 차이 측정
- - ⇒ infeasible - unknown data score $\nabla_xlogp(x)$를 접근할 수 없음

Score matching - ground-truth data score없이 Fisher divergence minimize하는 방법
- unknown data dist.의 i.i.d samples 기반으로 non-normalized statistical models을 learning 하기 위해 design된 방법
- $p(x)$를 estimate하지 않아도 $\nabla_xlogp(x)$를 $s_\theta(x)$가 directly train 가능
- 첫 번째 objective minimize하는 것과 두 번째 objective minimize하는 것 동일
- trace 계산으로 인해 high dimensional data에 확장 불가능
  - ⇒ Denoising score matching 사용

Denoising score matching
- noise를 통해 perturbed data dist. 사용 ⇒ $q_\sigma(\tilde x|x)$, $\epsilon \sim N(0, \sigma)$
- $q_\sigma(\tilde x|x)\sim N(x, \sigma^2)$, $\tilde x=x+\sigma z$, $z\sim N(0,I)$
- perturbed data dist의 score는 계산 가능

[Summary]

$s_\theta(x)$ form에 대한 추가적인 assumptions없기 때문에 상당한 양의 modeling flexibility 가짐
- Requirement: input & output의 dimensionality 동일 ⇒ 만족하기 쉬움
Score based-models은 score function을 통해 dist를 represent하고, free-form architectures with score matching

Langevin dynamics

학습된 $s_\theta(x)\approx\nabla_xlogp(x)$를 iterative procedure인 Langevin dynamics를 사용하여 sampling
only $\nabla_xlogp(x)$를 사용하여 MCMC procedure를 통해 p(x)에서 sampling

arbitrary prior dist. $x_0\sim\pi(x)$를 통해 chain을 initialize
$z_i\sim N(0,I)$
When $\epsilon$→0 & $K$→$\infty$, $x_K\approx x \sim p(x)$ under some regularity conditions
- In practice, $\epsilon$ 충분히 작고, $K$ 충분히 크면 error 무시 가능

⇒ $\nabla_xlogp(x)$만을 가지고 p(x)를 접근할 수 있기 때문에, $s_\theta(x)\approx \nabla_xlogp(x)$를 통해 sampling 가능

Markov Chain ⇒ i.i.d아니더라도, C.L.T, L.L.N 유사하게 따름
- Ergodic Theorem (Ergodic LLN)
- Markov Chain Central Limit Theorem (Markov Chain CLT)
- 수식 - $p(x)$가 아닌 $p(x_i)$임! ⇒ Markov Chian
- i.i.d아닌 상황에서도 해당 이론들 알아야 함
- Langevin dynamics - MC 중 하나의 방법

Naive score-based generative modeling and its pitfalls

Key challenge: few data points인 low density region에서 $\hat s_\theta(x)$
부정확함

Score matching - Fisher divergence를 minimize
- $l_2$ weighted by $p(x)$ ⇒ $p(x)$가 작은 low density regions에서 대부분 ignore됨
- ⇒ subpar results 이끌어냄

data가 high dimensional space일 때, Sampling with Langevin dynamics의 initial sample은 low density regions일 확률이 매우 높음
Inaccurate score-based model을 가지면 이 모델$s_\theta(x)$에 의존하는 Langevin dynamics가 잘못된 방향으로 샘플을 유도하게 되어 초기 단계부터 샘플링 과정이 빗나갈 수 있음
- ⇒ high quality samples을 생성하기 어려움

Score-based generative modeling with multiple noise perturbations

regions of low data density의 어려움 우회 ⇒ perturb data points with noises (In training)
noise magnitude가 충분히 크다면 ⇒ low data density region을 채워 estimated scores의 accuracy 향상

appropriate noise scale 정하는 방법
- Larger noise - 좋은 score estimation을 위해 많은 low density regions을 cover할 수 있음
  - over-corrupts & original dist.와 상당히 달라짐
- Smaller noise - less corruption
  - low density regions을 cover할 수 없음

⇒ multiple scales of noise perturbations simultaneously

[noise-perturbed distribution]

Isotropic Gaussian noise를 통해 data pertubation ⇒ noise 추가 = smoothing 느낌
L개의 standard deviations $\sigma_1<...<\sigma_L$인 noise들 사용
- ⇒ $N(0,\sigma^2_iI), i=1,2,...,L$
$p_{\sigma_i}(x)$ - noise-perturbed distribution
$y=x\sim p(x)$
$x_{\sigma_i} = x+\sigma_iz$, $z\sim N(0,I)$ ⇒ $p(x|y)\sim N(y, \sigma_i^2I)$
- $x\sim p(x)$ sampling을 통해 $p_{\sigma_i}(x)$를 쉽게 sampling 할 수 있음

[Estimate the score ft of each noise-perturbed distribution]

$\nabla_xlogp_{\sigma_i}(x)$ estimate ⇒ training Noise Conditional Score-Based Model $s_\theta(x,i)$
- = Noise Conditional Score Network (NCSN)
- $s_\theta(x,i) \approx \nabla_xlogp_{\sigma_i}(x), i=1,2,...,L$

$\sigma_i$가 커질수록 data dist. 퍼져 있는 모습 ⇒ low density영역에서도 accurate한 score ft 학습 가능

[Training objective for $s_\theta(x,i)$]

Weighted sum of Fisher divergences for all noise scales
- $\lambda(i)\in R_{>0}$ & $\lambda(i)=\sigma_i^2$으로 보통 선택
- naive(unconditional) score-based model $s_\theta(x)$를 optimizing하는 것과 정확히 동일

[Sampling from $s_\theta(x,i)$]

Noise-conditional score-based model - $s_\theta(x,i)$
$s_\theta(x,i)$, $i=L,L-1, ..., 1$를 차례로 Langevin dynamics 진행하여 sample 생성
- Annealed Langevin dynamics

Noise 큰 초기 단계($\sigma_L$) - 데이터가 넓게 펴저 있어 low density region에서도 accurate한 score얻을 수 있음
Noise 작은 후반 단계($\sigma_1$) - 데이터가 집중된 high density region을 반영하여, 최종 sample은 original dist.를 잘 represent할 수 있음

[Some practical recommendataions]

$\sigma_1<...<\sigma_L$ - geometric progression
- $\sigma_1$ 충분히 작음, $\sigma_L$는 max $d(x_i, x_j)$와 comparable하게 설정
- L - 수백 or 수천
$s_\theta(x,i)$ - U-Net skip connections
test때 사용하는 score-based model의 weight는 exponential moving average(EMA) 사용
- EMA - 가중치 변화를 평균화하여 test에서 더 안정적이고 일반화된 성능 발휘

Experiments

T: 각 $\sigma_i$에서 몇 번의 Langevin update를 진행할 지
L: $\sigma_i$ 개수

[Setup]

MNIST, CelebA, CIFAR-10
L = 10, T=100, $\epsilon$=$2$ x $10^{-5}$
$\sigma_1=1,..., \sigma_{10}=0.01$ & geometric sequence

저작자표시 비영리 변경금지 (새창열림)

'Paper Review > Score-based Model' 카테고리의 다른 글

[논문 리뷰] SDEs: Score-based generative modeling with stochastic differential equations (0)	2025.03.25
[개념 설명] Score-based Model (4)	2025.03.25

kongshin's Lab

[논문 리뷰] NCSN: Generative modeling by estimating gradients of the data distribution

Introduction

The score function, score-based models, and score matching

Langevin dynamics

Naive score-based generative modeling and its pitfalls

Score-based generative modeling with multiple noise perturbations

Experiments

'Paper Review > Score-based Model' 카테고리의 다른 글

티스토리툴바

[논문 리뷰] NCSN: Generative modeling by estimating gradients of the data distribution

Introduction

The score function, score-based models, and score matching

Langevin dynamics

Naive score-based generative modeling and its pitfalls

Score-based generative modeling with multiple noise perturbations

Experiments

'Paper Review > Score-based Model' 카테고리의 다른 글

관련글

티스토리툴바