[논문 리뷰] StyleGAN: A Style-Based Generator Architecture for Generative Adversarial Networks

PGGAN

Progressive growing of GANs for improved quality, stability, and variation
ICLR 2018
Citations: 8796
이후 GAN논문들에는 ProGAN, PGAN으로 명명
Generator 부분만 변경된 model ⇒ StyleGAN
https://arxiv.org/abs/1710.10196

Progressive growing of GANS

PGGAN - low-resolution부터 시작하여 학습을 진행하며, 점차적으로 high-resolution의 레이어를 추가해가며 이미지를 생성하는 GAN 모델
- 초기 단계 - low-resolution으로 부터 Large-scale structure 학습
- 점차 finer scale detail 학습

Fade-in 기법 사용

초기단계: 512 → 512x4x4 fully connected layer & conv 4x4

[장점]

Stable - low-resolution의 image를 학습하면 class information도 적고 mode도 몇개 없기 때문에 안정적
- class information - image의 주요 구조적 특징 - e.g. 사람의 눈, 코, 입의 위치와 같은 큰 특징들
Reduced training time - low-resolution에서부터 G, D가 비교하며 학습하기 때문에 학습 속도가 2~6배 빨라짐

StyleGAN Overview

A Style-Based Generator Architecture for Generative Adversarial Networks
CVPR 2019
Citations: 13,527
https://arxiv.org/abs/1812.04948

Contribution

Style-based generator로 고해상도 이미지를 high quality로 생성
Disentanglement를 측정하는 지표 2가지 제안
1024x1024 고해상도 사람 얼굴 데이터셋 FFHQ 발표

PGGAN에서 generator 부분 style-based generator로 변경
- PGGAN의 문제 - 세부적인 attribute를 조절하지 못함

Mapping Network 추가
- latnent space Z(Gaussian)에서 sampling한 z를 intermediate latent space W의 $w$로 mapping해주는 non-liear maping network
- $Z\sim N$
- non-linear mapping network $f:Z$→ $W$, $w\in W$, $W$- intermediate latent space
- $f$ - 512 → 512 & 8-layer MLP
- feature들을 disenstanglement 하기 위한 network
- Disentanglement: feature들의 correlation을 break
  - ‘각 attribute가 잘 분리’, ‘content와 style이 잘 분리’
  - ⇒ latent space가 linear의 subspace로 구성
  - 모든 attribute를 disentangle하게 하는 것은 거의 불가능하지만, 최대한 less entanglement하도록 만들 수 있음
- 기존 모델들의 latent는 data distribution 따랐어야함 ⇒ Gaussian or Uniform
- intermediate space로 mapping을 통해 조금 더 entanglement하게 space 구성

Synthesis network
- $w$ vector로 image를 생성하는 network
- PGGAN - 4x4부터 1024x1024, 총 9개의 layer가 있는 progressive growing 구조
- Synthesis network - 4x4부터 1024x1024 총 9개의 style block로 이루어진 progressive growing 구조
  - 마지막 RGB로 바꿔주는 layer - image channel 3개로 맞춰줌
  - style block - input은 전 style block의 output인 feature map & 2번의 conv 연산

Style Modules (AdaIN)
- latent space W는 매 layer의 conv때 affine transform으로 들어감 ⇒ AdaIN
  - conv마다 들어가기 때문에 총 18번
  - A - style vector $y_i=[y_{s,i},y_{b,i}]$로 transform & style scaling & bias factor
- AdaIN: Instant normalization과 동일하게 instance(각 이미지별), channel별로 normalization 후 style factor 적용
  - n - conv layer의 출력 channels 수

Constant input
- random noise인 z에서 conv 시작하는 것이 아닌 constant input인 w에서 conv 연산 시작
  - 성능 우수
  - 모든 샘플이 동일한 기반에서 시작하게 하여 학습을 더 안정적이고 일관성 있게 생성
  - w 공간은 disentangled 특성을 제공하여 더 직관적이고 효율적인 style 조작이 가능
  - 불필요한 random 요소를 줄이고, 생성된 이미지의 품질을 향상시킬 수 있음

Stochastic variation
- Progressive growing구조와 AdaIN으로 high resolution image생성
- 세부적인 attribute를 추가하기 위해 noise 추가
- conv의 output인 feature map에 Gaussian noise 더해주어 Stochastic variation 효과
  - B block을 통해 feature map size에 맞게 변형 ⇒ B 학습 parameter
  - style block과 AdaIN으로 만든 high-level attribute는 영향 X

Paper

0.Abstract

intuitive, scale-specific control synthesis 가능
Traditional dist. quality metric 관점에서 SOTA
- leads to demonstrably better interpolation properties
latent factors of variation을 disentangles better
any generator에 적용할 수 있는 2가지 방법 propose
- interpolation quality & disentanglement를 quantify
new, highly varied and high-quality dataset of human faces dataset을 introduce

1. Introduction

[Motivation]

GANs을 통해 resolution and quality of images를 생성 ⇒ rapid improvement
- but generators - black box
- image synthesis의 다양한 측면의 이해는 여전히 lacking
- 다른 generators 사이에 latent space interpolations 성능을 정량적으로 비교할 수 없었음

[StyleGAN architectural change]

synthesis process를 control하는 새로운 방법 제시
learned constant input으로 시작하여 각 conv layer에서 “style” 조정
- ⇒ image features의 strength를 직접적으로 controlling

constant input + noise injection
- stochastic variation에서 high-level attributes를 자동적으로 분리함
- intuitive scale-specific mixing & interpolation operations 가능

discriminator & loss ft 수정 X
⇒ GAN loss ft, regularization, hyperparameters등의 진행중인 discussion과 orthgonal
즉, generator에 대한 propose만 진행

input latent code → intermediate latent space (embedding)
- input latent space - training data의 density를 must follow
  - unavoidable entanglement 유도
- intermediate space - free from that restriction
  - disentanglement 할 수 있음

Flickr-Faces-HQ Dataset(FFHQ)
- high-quality image datasets of human faces 제공
- 70,000개 & 1024x1024 resolution

2. Style-based generator

[Traditional generator]

latent code - generator의 input layer에 input됨

[Style-based generator]

[Mapping Network]
- learned constant로부터 input layer 시작
- $z \in Z$
- non-linear mapping network $f:Z$→ $W$, $w\in W$
- $f$ - 512 → 512 & 8-layer MLP

[Style 조정]
- affine transformations $w$ → $y=(y_s, y_b)$
  - control adaptive instance normalization (AdaIN)
  - each conv. layer 후 AdaIN operation 적용
  - each feature map $x_i$는 각각 normalized ⇒ 즉 channel별
  - style $y$를 통해 scaled and biased
  - y의 dim은 feature maps 개수 * 2 or channel * 2
- style transfer와 다른 점은 style을 image가 아닌 $w$를 통해 얻음
  - 예제 image가 아닌 latent vector를 통해 style 적용 ⇒ 더 직관적 & 일관적
  - 다양한 style을 latent space에서 직접 조절 가능
[Synthesis network]
- 4x4 → 1024x1024 & 18 layers
- last layer의 output - RGB로 convert ⇒ 1x1 conv with 3 channels
- trainable parameters: 23.1M → 26.2M
- explicit noise inputs하여 stochastic detail을 추가하여 생성
  - single-channel images consisting of uncorrelated Gaussian noise
- each layer에 dedicated noise image를 feed
- 이 noise를 각 feature maps에 learned per-channel scaling factors(B) to the noise input
  - - + conv. output에 add

2.1 Quality of generated images

Experimentally, 이 generator가 image qulity에 compromise X & 오히려 상당히 improve
50,000 images FID score

A - PGGAN paper 그대로 사용
B - image resoluntion 조정할 때 bilinear up/ down ⇒ up/down sampling보다 부드럽게 처리
- + training 시간 up, hyperparameters tuning
C - Mapping Network & AdaIN operations 추가
D - Synthesis network input에 z대신 학습 가능한 4x4x512 constant tensor가 input
- 입력을 제어해도 성능이 좋은 점 quite remarkable

$W$의 extreme regions에서 sampling되는 것 방지하기 위해 truncation trick 사용
- truncation trick - $w$의 값을 특정 범위로 제한하여, 다양성을 trade-off하고 high quality image 생성 & 안정적
- $4^2-32^2$에서만 truncation trick을 사용하여 전반적인 구조는 안정화 & 고품질 유지
- high resolution의 details는 다양성 유지하도록 설정했음
- ⇒ FID는 truncation trick 사용 X

3. Properties of the style-based generator

styles에 scale-specific modifications을 통해 image synthesis control

mapping network를 통해 얻은 $w$를 각 layer마다 affine transform
- $A_i(w) = y$ - i번째 layer의 style 조정 ⇒ style parameter
- low resolution layer - AdaIN을 통해 큰 특징 조정
- high resolution layer - AdaIN을 통해 세부적인 details 조정
- Z 대신 W인 이유 ⇒ disentangled attritube를 통해 특정 attribute만 독립적으로 조정 가능
  - style 제어 - 직관적 & 효율적
collection of styles을 통해 novel image 생성 ⇒ by synthesis network
⇒ each style의 효과는 localized in the network

[Localization 이유]

channel 별로 zero mean, unit variance로 normalized
오로지 $y_s,y_b$(style)을 통해 new per-channel statistics 얻음
이를 통해 이전 statistics에 depend on X
⇒ each style controls only one conv. before being overridden by the next AdaIN operation

3.1 Style mixing

styles localization을 더욱 encourage ⇒ mixing regularization
- given percentage & Training에서 2개의 latent codes $z_1, z_2$ ⇒ $w_1, w_2$

Style mixing - synthesis network의 randomly selected point에서 latent code를 다른 것으로 switch
- $w_1$은 crossover point 전에 사용, $w_2$는 이후 사용
- crossover point - layer에서 randint로 지점들 random 추출
- ⇒ prevents the network from assuming that adjacent styles are correlated
- 1개의 z에서 변환된 w가 모든 layer의 style parameters 결정
- ⇒ style mixing을 적용하여 image 상반부는 $w_1$, 하반부는 $w_2$를 적용하여 연속된 layer들 사이에 강한 상관관계 없도록 학습
- ⇒ e.g. 얼굴의 전반적인 형태와 세부적인 texture 서로 독립적으로 조정될 수 있음

StyleGAN에서는 각 레이어가 affine transformation을 통해 독립적으로 스타일을 조정
mixing regularization을 추가로 적용함으로써 이 독립성을 더욱 강화

FID score - 낮을수록 best
percentage - training exampes 중 mixing regularization 적용한 data 비율
- training에서 몇 개의 latent code가 사용됐는지는 언급 X
test - latent code 개수를 변화하여 생성한 뒤 FID score 측정
mixing regularization을 활성화하면 여러 latent vector를 혼합하여 image 생성할 때 style의 localization 향상
- ⇒ 자연스럽고 고품질의 image 생성할 수 있음 (FID score 향상)

if 독립적으로 style 정보가 적용되지 않으면, 여러 style 정보가 image에 자연스럽게 반영 X ⇒ FID score 낮아짐

첫번째 row, column → 각 latent source A, B를 통해 생성된 image
Coarse Styles(low resolution, $4^2$→$8^2$)
- B의 low resolution style을 A에 적용
- high-level aspects 변형 - pose, general hair style ,face shape, eyeglasses 등
- but coloar나 finer features는 A와 resemble
Middle Styles(middle resolution, $16^2$→$32^2$)
- smaller scale facial features, hair style, eyes open/closed form from B
- pose, face shape, eyeglasses from A
Fine Styles(high resolution, $64^2$→ $1024^2$)
- 피부 질감, color scheme, microstructe from B
- 큰 특징들 A style 유지

3.2 Stochastic variation

adding per-pixel noise after each convolution ⇒ 각 layer에 noise 독립적으로 추가 o
- 특정 세부적인 details를 다양하고 자연스럽게 변화
- 생성 이미지의 다양성 증가

overall apperance는 대부분 identical
- ⇒ Global aspects는 stoachastic variation에 unaffected

b) 모든 layer에 noise X
- noise를 추가하지 않으면 세부적인 질감 부족 ⇒ “painterly(그림같은)”
a) 모든 layer에 noise 적용
- 모든 detail 자연스럽게 나타나며, 세부적인 특징 잘 드러남
c) high resolution layer에만 noise 적용 ($64^2$→$1024^2$)
- 세부적인 질감 추가 & 피부 모공, 머리카락 잔결 같은 작은 디테일 표현
d) low resolution layer에만 noise 적용($4^2$→ $32^2$)
- 큰 구조의 무작위성만 추가 ⇒ hair의 큰 곱슬 부분
⇒ mixing을 통한 $w$는 global한 feature를 noise는 stochastic한 feature를 조절
- noise 효과는 각 layer에서 localized effect 발생

3.3 Separation of global effects from stochasticity

StyleGAN
- Style - AdaIN을 통해 entire image에 affect ⇒ global effect
  - pose, lighting, background style 등 coherently하게 control
- Noise - each pixel에 독립적으로 추가 ⇒ spatially incosistent decisions 발생
  - but Discriminator에 의해 penalized 받기 때문에, 이 network는 noise를 local effect로만 사용
- ⇒ 따로 explicit guidance없이 global & local 역할을 각각 자연스럽게 학습하게 됨

StyleGAN - w를 사용하고 affine transformation을 각 레이어에 적용함으로써, 모든 레이어가 동일한 스타일 정보를 기반으로 동작
전역적으로 통일된 스타일 효과와 자연스러운 stochastic variation을 만들어 냄
⇒ quality up

4.Disentanglement studies

Disentanglement - latent space가 linear space로 구성
- 각각이 하나의 variation factor를 control

latent space Z에서 sampling probability는 training data의 density에 corresponding
- ⇒ fully disentangled을 불가능하게 함
Intermediate latent space $W$ - any fixed distribution를 따르는 sampling support 안해도 됨
- learned piecewise continuous mapping f(z)에 induced된 sampling density임
- ⇒ “unwrap” → factors of variation become more linear
- factors of variation을 미리 몰랐을 때, unsupervised setting으로 less entangled W를 training 함
any image dataset and generator에서 계산할 수 있는 2가지 quantifying disentanglement를 제안

[figure 설명]

a. Traing set의 feature distribution
- 긴 머리 남성 ⇒ 훈련 데이터에 없으면 특정 조합들 표현 X
b. mapping from Z to image features
- z는 정규 분포와 같은 간단한 분포에서 sampling
- but training dataset의 invalid combinations의 sampling을 방지하기 위해 curved됨 (특정 영역 왜곡)
c. mapping from W to image features
- mapping network를 통해 z → w
- undo → 각 style 더 독립적이고 disentangled되도록 학습
- ⇒ 왜곡이 없기에, training dataset에 벗어난 조합일지라도 자연스럽고 일관된 image 표현 가능

Others

FFHQ dataset 제공

저작자표시 비영리 변경금지

'Paper Review > Generative Model' 카테고리의 다른 글

[개념 설명] Inception scores(IS), Fréchet Inception Distance(FID), Negative Log Likelihoods(NLL) (0)	2025.03.13

kongshin's Lab

[논문 리뷰] StyleGAN: A Style-Based Generator Architecture for Generative Adversarial Networks

PGGAN

Progressive growing of GANS

StyleGAN Overview

Contribution

Paper

0.Abstract

1. Introduction

2. Style-based generator

2.1 Quality of generated images

3. Properties of the style-based generator

3.1 Style mixing

3.2 Stochastic variation

3.3 Separation of global effects from stochasticity

4.Disentanglement studies

Others

'Paper Review > Generative Model' 카테고리의 다른 글

티스토리툴바

[논문 리뷰] StyleGAN: A Style-Based Generator Architecture for Generative Adversarial Networks

PGGAN

Progressive growing of GANS

StyleGAN Overview

Contribution

Paper

0.Abstract

1. Introduction

2. Style-based generator

2.1 Quality of generated images

3. Properties of the style-based generator

3.1 Style mixing

3.2 Stochastic variation

3.3 Separation of global effects from stochasticity

4.Disentanglement studies

Others

'Paper Review > Generative Model' 카테고리의 다른 글

관련글

티스토리툴바