NL-046, Style Transformer: Unpaired Text Style Transfer without Disentangled Latent Representation (2019-ACL)

0. Abstract

Latent sapce에서 disentangling content와 style을 하는 것이 이전의 unpaired text style transfer에서 하는 방법이다.
하지만, 이것을 해결하는 현재 방법은 큰 두가지 문제점이 있다.

문장에서 semantics의 의미를 가지는 style information을 완전히 제거하는 것은 어렵다.
RNN 기반으로 하는 encoder-decoder을 사용하여 latent representation을 조절하는 것은, long-term dependency을 잘 다루지 못한다. 따라서 non-stylistic semantic content 보존이 어렵다.

이 논문에서는, source 문장의 latent representation에 대해 가정을 하지 않고 attention 메커니즘을 넣은 Style Transformer을 제안하여 style transfer과 content preservation을 잘하도록 한다.

1. Introduction

Text style transfer은 스타일의 속성(예. 긍부정)을 바꾸는 것인데, 이 때 context에서 style-independent content는 유지해야 한다.
여기서 style이란 정의는 애매모호하기 때문에, content는 같고 style은 다른 paired sentences을 구성하는 것은 어렵다.

보통은 이렇게는 말안하고, paired 쌍 데이터를 구축하는데는 비용이 많이 든다고 하데 여기서의 말도 나름 일리있는 듯
그런데, 이 말대로면 delete, retrieve, generate 방법이 말이 애매하다는 거 같기도 하고 사람도 데이터를 구축하기 힘들다는건데...

따라서 text style transfer은 unpaired transfer을 중점으로 연구되고 있다.
text style transfer의 대세 방법은 encoder-decoder framework이다.

즉 encoder은 style-independent latent representation을 뽑아내는 역할이다.(벡터 형식으로)
그리고 decoder은 encoder 결과 vector와 different style을 결합하여 text을 생성하는 것이다.
이렇게 하면, same content을 가지고 different style을 가지게 된다.

이러한 방법들은 어떻게 la묹tent space에서 disentangle content와 style을 할 것인지에 집중한다.

latent representation은 스타일이 포함되지 않은 더 좋은 text의 의미를 담기에 필요하다.
paired sentence 부족함 때문에 adversarial loss을 latent representation에서의 style 정보를 인코딩을 하는 것에 사용한다.
disentangled latent representation은 더 좋은 해석을 보여주지만 (더 좋은 representation을 위함이지만) 이 논문에서는 이에 대한 문제점을 설명하겠다.

문제점 1

Disentanglement의 질을 판단하기 어렵다.
Adversarial removal of demographic attributes from text data 논문에서 말하길 adversarially하게 학습을 하여도 latent representation 에서 style 정보를 회복할 수 있다.
따라서 문장의 semantic으로부터 stylistic property을 disentangle 하는 것은 쉽지 않다.

문제점 2

Disentanglement은 꼭 필요한 것이 아니다.
Multiple-attribute text rewriting 논문에서 말하길 좋은 decoder는 original style에 덮어쓰기(overwriting)을 함으로써 entangled latent representation에 원하는 style을 입혀 생성할 수 있다고 한다.

문제점 3

문장을 vector로 표현한다면, vector의 제한된 capacity 때문에 (길이가 정해져 있으니까 그런듯) 풍부한 semantic information을 담기 어렵다. (특별히 긴 문장에 대해서는 더욱더)
최근 기계 번역 task에서 original sentence을 참조하지 않고는 (attention seq2seq 처럼 latent representation 만으로 target sentence을 생성하는 것은 쉽지 않다.

문제점 4

Disentangle content와 style을 하기 위해서는, 입력 문장을 고정된 latent vector로 encoding을 해야만 한다.
따라서 이 방법은 입력 문장 보존을 위한 attention 메커니즘을 효과적으로 적용하기 어렵다.

문제점 5

대부분의 encoder-decoder 구조에서는 RNN 계열을 적용하고 있다.
하지만 RNN은 long-dependency에 약한 능력을 가지고 있다.
또한 original text을 참조하지 않고서는 RNN-based decoder은 content preserve을 하기 매우 어렵다.
긴 문장 생성하는 것은 uncontrollable하다.

이 논문에서는 disentangled models for style transfer에 대한 컨선을 가지고 있어서 이들과는 다르게 Style Transformer을 제시한다.

이는 Transformer을 기반으로한다.

이 논문의 Contribution

disentangled latent representation의 가정없이 novel training을 제시한다.
Transformer 구조를 style transformer task에 적용한 첫 번째
실험적인 결과는 우리의 방법이 다른 방법들보다 두 가지 데이터세트(Yelp, Imdb)에서 더 좋은 성능을 보여준다.
아쉬운 부분이라면 이 논문의 코드 공개와 두 가지 데이터로만 실험을 한 점..?

2. Related Work

생략(앞의 리뷰 논문들과 너무 겹쳐서.. 시간날 때 적어보겠음)

3. Style Transformer

style transfer task 설명

3.1 Problem Formalization

생략(앞의 리뷰 논문들과 너무 겹쳐서.. 시간날 때 적어보겠음)

3.2 Model Overview

목표는 $f_{\theta}(\textbf{x}, \textbf{s})$ 을 학습시키는 것인데, x는 문장이고 s는 스타일 컨트롤 변수이다.
가장 큰 챌린지는 non-parallel이기 때문에 transfer model을 supervision으로 학습할 수 없는 것이다.
따라서 3.4에서 설명할 discriminator을 이용하였다.

3.3 Style Transformer Network

입력 문장: $\textbf{x} = (x_1, x_2, \cdots, x_n)$
Transformer encoder: $Enc(\textbf{x}; \theta_{E})$

결과, Continuous representation: $\textbf{z} = (z_1, z_2, \cdots, z_n)$

Transformer decoder: $Dec(\textbf{z}, \theta_{D})$

$\textbf{y} = (y_1, y_2, \cdots, y_n)$
y를 구할 때, auto-regressively하게 나눠보면 다음과 같다.

각 time step t 관점에서 보면 다음과 같다.

$\textbf{o}_t$ 는 decoder에의 출력된 logit vector이다.

Standard Transformer framework에서 style control이 가능하게 하려고, Transformer encoder의 입력에 문장말고도 extra style embedding을 넣었다.

따라서 네트워크가 output condition의 확률을 계산할 때, 입력 문장 x와 style control 변수 s를 고려하게 된다.
여기서는 disentangle을 위한 따른 스킬은 없지만, 그것을 바라고 encoder에 s을 넣은 방법인 듯.
이렇게 하는 이유는 delete model 없기 때문이기도 한 듯
개인적으로 실험하는 방법은 decoder에 style을 넣고 있다.

3.4 Discriminator Network

$s \neq \hat{s}$ 이고 모델이 생성해야할 문장 $\hat{s}$ 을 모르기 때문에 $f_{\theta}(\textbf{x},\hat{s})$ 을 directly supervision을 할 수 없다.
따라서 discriminator을 이용하여 non-parallel copora을 학습시킨다.
즉 supervision 학습이 가능한 것은 $f_{\theta}(\textbf{x},\textbf{s})$ 이고 이것을 학습하는 것은 하나의 optimum solution이 있는 것이다.

optimum solution이란 것은 입력 문장을 재생성하는 것이다.
여기서는 입력에서 어떠한 변형을 안하니, 단순 autoencoder 식이 된다.
보통은 이렇게 학습하는 것은 큰 의미가 없다하여 denosing을 하여 학습 시키긴 함.

이것과 더불어 다음의 두 가지 방법을 이용하여 supervision 시스템을 만들어 스타일이 다른 경우에 대해 다음과 같이 학습을 한다.

Content preservation을 위하여: 문장 x를 feed transferred 하여 문장 $\hat{\textbf{y}} = f_{\theta}(\textbf{x}, \hat{\textbf{s}})$ 을 생성하고 이를 다시 Style Transformer network에 s(original style)와 함께 넣어서 생성하는 문장이 x가 되도록 한다.(cycle loss, back-translation 기법)
Style controlling을 하기 위하여: discriminator network을 학습하여 이를 이용하여 생성 문장의 style control을 한다.

짧게 설명하면, discriminator network은 다른 Transformer encoder으로 문장의 style을 판별할 수 있는 네트워크이다.

그리고 style Transformer network은 discriminator으로부터 style supervision을 받게 된다.
살짝 의아한게, 왜 discriminator의 출력을 supervision하지?
어차피 입력 문장이 어떤 속성을 가지는지 알기 때문에 정확한 supervision이 가능한 상황인데..?

3.4.1 Conditional Discriminator

이 논문에서는 두 가지 discriminator을 이용한 방법이 있다.

즉 두 가지 모델을 제시하는 것인데, 어떤 discriminator을 사용하냐의 차이다.
일반적으로는 3.4.2 버전을 사용하는 것으로 알고는 있으나 여기서 말하길 이미지의 Conditional GAN은 3.4.1 방법을 사용했다고 함.

Discriminator은 입력 스타일의 조건을 결정한다.
문장 x와 스타일 s가 discriminator $d_{\phi}(\textbf{x}, \textbf{s})$ 에 들어가서 x가 s의 스타일을 가지는지 물어보는 것이다.

일반적으로 문장이 D에 들어가서 스타일에 해당하는 확률을 출력하는 식이지만, 이와는 다른 방식임

Discriminator 학습 stage

데이터세트 x에서의 진짜 문장을 가지고 y= $f_{\theta}(\textbf{x},\textbf{s})$ 을 재구성하고 transferred 문장 $\hat{\textbf{y}} = f_{\theta}(\textbf{x}, \hat{\textbf{s}})$ 을 생성한다.
이렇게 생성된 문장과 속성을 이용해서 학습시킨다?
그냥 real sentence만 가지고 discriminator을 학습시키지 않고 생성 문장을 가지고도 학습시킨다는 것 같은데..뒷 부분의 알고리즘 도식도를 봐야할 듯

Style Transformer network 학습 stage

네트워크 $f_{\theta}$ 눈 $f_{\theta}(\textbf{x}, \hat{\textbf{s}})$ 와 $\hat{\textbf{s}}$ 을 discriminator에 넣은 결과를 이용하여 학습한다
즉 transferred 문장을 discriminator에 넣어서 $\hat{\textbf{s}}$ 가 나오도록 학습시킨다는 것인 듯

3.4.2 Multi-class Discriminator

이전과 달리, 문장을 discriminator $d_{\phi}(x)$ 에 넣어서 어떤 class인지 판별을 하게 된다.
좀 더 구체적으로 discriminator은 K+1개의 class을 구별할 수 있는데 K개의 다른 style과 마지막 class는 $f_{\theta}(\textbf{x}, \hat{\textbf{s}})$ 을 대표하는 것으로 fake인지를 구별하는 것이다.
Discriminator 학습 단계에서 y= $f_{\theta}(\textbf{x},\textbf{s})$ 을 재구성해서 labeled style을 맞추도록 학습이 된다.
Transferred sentence $f_{\theta}(\textbf{x},\hat{s})$ 는 class 0으로 매핑이 된다.

이 때 학습은 $f_{\theta}(\textbf{x},\hat{s})$ 가 style $\hat{\textbf{s}}$ 을 가지도록 학습을 한다.
즉 discriminator을 이용한 loss로 학습하겠다는 것

3.5 Learning Algorithm

두 가지 step으로 학습을 하며, 다음의 그림과 같다.

3.5.1 Discriminator Learning

Discriminator 학습은 real sentence x와 reconstruction sentence $f_{\theta}(\textbf{x},\textbf{s})$ 와 transferred sentence $f_{\theta}(\textbf{x},\hat{s})$ 을 구분하는 학습한다.
식과 알고리즘은 다음과 같다

즉 어떤 style인지 classifier의 개념으로 학습하면 됨.

3.5.2 Style Transformer Learning

$f_{\theta}(\textbf{x}, \hat{\textbf{s}})$ 을 학습하는데, $\textbf{s}$ 와 $\hat{\textbf{s}}$ 가 같냐 틀리냐에 따라 학습하는 loss가 다르다.
Self Reconstruction

$\textbf{s}$ = $\hat{\textbf{s}}$ 인 경우 autoencoder 학습하듯이 본인을 재구성하는 방향으로 다음의 식을 학습하면 된다.

Cycle Reconstruction

이 경우는 $\textbf{s}$ != $\hat{\textbf{s}}$ 인데, $\hat{\textbf{y}} = f_{\theta}(\textbf{x}, \hat{\textbf{s}})$ 을 다음과 같이 생성 문장을 다시 생성함으로써, 즉 back-translation처럼 학습한다.

Style Controlling

만약 그냥 재구성하도록만 학습을 하면, (즉 식 6, 7으로만 학습하면) 입력을 출력으로만 copy하는 식으로 학습이 된다고 한다.
따라서 style controlling이라는 생성 문장을 discriminator에 넣어서 제대로 style을 가지도록 학습 효과를 더해준다.
conditional discriminator의 경우는 다음의 식과 같다.

multi-class discriminator의 경우는 다음의 식과 같다.

알고리즘 도식도로 표현하면 다음과 같다.

3.5.3 Summarization and Discussion

GAN처럼 iterative하게 discriminator learning과 Style Transformer learning을 번갈아 하였다.
$n_d$ 만큼 discriminator learning을 하고 $n_f$ 만큼 style transformer을 학습한다.
전체적인 알고리즘 도식도는 다음과 같다.
이 section에서 또 얘기하고자 하는 것은 생성 문장을 discriminator에 통과하여 BP로 학습하기 위해서는 language discrete space 문제가 발생한다.

따라서 이를 해결하는 방법으로는 REINFORCE (Williams, 1992) or the Gumbel-Softmax trick (Kusner and Hernandez-Lobato ´ , 2016) 이 있다.
그러나 이 두 방법은 high variance problem가 존재한다.
실험을 해본결과 Gumbel-Softmax trick이 수렴속도가 느리고 모델의 성능향상이 적다고 한다.
이러한 이유로 soft generated sentence을 downstream에 넣어줘서 전체적인 학습 진행을 하였다.
따라서 우리의 모델은 When this approximation is used, we also switch our decoder network from greedy decoding to continuous decoding.
이말은 이전 토큰의 확률에서 가장 큰 token을 넣을 필요가 없이 확률 값을 넣을 수가 있다. (식 2처럼)

4. Experiments

4.1 Datasets

Yelp, IMDB 사용

4.2 Evaluation

생성되는 문장은 fluent, content-complete one with target style이 목표이다.
1) Style control, 2) Content preservation and 3) Fluency 세 가지 측면에서 평가를 하였다.

4.2.1 Automatic Evaluation

Style Control

fastText로 학습한 sentiment classifier로 평가

Content Preservation

NLTK을 이용하여 BLEU score을 계산하였다.
BLEU score이 높다는 것은 문장에서 많은 단어를 유지한다는 것으로 content preservation을 보여준다고 한다.
human reference가 가능한 경우는 h-BLEU을 측정하였다.

ref-BLEU, human-BLEU

Fluency

KenLM으로 5-gram LM을 학습

4.2.2 Human Evaluation

automatic evaluation은 불충분하기 때문에 두 가지 데이터세트에 대해 사람의 평가를 진행하였다.
sentiment당 100개의 샘플을 랜덤으로 뽑고 3명의 사람에게 source input을 보여주고 여러 모델에서 뽑은 transferred 문장에서 다음의 기준에서 좋은 걸 뽑으라고 했다.

좋은 것의 기준은 style control, content preservation, and fluency 으로 3가지가 있다. (즉 3가지 best를 뽑는 것)
no preference의 선택지도 주어진다고 한다.

4.3 Training Details

4-layer Transformer & 4 attention heads in each layer
hidden size, embedding size, and positional encoding size in Transformer are all 256 dimensions
Another embedding matrix with 256 hidden units is used to represent different style (입력 문장의 extra token으로 encoder에 들어가는 것)
positional encoding isn’t used for the style token
Discriminator은 <cls> token을 이용하였음
실험적으로 random word dropout이 self reconstruction loss을 계산할 때, 더 좋은 효과를 보여주었다. (식 6이 좀 더 잘 수렴해서 좋은 성능을 보여준다고 함)
또한 temperature parameter to the softmax와 sophisticated temperature decay schedule도 더 좋은 결과를 내는데 도움을 준다고 함.
이런걸 사실 다 해보려면 상당히 실험을 많이 해봐야 할 듯..(temperature같은 파라미터도 어떻게 주어야하는지 등)

4.4 Experimental Results

이전의 방법들에 비해 전체적으로 content preservation을 잘한다.
Conditional model은 style controlling을 multi-class model보다 잘한다.
두 개의 모델 모두 PPL이 상대적으로 낮은 편이다.
다른 모델 중 단일 측정 평가에서 best인 모델은 다른 점에서 명백한 단점이 있다.
사람의 평가는 DeleteAndRetrieve와 Controlled Generation 두 개와 비교를 하였음.
우리의 모델은 multi-class model을 이용하였음
400명이 넘는 사람의 평가에서 위와 같은 결과를 보여줌.

4.5 Ablation Study

Yelp dataset에서 ablation study을 해보았음
결과 분석은 생략

5. Conclusions and Future Work

Style Transformer with a novel training algorithm for text style transfer task.
competitive or better performance compared to previous state-of-the-art approaches
because our proposed approach doesn’t assume a disentangled latent representation for manipulating the sentence style, our model can get better content preservation on both of two datasets
미래에는 multiple-attribute에 대해서 해볼 것
back-translation 학습 방법도 적용해볼 수 있을 것 (cycle reconstruction과 머가 다른거지?)

Reference

https://arxiv.org/pdf/1905.05621v2.pdf

인공지능, AI, NLP, 논문 리뷰, Natural Language, Leetcode

AI Information

NL-046, Style Transformer: Unpaired Text Style Transfer without Disentangled Latent Representation (2019-ACL)

0. Abstract

1. Introduction

2. Related Work

3. Style Transformer

3.1 Problem Formalization

3.2 Model Overview

3.3 Style Transformer Network

3.4 Discriminator Network

3.4.1 Conditional Discriminator

3.4.2 Multi-class Discriminator

3.5 Learning Algorithm

3.5.1 Discriminator Learning

3.5.2 Style Transformer Learning

3.5.3 Summarization and Discussion

4. Experiments

4.1 Datasets

4.2 Evaluation

4.2.1 Automatic Evaluation

4.2.2 Human Evaluation

4.3 Training Details

4.4 Experimental Results

4.5 Ablation Study

5. Conclusions and Future Work

댓글

댓글 쓰기