NL-045, Transforming Delete, Retrieve, Generate Approach for Controlled Text Style Transfer (2019-EMNLP)

0. Abstract

Text style transfer은 preserving content information(=non-stylistic)을 하면서 stylistic attribute을 바꿔주는 task이다.
이 논문에서는 Generative Style Transformer(GST)으로 non-parallel 데이터에서 연구를 수행한다.
GST는 Transformer로 알려진 large unsupervised pre-trained language model을 사용한다.
GST는 "Delete Retrieve Generate" framework의 일부분으로 소스 데이터에서 deleting style attribute 하는 것을 새로운 방법으로 했다.
본 논문의 모델은 5개의 sentiment 데이터세트, gender, political slant transfer에 대해 성능이 SoTA이다.
본 논문은 automatic metric으로 GLEU metric을 사용하였고 이는 기존에 사용하던 BLEU score보다 human ratings에 가깝다고 하는 듯.

1. Introduction

Style Transfer text는 다음과 같은 곳에 응용해 볼 수 있다.

대화 에이전트의 대화 스타일(adapting conversational style in dialogue agents)
개인 정보를(ex. 성별) 보호하려고 personal attribute을 혼란스럽게 하기(obfuscating personal attributes (such as gender) to prevent privacy intrusion)
격식, 비격식의 문장 스타일 변화(altering texts to be more formal or informal)
시 생성(to generating poetry)

가장 큰 문제는 parallel corpora가 없다는 것.

Content는 같고, style만 다른 데이터 쌍이 없다는 것이다.
따라서 non-parallel 방식으로 연구해야함.

이전의 방법들은 문장으로부터 disentangle style와 content을 표현할 수 있는 latent representation을 adversarially 방법으로 학습하였다.

대표적으로 다음의 5가지 단점이 있다.

학습하기 어렵고 수렴하는데 오래걸린다.
content 보존과 style 변환 사이의 trade-off을 바꾸려면 re-trained가 되어야 한다.
latent disentangled representation으로부터 고통 받는다.
문장의 품질이 좋지 않다.(사람의 평가에 따르면)
대상 속성에 대한 세밀한 제어를 할 수가 없다.

이것을 다 해결한다면... 과연..

Li(Delete, Retrieve, Generate 논문)은 small subset of words of a sentence이 style 속성에 영향을 끼친다는 것을 말해준다. (즉 단어가 attribute의 요소에 큰 결정을 준다는 듯)

따라서 Li에서는 다음과 같은 과정을 거친다.

delete only the set of attribute words from a sentence to give the content
retrieve attribute words from the target style corpus
use a neural editor (an encoder-decoder LSTM) to generate the final sentence from the content and retrieved attributes.

DRG는 기존의 방법에 비해 품질이 더 좋음을 보여주고 다음과 같은 것에 영향을 받기 쉽다.

content에 중요한 단어를 삭제하는 것 (removing core content words which would preserve crucial context)
소스의 style attribute을 제거하는데 실패하는 것 (failing to remove source style attributes that should be replaced with target style attributes)
LSTM 기반 encoder-decoder은 Delete and Retrieve 모델에서 발생한 error에 강인하지 않다. (the LSTM-based encoder-decoder model not being robust to errors made by the Delete and Retrieve models)
소스 문장에 문장에 attribute을 넣는식으로 생성하는 문장은 유창하지 않다. (generating sentences that are not fluent, by abruptly forcing retrieved attributes into the source sentence)
긴 문장에 대한 실패 (failing on longer input sentences)

이 논문에서는 이미 기 학습된 unsupervised language model (a.k.a Transformer)을 이용해서 style transfer을 하려고 한다.

여기서 말하는 Transformer은 Generative Style Transformer (GST)라고 부른다.

Li에서 제시한 DRG 프레임워크를 사용하나, 몇 가지 단점을 해결한다.

Delete 기법을 ft-idf 기법이 아닌, 학습 방법으로 attention weight을 이용하겠다. (Delete mechanism, by using the attention weights of another Transformer that we refer to as the Delete Transformer (DT))
Encoder-Decoder의 LSTM의 단점으로 인해 GST로 문장을 생성하겠다. ( Generate mechanism by using GST, which does away with the need for (and consequent shortfalls of) a sequence-to-sequence encoder-decoder architecture using LSTMs.)

Sentiment, Gender, political slant에서 SoTA을 찍었다.

이 방법은 간단하다는 장점이 있고 Transformer을 이렇게도 쓸 수 있음을 보여준다.

2. Our Approach

데이터 세트: $D = \{ (x_1, s_1), ...,(x_m, s_m) \}$
우리의 목표는 $P(y|x,s^{tgt})$ 을 찾는 것이다.

$Style(y)=s^{tgt}$

DRG framework에서는 3-step이 있다.

Delete model

$P(c,a|x)$ 에서 c와 a는 각각 non-stylistic, stylistic components이다.
$Style(c) \notin S$ (c에는 style 정보가 없기 때문이다.)
즉 x에서 c,a을 분리할 수 있어야 하고 반대로 c, a에서부터 x를 재구성할 수 있어야 한다.

Retrieve model

이 부분은 optional 이다.
$D_{s^{tgt}}$ 에서 target attributes $a^{tgt}$ 을 뽑아낸다. (입력 x의 c를 가지고 있는 데이터에서)
$D_{s^{tgt}}$ 는 target style을 가지는 데이터 세트

Generate model

생성 모델에는 두 가지 방법이 있는데 (retrieve을 하냐, 안하냐)
a) target distribution $P(y|c,s^{tgt})$ 을 학습해야 한다.
b) target distribution $P(y|c,a^{tgt})$ 을 학습해야한다.
여기서 $Style(y)=s^{tgt}$

2.1 Delete

예) 입력 문장: "The restaurant was big and spacious"

style transfer task: 긍정에서 부정
Delete model: "big", "spacious"을 삭제

Attribute deletion은 "input reduction" (Feng, 2018) 방법을 기반으로 한다.

문장에서 attribute a가 삭제됐을 때, style classifier을 혼돈을 줘야한다.
즉 각 token x에 importance score을 매겨서 이 score의 기반으로 content안의 style attributes을 구별해낸다.

2.1.1 Delete Transformer

attention-based style classifier

v[i]에는 x[i]가 encoding된 tensor이다.
$\alpha$ 는 weight인데 style을 구별하기 위해 $\alpha$ [i]에는 v[i]에 해당하는 weight가 매칭되어 있다.

여기서는 BERT-based classifier로 Delete Transformer (DT)을 구성하였다.

그러나 DT는 multiple attention heads와 multiple blocks(layers)이기 때문에 single set of attention인 $\alpha$ 을 구하는 것은 non-trivial task이다.
이는 각 layer와 head가 다른 semantic과 linguistic 구조의 관점에서의 encoding을 하기 때문에 복잡하다.
따라서 novel method인 specific attention head을 제안하고 layers을 결합하여 style information을 encoding 하고 importance scores을 구한다.

Attribute extraction

BERT와 똑같이 [CLS] token이 문장 tokens 앞에 붙는다.
BERT에서 classification을 할 때, [CLS] token을 이용하기 때문에 다른 tokens 보다 [CLS] token에 가중치가 높아야 한다.
각각의 head-layer의 쌍 <h, l>에서 attention score을 다음과 같이 뽑아낸다.

즉, layer에 여러 개의 head가 있는데, [CLS]와 다른 token(w) 사이의 QK^T로 attention score을 구하겠다는 것

Transforemr에서 Q, K, V는 다음과 같이 사용됨을 리마인드
x의 tokens들 중 $\gamma |x|$ 는 만큼 삭제한다.

이를 reduction 과정이라고 부른다.
$\gamma$ 가 얼마만큼 삭제하냐의 의미
$\gamma$ 는 데이터세트에 따라 달라지는 파라미터
$|x|$ 는 x안에 tokens의 개수를 의미

이렇게 삭제하고 남은 문장을 $x'_{h,l}$ 이라고 쓴다.
$x'_{h,l}$ 으로 score $z(x'_{h,l})$ 을 다음의 식 4와 같이 계산한다.
여기서 $\lambda$ 는 smoothing parameter이다.
s는 가장 maximum 확률을 가지는 style로 할당된 스타일이고 s'=S-{s} 이다. 즉 s'는 가장 큰 확률을 제외한 나머지 style에 대한 score의 합이다.
긍부정 task에서 생각하면, classifier가 정확하다는 가정하에, attribute 몇 개를 지우니까 긍정에 z score가 높다면?
그러면 긍정일 확률이 부정일 확률과의 배수차이가 크다는 이므로 attribute을 잘못 지웠다는 것이다.
왜냐하면 우리의 목표는 attribute을 지웟을 때, 긍정과 부정을 잘 구별하지 못해야하기 때문이다.
$<h_s, l_s>$ 은 모든 <h, l> 쌍에 대해 식 4로 구한 score을 평균을 낸 것중 가장 작은 것을 의미한다.

즉 multiple head와 layer 중에서 어떤 head, layer을 이용한 attention score을 사용해서 attribute을 지워야할 지 모르니까, 다해보고 가장 style classifier에게 혼동을 주는 것을 찾아낸다는 것이다.
그래서 나온 결과인 $x'_{h_s, l_s}$ 아 content c가 되고 삭제된 tokens들이 attribute a가 된다.

Evaluation of Extracted Attributes

Amazon Mechanical Turk을 이용하여서 사람 평가를 하였다.
200개의 random sentences을 뽑아서 사용하였다.
이 논문의 방법은 89% 확률로 all style을 삭제하였고 12% 확률로 non-style attribute을 잘못 삭제하였다.
다른 비교 방법인 Li에서는 각각 67%, 29% 확률이 나왔다고 한다.

2.2 Retrieve

d 함수는 distance metric이고 cosine similarity을 이용하여 sentence representations을 비교해보았다.
Sentence representation 방법

TF-IDF weighted
Averaged Glove overall tokens of a sentence
Universal Sentence Encoder (USE)

이중 TF-IDF 방법이 제일 좋았음.

2.3 Generate

이 논문의 방법은 leverages을 이용하는 것이다.
대용량 corpus으로 학습한 unsupervised LM을 transfer learning을 하겠다는 것이다.
Unsupervised LM으로는 GPT을 사용하였고 'decoder-only'만을 이용했다.
GPT을 기반으로 GST(Generative Style Transformer)을 구축하였다.

GST는 masked attention heads로 left에서 right로 진행이 된다.
즉 오른쪽 토큰을 보지 못하는 것
Pre-trained가 많은 분야에서 SoTA을 찍는 것을 보고 응용한 것 (BERT와 같이)

근데 encoding은 어케하는 거지? 이것은 encoding을 하지 않고 속성+context의 첫 번째 단어로부터 문장을 생성하는 식이다.

2.3.1 Variants of GST (B-GST and G-GST)

논문에서는 GST의 두 가지 버전을 제공한다.
B-GST (Blind Generative Style)와 G-GST (Guided Generative Style Transformer)이다.
이는 sentence x와 source style $s^{src}$ , target attribute $a^{tgt}$ 을 가지고 어떻게 학습하냐의 차이다.
B-GST

Li에서 Retrieve이 없는 방식인데, target attribute $a^{tgt}$ 을 사용할 수 없을 때 쓰는 것.
즉, 모델에는 c와 $s^{src}$ 만이 들어가서 target style 문장을 생성한다.
specific desired target attribute을 모른다고 해서 blind라는 이름을 붙였다.

G-GST

모델의 입력으로는 c와 $a^{tgt}$ 이 들어가고 출력으로는 target style 문장을 생성한다.
여기서는 retrieve을 통하여 desired attribute을 guided을 주게 되는 것이다.
이 방법이 유용한 이유는 크게 2가지가 있다고 한다.

Target corpus에 source corpus와 비슷한 문장들이 있다면, 모델에게 target attributes에 대한 sparsity을 줄여주게 된다.
출력을 생성할 때, specifying target attribute에 의해 정교한 출력을 생성가능하다. (Retrieve component 없이도 가능하다. 아마도 그냥 Retrieve을 안하고 출력 속성을 주면 된다는 것인 듯)

2번 식은 다른 latent representation 기반으로 한 style transfer에 없는 중요한 기능이라고 한다.

2.3.2 Input Representation and Output Decoding

여기서는 BERT에서 처럼, special tokens을 target style에 넣었다고 한다.
B-GST

a) target style: $s^{tgt}$
b) start of content
c,d) start of output부터 t-1 step까지 all target tokens
예시) "그 음식점의 음식은 매우 맛있다"
style: 긍정, target: 부정, content: 그 음식점의 음식은 매우
t=2에서 입력) [(a)부정, (b)그, (output_1)음식점의]
t=3에서 입력) [(a)부정, (b)그, (output_1)음식점의, (output_2)음식은]

G-GST

target style대신 attribute로 바뀐 것일 뿐
예시) "그 음식점의 음식은 매우 맛있다"
style: 긍정, target attribute: 맛없다, content: 그 음식점의 음식은 매우
t=2에서 입력) [(a)맛없다, (b)그, (output_1)음식점의]
t=3에서 입력) [(a)맛없다, (b)그, (output_1)음식점의, (output_2)음식은]

위 그림은 G-GST에 해당하는 그림이고 B-BST는 retrieve component 빼고 똑같다.

즉 Timestep t에서 GST는 다음을 찾는 것이다.

B-GST: $p(y_t|c, y_1,y_2,\cdots,y_{t-1})$
G-GST: $p(y_t|c, a^{tgt}, y_1,y_2,\cdots,y_{t-1})$
가장 Transformer block의 topmost layer에서 softmax layer을 취하였다.

Training 과정에서는 teacher-forcing (or guided approach)을 사용하였다.
Test 과정에서는 beam-search을 이용하였다.

look-left window of 1
beam width 5
이렇게 뽑은 문장들 중, Delete Transformer에서 가장 높은 target-style match score을 선택하였다.

2.3.3 Training

B-GST

G-GST

G-GST을 학습할 때는 reconstruction을 하는 것인 사소하다.
왜냐하면 문장 x을 $c_x$ 와 $a_x$ 을 결합하여 생성하는 것은 어렵지 않기 때문
모델의 학습으로 원하는 것은 target attribute가 context of the source content으로 들어가 유창한 target style 문장을 생성하는 것이다.
이는 사소하지 않은 문제이고 따라서 training 할 때, attribute에 noise을 넣어준다.
10%는 random choosing을 하고(5%는 source, 5% target)이의 attribute을 $a'_x$ 라고 표기한다.
이 방법은 Li 방법과 비슷하다.

2.3.4 Model Details and Pre-training

Pytorch with HuggingFace 사용
GST의 문장 길이는 512이고 12 blocks(or layers), 각 block당 12 attention-heads.
모든 internal states (keys, queries, values, word embeddings, positional embeddings)은 768-dimension
BPE tokens

3. Experiments

3.1 Datasets

3.1.1. Yelp, Amazon & ImageCaption:

from https://github.com/lijuncen/Sentiment-and-Style-Transfer/tree/master/data

3.1.2. Political

from http://tts.speech.cs.cmu.edu/style_models/political_data.tar
미국 상원, 하원 멤버의 Facebook에 comments로써 Republica(민주당) or Democrat(공화당) class

3.1.3. Gender

from http://tts.speech.cs.cmu.edu/style_models/gender_data.tar
Yelp food buisness의 review중 남성, 여성으로 레이블된 데이터

3.2 Comparison to previous Works

Yelp, Amazon, Captions

StyleEmbedding (SE) (Fu et al., 2018)
MultiDecoder (MD) (Fu et al., 2018)
CrossAligned (CA) (Shen et al., 2017)
DeleteOnly (D)
DeleteAndRetrieve (D&R) of Li et al. (2018)
Human reference gold standards (H)

Political and Gender

Prabhumoye et al. (BT) (2018)

4. Evaluation of Results

Content preservation of the nonstylistic parts of the source sentence
Style transfer strength of the stylistic attributes to the target style
Fluency and correct grammar of the generated target sentence (Mir et al., 2019).

끝으로 human과 automatic evlauation을 하였음.

4.1 Human Evaluation

Political과 Gender 데이터세트에선 MTurkers가 style을 판단하기 어렵다고 생각하여 평가하지 않았다고 함. (정치적인, 성별인 것은 local 사람들만 아니까 그러겠지?)

4.2 Automatic Evaluation

Target style strength 평가

Style classifier을 Table 1 데이터세트로 FastText을 이용하여 학습하였음.
Test dataset에서 Yelp, Amazon, Captions, Political, Gender 순서로 98%, 86%, 80%, 92%, 82% 정확도를 보여줌.

Content preservation 평가

generated와 source sentence 사이의 BLEU score 평가 (self-BLEU 인듯)

Fluency 평가

Finetune시킨 large pre-trained language model, OpenAI-GPT2
이 논문의 생성 모델은 GPT1을 베이스로 했다고 언급을 해서 다른 것이라고 하나 이 평가방식은 GPT2가 GPT1을 기반으로 한 것이기 때문에 적어도 이득이 조금은 있을 것이라고 추측만 함...
이 때도 Table 1의 데이터세트로 학습하였고 PPL 측정 결과, Test datset에서 Yelp, Amazon, Captions, Political, Gender 순서로 24, 33, 34, 63, 81의 값을 가짐

GLEU

automatic metric과 human judgements 사이를 비교하기 위해 진행
GLEU (Generalized Language Evaluation Understanding Metric)은 기본적으로 문법적 오류를(GEC) 잡기 위한 방법으로 제안되었음.
BLEU는 target reference와 generated output 사이의 관계만 본다.
GLEU는 이 뿐만 아니라, source sentence까지 보는 것이다.
GLEU가 style transfer에 적합한 이유는 다음과 같이 설명을 한다.

penalizes words of the source that were wrongly changed in the generated sentence (소스에서 문장을 생성할 때, 잘못 생성된 단어에 대한 패널티)
rewards words that were successfully changed (성공적으로 바뀌었을 때의 리워드)
rewards those that were successfully retained from the source sentence to match those in the reference sentence. (소스 문장에 reference 문장으로 얼만큼 매칭되는지에 대한 성공 리워드)

Napoles et al. (2015)에 제공하는 GLEU을 실행했다고 하고 GLEU가 style에 좋은지 이해하려면 GLEU 메커니즘을 찾아봐야 할듯

4.3 Result Analysis

여기서 논문의 저자말론, 논문 모델로 생성된 것이 realistic하고 natural-sounding 문장들로 다른 것들 보다 확실히 좋다고 한다.
여기서의 결과는 G-GST가 B-GST보다 안좋다고 하는데, 이는 Retrieve 메커니즘이 별로라서 그렇다고 한다.
만약 후에, Retrieve 알고리즘이 좋아지고 제대로된 guided attribute을 주면 제대로 속성을 제어하여 생성할 수 있다고 함.
또한 Li 에서 말한 것과 같이 human evaluation과 학습 모델을 기반으로 한 PPL 혹은 Acc는 완전히 연관성이 없다고 한다.

Style transfer할 때, 시스템이 단순히 target training 코퍼스에서 문장을 하나 뽑는다고 하면, 이 metric에서는 score가 상당히 높을 것이다.
예를 들어, BT model in Table 5에서 style은 잘 바뀌었지만, BLEU 점수는 B-GST보다 낮음을 알 수가 있다.
여기서 말하고 싶은 것은, 사람의 평가에서는 논문의 모델이 좋았는데, automatic evaluation에서는 PPL이나 Acc 측면에서 다른 모델모다 떨어진 것을 설명하고 싶은 것이다.
이유는, 만약 test시, target training sentence을 그냥 하나 뽑아서 내뱉는 식의 모델이라 하면, 당연히 style은 완벽히 바뀔 것이고, PPL 또한 training 데이터로 학습한 것이기 때문에 PPL도 낮은 값을 가질 것이다.
하지만 BLEU을 보면, 점수가 높기는 어려울 것이다. 따라서 이런 관점으로 보면, automatic evaluation과 human score은 연관성을 가지기 어렵다는 것이다.
따라서 automatic evaluation만 따로 보면 안된다!

또한 Human reference가 제공되어 있으니 이를 automatic evaluation 평가해보면, table에 있는 것과 같이 좋지 않다.

이는 classifier 정확도가 human ratings와 다르는 것을 보여주기에 신뢰할 수가 없다.
BLEU에서도 마찬가지 현상이 일어나는데, 쉽게 source sentence을 copy해서 출력을 그대로 내뱉은 경우는 BLEU 점수가 높을 것이다.
하지만 이는 제대로 된 style transfer을 한 것은 아니기 때문에 BLEU 점수만 딱 떼놓고 보면 안된다.

하지만 GLEU 평가는 source을 고려하면서, target style match와 content retention의 균형을 유지하는 것으로 보인다.

GLEU에 대한 자세한 statistical correlation 연구는 향후에 보여주겠다.
하지만 기억해둘 것은, GLEU가 다른 automatic metric과는 다르게 weakness하지 않다는 것이다.

이럼에도 불구하고 automatic metrics은 여전히 지표가 될 수 있고 유용하다.

Table 4, 5에서는 BLEU 점수가 SoTA이다.
따라서 non-stylistic을 잘 유지하고 있음을 알 수 있다.

Figure 2에서 B-GST 모델이 이전 연구 D&R보다 생성 문장이 입력 문장의 길이와 관련성이 깊음을 알 수가 있다.

B-GST는 PPL에서 데이터세트에 상관없이 괜찮은 score을 보여준다.

5. Related Work

Style transformer: Unpaired text style transfer without disentangled latent representation (2019) (읽어봐야할 듯)

transformer을 leverage하여 adversarial generator-discriminator setting을 한 논문이라고 한다.
하지만 코드나 출력을 공개하지 않았기 때문에 비교를 할 수가 없었다고 한다.

Multiple-attribute text style transfer

이 논문은 포스팅한 논문인데, 이것도 위와 같은 이유로 비교를 못했다고 함.

6. Conclusion

긍부정, gender, political slan 변화에서 SoTA이다.
DRG framework을 leverage하였고, pre-trained LM, Transformer network을 이용하였다.

Reference

논문: https://arxiv.org/pdf/1908.09368.pdf
코드: https://github.com/agaralabs/transformer-drg-style-transfer

인공지능, AI, NLP, 논문 리뷰, Natural Language, Leetcode

AI Information

NL-045, Transforming Delete, Retrieve, Generate Approach for Controlled Text Style Transfer (2019-EMNLP)

0. Abstract

1. Introduction

2. Our Approach

2.1 Delete

2.1.1 Delete Transformer

2.2 Retrieve

2.3 Generate

2.3.1 Variants of GST (B-GST and G-GST)

2.3.2 Input Representation and Output Decoding

2.3.3 Training

2.3.4 Model Details and Pre-training

3. Experiments

3.1 Datasets

3.1.1. Yelp, Amazon & ImageCaption:

3.1.2. Political

3.1.3. Gender

3.2 Comparison to previous Works

4. Evaluation of Results

4.1 Human Evaluation

4.2 Automatic Evaluation

4.3 Result Analysis

5. Related Work

6. Conclusion

댓글

댓글 쓰기