NL-033, Style Transfer in Text: Exploration and Evaluation (2018-AAAI)

0. Abstract

Transfer style은 text나 image 모두 AI의 발전에 중요한 능력이다.
그러나 language style transfer에서는 the lack of parallel data / reliable evaluation metrics 때문에 아직 활발한 연구는 안되어 있다.
Parallel data 부족의 챌린지를 이 논문은 non-parallel data로 style transfer 학습으로 해결한다.(원래 이전 논문들도 다 non-parallel임)
Key idea는 content representations와 style representations을 adversarial networks을 통하여 학습한다.(이전 논문들도 다 이럼. 이게 정석)
principle evaluation metric의 부족함을 해결하기 위해 two novel evaluation metric을(두 개의 관점이 있는 것) 제안한다.

transfer strength 관점
content preservation 관점

Style transfer task로는 두 가지를 수행한다.

paper-news title transfer
positive-negative review transfer

사람이 판단하는 것과 proposed content preservation metric은 상당히 연관도가 높음을 결과로 보여준다.
그리고 style transfer은 auto-encoder 형식보다 잘 되는 것을 보인다.

1. Introduction

Style transfer은 AI(NLP, CV)에서 novel contents을 생성하는 인공지능 시스템의 측면에서 중요하다.
많은 NLP에 적용할 수 있는데, 예를 들면 자동으로 paper title → news title로 바꾼다면, academic news report에서 사람의 노력이 줄어들 것이다.
시를 생성하는 task와 같이 style을 바꾼다면 다른 style의 시도 생성할 수 있을 것이다.
하지만 아직까지는 언어처리에서의 style transfer은 the lack of parallel data / reliable evaluation metrics 때문에 뒤쳐져있다.
Seq2Seq로 번역 / 대화 시스템 / image captioning 등에서 성공적으로 적용이 되었다.
그러나 이는 많은 parallel data를 사용하였고 실제로는 이를 구하기 어렵다.
예를 들어, academic news report와 일치되는 papers의 수는 적다.(정확히 어떤 task를 말하는 건지는..뒤에서 알아보자)
따라서 non-parallel 데이터로 알고리즘을 적용한다.
Major challenge는 style과 content을 분리하는 것이고 CV에서는 이것에 대해 제안이 되어왔다.
NLP에서는 아직 연구 중이고 어떻게 분리하는 것이 open research problem이다.
번역/요약에서는 BLEU & ROUGE metric으로 평가를 하였다.
하지만 style trasnfer generation에서는 test할 때에도 parallel data가 부족하고 이는 CV에서도 발생하는 문제점이다.
따라서 우리는 general evaluation metric for style transfer in NLP를 제안한다.

제안하는 metric에는 두 가지 관점이 있다.

Transfer strength
Content preservation

Text style transfer 모델은 몇 가지(두 가지) 문제점이 있다.

lacking parallel training data
hard to separate the style from the content

여기서 제안하는 모델은 두 가지가 있다.

multi decoder seq2seq이다. 즉 문장에서 content로 encoding하고 원하는 style에 맞게 (여러 개의)decoding을 하는 것이다.
두 번째는 seq2seq2 with style embedding이다. 문장에서 content와 다른 style embedding을 하고 decoding을 한다. 즉 여기서는 decoder은 한 개이고 중간의 style을 바꾸는 식이다. (앞서 리뷰했던 논문 방법을 말하는 듯)

2. Contribution

여기서 말하는 contribution은 세 가지가 있다.

Language style transfer에 적용할 paper-news title의 dataset을 구성
Two general evaluation metrics for style transfer 제안. 제안한 evaluation metric은 사람의 evlauation과 높은 연관성이 있다.
Parallel data가 부족한 상태에서 style transfer learning을 위한 두 가지 모델을 제안한다.

3. Related Work

3.1 Style Transfer in Computer Vision

이미지에서 content와 style을 분리해서 다시 recombined을 하여 새로운 이미지를 만들어 내는 식이다.
이것의 방법은 one image에서 style을 표현하는 것인데, NLP에서는 single sentence만으로(혹은 short article) style information을 저장하기에 충분하지 않다.
CycleGAN이 이미지에서 대표적인 방법이다.
Li가 말하길 style trasnfer은 domain adaptation 문제로 다루는 것을 제안했다.
이론적으로 Gram metrics = minimize the MMD(Maximum Mean Discrepancy)하고 같은 것이라고 한다.
그러나 text를 다룰 때는 유사한 metric이 없다고 한다.

3.2 Style Transfer in Natural Language Processing

Jhamtani는 Enlgish에서 Shakespearean English로 바꾸는 task를 시도하였다.
Seq2Seq모델을 이용하였고 parallel data를 사용하였다.
Pointer Network을 사용하여 후보 단어들을 풍부하게하였다.
하지만 parallel data을 사용하는 리소스가 드는 것이고 일반적으로는 parallel data가 없다.
Non parallel data의 연구에는 Jaakkola와(읽어봐야 할 듯..) NL-032가 있다. (인용은 안됐지만 NL-031도 있음)

하지만 Jaakkola는 style transfer에 대한 의미있는 평가가 없다.
NL-032에서 사용한 evaluation은 classification 정확도다.
즉 이 논문에서는 평가의 문제를 언급하고 있고 이를 위해 새로운 evaluation metric을 제안한다!

연관이 있는 다른 연구들로는 다음과 같이 있다.

style prediction
CRNN(conditioned recurrent linguistic style)
위 연구들이 뭔지는 한 번 봐야할 듯..
하지만 이 연구들은 style transfer과 다른 게 변경해야할 문장인 source sentence가 없다.

3.3 Adversarial Networks for Domain Separation

Adversarial으로 domain separation 문제에 적용한 연구들이 꽤 있다.
Ganin and Lempitsky은 domain-invariant features을 학습하려고 함.

MNIST / SHVN 등 숫자로 한 듯 한데 가볍게 볼 필요는 있을 듯
이 모델은 source domain data와 unlabeled target domain data으로 학습을 한다.
즉 이 두개의 도메인의 shared representation을 학습한다. (약간 continuous learning 느낌도 살짝 있는 듯)
즉 이는 각 도메인의 individual features을 담고 있지 않는다.

Chen은 multi-task framework으로 문장의 shared and private한 representation을 생성하려고 한다.

Adversarial net을 이용한 강화학습을 하였음.

Long, Wang, and Jordan은 joint adaptation network을 제안함으로써 maximize joint maximum mean discrepancy을 하려고 한다.
이러한 모델들과 이 논문과 다른점은 새로운 문장을 생성하지 않는다는 것이다.
3.3의 related work은 representation 관점에서 연구된 논문 같으나 읽어볼 만한 내용들이라고 생각됨..(언제 읽을까..)

4. Model

이 논문에서는 2가지 모델을 제안한다.

Multi-decoder
Style-embedding 모델

두 가지 모델의 공통점은 content information을 학습한다는 것이다.
다른 점은, multi-decoder은 style에 따른 decoder가 존재하는 것이다. 즉 Encoder로 content을 뽑아내고 원하는 style decoder을 적용시켜 문장을 생성하는 식!
Style-embedding은 encoder로 content뿐만 아니라 style embedding도 학습한다. 그 후 single decoder로 다른 style embedding을 통하여 원하는 style 문장을 생성한다.

4.1 Background: Auto-encoder Seq2Seq Model

AE에 대한 소개이므로 생략
Seq2Seq2 AE을 base model로 사용한 이유는 input의 문장이 크게 안바뀌는 선에서 output을 생성하고 싶기 때문이라고 함.

4.2 Encoder

Encoder와 Decoder모두 GRU로 이루어져 있다.
GRU 설명은 생략

식 (5)와 같이 encoder을 표현함. (theta_e는 encoder의 parameters)

4.3 Decoder

Encoder의 마지막 state을 생성 프로세스의 시작으로 잡음
생성하는 식은 softmax을 이용하여 식 (6)과 같은 과정을 거침

Loss function은 minimize NLL으로 식 (7)로 구성됨. (M은 size of training data)

4.4 Multi-decoder Model

다른 말로는 auto-encoder with several decoders이다.
앞서 말했듯이 Encoder로 뽑아낸 representation은 contetn information만 담고있는 style을 반영하지 않은 것이다.
이렇게 뽑아낸 content representation을 style specific decoder을 통하여 원하는 문장을 생성해내는 식이다.
기존 AE을 그냥 학습하면 encoder을 통과한 representation은 content와 style을 모두 반영하게 되어있는데 어떻게 content만 반영하게 할 수 있을까!?
Chen의 연구를 보면(앞의 관련 연구) adversarial network로 shared와 private feature을 분리하는데 사용되었다.
이 논문도 비슷하게 content representation c와 style을 분리하는데 adversarial network을 사용할 것이다.
Adversarial은 두 개의 파트로 구성되어 있다.

Classifying the style of x given the representation learned by the encoder

Loss function은 minimize negative log probability of the style labels로 식 (8)과 같이 구성이 된다.
즉 encoder을 통과한 representation을 classification MLP에 넣어서 학습
Theta_c는 classification의 parameters

Masking the classifier unable to identify style of x by maximize the entropy(minimize the negative entropy) of the predicted style labels

이는 encoder을 통과한 representation이 제대로 된 classification을 못하도록 entropy을 늘려주자!

이 두가지 파트를 통해 encoder은 style information을 담지 못할 것이다.

예시로 보자)
x="그 영화는 재밌다", positive 가 입력이라고 하자.
encoder(x)가 positive로 판별할 수 있는 classify가 있다고 하자. -(1)
우리의 목표는 encoder(x)가 이러한 classify의 입력으로 들어갔을 때 positive인지 negative인지 구분 못하게 하고 싶은 것이다. -(2)
즉 (1)과 (2)과 각각 part1, part2에 해당하는 부분으로 (1)을 통해 classify을 학습하고 (2)을 통해 entropy을 증가시켜 구분을 못하게 한다.
즉 식 (8)은 classify을 학습하므로 theta_c을 학습하고, 식 (9)는 classify을 속이기 위해 theta_e을 학습하는 것!

식 (10)은 이제 decoder가 specific style 문장을 생성할 수 있도록 구성된 loss이다.
이는 식 (7)과 유사하다고 보면 된다.

식 (8), (9), (10)을 다 더한 것으로 total loss이고 간단하게 unweighted sum을 하였다고 한다. (개인적으로는 귀찮아서 weighted value을 구하기 귀찮았던 것으로 생각. 뭐 아이디어가 중요하니까...)
즉 식 (11)로 학습을 진행하는 것!

4.5 Style-embedding Model

두 번째 모델은 Li에서 영감을 받았다고 한다.

Li는 대화 모델인데, 방법이 persona-conversation(대화문)에서 personal information을 vector representation을 한다.
그리고 다른 contents와 style을 condition으로하는 RNN을 통과시켜서 문장을 생성한다고 한다.
이 논문도 유명한 거긴 한데 읽어보지는 않았...던... 시간나면 읽어보자

이 모델에서의 encoder은 multi-decoder 모델과 똑같이 작동한다. 즉 content representation을 생성한다.
추가적으로 style embedding $\textbf{E}\in R^{N \times d_s}$ 이 style을 표현한다. 여기서 N은 style의 개수이고 ds은 dimension of style embedding이다.

즉 style embedding이란 matrix가 있다.
예를 들어서 생각하면, style이 총 "과거", "현재", "미래" 총 3가지가 있다고 하면 E는 3xd matrix이다.
Conditioned GAN 혹은 NL-031같은 곳에서는 이러한 style embedding을 one-hot encoding을 썼다고 생각할 수 있다.
여기서는 one-hot encoding을 쓰는 것은 아니고 style embedding matrix을 학습하겠다는 것이다!

Single decoder에 [c;e]을 concat한 것이 들어가 다른 스타일의 문장을 생성하게 된다.
Total loss는 식 (12)와 같다.

L_gen2는 식 (7)과 같은 방식인데 style embedding parameters을 포함한 것
L_adv1와 L_adv2는 식 (8), (9)와 같다.

4.6 Parameter Estimation

Adadelta optimizer
Learning rate 0.0001
Batch size 128
50-training epoch, 10-training epochs를 각각 paper-news , positive-negative task에서 학습하면서 PPL이 가장 작은 값을 가지는 parameter 선택
Multi-decoder 모델은 style alternative하게 data을 학습.(긍정 부정 긍정 부정 ... 이런 식)
Style embedding 모델은 randomly shuffled data후 학습.

5. Evaluation

BLEU

번역 모델 평가 방법

ROUGE

Text summarization 평가

NIST / Meteor

NLP에서 쓰이는 방법

AM-FM

Ground truth없이 번역 모델 평가 방법
BLEU는 필요함
방법은 source와 target 출력을 SVD로 sentence embedding을 하여 cosine 유사도를 구함

RUBER

대화 시스템의 평가 방법
Reference와 unreference part로 나눠서 평가를 하는데 reference part에서는 model ouput과 ground truth와의 cosine distance of sentence embedding을 구함.

여기서는 transfer strength와 content preservation의 관점에서 general evaluation metric을 제시함.

5.1 Transfer Strength

Classifier을 이용하여 transfer strength의 정도를 평가한다.
Shen(NL-032)도 이렇게 하였었다.
신박한 방법이 있는 줄 알았더니 이 평가는 기존의 방법이였음..

5.2 Content Preservation

Source text와 target text의 유사도를 측정한다.
Sourcr와 target 문장을 embedding하여 cosine distance을 계산한다.
그렇다면 embedding은 어떻게??

~~여기서 min, mean, max pooling을 하는데 이것이 왜 content embedding인지는 설명이 안되어있어서 아쉽다.~~
~~이것이 정말 content라면 content embedding을 뽑는 모델대신 이것을 써도 될텐데 말이다.~~
Evaluation settings에 설명이 되어 있었다..
뇌피셜론) Word embedding이 단어의 뜻을 담고 있으므로 문장의 각 word embedding의 min, mean, max을 사용하면 문장을 대표하는 단어를 뽑는 것이므로 이것이 content일 것이라 판단한 것 같다.
Transfer strength은 기존의 방법처럼 classifier을 이용해서 제대로 style을 가지는지 판단하는 것으로 생각하면 됨.

Word embedding은 Glove-100을 사용하였다.
Transfer strength와 content preservation metric 2가지를 F1 score로 합칠 수도 있겠으나, 때로는 style transfer이 더 중요할 때가 있다.
따라서 F1 score을 사용하는 것이 항상 좋은 것은 아니라 하고, 이상적으로는 적절한 weighted integration이 좋을 것이나 이는 future work라 함

6. Experimental Setup

6.1 Datasets

두 가지 데이터세트를 사용

Paper-news title dataset

직접 제작하여 사용

Positive-negative review dataset

He and McAuley가 만든 데이터 세트 사용

2000개는 val,test로 사용하고 나머지는 train 데이터로 사용
전처리 내용

20 words가 넘는 문장은 사용 안함.
모든 character은 lower case
모든 숫자는 <NUM>으로 사용
이러한 전처리를 나도 해야하나?

6.2 Paper-News Title Dataset

Paper은 academic website에서 가져왔다.

ACM Digital Library, Arxiv, Springer, ScienceDirect and Nature

News title은 UC Irvine Machine Learning 저장소에서 가져왔다.

422,937개의 titles이가 있고 filter해서 science와 technology에 속하는 108,503개의 title만을 가져왔다.

6.3 Positive-Negative Review Dataset

Amazone product 리뷰를 포함하는 것이다 2016년에 공개되었다.
1996년부터 2014년까지 Amazone에서 142,800,000 product reviews에 관한 데이터이다.
도메인은 books, electronics, movies 등인데, 랜덤으로 400,000 positive와 negative을 뽑아서 구성하였다.

6.4 Model Settings

실험을 통하여 best parameters을 찾았다고 한다. (즉 한국어에서는 또 다른 값이 best겠지)
Paper-news title

word embedding size 64
encoder hidden vector size among {32, 64, 128}
style embedding size among {32, 64, 128}

Positive-negative review

word embedding size 64 for multi-decoder
{64, 128} for style-embedding model
encoder hidden vector size among {16, 32, 64}
style embedding size among {16, 32, 64}

6.5 Evaluation Settings

Transfer strength 측정에서 LSTM-sigmoid classifier가 필요하다. (왜 이 classifier을 사용? CNN classifier or SOTA을 사용해도 될 듯)

Hidden state dim은 128을 사용하였다.
학습은 2 epoch을 하고 평가를 진행하였다.
Paper-news transfer title은 validation에서 98.8% 성능을 달성
Positive-negative val dataset에서는 84.8% 성능을 달성

Content preservation metric

우리는 pre-trained 100-dim Glove 사용
여기서 sentiment words을 필터링을 하여 content preservation metric을 적용
Positive와 negative word dictionary가 filtering에 사용되었다.

7. Results and Analysis

여기서 제안한 metric이 text style transfer을 얼마나 잘 측정하는지 알기 위해 사람의 판단을 먼저 해보았다.

7.1 Comparison with Human Judgments

인간이 판단한 것과 여기서 제시한 metric과 비교를 해봄. (상관관게를 구함)
Test-sample을 200개를 random sampling하여 이에 대해 style-embedding model이 생성한 문장에 대해 정말로 잘 생성 되었는지 점수를 0, 1, 2점으로 매기게 한다.
0점은 안 비슷한 것, 1점은 약간 비슷, 2점은 매우 비슷한 것이다.
Amazon Mechanical Turk을 이용하여 평가함.
사람마다 평가한 200개의 sampling 데이터가 다를텐데, 각 쌍 데이터별로 이에 대해 평균을 냄
이렇게 데이터에 해당하는 사람 평가 score와 content preservation metric 값이 존재할 것이다.
이 두 값 사이의 Spearman's coefficient을 구하였다.

두 값의 관계를 구하는 방법이라고 보면 된다.
Correlation scroe은 0.5656 with p-value<0.0001 으로 높은 상관관계를 보인다.
즉 사람의 판단과 여기서 제시한 metric과 비슷한 의미를 가진다고 주장하는 것이다.

7.2 Model Performances

Figure 2를 보면, transfer strength와 content preservation은 반비례 관계이다.

파라미터들을 변경해가면서 성능을 측정한 것이다.

즉 많은 style을 변화를 시키려면, contents를 다소 잃어버리는 경향이 있다.

아직 모델의 한계라는 생각이 듬
위에서 content preservation metric이 유의미함을 보였지만 이 metric에 한계가 있는 점도 생각이 든다.
즉 이 모델이 꼭 반비례를 항상 나타내지는 않을 것 같음. 즉 사람의 평가로 content preservation score와 transfer strength score의 관계식을 보면 다를 것 같다는게 내 뇌피셜...

Autoencoder은 파라미터를 변경해도 transfer strength가 잘 변경이 안되지만, 여기서 제안한 모델들은 잘 변경이 된다. 즉 두 가지 측면의 균형이 잘 맞는다.

Autoencoder은 어떻게 style을 변경시킨다는 거지??

7.3 Paper-News Title Transfer

각 모델이 잘 살리는 관점들이 있으므로 (Figure 2의 왼쪽) 무엇이 항상 좋다고는 말할 수 없다.
하지만 다양한 시나리오를 생각하면, multi-decoder와 style-embedding 모델이 낫다. (style을 잘 바꿔주므로)

7.4 Positive-Negative Review Transfer

Pos-neg task에서 autoencoder의 style이 보존되지 않는 이유는(0이 아닌 이유는) classifier가 완벽하지 않아서다.(위에서 궁금했던 부분)

여기서 사용한 classifier은 pos-neg task에서 84%의 성능을 가진다고 함

따라서 pos-neg task에서 transfer strength의 측정치는 완벽히 신뢰할 수 없다.
Multi-deocder가 style-embedding보다 style-transfer은 잘 된다.

각각 decoder가 존재하기 때문에 당여한 것 아닌가?
하지만 개인적으로는 style-embedding으로 하는게 더 연구적인 느낌이 있는 듯

Pos-neg task보다 paper-news에서 style이 더 잘 바뀐다.

7.5 Analysis in a Multi-task Learning View

다른 task(긍정/부정)들 학습하는 모델들 간의 파라미터는 공유가 가능하다.
즉 Style-embedding 모델 같은 경우는 encoder와 decoder을 공유하는 식이고 multi-decoder은 encoder만 공유하는 식이다.
Autoencoder은 이렇게 공유하는 식이 아니기 때문에 style을 변경할 수 없다.
Style-embedding 모델은 전체 파라미터을(enc/dec) 공유하기 때문에 style 변화 강도가 약하다.
Multi-decoder은 decoder가 분리되어 있기 때문에 변화 강도가 세다.
Content preservation을 위해 파라미터가 더 공유될 필요가 있다.
Style-embedding은 학습하는데 필요한 데이터는 적지만(모델을 공유하기 때문에) style 정보를 encode하는 것은 쉽지 않다.

그냥 one-hot 쓰면 안되나?

7.6 Lower Bound for Content Preservation

하한의 content preservation 정도를 알아보기 위함이다.

즉 적어도 이건 넘어야 한다! 라는 값을 알기 위함

Randomly sampling 2000 문장 쌍을 추출

이 두 쌍에 대해 content preservation metric으로 평가
paper-news: 0.609
positive-negative: 0.863
실제 이 두 쌍은 content가 같지 않은 경우가 더 많을 것이다. 사실상 그러면 score는 적어도 0.5보다 낮게 측정을 해야 좋은 metric 아닌가?

이러한 결과보다 제안한 모델들의 성능이 좋으므로 content가 보존되도록 학습을 하였다고 주장함.

7.7 Qualitative Study

Autoencdoer은 입력=출력
나머지 두 모델은 content 보존하면서 몇 개의 단어나 구들을 바꾼다.
Pos-neg에서는 잘 되는 편이지만, paper-news에서는 덜 제대로 되는 느낌이다.

8. Conclusion

두 가지 모델을 제안
두 가지 evaluation metric 제안
두 가지 데이터 세트 구성
Content preservation evaluation은 사람의 평가와 유사한 척도를 나타냄을 실험으로 보임
In the furue... 문장의 유창성(fluency)도 고려한 comprehensive evaluation metric을 구성하도록 하겠다.

Reference

논문: https://arxiv.org/pdf/1711.06861.pdf

인공지능, AI, NLP, 논문 리뷰, Natural Language, Leetcode

AI Information

NL-033, Style Transfer in Text: Exploration and Evaluation (2018-AAAI)

0. Abstract

1. Introduction

2. Contribution

3. Related Work

3.1 Style Transfer in Computer Vision

3.2 Style Transfer in Natural Language Processing

3.3 Adversarial Networks for Domain Separation

4. Model

4.1 Background: Auto-encoder Seq2Seq Model

4.2 Encoder

4.3 Decoder

4.4 Multi-decoder Model

4.5 Style-embedding Model

4.6 Parameter Estimation

5. Evaluation

5.1 Transfer Strength

5.2 Content Preservation

6. Experimental Setup

6.1 Datasets

6.2 Paper-News Title Dataset

6.3 Positive-Negative Review Dataset

6.4 Model Settings

6.5 Evaluation Settings

7. Results and Analysis

7.1 Comparison with Human Judgments

7.2 Model Performances

7.3 Paper-News Title Transfer

7.4 Positive-Negative Review Transfer

7.5 Analysis in a Multi-task Learning View

7.6 Lower Bound for Content Preservation

7.7 Qualitative Study

8. Conclusion

댓글

댓글 쓰기