■ Comment

논문 전체를 읽으려다가 같은저자가 2018년 ACL에 후속 논문을 내어서 이 논문의 업그레드 버전으로 예상된다. 따라서 간단히만 읽어보자.
여기서 적용한 방법은 크게 보면 VAE인데, CVAE라고해서 conditional VAE 방법론이다.
조건은 context가 있는데 context에는 이전의 utteracnes, conversational floor, meta feratures을 포함한다.
여기에 추가적으로 knowledge-guided 을 추가하면 kgCVAE가 되는데 knowledge라는 것은 dialog-act라고 보면된다.
dialog-act은 데이터세트에서 간단하게 학습된 dialog-act recognizer을 통하여 레이블링을 따로 한 것처럼 보여진다.
아무튼 크게 보면 VAE이기 때문에 다양한 latent representation에서 여러 개의 utterances을 생성할 수 있게 된다.

논문에서는 이 점을 강조하여 discourse-level의 다양성을 캡쳐한다고 한다.
일반적으로 VAE는 optimization의 단점이 있기 때문에 이 논문에서도 그 부분을 다루지만, 읽어보지는 않았다.
(내가 알기론) optimization이 근데 최적화가 최근에도 문제점이 있기 때문에 최근 NLG 연구 관점에서는 VAE을 많이 안쓰는 걸로 알고는 있는데..

전체적으로 느끼기엔 style-transfer에서 VAE로 학습한 것과 유사하다는 점을 느꼈다.

0 Abstract

최근 뉴럴 인코더-디코너 모델들은 open-domain 대화에서 놀라운 성능을 보여주지만, 그들은 종종 dull 하고 generic response들을 생성한다.
이전의 연구는 단어 레벨에서 디코더의 출력을 다양화하는데 초점을 맞춰 이 문제를 해결하려고 했지만, 우리는 conditional VAE의 기반의 새로운 프레임워크로 인코더에서 discourse-level의 다양성을 캡쳐한다.
우리의 모델은 latent variables이 potential conversational intents에 대한 distribution을 배우고 greedy 디코더를 사용하여 diverse response을 생성한다.
우리는 더 나아가서 novel variant을 개발하고 이것이 언어학적 사전의 지식과 결합하여 더 좋은 성능을 내도록 개발한다.
마침내, 학습 과정은 bag-of-word loss의 소개된 loss로부터 향상된다.
우리의 제안 모델은 베이스라인 접근법보다 중요하고 더욱 다양한 response을 생성하는 것을 입증하고 discourse-level의 decision-making 능력울 보여준다.

1 Introduction

대화 매니저는 대화 시스템에서 중요한 키 요소중 하나로 이는 decision-making 프로세스의 모델링을 담당한다.
특별히, 이것은 전통적으로 새로운 utterances와 dialog context을 입력으로 받고 discourse-level decisions을 생성한다.
진보된 대화 매니저는 보통 잠재적인 액션 리스트를 가지고 있다.

이 액션리스트는 대화중에 다양한 행동을 가지게 한다. (예. different strategies to recover from non-understanding)

그러나, 대화 매니저를 디자인하는 일반적인 접근법은 가능한 decision의 방대한 양 때문에 open-domain 대화 모델로 잘 확장되지 않는다.
그래서 인코더-디코더 모델들로 open-domain 대화에 적용하는 것의 관심이 커지고 있다.
기본적인 접근법은 대화를 transduction 태스크로 대하고 대화 히스토리가 source 시퀀스이고 next response는 타겟 시퀀스이다.

즉 source(=대화이력)로부터 target(=다음문장) 개념으로 변환 태스크로 접근한다

모델은 많은 대화 코퍼스부터 MLE objective을 이용하여 need for manual crafting없이 end-to-end 학습된다.
그러나 최근 연구는 인코더-디코더 모델들은 meaningful 하고 specific한 정답대신 generic 하고 dull한 response을 생성하는 경향이 있다.
이것의 한계를 해결하고 설명하려는 많은 시도가 있었고 그들은 크게 2가지 카테고리로 나뉜다.

(1) the first category argues that the dialog history is only one of the factors that decide the next response.
Other features should be extracted and provided to the models as conditionals in order to generate more specific responses (Xing et al., 2016; Li et al., 2016a);
(2) the second category aims to improve the encoder-decoder model itself, including decoding with beam search and its variations (Wiseman and Rush, 2016), encouraging responses that have long-term payoff (Li et al., 2016b), etc

대화매니저와 인코더-디코더 모델들안에서 이전의 연구들를 기반으로한 이 논문의 핵심아이디어는 discourse level에서 one-to-many 문제로 대화들을 모델링하는 것이다.
이전의 연구들은 open-domain 대화에서 next response을 경정하는 많은 요소들이 있고 이것들을 모두 추출하는 것은 쉽지 않다.
직관적으로, 유사한 대화 히스토리가 (그리고 다른 관측된 입력들) 주어진다면, 가능한 response가 많이 존재한다. (대화레벨에서)

각각은 입력에서 표현되지 않은 latent variables의 특정 구성에 대응될 것이다.

가능성있는 responses을 밝히려면, 우리는 latent variables을 사용한 잠재적인 responses의 distributed utterance embedding의 확률 분포를 모델링 해야한다. (그림 1)
이것은 학습된 분포로부터 샘플들을 뽑아서 다양한 responses을 생성하게하고 디코더 뉴럴 네트워크를 통하여 그들의 단어들을 재구성할 수 있다.
구체적으로, 우리의 컨트리뷰션은 3가지다.

1. We present a novel neural dialog model adapted from conditional variational autoencoders (CVAE) (Yan et al., 2015; Sohn et al., 2015), which introduces a latent variable that can capture discourse-level variations as described above
2. We propose Knowledge-Guided CVAE (kgCVAE), which enables easy integration of expert knowledge and results in performance improvement and model interpretability.
3. We develop a training method in addressing the difficulty of optimizing CVAE for natural language generation (Bowman et al., 2015).

우리는 우리의 모델을 사람대 사람의 대화 데이터로부터 평가하고 믿을만한 결과를 이끌어낸다.

(a) generating appropriate and discourse-level diverse responses, and
(b) showing that the proposed training method is more effective than the previous techniques.

2 Related Work

2.1 Encoder-decoder Dialog Models

2.2 Conditional Variational Autoencoder

3 Proposed Models

3.1 Conditional Variational Autoencoder (CVAE) for Dialog Generation

각 대화는 3가지 랜덤 변수를 통해 표현된다.

the dialog context c (context window size k − 1)
the response utterance x (the k-th utterance)
a latent variable z
이 변수들로 valid responses에 해당하는 latent distribution을 캡쳐한다.

더 나아가서 c는 대화 히스토리와 다음과 같은 것으로 구성된다.

the preceding k-1 utterances
conversational floor (1 if the utterance is from the same speaker of x, otherwise 0)
meta features m (e.g. the topic).
즉 c는 context 정보를 담는 것들

우리는 그리고 conditional distribution을 다음과 같이 정의한다.

p(x, z|c) = p(x|z, c)p(z|c)
그리고 우르의 목표는 딥뉴럴네트워크로 (θ로 파라미터화된) p(z|c) and p(x|z, c)을 근사화하는 것이다.

pθ(z|c)는 prior network이고 pθ(x, |z, c)는 response decoder이다.
즉 c에서 x,z를 뽑는 것은 z를 뽑고 그것을 이용해서 그림 2.a처럼 가는 것

1. Sample a latent variable z from the prior network pθ(z|c).
2. Generate x through the response decoder pθ(x|z, c).

CVAE는 c가 주어질 때, x의 conditional log likelihood을 최대화한다.

c는 latent variable z에 대해 다루기 어려운 것을 포함하고 있는 것이다.

(Sohn et al., 2015; Yan et al., 2015)에 제안된 것처럼, CVAE는 variational lower bound of the conditional log likelihood을 최대화하도록 Stochastic Gradient Variational Bayes (SGVB) framework을 통하여 효율적으로 학습된다.
우리는 z가 diagonal covariance matrix을 가지는 multivariate Gaussian distribution으로 가정하고 recognition network qφ(z|x, c)을 소개하여 true posterior distribution p(z|x, c)을 근사화하도록 한다.

여기서 recognition network의 역할은 무엇인지?

Sohn and et al,. (2015) have shown that the variational lower bound can be written as:

즉 이 lower bound을 최대화하도록 학습하면 된다는 말이다.
VAE의 개념을 똑같이 따온 것이라고 보면 될듯

그림 3은 우리의 모델의 전체를 오벼준다.
Utterance 인코더는 GRU을 가지는 biRNN이고 utterance을 fixed-size vector로 각각 인코딩하여 forward와 backward RNN의 last hidden states을 conatenating한다.

concat 결과가 그림에서 x이다.

Context 인코더는 1-layer GRU 네트워크로 u1:k−1와 대응되는 대화 floor을 입력으로 이전의 k-1 utteracnes을 취하여 인코딩한다.
마지막 context encoder의 hidden state $h^c$ 는 meta features와 concate해서 다음과 같이된다.

$c=[h^c,m]$

우리가 z가 isotropic Gaussian distribution을 따른다고 가정했기 때문에 recognition network qφ(z|x, c) ∼ N (µ, σ2*I)가 되고 prior network pθ(z|c) ∼ N (µ', σ'2*I)가 된다.

이를 학습할 때는 VAE에서 쓴 reparametrization trick을 여기서도 썼다고함

obtain samples of z either from N (z; µ, σ2 I) predicted by the recognition network (training) or N (z; µ 0 , σ02 I) predicted by the prior network (testing).
Finally, the response decoder is a 1-layer GRU network with initial state s0 = Wi [z, c]+bi . The response decoder then predicts the words in x sequentially

즉, 그림의 (b) 토대로 정리하면

u1부터 uk-1까지 이전의 utterance이다.
아래 행을 보면, 이전의 ui들이 utterance encoder을 통과하고 context encoder을 통과하여 나온 출력에 meta 정보를 넣어서 c를 형성한다
여기서 0, 1은 next response와 말하는자가 똑같은지를 판단하는 것이다.

즉 uk와 같은 사람이 말한 utterance인 경우는 1 아니면 0이다.

meta는 topic을 벡터화한 것의 개념이다.
위 행을 보면, (학습할 땐) next uk을 알고 있으니, 그것이 utternace encoder을 통과한 것을 x라고 한다.
여기서 prior network pθ(z|c)을 통하여 식(3)처럼 z을 뽑아낸다.
또한 recognition network qφ(z|x, c)을 통하여 식(2)처럼 z을 뽑아낸다.
즉 이 두 z가 같아야 한다는 개념으로 loss을 주기 위해 recognition network가 있는 것이다.
그다음 response network로 pθ(x, |z, c)가 문장을 생성하는 것이다.
실제 test할 때는, next uk을 모르므로 당연히 여기서는 prior network pθ(z|c)을 통하여 뽑은 z을 사용하는 것이지만, 학습할 때는, 정확도?때문인지 recognition network qφ(z|x, c)으로 뽑은 z을 사용한다.

3.2 Knowledge-Guided CVAE (kgCVAE)

실제로, CVAE을 학습할 때 optimization 문제가 어렵고 많은 양의 데이터를 필요로 한다.
반면에, 이전의 spoken dialog systems 연구와 discourse analysis 연구에서는 많은 언어학적 근거가 중요한 특징을 자연스러운 대화를 표현한다고 제안한다.
예를들어, dialog acts은 dialog managers에서 널리 사용이되어 시스템의 propositional function을 표현한다.
그래서 우리는 만약 discourse features을 학습동안에 정확하게 추출한다면, 모델이 meaningful latent z을 학습하는 것이 유익할 것이라고 추측한다.
basic CVAE 모델에 언어학적 특징을 결합하기 위해, 우리는 먼저 언어학적 특징을 y로 표기한다.

그래서 우리는 x의 생성은 c, z, y에 의존적이라 가정한다. (response decoder 부분)

y는 그림 2에서처럼 z,c와 의존한다.

즉 주어진 y말고 z,c로 y'을 예측해서 이 둘을 비교하겠다는 것

구체적으로, 학습을 하는동안 response decoder의 초기 상태 s0=Wi[z,c,y]+bi이고 매 스텝에서의 입력은 [et, y]이다. (그림 2.b의 오른쪽)

여기서 et는 x의 t번째 word embedding이다.

추가적으로 MLP로 y'=MLPy(z,c)로 y'을 구한다. (그림 2.b의 오른쪽)
테스트할때는, 예측된 y'는 oracle decoder대신 response decoder에의해 사용된다.
우리는 모델을 knowledge-guided CVAE (kgCVAE)라 부르고 개발자들은 원하는 discoure 특징들이 latent variable을 캡쳐할 수 있도록 추가할 수 있다.
KgCVAE model is trained by maximizing

1번째 loss가 z끼리 비교하는 KL divergence loss이다.
2번째 loss는 MLE로 구성이 될 것이고.. 3번째 loss는 y'와 y가 같도록 하는 loss로 CVAE에서 추가된 것이다.
근데 여기까지 논문을 읽어보면 학습과정에서의 y는 dialog act로 실제 주어져야한다.
논문에서는 이런 부분이 안써있지만, 뒤에 데이터세트 부분에서 다음과 같이 dialog act을 레이블링을 했다는 것이다.
Furthermore, a subset of SW was manually labeled with dialog acts (Stolcke et al., 2000).
We extracted dialog act labels based on the dialog act recognizer proposed in (Ribeiro et al., 2015).
The features include the uni-gram and bi-gram of the utterance, and the contextual features of the last 3 utterances.
We trained a Support Vector Machine (SVM) (Suykens and Vandewalle, 1999) with linear kernel on the subset of SW with human annotations.
There are 42 types of dialog acts and the SVM achieved 77.3% accuracy on held-out data.
Then the rest of SW data are labelled with dialog acts using the trained SVM dialog act recognizer.

이제 y의 재구성은 손실 함수의 일부이므로 kgCVAE는 surface-level x와 c에 기반하여 정보를 발견하는 것보다 y 관련 정보를 z로보다 효율적으로 인코딩 할 수 있습니다.
kgCVAE의 또 다른 장점은 단어 수준 응답과 함께 높은 수준의 레이블 (예 : dialog act)을 출력 할 수있어 모델의 출력을보다 쉽게 해석 할 수 있다는 것입니다.

3.3 Optimization Challenges

실제 학습할 때 최적화와 관련된 문제.. 생략

4 Experiment Setup

5 Results

6 Conclusion and Future Work

결론적으로, 우리는 one-to-many 자연스러운 open-domain 대화를 확인하고 두 개의 novel 방법으로 더 좋은 성능을 내는 다양하고 적절한 responses을 discourse level에서 생성한다.
현재 논문이 dialogue acts에 관하여 다양한 responses을 다루고 있지만, (실제로는) 이 연구는 큰 연구 방향 중의 일부분이다.

즉, 대화의 잠재적 요인에 대한 더 나은 표현을 배우기 위해 과거의 언어 적 발견과 깊은 신경망의 학습력을 모두 활용하는 더 큰 연구 방향의 일부입니다.

novel 뉴럴 dialog 모델의 출력은 사람으로부터 좀 더 설명하기 쉽고 컨트롤 된다.
dialog acts을 추가함으로써, 우리는 kgCVAE 모델을 다른 언어학적 현상인 sentiment, NER 등에 적용할 계획이다.
마지막으로, recognition network는 유용한 high-level intent를 자동으로 발견하는 data-driven 대화 관리자를 설계하기위한 기초가됩니다.
위의 모든 내용은 유망한 연구 방향을 제시합니다.

Reference

https://arxiv.org/pdf/1703.10960.pdf

인공지능, AI, NLP, 논문 리뷰, Natural Language, Leetcode

AI Information

Short-009, Learning Discourse-level Diversity for Neural Dialog Models using Conditional Variational Autoencoders (2017-ACL)

■ Comment

0 Abstract

1 Introduction

2 Related Work

2.1 Encoder-decoder Dialog Models

2.2 Conditional Variational Autoencoder

3 Proposed Models

3.1 Conditional Variational Autoencoder (CVAE) for Dialog Generation

3.2 Knowledge-Guided CVAE (kgCVAE)

3.3 Optimization Challenges

4 Experiment Setup

5 Results

6 Conclusion and Future Work

댓글

댓글 쓰기