■ Comment

이 논문도 다른 response selection과 같이 self-supervised tasks을 결합하여 성능을 올렸다.
개인적으로 느끼기엔 큰 아이디어는 비슷하다.

즉, 주어진 대화세트를 이용하여 BERT 학습하는 것과 비슷하게 self-supervised tasks을 구성하여 학습에 이용하는 것이다.
이러한 추가적인 task로 모델이 대화의 발화들 사이의 관계를 잘 학습하게끔 만드는 것이다.
좀 더 세부적으로는 post-training 으로 pre-training BERT에 다시 pre-training하는 경우가 있고
이 논문처럼, loss에 포함시켜 auxiliary task처럼 같이 학습할 시킬 수 있다.
어떤 task인지에 따라 NL-113에서 나온 것처럼, augmentation으로 활용할 수 있다.

따라서 논문 마다 차이는 어떤 task를 구성하는 지 정도이다.

여기서는 4가지를 언급한다.
1) next session prediction

일반적인 next senetence prediction과 다른 점은 session이라는 것이다.
따라서 대화를 두 세션으로 나눠서 그 세션이 이어지는 것인지를 판별한다.

2) utternace restoration

이는 MASK 토큰을 맞추는 것을 확장시킨 아이디어 같은데
발화 전체를 MASK 토큰으로 처리한다 (길이는 같게)
그래서 이 MASK 토큰을 양옆 컨텍스트로부터 맞추는 것이다.

3) incoherence detection

이는 발화 하나를 다른 것으로 랜덤 교체한다.
그래서 이 교체된 발화가 교체된 것인지 아닌 것인지를 맞추는 것이다.
즉 교체된거면, 양옆 컨텍스트와 일관성이 없다는 것이고 안교체한 데이터는 일관성이 있는 것이다.

4) consistency discrimination

이는 대화속 발화들중에서 같은 화자가 말한 두 발화(u, v1)를 뽑아서 매칭 score을 매긴다.
그리고 다른 대화속 발화를 임의로 샘플링한 발화(v2)와 (u, v2) 사이의 매칭 score을 매긴다.
같은 화자가 말한 score가 더 높도록 hing loss을 이용해 학습한다.
아이디어는 나쁘지 않은데 이게 정말 효과가 있는 것인지는 살짝 의문이 든다.

그 외 이 논문에서 비교 논문 중에서 다음 설명을 참고하면 좋을 것 같다.

BERT-VFT (Whang et al., 2020):

fine-tuning전에 모델은 학습 코퍼스에서 BERT와 같은 방법으로 post-training 실행한 모델

SA-BERT (Gu et al., 2020):

BERT-VFT을 따르다, speaker-aware embedding을 결합한 것

0 Abstract

멀티턴 컨텍스트에 따라 적절한 response selection하는 능력을 가진 똑똑한 대화 시스템을 만드는 것은 어렵다.
기존의 연구들은, 다양한 뉴럴 구조들 혹은 PLMs을 이용한 context-response mathcing model을 설계하고 전형적으로 single response prediction task을 학습한다.
이러한 접근법들은 대화 데이터에 포함되는 많은 가능성있는 학습 시그널들을 간과하고, 이는 아마도 context understanding에 장점이 있고 response prediction을 위한 더 좋은 features을 생성할 것이다.
게다가, 전통적인 방법의 supserivse되는 기존의 대화 시스템들로부터 검색되는 response 은 여전히 치명적인 챌린지를 직면한다.

incoherence and inconsistency

이러한 이슈를 해결하기 위해, 이 논문에서, 우리는 pre-trained LMs을 기반으로 대화 데이터로부터 디자인된 self-supervised tasks으로 context-response 매칭 모델을 학습하는 것을 제안한다.
구체적으로, 우리는 4개의 self-supervised tasks을 소개한다.

next session prediction
utternace restoration
incoherence detection
consistency discrimination
그리고 이러한 멀티 테스크 방법으로 이러한 auxilizry tasks와 함께 PLM-based response selecetion model을 학습한다.

이를 통해, auxiliary tasks은 더 좋은 local optimum을 달성하도록 matching model을 학습하도록 가이드할 수 있고 더 나은 response을 선택하게 할 수 있다.
두 벤치마크에 대한 실험 결과들은 제안된 auxiliary가 self-supervised tasks가 멀티턴 response selection에서 중요한 성능향상을 가져옴을 가리킨다.
그리고 우리의 모델은 두 데이터세트에서 SoTA을 달성한다.

1. Introduction

사람과 자연스럽고 의미있게 이야기하는 대화 시스템을 설계하는 것은 high-level 인공지능 방향의 챌린지한 문제중 하나이고 아카데믹, 산업 분야에서 관심이 많아지고 있다.
대부분 기존의 대화시스템들은 generation-based 혹은 retrieval-based 이다.
대화 컨텍스트가 주어지면, generation-based 접근법들은 response word by word을 conditional LM으로 합성한다.

반면에, retrieval-based 방법들은 candidate pool에서 적당한 response을 선택한다.

이 논문에서, 우리는 retrieval-based 접근법들에 집중하고, 이는 유익한 response을 제공하는데 탁월하고, 여러 유명한 상업 제품들 (Xialoce from Microsoft and AliMe Assist from Alibaba)에서 적용되어왔다.
우리는 멀티턴 대화에서 response selecetion 테스크를 고려하고, 여기서 retrieval 모델은 멀티턴 대화 컨텍스트와 많은 response candidates 사이의 matching 정도를 측정하여 적절한 response을 선택해야한다.
이전의 연구들은 context을 하나의 utterance로 연결하여 utterance-level representations으로 matching score을 계산한다.
후에, 대부분의 response selection models은 representation-matching-aggregation 패러다임 안에서 context-response matching을 수행하고, 여기서 각 발화의 턴은 개인적으로 represent되고 연속적인 정보는 utterance response matching features의 시퀀스들 사이에서 모아진다.
response selection의 성능을 향상시키기 위해, 몇 개의 최근 접근법들은 representations의 여러 세분성 (or layers)을 고려하여 context와 response 사이의 좀 더 복잡한 상호작용 메커니즘을 제안한다.
최근에, 많은 연구들은 pretrained LMs (BERT, XLNet, RoBERTa)이 universal language representations을 학습할 수 있고, 이는 다양한 다운스트림 NLP 테스크에 도움되고 새로운 모델을 처음부터 학습하는 것을 없앨 수 있다.
pre-trained 모델들을 멀티턴 response selection에 적용하기 위해, Whnag과 Gu는 BERT을 matching model을 학습하는 것에 활용하는 시도를 했고, 여기서 context와 candidate response는 먼저 concate되고나서 PLMs에 넣어서 final matching score을 계산한다.
이러한 pre-trained LMs은 여러 transformer layers을 통하여 inter-utterance와 intra-utterance 사이의 interaction information을 잘 캡쳐할 수 있다.
PLM-based response selection models은 strong representation 능력 때문에 뛰어난 성능을 입증하였음에도 불구하고, 여전히 효과적으로 task-related knowledge을 학습 과정에서 배우는데는 챌린지하다. (특별히 학습 코퍼라가 한계가 있을 때)
자연스럽게, 이러한 연구들은 전통적으로 오직 context-response matching task 만으로 response selection model을 학습하고 대화 데이터에 포함되는 많은 가능성있는 학습 시그널들을 간과한다.
이러한 학습 시그널들은 아마 context understanding에 유익하고 response selection에 더 나은 features을 생성할 것이다.
게다가, 전통적인 방법으로 지도학습된 기존의 대화 시스템들로부터 검색된 response는 여전히 중요한 챌린지를 직면한다.

incoherence 와 inconsistency

위의 이슈들로 인해, 이 논문에서 복잡한 context-response matching models을 구성하는 대신, 우리는 pre-traeind LMs을 기반으로한 대화 데이터를 위하 고안된 auxiliary self-supervised tasks로 context-response matching model을 학습하는 것을 제안한다.

특별히, 우리는 4가지 next session prediction, utterance restoration, incoherence detection and consistency discrimination을 self-supervised tasks로 소개하고 PLM-based response selecetion model을 이러한 auxiliary tasks와 함께 멀티테스크 방법으로 같이 학습한다.
반면에, 이러한 auxiliary tasks은 respons selection의 능력을 향상시키는데 도움을 줘서 대화 컨텍스트를 이해하고 semantic relevance 혹은 컨텍스트와 response candidates사이의 consistency or coherent을 측정하게 한다.

반면에, 그들은 matching model을 가디으해서 효과적으로 task와 연관된 knowledge을 고정된 학습 코퍼스로 배울 수 있게하고 response prediction에 더 나은 features을 생성할 수 있다.
우리는 멀티턴 response selection에대한 두 개의 벤치마크 데이터세트에서 실험을 했다.

Ubuntu Dialog Corpus (Lowe et al., 2015) and the E-commerce Dialogue Corpus (Zhang et al., 2018).

실험 결과들은 우리의 제안한 접근법이 두 데이터세트에서 모두 SoTA들 달성함을 보여준다.
이전의 SoTA 방법들에 비해, 우리의 모델은 우분투데이터세트에서 R10@1에서 2.9% 향상, E-commerce 데이터세트에서 4.8% 향상을 보여준다.
게다가, 우리는 제안한 self-supervised learning 스키마를 몇 개의 non-PLM based response selecetion models에 적용한다. (dual LSTM과 ESIM)

실험 결과는 우리의 학습 스키마가 기존 일치 모델의 성능을 일관되고 크게 향상시킬 수 있음을 나타냅니다.
놀랍게도, self-supervised learning으로, 간단한 ESIM 또한 우분투 데이터세트에서 BERT보다 더 나은 결과를 보여줘서, 우리의 접근법이 다양한 매칭 구조에 장점이 있음을 보여준다.

In summary, our contributions are three-fold:

We propose learning a context-response matching model with multiple auxiliary self-supervised tasks to fully utilize various training signals in the multi-turn dialogue context.
We design four self-supervised tasks, aiming at enhancing the capability of a PLM-based response prediction model in capturing the semantic relevance, coherence or consistency.
We achieve new state-of-the-art results on two benchmark datasets. Besides, with the help of auxiliary self-supervised tasks, a simple ESIM model can even achieve better performance than BERT on the Ubuntu dataset.

2. Model

2.1. Task Formalization

멀티턴 대화 데이터세트 $D = \{c_i , r_i , y_i\}^{N}_{i=1}$ 이 있다고 하자.

$c_i = \{u_{i,1}, u_{i,2}, \cdots. , u_{i,mi} \}$ 는 대화 컨텍스트를 의미한다.
여기서 $u_{i,t}$ 는 t번째 턴의 utterance을 표현한다.
$r_i$ 는 response candidate을 의미한다.
$y_i \in \{ 0, 1\}$ 는 label이고 $y_i$ =1는 $r_i$ 가 ci에 대해 적절한 response라는 것이다.

테스크는 D로부터 matching model g(·, ·)을 학습하여서 어떠한 새로운 context c = {u1, u2, . . . , um}와 response candidate r이 주여저도, g(c, r) ∈ [0, 1]가 c와 r사이의 매칭 정도를 측정할 수 있는 것이다.

2.2. Matching with PLMs

우리는 context-response matching model을 pre-trained LMs으로 설계하고, 이는 많은 양의 unlabelled data로 학습되고 강한 보편적인 representations을 제공한다.

이는 task-specific 학습 데이터로 finetuned 되어서 다운스트림 테스크들에서 좋은 성능을 달성한다.

이전의 연구들에 따라, 우리는 BERT을 베이스 모델로 선택하여 공정한 비교를 한다.
구체적으로, context c = {u1, u2, . . . , um}가 주어졌을 떄, t번째 발화 ut = {wt,1, . . . , wt,lt }는 $l_t$ 개의 단어들 시퀀스로 구성된다.

또한 response candidate r = {r1, r2, . . . , rlr } 은 $l_r$ 개 단어들과 label y ∈ {0, 1}로 구성된다.
우리는 먼저, 컨텍스트와 response의 모든 발화들을 concat하여 하나의 연속적인 토큰 시퀀스들을 만들고, special tokens으로 그들을 구분한다, 이는 x = {[CLS], u1, [EOT], u2, [EOT], . . . , [EOT], um, [EOT], [SEP], r, [SEP]}로 수식화된다.

여기서 [CLS] and [SEP]는 BERT의 classification 심볼과 segment seperation 심볼이다.

[EOT]는 End of Turn 태그로 멀티턴 컨텍스트에서 고안되었다.

각 단어 x에 대해, x의 token, position, segment embeddings은 합쳐져서 pre-trained transformer layer (a.k.a BERT)로 입력되고 우리에게 contextualized embedding sequence $\{E_{[CLS]}, E_2, . . . , E_{l_x} \}$ 가 주어진다.
$E_{[CLS]}$ 는 aggregated representation vector로 context-response 쌍에 대한 semantic interaction information가 주어진다.
우리는 그리고 나서 $E_{[CLS]}$ 을 multi-perception layer을 통과시켜 final matching score을 다음과 같이 획득한다.

where W{1,2} and b{1,2} are trainable parameters for response prediction task, f(·) is a tanh activation function, σ(·) stands a sigmoid function.
Finally, cross-entropy loss function is utilized as the training objective of the context-response matching task:

위의 context-response 매칭 테스크에서 이전의 fine-tuning 과정이전에, 공정한 비교를 위해 우리는 이전의 연구들을 따르고 domain-adaptive post-training을 실행하여 in-domain knowledge을 BERT와 결합한다.
나머지 섹션에서, 우리는 제안한 4가지 auxiliary self-supervised tasks을 소개하고나서, final learning objective을 소개한다.

2.3. Self-Supervised Tasks

매칭 모델에 대한 heading은 효과적으로 domain knowledge을 고정된 학습 코퍼라에서 배울 수 있고 response rpediction에 대해 더 나은 features을 생성할 수 있다.

우리는 4개의 auxiliary self-supervised tasks을 고안한다.
1) session-level matching, 2) utterance restoration, 3) incoherence detection and 4) consistency classification

이러한 self-supervised tasks는 모델의 능력을 강화해서 context와 response candidate 사이의 semantic relevance, coherent, consistency을 측정하도록 시도한다.
반면에, 그들은 모델의 학습 능력이 더 좋은 local optimum을 달성하도록 가이드한다.
그림 2는 4가지 타입의 self-superivsed tasks을 설명한다.

2.3.1. NEXT SESSION PREDICTION

대화 턴들간의 자연스러운 순차적인 관계때문에, 나중의 턴들은 보통 컨텍스트의 이전턴들과 강한 semantic relevance을 보여준다.
이러한 특성을 고려하여, 우리는 대화 컨텍스트에서 좀 더 일반적인 response prediction task을 설계하고, 이름을 netxt session prediction (NSP)라고 한다.

이는 대화 데이터의 순차적인 관계을 최대한 활용하고 모델의 능력을 강화해서 semantic relevance을 측정하도록 한다.

구체적으로, next session prediction task는 모델이 두개의 시퀀스들이 연속적인지 관련있는지를 예측하는 것이다.
그러나, response utterance와 context을 매칭하는 대신, 모델은 대화 세션의 두 개 사이의 매칭정도를 게산한다.

수식적으로, context c = {u1, u2, . . . , um}가 주어졌을 때, 랜덤하게 c을 2개의 연속적인 조각 $c_{left}$ = {u1, . . . , ut} and $c_{right}$ = {ut+1, . . . , um}으로 나눈다.
그리고나서, 50%의 확률로 $c_{left}$ 혹은 $c_{right}$ 을 전체 학습 코퍼스에서 샘플된 context의 조각으로 교체한다.
만약, 두 조각중 하나가 교체되었다면, 우리는 label y_nsp = 0으로 주고 아니면 y_nsp = 1로 한다.
next session prediction task는 모델이 $c_{left}$ 와 $c_{right}$ 가 연속적인 context에서 왔는지를 판별하게 요구한다.

PLMs을 제안된 self-supervised task와 함께 학습하기 위해, 우리는 각 조각들의 모든 발화들을 먼저 concat하여 하나의 시퀀스로 만들고 발화의 끝에 [EOT]을 추가한다.

메인 테스크와 유사하게, 우리는 BERT에 두 개의 segments을 넣어서 조각 쌍의 aggregated representation인 $E^{nsp}_{[CLS]}$ 을 얻는다.

우리는 그러고나서, final matching score $g_{nsp}(c_{left}, c_{right})$ 을 non-linear transformation으로 계산한다.
마지막으로 context alignment task objective function은 다음과 같다.

2.3.2. UTTERANCE RESTORATION

PLMs에 있는 self-supervised tasks중의 하나로써, token level masked language modeling은 보통 모델이 word 시퀀스들을 bidireciontal context와 함께 semantic과 syntactic features을 배울 수 있도록 가이드하는데 활용된다.
여기서, 우리는 더 나아가서 utterance-level masked language modeling을 소개한다.

즉 utterance resotration (UR) task는 모델이 컨텍스트의 발화들 사이에서 semantic connections을 알 수 있게 장려한다.

구체적으로, 우리는 대화 세션에서 랜덤으로 샘플링한 발화안의 모든 토큰들을 마스킹하고 모델이 나머지 context로부터 정보를 복구하도록 한다.

u1, u2, umask, u4 에서 umask=u3가 되도록 한다는 것

대화 컨텍스트를 둘러싸는 것에 적합한 적절한 발화를 예측하는 것을 배움으로써 모델은 대화에 더 잘 적합한 rerpesentations을 만들 수 있고, 이는 continuous bag-of-words 모델의 아이디어와 유사하다.
형식적으로, context c = {u1, u2, . . . , um}가 주어졌을 때, 우리는 랜덤으로 utterance ut을 선택하고 발화의 모든 토큰들을 special token [MASK]로 교체한다.

모델은 cˆ = {u1, . . . , ut−1, umask, ut+1, . . . , um}을 기반으로 ut을 복구하도록 요구된다.
BERT 테스크에서 적용하기 위해, 우리는 BERT 인코더의 입력으로써 x_ur = {[CLS], u1, [EOT], . . . , umask, [EOT], . . . , um, [EOT], [SEP]}을 취하고, 여기서 u_mask는 오직 [MASK] 토큰들로만 구성되고 ut와 같은 길이를 가진다.

BERT에의해 수행되고 나서, top layer은 representation sequence E_ur = {E[CLS], E_{1,1}, . . . , E_{1,l1} , E[EOT], . . . , E_{m,1}, . . . , E_{m,lm}, E[EOT], E[SEP]}을 출력하다.

여기서 lt는 t번째 발화의 길이이다.

모델은 각 단어의 contextualized representations을 조건으로 masked 발화를 예측한다.
각 마스크된 단어의 확률 분포는 다음과 같이 계산된다.

where Wur, W0 ur, bur, b 0 ur are trainable parameters, wt,j is the j-th token of the t-th utterance, and GLEU(·) is an activation function.

Then, the training objective of utterance restoration task is to minimize the following negative loglikelihood (NLL)

2.3.3. INCOHERENCE DETECTION

언어학에서 담론 일관성의 컨셉으로부터 영감을 받아, 우리는 incoherence detection (ID) 테스크를 소개한다.

이는 모델이 대화 세션 사이에서 incoherent utterance을 인식하도록 요구하고, 발화들 사이에서 연속적인 관계를 캡쳐하고 coherent response candidates을 선택하기 위한 모델의 능력을 강화시키기 위함이다.

구체적으로, 대화 컨텍스트 c = {u1, . . . , um}가 주어지면, 우리는 랜덤으로 uk ∈ {u1, . . . , um}중 하나의 발화를 선택하고 전체 훈련 코퍼스에서 랜덤으로 생플링된 발화로 교체한다.

그리고나서, 모델이 컨텍스트 사이에서 incoherent 발화를 찾도록 한다.

샘플에서 예를 들어, 우리는 one-hot label {z1, . . . , zm}을 정의한다.

여기서 t=k일 때 zt=1이고, 이는 t번째 발화가 교체되었다는 것이고 반대의 경우는 zt=0이다.

이 테스크를 모델링하기 위해, BERT 인코더는 입력 x_id = {[CLS], u1, [EOT], . . . , um, [EOT], [SEP]}을 받고 E_id = {E[EOT], E1,1, . . . , Em,lm, E[SEP]}을 출력한다.

여기서 Et,j는 k번째 발화에서 j번째 단어의 contextualized embedding을 가리키고 lt는 t번째 발화의 길이이다.

우리는 시퀀스 임베딩 {Et,1, . . . , Et,lt }의 mean and max을 결합하여 k번째 발화의 aggregated representation을 계산한다.
이는 다음과 같이 수식화된다.

Then, the model makes a prediction based on the aggregated representations of each utterance, the probability of the t-th utterance being replaced i

where Wid and bid are trainable parameters.

Finally, the learning objective of inconsistency detection task is defined a

2.3.4. CONSISTENCY DISCRIMINATION

대화 컨텍스트와 일치하는 response을 선택하는 것은 대화 에이전트를 구축하는 것중에 중요한 챌린지 중 하나이다.
그러나, 대부분 이전의 연구들은 컨텍스트와 response candidate 사이의 semantic relevance을 모델링하는데 집중한다.
직관적으로, 같은 대화 세션에서의 발화들은 비슷한 토픽들을 공유하고 같은 대담자의 발화들은 같은 personality 혹은 style을 공유하는 경향이 있다.
특성에 따라서, 우리는 모델이 self-supervised discriminative training으로 consistency을 측정하도록 해서 response prediction의 능력을 강화시키는 시도를 한다.

여기서 self-supervised 학습은 natural structure of dialogue data을 활용한다.

수식적으로, 대화 컨텍스트 c = {u1, u2, . . . , um}가 주어지면, 우리는 같은 발화자로부터 2개의 발화들을 샘플링해서 그들을 u와 v로 표기한다.

그리고나서, 우리는 학습 코퍼스에서 다른 컨텍스트로부터 발화 $\tilde{v}$ 을 랜덤 샘플링한다.
모델은 <u, v>와 <u, $\tilde{v}$ >의 consistency 정도를 측정하고 <u, $\tilde{v}$ >의 점수가 더 높게 부여한다.

u와 v는 대화 컨텍스트에서 연속적이지 않고 같은 발화자로부터 왔기 때문에, 모델은 두 시퀀스들 사이의 (topic, personality, style)와 같은 consistency에 대한 특징들을 잡아내려고 한다. (semantic relevance or coherence 대신에)
시퀀스 쌍 <u, $\tilde{v}$ >의 consistency 점수를 계산하기 위해, 우리는 먼저 두 발화들을 concat을 하여 x_cd = {[CLS], u, [SEP], v, [SEP]}을 만들고 BERT에 입력으로 넣는다.

이전의 테스크에서 설명한것과 같이, BERT는 aggregated representation E_cd[CLS]을 리턴한다.
그리고나서, consistency score g_cd(u, v)는 non-linear transformeation으로 E_cd [CLS]이 계산된다.
마찬가지로, 우리응 <u, $\tilde{v}$ >에 대한 consistency score g_cd(u, $\tilde{v}$ )을 얻는다.

마지막으로 우리는 g_cd(u, v)가 g_cd(u, $\tilde{v}$ )보다 최소한으로 margin ∆보다 크도록 다음과 같이 hing loss 함수를 정의한다.

2.4. Learning Objective

We adopt a multi-task learning manner and define the final objective function as:
where α is a hyper-parameter as a trade-off between the objective of the main task and those of the auxiliary tasks.
이러한 방식으로 모든 작업이 공동 학습되므로 모델이 학습 코퍼스를 효과적으로 활용하고 대화 텍스트의 특성과 대화 데이터에 포함 된 암시 적 지식을 모두 학습 할 수 있습니다.
auxiliary 작업은 모델의 일반화 능력을 향상시키기위한 모델 추정에서 정규화로 간주 할 수 있습니다.

3. Experiments

3.1. Datasets and Evaluation Metrics

우리는 멀티턴 대화 response selection을 위한 두 가지 벤치마크에대한 제안된 방법을 평가한다.
첫 번째 데이터세트는 Ubuntu Dialogue Corpus (v1.0)으로, 이는 기술적 서포트에대한 멀티턴 영어 대화로 이루어져있고 우분투 포럼에서 chat logs로부터 수집되었다.

우리는 Gu에의해 공유된 것을 복사하여 사용하고, 여기서 숫자들, paths, URLs는 placeholders로 대체되었다.
우분투 데이터세트는 학습을위한 100만 context-response pairs을 가지고있고 50만 쌍의 validation과 test을 가지고 있다.
학습 세트에서 positive candidates와 negative candidates의 비율은 1:1이고 validation과 test에서는 1:9이다.

두 번째 데이터세트는 E-commerce 대화 코퍼스로, 이는 Taobao에서 소비자와 소비자 서비스 스태프 사이의 real-world 멀티턴 대화들이다.

Taobao은 중국에서 매우 큰 이커머스 플랫폼이다.
E-commerce 데이터세트는 학습을 위한 100만 context-response 쌍을 가지고 1만 쌍의 validation과 test을 가진다.
학습에서는 positive:negative=1:1이고 validation과 test에서는 1:9이다.

Lowe, Zhang 에 따라서, 우리는 Rn@ks을 evaluation metrics으로 적용하고 Rn@k는 recall at position k in n candidates이고 n개의 candidates사이에서 top k positions에 랭크되는 positive response의 확률을 측정한 것이다.

3.2. Baseline Models

We compared BERT-SL with the following models:
DualLSTM (Lowe et al., 2015):

the model concatenates all utterances in the context to form a single sequence and calculates a matching score based on the representations produced by an LSTM.

Multi-View (Zhou et al., 2016):

the model measures the matching degree between the context and the response candidate in both a word view and an utterance view.

SMN (Wu et al., 2017):

the model lets each utterance in the context interacts with the response candidate, and the matching vectors of all utterance-response pairs are aggregated with an RNN to calculate a final matching score.

DUA (Zhang et al., 2018):

the model formulates previous utterances into context using a deep utterance aggregation model, and performs context-response similar to SMN.

DAM (Zhou et al., 2018):

the model is similar to SMN, but utterances in the context and the response candidate are represented with stacked self-attention and cross-attention layers. The matching vectors are aggregated with a 3-D CNN.

MRFN (Tao et al., 2019b):

the model employs multiple types of representations for context-response interaction, where each type encodes semantics of units from a kind of granularity or dependency among the units.

ESIM (Chen & Wang, 2019):

the model first concatenates all utterances in the context into a single sequence, and then employs ESIM structure derived from NLI for context response matching.

IMN (Gu et al., 2019):

following Wu et al. (2017), the model enhances the representations at both the wordand sentence-level and collects matching information of utterance-response pairs bidirectionally.

IoI (Tao et al., 2019a):

the model lets the context-response matching process goes deep along the interaction block chain via representations in an iterative fashion.

MSN (Yuan et al., 2019):

the model utilizes a multi-hop selector to select the relevant utterances in context and then matches the filtered context with the response candidate to obtain a matching score.

BERT (Whang et al., 2020):

the model fine-tunes the BERT with the concatenation of the context and the response candidates as the input.

BERT-VFT (Whang et al., 2020):

fine-tuning전에 모델은 학습 코퍼스에서 BERT와 같은 방법으로 post-training 실행한 모델

SA-BERT (Gu et al., 2020):

BERT-VFT을 따르다, speaker-aware embedding을 결합한 것

3.3. Implementation Details

Following Gu et al. (2020), we select English uncased BERTbase (110M) as the context-response matching model for the Ubuntu dataset and Chinese BERTbase model for he E-commerce dataset.

We implement the models with the code in https://github.com/huggingface/ transformers.

The maximum lengths of the context and response were set to 448 and 64 as the maximum length of input sequence in BERT is 512.
Intuitively, the last tokens in the context and the previous tokens in the response candidate are more important, so we cut off the previous tokens for the context but do the cut-off in the reverse direction for the response candidate if the sequences are longer than the maximum length.
We choose 32 as the size of mini-batches for training.
On both the Ubuntu dataset and the Douban dataset, we applied domain adaptive post-training before the finetuning procedure following the settings of Whang et al. (2020).
Training instances of auxiliary tasks are generated dynamically.
We select ∆ (Equation (9)) in {0.2, 0.4, 0.6, 0.8} and find that 0.6 is the best choice.
We vary α (Equation (10)) in {0.1, 0.2, 0.5, 1.0} and choose α = 1.0 as the trade-off between the learning objectives.
The model is optimized using Adam optimizer with a learning rate set as 3e − 5.
Early stopping on the validation data is adopted as a regularization strategy.
All the experiment results except ours are cited from previous works.

3.4. Evaluation Results

3.5. Discussions

4. Related Works

5. Conclusion

이 논문에서, 우리는 context-response matching model을 4개의 대화 데이터로부터 디자인된 auxiliary self-supervised task와 함께 학습한다.
이러한 auxiliary tasks와 같이 학습된 매칭 모델은 대화 데이터에 포함된 task-related knowledge을 효과적으로 배울 수 있고 더 나은 local optimum을 달성하여 response selection에 대한 더 좋은 특징들을 생성한다.
두 벤치마크에대한 실험 결과들은 제안된 auxiliary self-supervised tasks들이 검색기반 대화에서 멀티턴 response selection에 대한 중요한 향상을 가져오고 우리의 PLM-based 모델은 새로운 SoTA을 달성한다.

Reference

https://arxiv.org/pdf/2009.06265.pdf

인공지능, AI, NLP, 논문 리뷰, Natural Language, Leetcode

AI Information

NL-115, Learning an Effective Context-Response Matching Model with Self-Supervised Tasks for Retrieval-based Dialogues (2021-AAAI)

■ Comment

0 Abstract

1. Introduction

2. Model

2.1. Task Formalization

2.2. Matching with PLMs

2.3. Self-Supervised Tasks

2.3.1. NEXT SESSION PREDICTION

2.3.2. UTTERANCE RESTORATION

2.3.3. INCOHERENCE DETECTION

2.3.4. CONSISTENCY DISCRIMINATION

2.4. Learning Objective

3. Experiments

3.1. Datasets and Evaluation Metrics

3.2. Baseline Models

3.3. Implementation Details

3.4. Evaluation Results

3.5. Discussions

4. Related Works

5. Conclusion

댓글

댓글 쓰기