NL-063, E2E NLG Challenge: Neural Models vs. Templates (2018-INLG)

■ Comment

여기서도 template와 뉴럴 시스템 두 가지 방법에 대해 말한다.
TGen, 베이스라인에대해 핵심적인 부분을 설명을 잘해둔 것 같다.
Model-D에서 입력 프리프로세싱으로 하는 부분 참고

각 attribute value을 single token으로 처리하기
긴 학습 문장은 필터링 하기

여기서 model-T는 템플릿 방법으로 model-D보다 automatic 성능은 좀 뒤쳐지지만, 실제 문법적인 것과 입력 정보를 다 담아내기 때문에 최종 submission으로 선택했다고 한다.
실제로 사람평과상 더 좋은 그룹에 속했다고 하는 듯 하고, 이것이 즉 data-drive 식의 복잡한 뉴럴 네트워크말고 간단한 접근법도 필요한 부분이라고 주장하는 것으로 보인다.

0. Abstract

E2E NLG 챌린지는 key-values의 쌍의 셋으로부터 레스토랑 descriptions을 생성하는 태스크이다.

이 논문은 챌린지 참가했던 결과를 설명한다.

우리는 간단하지만 효과적인 neural encoder-decoder 모델을 보여준다.

이는 강력한 베이스라인보다 더 좋은 유창한 레스토랑 description을 생성한다.

우리는 주최자가 제공한 데이터를 분석하고 태스크가단 몇시간만에 개발되는 template-based 모델로 접근할 수 있다고 결론을 내린다.

1 Introduction

NLG는 structured data 표현에서부터 NLG 발화를 생성하는 문제이다.

E2E NLG 챌린지는 end-to-end 데이터 기반 NLG 방법에 중점을 둔 shared task이다.

이러한 방법들은 많은 관심을 끌고 있다.

왜냐하면 그들은 learning of textual structure and surface realization patterns from non-aligned data을 joint learning을 수행하기 때문이다.
이것은 NLG 코퍼스 생성에 필요한 많은 사람의 annotation 노력을 줄여주게 된다.(Wen et al., 2015; Mei et al., 2016; Dusek and Jurcicek ˇ , 2016; Lampouras and Vlachos, 2016)

챌린지에 제출한 우리의 방법은 다음의 contribution이 있다.

(1) we show how exploiting data properties allows us to design more accurate neural architectures
(2) we develop a simple template-based system which achieves performance comparable to neural approaches.

1.1 Task Definition

제출자는 레스토랑 도메인에서 crowwdscourced 데이터 세트 50k 인스턴스을 제공한다.
각 학습 instance는 dialogue act-based MR으로 구성되어 있고 최대 16개 까지의 reference가 있다.
데이터는 textual 입력들로부터 생성되는 것에 비하여 "좀 더 자연스럽고 유익하고 다양한 사람 레퍼런스"를 만들 의도로 pictorial representations(그림 표현)을 자극으로 사용하였다.
이 태스크는 주어진 MR에 대해 담화를 생성하여 human-generated reference texts와 유사도가 측정되고 사람에 의해 평가를 받게 된다.
유사도는 standard 평가 메트릭을 사용한다.

BLEU (Papineni et al., 2002), NIST (Doddington, 2002)
METEOR (Lavie and Agarwal, 2007)
ROUGE-L (Lin, 2004)
CIDEr (Vedantam et al., 2015)

그러나 최종 평가는 사람의 ratings을 통해되고 사람들은 크라우드 소싱한 사람들과 전문가들이다.
접근법의 평가를 좀 더 용이하게 하기 위해 제출자 팀은 TGen을 사용하여 E2E 시스템의 베이스라인을 제시한다.

이것은 seq2seq neural system with attention을 사용한 것이다.
TGen은 beam search로 디코딩을 사용하며, top k에대해 reranker을 사용하여 MR의 all attributes을 표현하지 않는 것의 candidates에 대해 처벌한다.
TGen은 또한 delexicalization 모듈을 포함한다.
이 모듈은 sparsely하게 발생하는 MR attributes을 다뤄 특정 values을 placeholders 토큰으로 매핑하여 pre-processing하고 post-processing으로 실제 values을 placeholders에 대체하는 식이다.

2 Our Approach

두 가지 방법에 대해서 설명한다.

첫 번째는 model-D로 "data-driven" 방법으로 encoder-decoder 뉴럴 시스템이다.

이는 TGen과 비슷하나 좀 더 효율적인 encoder 모듈을 사용하였다.

두 번째는 간단한 template-based model-T로 데이터 분석의 결과를 통하여 구축하였다.

2.1 Model-D

이 모델은 두 가지 중요한 부분으로부터 영감을 받았다.

fixed number of unique MR attributes
low diversity of the lexical instantiations of the MR attribute values

각 입력 MR은 정해진 수의 unique attributes을 (3~8개) 포함한다.

이것은 각 positional id을 각 attribute와 연결하게끔 하고 인코딩 단계에서 attribute names (or keys)와 매칭시킨다.

이것은 인코딩 시퀀스를 단축시키며 아마 인코더의 학습 과정을 더 쉽게 만든다.
이것은 또한 MR 입력의 길이를 통합시키고 간단하고 조금 더 효율적인 뉴럴 네트워크를 사용가능하게 한다.

효율적인 뉴럴네트워크는 일련적이지 않고 one step으로 입력 시퀀스를 처리할 수 있는 것을 말한다. (ex. MLP)

입력 MR keys의 활성화 숫자의 값과 구성하는 tokens의 숫자가 고정되어 있지 않기 때문에 MLP을 사용하는 것이 복잡하다고 생각할 수 있다.
예를 들면, MR key price는 one-token 값 "low" 혹은 좀 더 긴 “less than £10”을 가지고 있다고 하자.
그러나 MR 속성 값이 실제는 낮은 변동성을 가진다.

8개 keys중 6개가 7개보다 적은 unique values을 가지고 두 개의 keys (name, near)은 named entities을 가리키고 실제로 delexicalize하기 쉽다.

이것은 여러 개의 단어로 구성되어있을지라도 각 value을 single token으로 다룰 수 있다. (e.g. “more than £30”, “Fast food”).
즉 입력에서 몇 개의 attribute가 들어올지 모르지만, 여기서는 attribute key가 8개밖에 안되고 각 key에 해당하는 value가 대부분 7개이하이기 때문에 핸들링하기가 쉽다.

따라서 attribute value가 여러 단어로 구성되어 있을지라도 (개수가 적기 때문에) 하나의 토큰으로 처리하면 된다.
이것에 대한 예시는 밑에 table 1쪽을 참고하면 된다.

각 예측된 출력은 레스토랑의 설명 text이다.
Novikova 리포트에 따르면 reference에 사용되는 평균적인 단어 수는 20.1이다.
우리는 cut-off threshold을 50으로 잡고 학습 instance에서 긴 레스토랑의 설명문은 필터링했다. (아마 여기서 사용하는 GRU 모델이 긴 텍스트를 처리하기에 적합하지 않아서 그런거 같음)
이 시스템은 encoder-decoder 구조로 세 개의 모듈로 구성되어 있다.

1) an embedding matrix
2) one dense hidden layer as an encoder
3) a RNN-based decoder with gated recurrent units (GRU)

모델의 입력의 예시는 다음과 같다.

위의 MR keys의 예시가 들어왔을 때 알파벳 숫자로 고려하여 positional ids을 keys에 부여하면 된다.
값이 안주어진 Key들은 PAD 처리한다.

주어진 인
Given an instance of a (MR, text) pair, we decompose the MR into eight components (mrj in Figure 2), each corresponding to a value for a unique MR key, and add an end-of-sentence symbol (EOS) to denote the end of the encoded sequence. Each component is represented as a high-dimensional embedding vector. Each embedding vector is further mapped to a dense hidden representation via an affine transformation followed by a ReLu (Nair and Hinton, 2010) function. These hidden representations are further used by the decoder network, which in our case is a unidirectional GRU-based RNN with an attention module (Bahdanau et al., 2014). The decoder is initialized with an average of the encoder outputs.
The decoder generates a sequence of tokens, one token at a time, until it predicts the EOS token. Our model employs the greedy search decoding strategy and does not use any reranker module.

2.2 Model-T

MR attribute values의 적은 lexical 변화를 고려하면, deterministic NLG 시스템이 이 task을 다룰 수 있는지 여부에 관심이 있을 것이다.
우리는 MR attribute keys와 values가 학습데이터에서 처리되는 방식을 관찰했고 대부분의 텍스트 설명은 MR 속성 자연어 처리와 비슷한 순서를 따른다는 것을 발견했습니다.
여기서 [X]는 MR key을 의미한다.
이러한 패턴은 model-T의 중심 템플릿이 된다.
모든 MR attribute 자연어에 이러한 방법이 적절한 것은 아니다.

예를 들어, key-value 쌍 customerRating[3 out of 5]은 “. . . has a 3 out of 5 customer rating”식으로 될텐데 이것은 가장 좋은 문구가 아니다.
더 좋은 설명은 “. . . has a customer rating of 3 out of 5"이 될 것이다.

MR 속성의 특정 값에 따라 일반 템플릿을 수정하는 간단한 규칙 집합을 사용하여 이러한 변형을 Model-T에 통합합니다.
section 2.1에서 말했듯이 각 instance의 입력은 8개의 MR 속성을 가질 수 있다.
이러한 사실을 설명하기 위해서, 우리는 일반적인 템플릿을 입력의 특정 MR 속성에 맞는 smaller components 분해한다.
우리는 더욱이 규칙 셋을 만들고 이것은 MR 속성이 입력의 일정인이 아닌지에 따른 component을 활성하한다.
예를 들어, 만약 price가 입력 MR 속성 셋에 없으면 general template는 다음과 같다.
Finally, we also add a simple post-processing step to handle specific punctuation and article choices.

3 Metric Evaluation

TGen은 은 여러 번의 실험을 통한 결과가 아닌 한번의 실험에 대한 결과를 제공해주는 것이므로 분포를 비교할 순 없다. (즉 non-deterministic한 뉴럴 접근 방식이라는 것)
Model-T는 deterministic 시스템이므로 단일 실행 결과를 보여주기에 충분합니다.
여기서 Model-D가 TGen보다 좋음을 보여준다. (5가지 메트릭 모두)
Model-T는 TGen과 Model-D보다 안좋은데 이는 Model-T가 데이터 중심이 아니므로 생성되는 텍스트가 reference output과 다를 수 있기 때문이다.
Table 3 shows the error types and the number of mistakes found in each of the prediction files.
The error types should be self-explanatory (sample predictions are given in Appendix A.2).
테이블 3에서 Model-T는 규칙기반이기 때문에 문법적이나 그런 부분은 하나도 잘못된 것이 없다.
Model-D의 대부분의 에러는 입력 MR 값의 잘못된 자연어 처리 혹은 punctuation(구두점) 실수이다.
An easy solution to this problem is adding a postprocessing step which fixes punctuation mistakes before outputting the text.
결정적으로 Model-D는 종종 일부 MR 속성 값을 삭제하거나 수정합니다.
주최자에 따르면, 설계된 데이터의 40%가 output에 추가 정보 또는 생략된 정보를 포함하고 있습니다 (Novikova et al., 2017b).
crowd workers were allowed to not lexicalize attribute values which they deemed unimportant.
우리는 학습 데이터를 조사하고 데이터에서 Model-D의 discrepancies가 학습되었는지 확인하였다.

4 Training Data Analysis

이 부분은 학습 데이터 자체에도 에러 부분이 존재한다는 것을 설명한다. (자세한 부분은 생략...)
Most mistakes come from ungrammatical constructions, e.g. incorrect phrase attachment decisions (“The price of the food is high and is located . . . ”), incorrect usage of articles (“located in riverside”), repetitive constructions (“Cotto, an Indian coffee shop located in . . . , is an Indian coffee shop . . . ”).
Some restaurant descriptions follow a tweet-style narration pattern which is understandable, but ungrammatical (“The Golden Palace Italian riverside coffee shop price range moderate and customer rating 1 out of 5”).

4.1 Final Evaluation

최종 submission은 Model-T의 예측을 선택했는데 model-D에 비해 점수 스코어는 낮지만 대부분의 문법적인 출력이 맞고 입력 정보를 생성문장에 포함하고 있다.
최종 submission 결과는 table 5에 있다.
The E2E NLG Challenge focuses on end-to-end data-driven NLG methods, which is why systems like Model-T might not exactly fit into the task setup.
Nevertheless, we view such a system as a necessary candidate for comparison, since the E2E NLG Challenge data was designed to learn models that produce “more natural, varied and less template-like system utterances” (Novikova et al., 2017b)

5 Conclusion

이 논문에서는 E2E NLG 챌린지에 참가한 결과를 보여준다.
우리는 두 가지의 접근법을 보여주고 성능을 측정하였다 (quantity and in quality).
우리는 때때로 data-driven의 복잡한 모델보다 간단한 테크닉을 통하여 문제를 접근하는 것이 낫다는 것을 보여준다.
우리의 발견이 현대의 NLG 접근법의 한계를 극복하는 방법으로 발전되기를 기대한다.

Reference

https://www.aclweb.org/anthology/W18-6557.pdf

인공지능, AI, NLP, 논문 리뷰, Natural Language, Leetcode

AI Information