NL-060, Document Similarity for Texts of Varying Lengths via Hidden Topics (2018-ACL)

■ Comment
•Document 유사도를 구하기 위해 읽은 논문이다.
•Document 유사도에 관련된 논문은 찾아본 결과 많다고 보기는 어렵다.
•또한 간단히 스크래치를 해본 결과 최근 대세인 BERT을 이용하는 논문은 없어보인다.
•이 논문은 학습 방법이라기보다 SVD을 이용한 방법인데 이를 이용하여 추출한 topic vector을 사용한다.
•Topic이란 개념은 document을 구성하고 있는 개념이라 보면 되고 vector space의 개념이라 일반적인 class와는 다르다.
•기본적으로 몇 개의 topic을 사용할 지는 모델 파라미터로 eigenvalue의 개념을 사용하고 그 중에서 중요도를 측정하는 법을 제시하는데 이것이 정말로 중요한지는 실험적으로 보충 자료에 있다고 한다.

0. Abstract

Similarity을 측정하는 것은 많은 application에서 중요하다.
가능한 방법으로는 non-comparable lengths을 가지는 document pairs을 비교하는 방법이 있다.

대표적으로 documnet와 summary의 관계

이러한 비교는 rich details을 담고있는 긴 문서와 중요한 summary만을 담고있는 abstract 정보를 비교하는 gap을 이용하는 것이다. (contextual vs abstraction)
논문에서는 hidden topics의 common space속의 text을 비교하여 이러한 gap을 이용한 document matching을 제시한다.

1. Introduction

Documents를 서로 비교하는 측정하는 key는 information retrieval, book recomendation, news categorization, essay scoring등을 포함한 다양한 NLP이다.
이러한 tasks의 중심은 비교되는 문서의 길이가 비슷하다는 가정이다.
문장 요약과 추천과 같은 NLU 기술의 고도화로 문서 비교에 대한 새로운 요구사항이 생성되었다.
예를 들어, 요약기술(extractive and abstractive)들은 자동적으로 수백개의 단어를 가지는 긴 문서를 몇 개의 단어만 가지는 압축된 text로 바꾼다.

이 때, 기존 text의 core meaning은 보존한다.

아마도 요약의 관련된 관점은 summary와 document의 bidirectional mathcing으로 볼 수 있다.
논문에서 고려하는 document similarity는 길이가 다른점뿐만 아니라 abstraction level도 있다.

abstract level: a definition of an abstract concept versus a detailed instance of that abstract concept

Table 1에 있듯이 Concept와 Project의 matching task가 설명되어 있다.
Concept

유전 : 유전의 상속과 변이
모든 세포는 DNA 분자 형태의 유전 정보를 포함합니다. 유전자는 단백질 형성을 코딩하는 설명서를 포함하는 DNA의 영역입니다.

Project

가계 분석 : 특성의 가계도
소개 : 어머니와 같은 머리 색깔이나 눈 색깔이 있습니까? 우리가 가족 구성원을 볼 때 일부 신체적 특성이나 특성이 공유되는 것을 쉽게 알 수 있습니다. 이 프로젝트를 시작하려면 가족의 다른 구성원을 보여주는 가계도를 그려야합니다. 이상적으로는 3 세대 이상의 여러 사람을 포함해야합니다.
재료 및 장비 : 종이, 펜, 랩 노트북 절차 :이 과학 프로젝트를 시작하기 전에

Concept는 grade-lvel 과학 커리큘럼 아이템과 요약을 표현하고 있다.
Project는 과학 프로젝트 집합의 리스트을 가지고 있고 문서를 표현하고 있다.
일반적으로 project는 긴 text을 가지고 있다.

indtroduction과 material와 procedures을 포함하고 있다.

반면에 science concepts들은 중요한 짧은 정보인 제목과 요약 설명을 포함하고 있다.
이 둘에 대한 자세한 설명은 sec 5.1에서 설명한다.
이 matching task는 커리큘럼에서 주어진 concept에 대한 프로젝트를 자동적으로 제안하여 프로젝트가 learner의 concept의 basic understanding을 강화하도록 돕는 것이다.
반대로 과학 프로젝트가 주어지면, 커리큘럼에 있는 concept 리스트와 매칭시켜 어떤 concept을 커버하는지 식별해야할 수도 있다.
이것은 intelligent 과외 시스템의 맥락에서 생각할 수 있다.
위에서 언급한 matching task 다음과 같다.
1) 비교되는 문서간의 상대적 길이가 맞지 않는다.

긴 문장 (document)와 짧은 문장 (summary)가 vocabulary mismatch 문제를 일으킨다.
문서와 요약이 핵심이 되는 key terms을 공유하지 않는다.

2) document는 contextual 의미를 추론하기위한 리즈너블한 양의 text을 제공하지만 summary은 문서에서 똑같은 term을 포함지 않을 수 있는 limited context만을 제공하기 때문에 발생하는 불일치 문제
이러한 챌린지가 기존의 접근법이 문제점이 있다고 한다. (Doc2Vec)
요약은 비교적 noise가 없으며 Doc2Vec을 비교하기에 부적절하다.

3. Domain Knowledge

As will be described later, our distance metric for comparing a document and a summary relies on word embeddings. We show in this work, that embeddings trained on a science-domain corpus lead to better performance than embeddings on the general corpus (WikiCorpus).
Thus, we conclude that domain specific embeddings (here, science), is capable of incorporating domain knowledge into word representations. We use this observation in our document-summary matching system to which we turn next.

4. Model

제안 모델은 총 3가지의 모듈로 구성되어 있다.

4.1. Preprocessing

The preprocessing module tokenizes texts and removes stop words and prepositions(전치사).
This step allows our system to focus on the content words without impacting the meaning of original texts.
즉, stop words, 전치사를 제거해서 original text에는 영향을 안미치면서 모델이 content words에 집중하도록 하는

4.2. Topic Generation from Documents

Document는 긴 문장의 text로 topics의 구성으로 부터 가져온 ‘structure’화된 단어들의 집합 구조라고 가정한다.

this ‘structure’ is represented as a set of hidden topics

따라서 여기서는 document가 특정한 hidden "topics"로부터 생성이 되었다고 가정한다. (이는 LDA의 모델링 가정과 유사하다.)
그러나 LDA와 달리 여기서 "topics"는 특정 words나 distribution over words가 아니라 본질적으로 벡터 집합이다.

이 말은 document을 구성하는 words (벡터로 표현된) 가 hidden topeic vectors로 부터 생성될 수 있다.

Document안에 있는 word vectors: {w1, . . . , wn}
Document안에 있는 hidden topic vectors: {h1, . . . , hK}
둘 다 300차원이다.
word embedding은 단어의 compositional properties을 반영하는 것으로 알려져 있다. (또한 the embedding of a phrase is nearly the sum of the embeddings of its component words)
reconstruction 에러를 최소화 하면서 document's hidden topics의 words의 선형으로 재구성할 수 있다.
K topic vectors as a topic matrix H = [h1, . . . , hK](K < d)
We define the reconstructed word vector $\tilde{\textbf{w}}_i$ for the word $\textbf{w}_i$ as the optimal linear approximation given by topic vectors: $\tilde{\textbf{w}}_i = \textbf{H} \tilde{\alpha}_i$

$\textbf{w}_i$ 가 되도록 topic matrix H을 projection 시키는 alpha_i을 구한다.
그 alpha을 이용하여 구한 것이 $\tilde{\textbf{w}}_i$ 이 되는 것(즉 $\textbf{w}_i$ 을 근사하는 것으로 학습한 alpha을 이용함)

이 때 reconstruction loss는 다음과 같다.

$E = \sum_{i=1}^{n} ||w_i - \tilde{w}_i ||^{2}_{2}$

이 E를 최소화하는 H∗을 찾는 것이 최종 목표이다.

||*|| is the Frobenius norm of a matrix.
즉 여기서 alpha도 모르고 H도 모르는 상태이다.
우리의 최종목적은 H*(optimal)을 찾는 것이다.
즉 (1)번식과 (2)번식을 살펴보면, 먼저 어떤 H에서 w_i을 근사할 수 있는 alpha_i들을 구한다.
이 alpha_i들을 이용한 error(= reconstruction loss)의 합을 구한다.
H 바뀜에 따라 이 error들의 합이 달라질텐데, 이것이 최소화 되는 H가 우리가 찾는 H*이다.

loss의 일반성을 잃지 않으면서 여기서는 $\{ \textbf{h}_i \}^{K}_{i=1}$ 가 orthonormal이기 원한다. ( $h^{T}_{i}h_{j}=1_{(i=j)}$ )

위에서 (2)번 식은 topic vectors의 optimal linear space spanned by the topic vectors을 설명하기 때문에 norm과 the linear dependency of the vectors은 중요하지 않다.

따라서 다음의 constraint을 넣어주게 된다.

원래 $\tilde{w}_i = w-H\alpha_i$ 만 만족하는 $\alpha$ 을 찾는 것인데 $\alpha_i = H^Tw_i$ 가 되도록 constraint을 걸어준다는 것이다.
이렇게 제약을 걸어주면 (2)식의 목표 중 하나가 $\tilde{\textbf{w}}_i$ 가 $\textbf{w}_i$ 가 되도록 하는 것이기 때문에 HH^T가 I가 되어야 하는 조건이 생기는 것의 개념으로 보면 된다.

stack word vectors in the document as a matrix W = [w1, . . . , wn]

(2)식은 다음의 matrix 꼴로 표현이 가능하다.
(3)식이 (2)식+constraint이기 때문에 아래와 같이 표현이 가능한 것이다.

이 문제는 Singular Value Decomposition(SVD)으로 해결이 가능하다.

간단히 SVD에 적어보자면
$\textbf{W}$ 은 $\textbf{W} = \textbf{U}\Sigma \textbf{V}^T$ 으로 decomposed가 된다.
U^TU = I, V^TV = I, and Σ is a diagonal matrix where the diagonal elements are arranged in a decreasing order of absolute values
이렇게 분해된 U matrix에서 K개의 vectors만 가져온다.
K vectors in the matrix U are exactly the solution to H∗ = [h ∗ 1 , . . . , h ∗ K]
즉 SVD에서 뽑은 U는 서로 vector가 orthonormal하기 때문에 이것중 큰 eigenvalue을 가지는 순서대로 K vectors을 U의 정답으로 볼 수 있다는 얘기

We note that these topic vectors are not equally important, and we say that one topic is more important than another if it can reconstruct words with smaller error

즉 재구성할 때 error가 적은 topic vectors가 중요하다.
eigenvalue의 개념을 다시 한번 말해주는 듯

$E_k$ 의 정의는 topic vector으로 오직 $\textbf{h}_{k}^{*}$ 만을 썼을 때의 error로 다음과 같다.

여기서 $i_k$ 라는 topic $\textbf{h}_{k}^{*}$ 의 importance을 정의하며 document에서 words을 재구성하는 topic's의 능력을 측정한다.

보충 자료에서 $i_k$ 가 클수록 reconstruction error $E_k$ 가 줄어든다는 것을 보여준다.
즉 위에서 K개의 hk을 가져와 사용하지만 이것의 중요도는 (eigenvalue가 아닌) ik에 비례한다는 것이다.

이제는 $i_k$ 을 $\bar{i_k}$ 로 normalize을 하여서 word matrix W의 norm에 비례하지 않는 importance을 만들어 낸다.

또한 이것으로 K topics의 importance 합은 1로 만든다.

여기서 K는 모델의 파라미터라 볼 수 있는데 작은 K는 document의 key ideas을 커버를 못할 수 있고 큰 K는 사소하고 노이즈한 정보까지 포함할 수 있다.
실험적으로 K가 15일 때 document의 중요한 정보들을 잘 뽑아낼 수 있다고 알아냈다.

4.3. Topic Mapping to Summaries

다시 한번 정리하자면 document matrix W에서 K topic vectors $\{ \textbf{h}_{k}^{*} \}$ 을 뽑았고 이것의 importance는 $\{ \bar{i}_{k} \}_{k=1}^{K}$ 에 반영된다.
이 모듈에서는 document-summary pair의 관계성을 측정한다.
이를 위해서는 document와 매칭이 되는 summary가 document의 "topics"와 관련있어야 한다.
Suppose the vectors of the words in a summary are stacked as a d × m matrix S = [s1, . . . , sm], where sj is the vector of the j-th word in a summary.
Documnet의 재구성할 때와 똑같이 (3)식을 이용하여 summary도 재구성할 수 있다.
$\tilde{\textbf{s}}_{j}^{k}$ 을 하나의 topic $\textbf{h}_{k}^{*}$ 가 주어졌을 때 summary word $\textbf{s}_{j}$ 의 재구성이다. ( $\tilde{\textbf{s}}_{j}^{k}=\tilde{\textbf{h}}_{k}^{*} \tilde{\textbf{h}}_{k}^{*}^T \textbf{s}_{j}$ )
r함수는 topic vector h와 summary word s의 관계도를 정의한 것이다.

위의 $\tilde{\textbf{s}}_{j}^{k}=\tilde{\textbf{h}}_{k}^{*} \tilde{\textbf{h}}_{k}^{*}^T \textbf{s}_{j}$ 을 넣어서 생각해보면 ( $\textbf{h}_{k}^{*}$ sj^T)의 내적값으로 확인할 수 있다.

let $r(\textbf{h}^{∗}_{k}, \textbf{S})$ be the relevance between a topic vector and the summary, defined to be the average similarity between the topic vector and the summary words:

relevance between a topic vector and a summary을 계산을 할 때 보면 분모에 norm을 나눠주기 때문이 0과 1사이의 값을 가지게 된다.
하지만 (9)번식은 topic의 importance 개념이 빠져있기 때문에 이를 수정한 최종 식 r(W, S) as the relevance between the document W and the summary S을 정의한다.

수식적으로는 r(W, S) is the sum of topic-summary relevance weighted by the importance of the topic
이 값이 높을수록 관계성이 높다고 볼 수 있다.

이것을 그림으로 시각화하면 다음과 같다.

The two documents are from science projects: a genetics project, Pedigree Analysis:
This shows that the words in the summary (and hence the summary itself) can also be reconstructed from the hidden topics of documents that match the summary (and are hence ‘relevant’ to the summary).
즉 document 단어들이 document 속의 topic 평면으로 projection이 된다는 것을 보여주는 것이다.
(리마인드하면 여기서는 topic이라는 것이 숫자의 개념 벡터이다. 따라서 LDA와 같은 class의 개념은 아니라는 것)

5. Experiments

5.2. Text Summarization

요약본과 문서를 매칭시키는 것은 실생활에서도 많이 있다. (예시로는 이세돌 알파고가 논문에 나옴)
이러한 매칭은 size가 다른 texts사이를 매칭시키는 방법을 평가하는 이상적인 것이다.
Dataset

CL-SciSumm Shared Task
이 데이터세트는 730 ACL Computational Linguistics research papers covering 50 categories in total이다. (730개 논문, 50개 카테고리)
Each category consists of a reference paper (RP) and around 10 citing papers (CP) that contain citations to the RP. A human-generated summary for the RP is provided and we use the 10 CP as being relevant to the summary. The matching task here is between the summary and all CPs in each category.

Evaluation

일반적인 요약길이는 100개의 단어고 논문은 2000개 이상의 단어들로 구성되어 있다.
730개의 논문을 모두 평가하였다.
top-k matchings을 고려한 precision@k으로 retrieval 평가를 하였다.
For each combination of the text-similarity approaches and embeddings, we show precision@k for different k’s in Figure 5.
We observe that our method with science embedding achieves the best performance compared to the baselines, once again showing not only the benefits of our method but also that of incorporating domain knowledge.

6. Discussion

Documents와 summaries을 매칭시키는 새로운 방법을 제시한다.
문제는 긴 문장과 그것의 abstraction with hidden topics의 차이를 해소하는 것이였다.
성능 향상을 위해 도메인 지식을 일치 시스템에 통합합니다.
Our approach has beaten two strong baselines in two downstream applications, concept-project matching and summary-research paper matching.

Reference

https://arxiv.org/pdf/1903.10675.pdf

인공지능, AI, NLP, 논문 리뷰, Natural Language, Leetcode

AI Information