◼️ Comment

뭔가 논문의 핵심내용을 잘 설명 못하는 느낌이다. (좀 더 예제를 구체적으로 했으면 하는 바람..?)
아무튼 이해한 토대로 정리하자면, 논문의 핵심은 entity과 연관된 테스크에 적합한 새로운 pretrained LM을 제안하는 것이다.

즉 pretraining 단계에서, entity에 해당하는 embedding도 같이 학습하는 것이다.

Pretraining

예시 문장) 나는 <경기도 판교: LOC>에 산다.
라는 문장으로 pretraining을 할 때, 이전의 pretraining에선
"나는 경기도 <MASK>에 산다."로 단어를 랜덤 마스킹해서 <MASK>=판교 라고 학습하도록 한다.
여기서는 "나는 경기도 <MASK1>에 산다" + "<MASK2>"로 마스킹된 토큰 2개를 입력으로 한다.
여기서 <MASK1>=판교, <MASK2>="경기도 판교" 으로 예측하게끔 학습하는 것이다.
즉 <MASK2>가 "경기도 판교"가 되도록 학습하려면, "경기도 판교"가 하나의 엔티티라는 것을 알고 있어야 한다. (그래야 입력으로 주니깐)
이것은 위키피디아의 하이퍼링크를 이용한다고 한다.
즉 위키피다에서 "경기도 판교"가 하이퍼링크로 걸려있다면, 이것은 하나의 엔티티로 간주한다는 것이다.
물론 하나의 엔티티라는 것은 알아도, 무슨 엔티티인지는 pretraining 코퍼스에서는 당연히 알 수가 없기 때문에, 학습시에 <MASK2>에 대응되는 label은 원래 토큰인 "경기도 판교"가 되는 것이다.

논문에서 설명은 없지만, 그림 1을 보면, Beyonce는 MASK 토큰으로 치환되지 않은 엔티티이므로, 그림의 pretraining step에선 학습에 사용되지 않는 것 같다.
즉 MASK로 치환된 토큰이 엔티티일때만, 엔티티쪽의 embedding이 학습되는 것일듯 (MLM처럼)

하이퍼링크가 없는 것들은, UNK 엔티티로 간주한다.
그 외에, 학습방향은 이전의 MLM와 크게 다르지 않고, 그림 1을 참고하면 된다.
임베딩으론 엔티티 임베딩 e을 사용하고 (하나의 벡터) 하나의 엔티티로 묶인 토큰들의 position 평균을 사용한다.
또한 token embedding matrix을 ALBERT와 같이 2개로 쪼갠 행렬을 이용한다.

위와 같은 pretraining을 하면 entitiy에 대한 정보를 사전에 학습할 수 있다는 것 같다.

따라서 tine-tuning할 때 성능이 올라간다고 한다.

추가적으로 attention score을 계산할 때 사용되는 Q 매트릭스는 4가지가 있다.

즉 토큰 vs 엔티티 어떤 타입의 인풋인지에 따라 이것이 결정되는 것이라고 보면 된다.
이렇게 구분을 안해놓으면, 그냥 단순히 BERT 처럼 학습하는 것과 다르지 않아서 타입을 인식하는 효과가 미미해서지 않을까 싶다.

모델의 구조는 RoBERTa-LARGE을 기반으로 한다고 한다.

따라서 이 모델의 성능이 BERT보다 유의미하게 좋은게 큰 의미라기보단, RoBERTa보다 얼마나 좋냐가 더 중요해보인다.
물론 RoBERTa보다 좋아지기는 하나, 큰 차이는 또 아닌 거 같아서..
이렇게 위키피디아의 하이퍼링크를 활용하기 위한 데이터를 수집하기 위한 cost가 아깝지 않나? 뭐 이런생각도 든다.

0 Abstract

Entity representations은 엔티티들을 포함하는 natural language tasks에서 유용하다.
이 논문에서, 우리는 새로운 단어들과 엔티티들의 pre-trained contextualized representations을 bidirectional transformer을 기반으로 제안한다.
제안된 모델은 주어진 텍스트에서 words와 entities을 독단적인 토큰들로 간주하고 그들의 contextualized representations을 출력한다.
우리의 모델은 BERT의 masekd langugae model을 기반으로한 새로운 pretraining task을 사용하여 학습된다.
테스크는 위키피디아에서 많은 entity-annotated 코퍼스안의 랜덤으로 maksed words와 entities을 예측하는 것을 포함한다.
우리는 또한 entity-aware self-attention 메커니즘을 제안하고, 이는 transformer의 self-attention 메커니즘의 확장이고 attention scores을 계산할 때, 토큰들의 타입을 (words or entities) 고려한다.
제안한 모델은 entity-related tasks의 여러 범위에서 실험적으로 놀라운 성능을 달성한다.
특히, 이는 잘 알려진 5개 데이터세트에 대해 SOTA들 달성한다.

Open Entity (entity typing),
TACRED (relation classification),
CoNLL-2003 (named entity recognition),
ReCoRD (cloze-style question answering), and
SQuAD 1.1 (extractive question answering).

Our source code and pretrained representations are available at https: //github.com/studio-ousia/luke.

1 Introduction

많은 NLP tasks는 entities을 포함한다.

즉, relation classification, entity typing, named entity recognition (NER), and question answering (QA).

entity-related tasks와 같은 것을 푸는 핵심은 모델이 entities의 효과적인 representations을 배우는 것이다.
전통적인 entity representations는 각 entity를 고정된 embedding vector로 할당하여 knowledge base (KB)안의 엔티티에 관한 정보를 저장하는 것이다.
이러한 모델들은 KB에서 풍부한 정보를 캡쳐함에도 불구하고, 그들은 텍스트안의 entities을 표현하려면 entity linking을 요구하고 KB에 존재하지 않는 entites을 표현할 수 없다.
대조적으로, Transformer을 기반으로한 contextualized word representations (CWRs)은 LM을 기반으로 unsupervised pretraining tasks로 효과적으로 일반적인 word representations을 제공한다. (BERT, RoBERTa와 같은)
많은 최근 연구들은 CWRs을 기반으로 계산된 entites의 contextualized representations을 사용하여 entity-related tasks을 해결하려고 한다.
그러나, CWRs의 구조는 다음의 2가지 이유로 entities을 표현하는데 적합하지 않다.

(1) CWRs은 엔티티들의 span-level representations을 출력하지 않기 때문에, 그들은 전형적으로 작은 downstream dataset을 기반으로 이러한 representations을 계산하는 방법을 배우는 것이 필요하다.
(2) 많은 entity-related tasks (relation classification and QA)은 entities 사이의 relationship에 대한 추론을 포함한다.

transformer는 self-attention mechanism을 사용하여 단어를 여러 번 연관시켜 단어간의 복잡한 관계를 캡쳐할 수 있지만, 많은 엔티티가 모델에서 여러 토큰으로 분할되기 때문에 엔티티간에 이러한 추론을 수행하기가 어렵습니다.
게다가, CWRs의 word-based pretraining task 은 엔티티들의 representations을 배우는것에 적합하지 않다.

왜냐하면, 엔티티안의 주어진 다른 단어들가 masked word을 예측하는 것은 전체의 entity을 예측하는 것보다 쉽기 때문이다.
예) “The Lord of the [MASK]”에서 MASK을 Rings로 예측하는 것

이 논문에서, 우리는 단어들과 엔티티들의 새로운 pre-trained contextualized representations인 LUKE을 (Language Understanding with Knowledge-based Embeddings) 소개한다.
Transformer을 기반으로한 LUKE은 위키피다에서 얻은 많은 entity-annotated 코퍼스을 사용하혀 학습된다.
LUKE와 기존의 CWRs 사이의 중요한 다른점은 words만을 고려하지 않고, 독단적인 토큰들로써 entities 또한 고려하고, Transformer을 사용하여 모든 토큰들의 중간 및 출력을 계산한다. (그림1)

tokens을 엔티티들로 고려하기 때문에, LUKE는 직접적으로 entities 사이의 관계들을 모델링한다.
새로운 pretraining task를 사용하여 학습된 LUKE는 BERT의 masked language model (MLM)의 간단한 확장이다.

이 테스크는 랜덤으로 엔티티들을 [MASK] 엔티티들로 바꿔서 랜덤으로 마스킹하는 것을 포함하며, 이러한 마스킹된 entities의 원본을 예측함으로써 모델을 학습한다.

우리는 RoBERTa을 기본 pre-trained model로 사용하고, 모델을 MLM와 우리의 제안된 테스크의 objective을 동시에 최적화함으로써 pretraining을 수행한다.
다운스트림 테스크들에 적용할 때, 결과 모델은 입력의 [MASK] 엔티티들을 사용한 텍스트안의 임의의 entities의 representations을 계산할 수 있다.
게다가, 만약 entitiy annotation이 테스크에서 가능하다면, 모델은 entitiy embeddings에 대응하는 풍부한 인코딩된 entitiy-centric 정보를 기반으로 entitiy representations을 계산할 수 있다.
이 논문의 다른 contribution key는 우리의 entity-aware self-attention 메커니즘을 사용하여 transformer을 확장한 것이다.
기존의 CWRs와 달리, 우리의 모델은 두 개의 토큰들의 타입을 다룬다.

즉, words와 entities

그래서, 우리는 메커니즘이 쉽게 토큰들의 타입들을 결정할 수 있는 장점이 있다고 가정한다.
이를 위해 attending token과 attended token을 기반으로 다른 쿼리 메커니즘을 수용하여 self-attention 메커니즘을 강화합니다.
우리는 5가지 표준 엔티티 관련 작업에 대한 광범위한 실험을 수행하여 제안 된 모델의 효과를 검증합니다.

entity typing, relation classification, NER, cloze-style QA, and extractive QA.

Our model outperforms all baseline models, including RoBERTa, in all experiments, and obtains state-of-the-art results on five tasks:

entity typing on the Open Entity dataset (Choi et al., 2018),
relation classification on the TACRED dataset (Zhang et al., 2017),
NER on the CoNLL-2003 dataset (Tjong Kim Sang and De Meulder, 2003),
clozestyle QA on the ReCoRD dataset (Zhang et al., 2018a), and
extractive QA on the SQuAD 1.1 dataset (Rajpurkar et al., 2016).

We publicize our source code and pretrained representations at https://github.com/studio-ousia/luke.
메인 컨트리뷰션

We propose LUKE, a new contextualized representations specifically designed to address entity related tasks. LUKE is trained to predict randomly masked words and entities using a large amount of entity-annotated corpus obtained from Wikipedia.
We introduce an entity-aware self-attention mechanism, an effective extension of the original mechanism of transformer. The proposed mechanism considers the type of the tokens (words or entities) when computing attention scores.
LUKE achieves strong empirical performance and obtains state-of-the-art results on five popular datasets: Open Entity, TACRED, CoNLL2003, ReCoRD, and SQuAD 1.1.

2 Related Work

3 LUKE

Figure 1 shows the architecture of LUKE. The model adopts a multi-layer bidirectional transformer (Vaswani et al., 2017).
이것은 문서의 words와 entities을 입력 토큰으로 간주하고, 각 토큰의 representation을 계산한다.
수식적으로, m개의 단어들 w1, w2, ..., wm와 n개의 entities e1, e2, ..., en이 주어지면, 우리의 모델은 D차원의 word representations $\mathbf{h}_{w_1}$ , ..., $\mathbf{h}_{w_m}$ where $\mathbf{h}_{w} \in \mathbb{R}^D$ 와 entity representations $\mathbf{h}_{e_1}$ , $\mathbf{h}_{e_2}$ , ..., $\mathbf{h}_{e_n}$ where $\mathbf{h}_{e} \in \mathbb{R}^D$ 을 계산한다.
엔티티들은 위키피디아 엔티티들 (즉, 그림 1의 Beyonce) 혹은 special 엔티티들 (즉, [MASK])가 될 수 있다.

엔티티가 입력토큰으로 주어진다는게, 이처럼 Beyonce 혹은 [MASK]에 엔티티란 표시를 해준다는 것이다.
그림1에 보면 e embedding이 추가적으로 들어가있는데, 이 부분이 entity type embedding이다.
또한 미리 살펴보면, token embedding이 원래 토큰 임베딩과 다른 것을 쓰는 것 같다.
추가적으로, (D5+D6)/2으로 position embedding이 표현되는데, 이는 Los Angeles가 하나의 엔티티이기 때문이다. (즉 이것으로 인해, 엔티티 단위의 pretraining을 할 수 있다는 것?)

3.1 Input Representation

The input representation of a token (word or entity) is computed using the following three embeddings:
Token embedding represents the corresponding token.

우리는 word token embedding $\mathbf{A} \in \mathbb{R}^{V_w \times D}$ 을 정의하고, 여기서 $V_w$ 는 vocab의 수이다.
계산 효율성을 위해, 우리는 entity token embedding을 두 개의 작은 행렬로 분해하여 표현한다.
$\mathbf{B} \in \mathbb{R}^{V_e \times H}$ 와 $\mathbf{U} \in \mathbb{R}^{H \times D}$ 이고 $V_e$ 는 여기서 vocab에서 entities의 개수다.
entity token embedding의 full matrix은 BU로 구한다.

Position embedding represents the position of the token in a word sequence.

시퀀스의 i번째 position에 나타나는 word와 entitiy은 $\mathbf{C}_i \in \mathbb{R}^{D}$ 와 $\mathbf{D}_i \in \mathbb{R}^{D}$ 로 각각 표현된다.
만약 entity name아 여러 words을 가지고 있다면, position embedding은 해당하는 positinos의 embedding의 평균으로 계산된다. (그림 1 처럼)

Entity type embedding represents that the token is an entity.

The embedding is a single vector denoted by $\mathbf{e} \in \mathbb{R}^{D}$ .

word의 input representation은 token과 position embeddings의 합으로

entity는 token, position, entitiy type embedding의 합으로 표현된다.

이전의 연구에 따라, 우리는 special tokens [CLS], [SEP]을 word 시퀀스안에 첫 번째와 마지막 words로 각각 넣어준다.

3.2 Entity-aware Self-attention

self-attention mechanism은 transformer에서 발견되었고, 각 토큰 쌍 사이의 attention score을 기반으로 tokens은 서로 서로 연관된다.
입력 벡터 시퀀스 x1, x2, ..., xk where where xi ∈ R^D가 주어지면, 각 출력 벡터들은 y1, y2, ..., yk, where yi ∈ R^L이다.

이는 transformed input vectors의 weighted sum을 기반으로 계산된다.

여기서, 각 입력과 출력 벡터는 모델의 토큰과 (word or entity) 대응된다.

그래서 k=m+n

The i-th output vector yi is computed as:

where Q ∈ R^L×D, K ∈ R^L×D, and V ∈ R^L×D denote the query, key, and value matrices, respectively.

LUKE는 토큰들의 2가지 타입을 (i.e., words and entities) 다룰 수 있기 때문에, 우리는 attentino scores (eij)을 계산할 때, target 토큰 타입의 정보를 사용할 수 있는 이점을 가정한다.
이를 염두에두고, 우리는 entity-aware query 메커니즘을 소개함으로써 메커니즘을 강화하고, 이는 xi와 xj의 토큰 타입들의 가능한 쌍을 위해 다른 query matrix을 사용한다.
Formally, the attention score eij is computed as follows:

즉, Q가 xi, xj 타입에 따라 4가지(2x2) 종류가 있는 것이라 보면된다.

3.3 Pretraining Task

LUKE을 pretrain 하기 위해, 우리는 conventional MLM와 MLM의 확장으로 새로운 pretraining task을 사용해서 entity representations을 학습한다.
특별히, 우리는 위키피디아의 하이퍼링크들을 entity annotations으로 간주하고, 위키피디아로부터 검색된 large entity annotated corpus을 사용해서 모델을 학습한다.

우리는 랜덤으로, 엔티티들의 특정 퍼센트를 mask하여 special [MASK] entities로 교체하고나서 모델이 masked entites을 예측하도록 학습한다.

공식적으로, masked entity에 해당하는 기존의 entity은 우리의 vocab의 모든 엔티티들에 대해 softmax을 적용해서 예측된다.

여기서 he는 masked entity에 해당하는 representation이다.
T ∈ R^{H×D}와 Wh ∈ R^{D×D}은 weight matrices이고
bo ∈ R^{Ve} and bh ∈ R^{D}은 bias vectors이고
gelu(·)은 gelu activation function이고
layer norm(·)은 layer normalization function이다.
여기서 m은 D차원이 되고 y^는 Ve차원이 되는 식이다. 특이한?점은 layer_norm을 썼다는 것과 B을 다시한번 사용했다는 점 정도..?

Our final loss function is the sum of MLM loss and cross-entropy loss on predicting the masked entities, where the latter is computed identically to the former

3.4 Modeling Detail (번역)

모델 구성은 RoBERTa-LARGE (Liu et al., 2020), 양방향 변압기 및 BERT 변형을 기반으로 사전 훈련 된 CWR (Devlin et al., 2019)을 따릅니다.
특히, 우리의 모델은 D = 1024 개의 숨겨진 차원, 24 개의 숨겨진 레이어, L = 64 개의주의 머리 차원 및 16 개의 자기주의 머리의 양방향 변환기를 기반으로합니다.
엔티티 토큰 임베딩의 차원 수는 H = 256으로 설정됩니다.
총 매개 변수 수는 약 483M이며 RoBERTa에서 355M, 엔티티 임베딩에서 128M으로 구성됩니다.
입력 텍스트는 Vw = 50K 단어로 구성된 어휘와 함께 RoBERTa의 토크 나이저를 사용하여 단어로 토큰 화됩니다.
계산 효율성을 위해 엔티티 어휘에는 모든 엔티티가 포함되지 않고 엔티티 주석에 가장 자주 나타나는 Ve = 500K 엔티티 만 포함됩니다.
엔티티 어휘에는 [MASK] 및 [UNK]라는 두 개의 특수 엔티티도 포함됩니다.
이 모델은 위키 백과 페이지에서 20 만 단계의 무작위 순서로 반복을 통해 학습됩니다.
학습 시간을 줄이기 위해 RoBERTa를 사용하여 LUKE가 RoBERTa (변환기의 매개 변수 및 단어 임베딩)와 공통적으로 갖는 매개 변수를 초기화합니다.
과거 작업 (Devlin et al., 2019; Liu et al., 2020)에 따라 모든 단어와 엔티티의 15 %를 무작위로 마스킹합니다.
If an entity does not exist in the vocabulary, we replace it with the [UNK] entity.

즉, 모든 토큰들에 해당하는 엔티티 embedding이 있다는 것인데
하이퍼링크가 없는 엔티티들은 UNK이다.
위키피디아 데이터라고 해도 대부분 없지 않나..?

우리는 우리의 메커니즘에 대한 절제 연구를 원하지만 사전 훈련을 두 번 실행할 여유가 없기 때문에 엔티티 인식 자기주의 메커니즘이 아닌 원래의 자기주의 메커니즘을 사용하여 사전 훈련을 수행합니다.
자체주의 메커니즘 (Qw2e, Qe2w 및 Qe2e)의 쿼리 매트릭스는 다운 스트림 데이터 세트를 사용하여 학습됩니다.
사전 교육에 대한 자세한 내용은 부록 A에 설명되어 있습니다.

4 Experiments

4.1 Entity Typing

우리는 먼저 entity typing에 대해 실험한다.

이는 주어진 문장의 entity의 타입을 예측하는 테스크이다.
찾아보니, NER와 다른 점은 문장만 주어지는게 아니고 entity의 tokens도 같이 주는 것이다.
즉 BI의 바운더리는 같이 주어지는 개념?인 듯 (따라서 NER이 좀 더 어려운 것)

Zhang et al. (2019)에서는 Open Entity 데이터 세트 (Choi et al., 2018)를 사용하고 9 개의 일반 엔티티 유형 만 고려합니다.
Wang et al. (2020), 우리는 느슨한 미세 정밀도, 재현율 및 F1을보고하고 마이크로 -F1을 기본 메트릭으로 사용합니다.
Model

우리는 target entity을 [MASK] entity을 사용하여 표현하고 모델에 각 문장의 words와 sentence을 입력한다.
우리는 그러고, entity representation에 해당하는 것을 기반으로 선형 분류기를 사용하여 entity을 분류한다.
우리는 task을 muilti-label classification으로 간주하고, 모델을 모든 엔티티 타잎들에 대해 평균된 binary cross-entropy loss을 사용하여 학습한다.

Baselines

UFET (Choi et al., 2018) is a conventional model that computes context representations using the bidirectional LSTM. We also use BERT, RoBERTa, ERNIE, KnowBERT, KEPLER, and KAdapter as baselines.

Results

Table 1 shows the experimental results. LUKE significantly outperforms our primary baseline, RoBERTa, by 2.0 F1 points, and the previous best published model, KnowBERT, by 2.1 F1 points. Furthermore, LUKE achieves a new state of the art by outperforming K-Adapter by 0.7 F1 points.

4.2 Relation Classification

4.3 Named Entity Recognition

We conduct experiments on the NER task using the standard CoNLL-2003 dataset (Tjong Kim Sang and De Meulder, 2003).
Following past work, we report the span-level F1.
Model

Sohrab and Miwa (2018)에 따르면, 우리는 각 문장에서 entity name candidates로써 모든 가능한 spans을 열거해서 테스크를 해결할 수 있다.
그리고 그들을 target entity types 혹은 non-entity type으로 분류하고, 이는 span이 entity가 아님을 가리킨다.
데이터세트의 각 문장에서, 우리는 모든 가능한 spans에 해당하는 words와 [MASK] entities을 입력한다.

여기서 [MASK] 엔티티들을 예측하는 것

각 span의 representation은 span 사이의 first와 last words의 word representations와 span에 해당하는 entity representation을 연결하여 계산된다.
우리는 이 representations을 선형 분류기를 사용하여 분류하고 모델을 cross-entropy loss로 학습한다.
우리는 계산 효율성을 위해 span이 16개 words가 넘는 것은 배제한다.
인퍼런스 동안, 우리는 먼저 non-entity type으로 분류된 모든 spans을 배제한다.
ovelapping spans을 선택하는 것을 피하기 위해, span이 이미 선택된 것과 겹치지 않는 경우 예측된 엔티티 유형의 logit을 기반으로 나머지 범위에서 탐욕스럽게 span를 선택합니다.
Following Devlin et al. (2019), we include the maximal document context in the target document.

Baselines

LSTM-CRF (Lample et al., 2016)는 CRF (조건부 랜덤 필드)가있는 양방향 LSTM을 기반으로하는 모델입니다.
Akbik et al. (2018)은 문자 수준의 컨텍스트 표현으로 향상된 CRF와 양방향 LSTM을 사용하여 작업을 다룹니다.
마찬가지로 Baevski et al. (2019) 양방향 변압기를 기반으로 한 CWR로 강화 된 CRF와 함께 양방향 LSTM을 사용합니다.
또한 ELMo, BERT 및 RoBERTa를 기준선으로 사용합니다.
RoBERTa와 공정한 비교를 수행하기 위해 위에서 설명한 모델을 사용하여 성능을보고하고, 스팬의 첫 번째 단어와 마지막 단어의 표현을 연결하여 계산 된 스팬 표현을 사용합니다.

Results

The experimental results are shown in Table 3. LUKE outperforms RoBERTa by 1.9 F1 points. Furthermore, it achieves a new state of the art on this competitive dataset by outperforming the previous state of the art reported in Baevski et al. (2019) by 0.8 F1 points.

4.4 Cloze-style Question Answering

4.5 Extractive Question Answering

5 Analysis

5.1 Effects of Entity Representations

5.2 Effects of Entity-aware Self-attention

5.3 Effects of Extra Pretraining

6 Conclusions

이 논문에서, 우리는 LUKE을 제안한다.

Transformer을 기반으로 한 단어들과 엔티티들의 새로운 pretrained contextualized representations이다.

LUKE는 novel entity-aware self-attention 메커니즘을 사용하여 향상된 transformer 구조를 사용하여 words와 entities의 contextualized represenations을 출력한다.
실험 결과들은 다양한 entitiy-related tasks에서 효과성을 입증한다.
추후 연구는 LUKE을 domain-specific tasks에 적용할 것이다. (biomedical 혹은 legal domains과 같은)

Reference

https://www.aclweb.org/anthology/2020.emnlp-main.523.pdf

인공지능, AI, NLP, 논문 리뷰, Natural Language, Leetcode

AI Information

NL-114, LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention (2020-EMNLP)

◼️ Comment

0 Abstract

1 Introduction

2 Related Work

3 LUKE

3.1 Input Representation

3.2 Entity-aware Self-attention

3.3 Pretraining Task

3.4 Modeling Detail (번역)

4 Experiments

4.1 Entity Typing

4.2 Relation Classification

4.3 Named Entity Recognition

4.4 Cloze-style Question Answering

4.5 Extractive Question Answering

5 Analysis

5.1 Effects of Entity Representations

5.2 Effects of Entity-aware Self-attention

5.3 Effects of Extra Pretraining

6 Conclusions

댓글

댓글 쓰기