NL-050, TinyBERT, Distilling BERT for Natural Language Understanding (2019-Arxiv)

■ Comment
BERT을 경량화하는 논문들을 읽어볼 예정인데 그 중 첫 번째로 정한 논문이다.
읽어보기전 자료를 찾아보던 중, 잘 정리된 자료가 있어서 링크를 첨부한다.

SlideShare: 링크
대부분의 모델 흐름과 설은 이자료에서 잘 담아냈기 때문에 직접 구현하거나 논문을 구석구석 세세히 알 필요가 없는 분들은 자료만 참고해도 될 것 같다...
(나또한..) 논문의 서론과 끝만 정리를..

*Distilling

1.Large Corpus으로 pre-trained BERT을 student에 distilling을 한다.

2.Task data로 Fine-tuned BERT을 student에 distilling을 추가로 한다. (data-agumentation도 적용)

0. Abstract

BERT와 같이 Language model pre-training은 많은 NLP task에서 효과적인 성능 향상을 보여주고 있다.
하지만 이는 computationally expensive와 memory intensive하다는 단점이 있다.

따라서 resource-restricted devices에서 효과적으로 실행하기 어렵다는 점이 있다.
즉, 상용화 레벨에서는 BERT을 바로 사용하는 것은 쉽지 않다!

이 논문에서는 inference의 속도는 가속화 시키고 모델 사이즈는 줄이지만 성능은 보존하기 위해 Transformer distillation 방법을 사용했다고 한다.

이는 Knowledge distillation (KD)을 기반으로한 방법이다.
즉 많은 정보를 담고 있는 teacher BERT가 small student TinyBERT로 지식을 전달하는 개념

학습은 two-stage 학습 framework을 사용하였으며 이 두가지는 pre-training과 task-specific learning stage을 의미한다.

pre-training은 general-domain을 의미
task-specific은 원하는 task을 의미

TinyBERT은 GLUE에서 Teacher BERT의 96% 성능 성능을 달성
7.5배 작고 9.4배 빠른 inference 시간
이전의 BERT distillation SoTA보다 28%로 파라미터 줄이고 31%로 inference 시간을 줄였다고 한다.

1. Introduction

많은 model compression 기술들 중 가장 많이 쓰이는 기법은 두 개가 있다

weights pruning
knowledge distillation (KD)

이 논문에서는 KD 기법을 이용하였고 제프리 힌튼이 낸 아이디어로 teacher-student framework 방식이다.
KD는 많은 정보를 embedded된 teacher network을 student에 전달하여 reproduce하는 방식이다.
정보를 전달할 때, pre-training stage와 fine-tunning stage에서 각각 KD를 적용할 수 있는데 이에 대한 비교는 다음의 테이블에 있다.
The main contributions of this work are as follows:

1) We propose a new Transformer distillation method to encourage that the linguistic knowledge encoded in teacher BERT can be well transferred to TinyBERT.
2) We propose a novel two-stage learning framework with performing the proposed Transformer distillation at both the pre-training and fine-tuning stages, which ensures that TinyBERT can capture both the general-domain and task-specific knowledge of the teacher BERT.
3) We show experimentally that our TinyBERT can achieve comparable results with teacher BERT on GLUE tasks, while having much fewer parameters (∼13.3%) and less inference time (∼10.6%), and significantly outperforms other state-of-the-art baselines on BERT distillation.

2 Preliminaries

2.1 Transformer Layers

BERT 논문 참고

2.2 Knowledge Distillation

모델에서 behavior라고 정의하는 함수가 있는데 이것은 모델이 어떠한 출력까지의 네트워크를 말한다.

예를들어, CNN로 MNIST를 분류하는 것에서, 마지막 FC layer전까지를 2d features라고 볼 수가 있고 그렇다면 2d features을 뽑아내는 behavior은 CNN들을 의미하는 것이다.

즉 teacher의 behavior을 student behavior function과 차이가 업도록 loss KD을 정의해줘서 학습하는 것이다.
물론 여기서 들어가는 입력은 text 들이 되겠다.
KD에서의 핵심은 어떻게 effective behavior function과 loss function을 정의하는 가이다.

이전의 KD 방법과는 다르게 two stages에서 loss가 잘 작동하도록 고려하였다.

3. Method

참고한 자료 확인!
눈여겨 볼 것은 어떻게 g(m)을 정의하는 가이다.

즉 student의 layers 수가 teacher의 layers 수보다 적은 상황에서 student layers가 어떤 teacher layers을 모방하도록 함수 g(m)를 정의하는 가임
가장 기본적으로는 teacher-layers-nums/student-layers-nums으로 나눠서 uniform하게 설정해주는 방식이다.
논문의 실험부분에서 이를 포함하여 3가지에 대한 실험을 하였는데 g(m)도 학습하는 방식으로 설정하는 방법은 없을까? 하는 의문이 있음

4. Experiments

참고 자료 확인!
위에서 말한 mapping 함수에 대한 실험이 table 7이다.

이것으로 어떻게 mapping 하냐에 따라 성능이 바뀌는데 즉, teacher network에서 각 부분이 다른 역할을 한다는 것을 보여준다고 한다.

5. Conclusion and Future Work

논문에서는 KD 방법을 처음으로 Transformer-based distillation에 적용하였고 더 나아가서 two-stage framework을 TinyBERT 학습에 적용하였다.
TinyBERT은 의미있는 실험결과를 보여주며 모델 사이즈와 inference 시간을 줄여준다.
따라서 BERT-based NLP application을 edge device에 효과적으로 적용할 수 있다.
앞으로서는 더 깊고 넓은 네트워크인 BERT-Large와 XLNET-Large 등에 대해서 TinbyBERT 기법을 적용해볼 예정이다.
또한 distillation과 quantization/pruning을 joint learning을 하여 더 좋은 압축을 할 수 있을 것이다.

6. Appendix

A. Data Agumentation Details

참고자료와 함께 다음의 알고리즘 도식도를 보면 이해가 쉬울 것 같다.
pt=0.4, N=20, M=15 으로 설정하였다고 한다.

Reference

https://arxiv.org/pdf/1909.10351v2.pdf
참고로 2020 ICLR에서 reject 당했음ㅠ(rating보니까 아까워보이긴 한데..): 오픈리뷰 링크

인공지능, AI, NLP, 논문 리뷰, Natural Language, Leetcode

AI Information

NL-050, TinyBERT, Distilling BERT for Natural Language Understanding (2019-Arxiv)

*Distilling

1.Large Corpus으로 pre-trained BERT을 student에 distilling을 한다.

2.Task data로 Fine-tuned BERT을 student에 distilling을 추가로 한다. (data-agumentation도 적용)

0. Abstract

1. Introduction

2 Preliminaries

2.1 Transformer Layers

2.2 Knowledge Distillation

3. Method

4. Experiments

5. Conclusion and Future Work

6. Appendix

A. Data Agumentation Details

댓글

댓글 쓰기

NL-050, TinyBERT, Distilling BERT for Natural Language Understanding (2019-Arxiv)

*Distilling 1.Large Corpus으로 pre-trained BERT을 student에 distilling을 한다. 2.Task data로 Fine-tuned BERT을 student에 distilling을 추가로 한다. (data-agumentation도 적용)

0. Abstract

1. Introduction

2 Preliminaries

2.1 Transformer Layers

2.2 Knowledge Distillation

3. Method

4. Experiments

5. Conclusion and Future Work

6. Appendix

A. Data Agumentation Details

댓글

댓글 쓰기

*Distilling

1.Large Corpus으로 pre-trained BERT을 student에 distilling을 한다.

2.Task data로 Fine-tuned BERT을 student에 distilling을 추가로 한다. (data-agumentation도 적용)