◼️ Comment

이 논문도 상당히 긴 논문이지만, ERC에서 기초?가 되는 느낌이기 때문에 읽어보았다.
완벽한 논문 이해보다는, 논문의 흐름과 핵심 키 아이디어만 얻는 개념으로 살펴보았다.
일단 multimodal 데이터세트, IEMOCAP, AVEC에서 실험을 진행한 것이다.
모델 방법은 GRU+att을 기반으로 한다.

3.3의 정리라고 쓴 부분이 모델 흐름을 나타낸다.
핵심은 party / context을 memory 개념으로 각각 GRU을 통해 끌고가는 것이다.
즉, 새로운 모델링을 해볼 때, party 별 flow을 한 번 고려해봄직 하다.

또한 모델의 에러 분석을 보면

비슷한 감정관계 때문에 오분류가 많이 일어난다고 한다.
또한 감정 분포에서 neutral이 많기 때문에 neutral 쪽의 오분류도 많다고 한다.
대화 레벨에서 화자의 감정이 변화하였을 때 성능적으로 떨어지는 결과를 보여준다고 한다.
즉, 한 화자에 대해 감정이 변화했는지 안했는지를 체크하는 모듈이 있다면 더욱 높은 성능을 내지 않을까 하는 아이디어도 얻어볼 수 있다.

0. Abstract

대화속 감정 검출은 많은 어플리케이션에서 필수적인 스텝이다.

opinion mining over chat history
social media threads
debates
argumentation mining
understanding consumer feedback in live conversations
and so on.

현재 시스템은 각 발화의 화자에 맞춰 대화 참여자를 개별적으로 처리하지 않습니다.
이 논문에서는, 우리는 RNN 방법을 기반으로 새로운 방법을 설명한다.

대화를 통하여 참여자 개인의 states을 track하고 이 정보를 감정 분류에 사용한다.

우리의 모델은 두 개의 다른 데이터세트에서 꽤 큰 차이로 SoTA을 달성한다.

1 Introduction

대화속 감정 검출은 많은 중요한 테스크의 활용때문에 연구 커뮤니티에서 관심이 증가하고 있다.

테스크로는 다음과 같다: opinion mining over chat history and social media threads in YouTube, Facebook, Twitter, and so on.

이 논문에서, 우리는 가능한 많은 대화 데이터를 프로세싱으로 인해 이러한 요구를 처리할 수 있는 RNN 기반 방법을 제시한다.
SoTA을 포함한 현재 시스템들은, 의미있는 방법으로 대화에서 참여자를 구분하지 않는다.

그들은 주어진 담화의 화자를 알지 못한다.

반면에, 우리는 담화, 컨텍스트, 현재 참여자 상태에 의존하는 대화의 흐름에 따라, 참여자 상태로 개인적인 참여자를 모델링한다.
우리의 모델은 대화의 감정에 관련된 3개의 주요 측면이 있다는 가설을 기반으로 한다.

the speaker
the context from the preceding utterances
the emotion of the preceding utterances.

이 3가지 측면들은 독립적일 필요는 없으나, 그들의 분리된 모델링은 SoTA을 달성하도록 한다. (테이블 2)
2명 대화에서는, 참여자들은 구별된 역할을 가지고 있다.
그래서, 컨텍스트를 추출하기 위해, 주어진 순간에 speaker와 listener 두 명의 preceding turns을 고려해야하는 것은 중요하다. (그림 1)
제안한 DialogueRNN 시스템은 3개의 gated recurrent units (GRU)을 적용하여 이러한 관점을 적용한다.

다가오는 담화는 global GRU와 party GRU에 넣어져서 context와 party state을 각각 업데이트한다.
3개의 GRU = global GRU / party GRU / emotion GRU (그림 2 참조)
담화를 인코딩하는 동안, global GRU는 참여자 정보에 해당하여 인코딩한다.

이러한 GRU에 대한 attending은 대화에서 다른 참여자들에의한 모든 이전의 담화들의 정보를 가지는 contextual representation을 제공한다.

즉 GRU에서 이전 담화들에 대한 attending(attention?)이 contextual representation을 만든다는 것인 듯

speaker 상태는 attention과 speaker의 이전의 상태를 통한 context에 의존한다.
time t에서 speaker state가 speaker의 이전 state 및 선행 참여자에 대한 정보를 가지는 global GRU에서 직접적으로 정보를 가져 오는 것을 보장합니다.

대화의 흐름을 담고 있는 global GRU와 speaker에 한정된 정보를 담고있는 GRU을 사용한다는 것?

마침내, 업데이트된 speaker state는 emotion GRU에 넣어져서 주어진 담화의 emotion representation을 디코딩을 한다, 이는 감정 분류에 사용된다.

time t에서, emotion GRU cell은 t-1의 emotion representation와 t의 speaker state을 받는다.
emotion GRU는 global GRU을 따라, 참여자간 관계 모델링에서 중추적인 역할을합니다.

반면에, party GRU 은 동일한 참여자의 두 개의 연속적인 sequential state을 모델링한다.
DialogueRNN에서는, 모든 3개의 GRU 타입들이 recurrent 방식으로 연결 된다.
우리는 DialogueRNN이 더 나은 컨텍스트 표현으로 인해 (Hazarika et al. 2018; Poria et al. 2017)과 같은 최첨단 컨텍스트 감정 분류기를 능가한다고 믿습니다.
핵심

3개의 GRU = global GRU / party GRU / emotion GRU (그림 2 참조)
여기서 global GRU은 전체적인 흐름을 담아두는 매 time에서 진행되는 것
party GRU은 동일한 화자의 감정 흐름을 캐치하는 것
emotion GRU: party GRU로부터 speaker state을 뽑아 emotion GRU에 전달되서 감정을 분류하는데 사용되는 것
모델링할 때, party GRU 방식만으로도 효과가 있을까?

2 Related Work

감정 인식은 자연어 처리, 심리학,인지 과학 등 다양한 분야에서 주목을 받고있다 (Picard 2010).
Ekman (1993)은 감정과 얼굴 단서 사이의 상관 관계를 발견했습니다.
Datcu와 Rothkrantz (2008)는 감정 인식을위한 시각 신호와 음향 정보를 융합했습니다.
Alm, Roth 및 Sproat (2005)는 Strapparava 및 Mihalcea (2010)의 작업에서 개발 된 텍스트 기반 감정 인식을 도입했습니다.
Wollmer et al. (2010)은 다중 모드 설정에서 감정 인식을 위해 상황 정보를 사용했습니다.
최근 Poria et al. (2017)은 RNN 기반 딥 네트워크를 멀티 모달 감정 인식에 성공적으로 사용했으며, 그 뒤를 이어 다른 작업이 이어졌습니다 (Chen et al. 2017; Zadeh et al. 2018a; 2018b).
인간의 상호 작용을 재현하려면 대화에 대한 깊은 이해가 필요합니다. Ruusuvuori (2013)는 감정이 대화에서 중추적 인 역할을한다고 말합니다.
대화의 정서적 역학은 대인 관계 현상이라고 주장되어 왔습니다 (Richards, Butler 및 Gross 2003).
따라서 우리의 모델은 효과적인 방식으로 대인 관계를 통합합니다.
또한 대화는 자연스러운 시간적 성격을 가지고 있기 때문에 반복적 인 네트워크를 통해 시간적 성격을 채택합니다 (Poria et al. 2017). 메모리 네트워크 (Sukhbaatar et al. 2015)는 질문 응답 (Sukhbaatar et al. 2015; Kumar et al. 2016), 기계 번역 (Bahdanau, Cho, Bengio 2014), 음성 인식 (Graves)을 포함한 여러 NLP 영역에서 성공적이었습니다. , Wayne 및 Danihelka 2014) 등이 있습니다.
따라서 Hazarika et al. (2018)은 두 개의 별개의 메모리 네트워크가 화자 간 상호 작용을 가능하게하여 최첨단 성능을 제공하는 이원 적 대화에서 감정 인식을 위해 메모리 네트워크를 사용했습니다.

3 Methodology

3.1 Problem Definition

대화에서 참여자들을 p1, p2, ..., pM라고 하자. (여기서 사용한 데이터세트에선 M=2이다.)
구성된 발화들 u1, u2, ..., uN의 감정 레이블들 (happy, sad, neutral, angry, excited, and frustrated) 예측하는 테스크이다.

여기서 담화 $u_t$ 는 참여자 $p_{s(u_t)}$ 에 의해 발화되고, s는 해당하는 참여자의 index와 매핑함수라고 보면 된다. (즉, s(ut)는 ut을 말한 speaker을 말하는 것 1또는 2겠지)

또한 $u_t$ 는 Dm차원의 utterance representation으로 아래에서 설명할 feature 추출기을 사용하여 획득된다.

3.2 Unimodal Feature Extraction

SoTA 방법, conversational memory networks (CMN) (Hazarika et al. 2018)와 공정한 평가를 위해, 우리는 동일한 feature extraction 과정을 따른다.
Textual Feature Extraction

우리는 CNN으로 textual feature을 추출한다.
Kim을 따라, 우리는 각 담화에서 각각 50 feature-maps을 가지는 3개의 구별되는 CNN filters (3,4,5)을 사용하여 n-gram features을 가진다.
출력들은 max-pooling을 적용하고 ReLU을 따른다.
이러한 활성함수들은 100 차원 dense layer에 적용되고, 이는 textual utterance representation으로 간주된다.
이 네트워크는 emotion labels와 함께 utterance level에서 학습이 된다.
end-to-end로 학습이 되는 것이 아니고 문장 감정 분류로 학습을 미리해서 textual feature을 뽑는 식인가?

Audio and Visual Feature Extraction

Identical to Hazarika et al. (2018), we use 3D-CNN and openSMILE (Eyben, Wollmer, and Schuller 2010) for visual and acoustic ¨ feature extraction, respectively.

3.3 Our Model

정리

이것를 구현하거나 할 필요는 없기 때문에.. 흐름만 파악한 것을 정리해보자면
총 speaker-state 모델링, global state의 flow을 가진 모델이다.
화자는 2명이기 때문에 총 3개의 flow가 있다.
speaker가 말할 떄는 다른 화자는 listener가 되는 것이다.
현재 담화가 주어졌을 때, 그 담화의 speaker 상태를 GRU로 업데이트하는데, 입력으로 1) 이전 speaker 상태(GRU 출력)과 2) context 정보(global state)와 3) 현재 발화가 주어진다.
context는 global state의 각 step의 결과에 대한 self-attention을 통한 weighted sum이 되는 것이다. (아래에서 speaker GRU로 부름)
현재 담화 speaker update는 party GRU로 부른다.
speaker가 말할 때, listener 업데이트도 할 수 있겠지만, 실험 결과 별 차이가 없었는지, 그냥 상태를 변화안시키고 간다고 한다.
감정 분류는 party GRU의 결과를 GRU_E에 태워서 예측하는 식이다.

우리는 대화 담화의 감정이 3개의 주요 요소들에 의존한다고 가정한다.

1. the speaker
2. the context given by the preceding utterances
3. the emotion behind the preceding utterances

우리의 모델 DialogueRNN은 그림 2a에서 보여주며, 다음의 3가지 요소를 모델링한다.

각 party는 담화에서 party가 말할 때와 이것이 변하는 party state을 사용하여 모델링 된다.

이것은 모델이 대화를 통해 parties의 감정 변화를 트래킹할 수 있게하고, 이는 담화들 뒤에 있는 감정과 연관이 있다.
게다가, 담화의 context는 정확한 party state representation에 필요한 context rerpresentation을 위해 global state을 사용하여 모델링된다.

global state: called global, because of being shared among the parties
global state의 이전 담화들과 party states은 context representation을 위해 jointly하게 인코딩된다. (그림 참고)
뭔가 말로는 복잡하지만, 그림대로 global과 party가 상호작용하면서 서로 인코딩에 활용하는 모습이라고 이해하면 될 듯

마침내, 모델은 context로써 이전의 speaker들의 state와 함께 speaker의 party state으로 부터 감정 표현을 추론한다.
이 감정 표현은 마지막 감정 분류에 사용된다.
우리는 GRU cells을 사용하여 states와 representatios을 업데이트 한다.

각 GRU cell은 hidden state을 계산한다.
우리는 GRU 계산을 부록에 자세히 적어두었다.

GRUs are efficient networks with trainable parameters: W {r,z,c} ∗,{h,x} and b {r,z,c} ∗ .
우리는 current speaker 상태와 previous utternace의 emotion representation의 함수로써 current utterance의 emotion representation을 모델링한다.
마침내, emotion representation은 emotion 분류를 위한 softmax layer로 보내진다.
Global State (Global GRU)

Global state는 utterance와 speaker의 상태를 jointly하게 인코딩하여 주어진 utterance의 context을 캡쳐하는데 목표를 두고 있다.
각 state는 또한 speaker-specific utterance representation으로 간주된다.
이러한 상태에 attending은 화자간(inter-speaker) 및 발화간(inter-utterance) 종속성이 향상되어 컨텍스트 표현이 향상됩니다.

Party State (Party GRU)

DialogueRNN은 대화에서 고정된 사이즈 vector q1, q2, ..., qM을 사용하여 개별 speakers의 상태를 추적한다. (GRU 출력들이 q인듯)
이러한 states은 감정 분류와 연관된 대화의 화자 상태의 표현을 나타낸다.
우리는 대화의 speaker or listener 모두 참여자의 현재(time t) role과 다가오는 utterance ut을 기반으로 이러한 상태를 업데이트한다.
이러한 state vectors은 모든 참여자에서 null vectors로 초기화된다.
이 모듈의 주 목적은 모델이 각 발화의 speaker가 누군지를 알고 그에 따라 핸들링할 수 있게 하는 것이다.

Speaker Update (Speaker GRU)

Speaker은 보통 대화의 이전 utterances인 context을 기반으로 response을 구성한다.
그래서, 우리는 다음과 같이 utterance와 관련있는 context을 캡쳐한다.
여기서 α는 attention scores로 이전 대화들의 global 상태표현에 대한 것이다.
Finally, in Eq. (4) the context vector ct is calculated by pooling the previous global states with α.

Listener Update

Listener state은 speaker의 utterance 때문에 listeners의 상태변화를 모델링한다.
We tried two listener state update mechanisms:

단순히 리스너의 상태를 변경하지 않고 유지하십시오.
다른 GRU 셀 GRU_L을 사용하여 청취자의 시각적 단서 (얼굴 표정) vi, t 및 컨텍스트 ct를 기반으로 청취자 상태를 업데이트합니다.

두 번째 접근법은 매개 변수 수를 늘리면서 매우 유사한 결과를 산출하기 때문에 더 간단한 첫 번째 접근법으로도 충분합니다.
이것은 청취자가 말할 때만 대화에 관련이 있다는 사실 때문입니다.
즉, 조용한 파티는 대화에 영향을 미치지 않습니다.
이제 당사자가 말할 때, 우리는 모든 이전 발화에 대한 관련 정보를 포함하는 컨텍스트 ct로 상태 qi를 업데이트하여 명시 적 리스너 상태 업데이트를 불필요하게 만듭니다.
이는 표 2에 나와 있습니다.

Emotion Representation (Emotion GRU)

우리는 화자의 상태 qs (ut), t와 이전 발화 et-1의 감정 표현에서 감정적으로 관련된 표현 등을 추론합니다.

Emotion Classification

We use a two-layer perceptron with a final softmax layer to calculate c = 6 emotion-class probabilities from emotion representation et of utterance ut and then we pick the most likely emotion class:

Training

We use categorical cross-entropy along with L2- regularization as the measure of loss (L) during training:

We used stochastic gradient descent based Adam (Kingma and Ba 2014) optimizer to train our network.
Hyperparameters are optimized using grid search (values are added to the supplementary material).

3.4 DialogueRNN Variants

We use DialogueRNN (Section 3.3) as the basis for the following models:

DialogueRNN + Listener State Update (DialogueRNNl):
Bidirectional DialogueRNN (BiDialogueRNN):
DialogueRNN + attention (DialogueRNN+Att):
Bidirectional DialogueRNN + Emotional attention (BiDialogueRNN+Att):

위와 같은 다양한 변형을 시도했었고 실험 결과는 테이블 2에 있음.
결관는 BiDialogueRNN+Att가 제일 좋게 나온다.

4 Experimental Setting

4.1 Datasets Used

우리는 DialogueRNN을 평가하기 위해 IEMOCAP (Busso et al. 2008) and AVEC (Schuller et al. 2012)의 두 데이터세트를 사용한다.
데티서테르를 학습과 테스트트로 80/20 비율로 parititions이 어떠한 speaker도 공유하지 않도록 나눈다.
Table 1 shows the distribution of train and test samples for both dataset.

4.2 Baselines and State of the Art

c-LSTM (Poria et al. 2017)
Biredectional LSTM (Hochreiter and Schmidhuber 1997) is used to capture the context from the surrounding utterances to generate context-aware utterance representation.
However, this model does not differentiate among the speakers.
c-LSTM+Att (Poria et al. 2017)
In this variant attention is applied to the c-LSTM output at each timestamp by following Eqs. (13) and (14).
This provides better context to the final utterance representation.
TFN (Zadeh et al. 2017)
This is specific to multimodal scenario.
Tensor outer product is used to capture intermodality and intra-modality interactions.
This model does not capture context from surrounding utterances.
MFN (Zadeh et al. 2018a)
Specific to multimodal scenario, this model utilizes multi-view learning by modeling view-specific and cross-view interactions.
Similar to TFN, this model does not use contextual information.
CNN (Kim 2014)
This is identical to our textual feature extractor network (Section 3.2) and it does not use contextual information from the surrounding utterances.
Memnet (Sukhbaatar et al. 2015)
As described in Hazarika et al. (2018), the current utterance is fed to a memory network, where the memories correspond to preceding utterances.
The output from the memory network is used as the final utterance representation for emotion classification.
CMN (Hazarika et al. 2018)
This state-of-the-art method models utterance context from dialogue history using two distinct GRUs for two speakers.
Finally, utterance representation is obtained by feeding the current utterance as query to two distinct memory networks for both speakers.

5.1 Comparison with the State of the Art

위의 비교모델들과 했다는 것.. 생략

5.2 DialogueRNN vs. DialogueRNN Variants

생략 (3.4에 대한 추가 설명정도)

5.3 Multimodal Setting

5.4 Case Studies

5.5 Error Analysis

예측 중에서 눈에 띄는 트렌드는 관계가 있는 감정들 사이에 cross-predictions이 높다는 것이다.
모델이 happy 감정에 대해 대부분의 오분류는 excited class로 된다.
또한, anger와 frustrated는 서로에 대해 오분류를 공유한다.
우리는 이러한 감정 쌍 사이에 미묘한 차이 때문에 더 어려운 모호성을 유발한다고 생각한다.
높게 틀리는 다른 클래스로는 neutral class이다.

주 이유는, 고려된 감정들 사이에 class 분포가 핵심이다.

dialogue level에서, 우리는 같은 참여자의 이전의 turn에서 감정이 바뀌면서 차례대로 상당히 많은 오류가 발생하는 것을 관찰한다.
테스트세트에서 이러한 감정 이동의 발생에서, 우리의 모델은 47.5% 인스턴스를 옳바르게 예측한다.

이것은 감정 이동이 없었을 때, 69.2% 성공적인 것에 비교하여 적다.
즉 감정 이동이 일어나면 더 맞추기가 어렵다는 해석일듯

대화에서 감정의 변화는 latent dynamics에 의해 지배되는 복잡한 현상입니다.
이러한 사례의 추가 개선은 연구의 열린 영역으로 남아 있습니다.

5.6 Ablation Study

6 Conclusion

우리는 ERC에서 RNN-based 뉴럴 네트워크를 설명한다.
SoTA, CMN 방법과 비교하여, 우리의 방법은 speaker의 특성을 고려하여 다가오는 utterance을 다루고, 이는 utterance에 더 세밀한 context을 주게 된다.
우리의 모델은 textual and multimodal settings에서 2개의 데이터세트에서 SoTA을 달성한다.
우리의 방법은 multi-party 세팅으로 확장이 가능하며, 우리는 앞으로 더욱 연구를 할 것이다.

Reference

https://arxiv.org/pdf/1811.00405.pdf

인공지능, AI, NLP, 논문 리뷰, Natural Language, Leetcode

AI Information

NL-092, DialogueRNN: An Attentive RNN for Emotion Detection in Conversations (2019-AAAI)

◼️ Comment

0. Abstract

1 Introduction

2 Related Work

3 Methodology

3.1 Problem Definition

3.2 Unimodal Feature Extraction

3.3 Our Model

3.4 DialogueRNN Variants

4 Experimental Setting

4.1 Datasets Used

4.2 Baselines and State of the Art

5.1 Comparison with the State of the Art

5.2 DialogueRNN vs. DialogueRNN Variants

5.3 Multimodal Setting

5.4 Case Studies

5.5 Error Analysis

5.6 Ablation Study

6 Conclusion

댓글

댓글 쓰기