다음 답변은 논문 SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities 기반이다.

1. 텍스트와 오디오가 동시에 되는 모델을 학습한게 맞지?

맞다. 이 논문은 텍스트와 음성을 모두 입력/출력할 수 있는 멀티모달 LLM을 목표로 한다. 논문에서는 이를 “intrinsic cross-modal conversational abilities”라고 표현한다.

모델 입력/출력

논문 기준으로 지원하는 형태는 다음과 같다.

입력	출력
Text	Text
Text	Speech
Speech	Text
Speech	Speech

특히 speech-to-speech 대화도 지원한다.

구조적으로는:

입력 speech → discrete speech unit sequence
입력 text → 일반 text token
출력 speech → discrete speech unit 생성 후 vocoder로 waveform 복원

형태다.

2. 학습할때 텍스트와 오디오 모두 discrete token으로 변환해서 next token prediction한게 맞는지?

부분적으로 맞다.

정확히는:

speech는 discrete token(unit) 으로 변환
text는 일반 LLM tokenizer token
이후 하나의 vocabulary 안에서 autoregressive next-token prediction 수행

이다.

어떤 토크나이저를 썼는지

Speech tokenizer

speech tokenizer로는 mHuBERT / HuBERT 기반 discrete unit extractor를 사용했다.

논문 설명:

HuBERT intermediate representation 추출
k-means clustering
cluster index를 discrete unit token으로 사용
인접 중복 token 제거

즉 speech는:

waveform → HuBERT → cluster index → discrete speech tokens

형태다.

Text tokenizer

text tokenizer는 명시적으로 적혀있지 않다.

하지만 backbone이 LLaMA 이므로, 사실상 LLaMA tokenizer를 사용했다고 보는 것이 자연스럽다.

둘 다 discrete token으로 next-token prediction 했는지

Speech

그렇다.

Stage 1에서는 speech discrete unit에 대해 직접 next-token prediction 수행한다. loss도 명시되어 있다.

즉:

P(u_i\mid u_{<i})

형태다.

Text + Speech 혼합

Stage 2 이후에는 text token과 speech unit token이 같은 vocabulary 공간 안에 들어간다.

즉 모델 입장에서는:

[text tokens][speech unit tokens]

모두 그냥 sequence token처럼 autoregressive modeling 한다.

3. 모델 학습 순서를 설명해봐

논문은 명확하게 3-stage training을 사용한다.

전체 학습 순서

Stage 1 — Modality-Adaptation Pre-training

목적

speech discrete unit를 LLM vocabulary에 적응시키는 단계

초기 상태

backbone: pretrained LLaMA-13B
vocabulary 확장
speech unit token 추가

즉 텍스트 pretrained backbone에서 시작한다.

처음부터 speech+text jointly pretrain한 것은 아니다.

학습 데이터

LibriLight
60K hours unlabeled speech

학습 방식

speech unit sequence만 사용한 next-token prediction.

즉 pure speech LM adaptation 단계다.

Stage 2 — Cross-modal Instruction Fine-Tuning

목적

speech-text alignment 및 instruction following

데이터

논문은 다음을 mixing했다고 명시한다:

SpeechInstruct Cross-modal Instruction
moss-002-sft text instruction dataset

즉:

speech↔text paired instruction
일반 text instruction tuning

을 함께 사용했다.

speech-text 데이터 양

Cross-modal Instruction dataset:

9 million unit-text pairs

텍스트와 오디오 비율

정확한 mixing ratio는 논문에 없다.

없다고 보는 게 맞다.

Stage 3 — Chain-of-Modality Instruction Fine-Tuning

목적

speech reasoning / speech dialogue 강화

예:

speech input
→ internal text reasoning
→ speech output

학습 방식

LoRA fine-tuning 사용.

즉 full finetuning이 아니라 parameter-efficient tuning.

데이터

moss-002-sft 기반 synthetic speech instruction dataset
37,969 samples

처음부터 text+audio jointly training 했는지?

아니다.

순서는:

pretrained text LLM (LLaMA)
speech token vocabulary 추가
speech-only adaptation
speech-text instruction tuning
chain-of-modality tuning

이다.

post-training도 하는지?

한다.

논문 관점에서는:

Stage 2 = instruction tuning
Stage 3 = additional LoRA instruction tuning

이 post-training 역할을 한다고 볼 수 있다.

특히 RLHF나 preference optimization은 없다.

4. 오디오 데이터라고 하는 것은 어떤 형태인지?

둘 다 있다.

(1) Plain audio

Stage 1은:

unlabeled speech corpus
transcript 없음

이다.

즉 pure audio sequence modeling.

(2) Audio-text pair

Stage 2는 ASR dataset 기반이다.

사용 데이터:

GigaSpeech
Common Voice
LibriSpeech

모두 transcription이 있는 ASR dataset이다.

즉:

(audio, transcript)

pair를 사용한다.

speech 자체 저장 형태

모델 내부에서는 waveform을 직접 쓰지 않는다.

반드시:

waveform
→ HuBERT discrete unit
→ token sequence

로 변환한다.

5. 모델 평가를 어떻게 하는지

평가 방식

논문은 대부분 human evaluation + case study 중심이다.

정량 benchmark 점수는 거의 없다.

각 학습 stage별 평가를 하는지?

아니다.

논문에서는 최종 SpeechGPT capability 중심으로 평가한다.

Stage별 ablation이나 intermediate evaluation은 거의 제공하지 않는다.

텍스트와 오디오 benchmark 모두 평가하는지?

부분적으로만.

논문은:

cross-modal instruction following
spoken dialogue

위주 평가를 한다.

하지만:

표준 NLP benchmark
표준 ASR benchmark
표준 TTS benchmark

같은 rigorous quantitative evaluation은 없다.

즉 qualitative demo 중심 논문에 가깝다.

평가 예시

논문에는:

speech transcription
speech generation
speech dialogue

예시가 제공된다.

6. 논문의 동기 및 기여점은 뭐야

동기

논문은 기존 speech-LLM 접근의 문제를 지적한다.

기존 방식:

Speech → ASR → LLM → TTS

cascade pipeline

문제:

modality alignment 안 됨
speech knowledge transfer 안 됨
prosody/emotion 손실
speech generation만 가능하고 speech understanding 부족

핵심 아이디어

speech를 discrete token화해서:

speech tokens + text tokens

를 하나의 autoregressive LM 안에서 직접 modeling 하자.

주요 기여점

논문이 주장하는 contribution:

1. speech input/output 가능한 multimodal LLM

2. SpeechInstruct dataset 구축

최초의 large-scale speech-text instruction-following dataset이라고 주장.

3. Chain-of-Modality training

speech → text reasoning → speech generation 구조 도입.

4. discrete representation 기반 modality unification

speech를 token space로 가져와서 LLM에 자연스럽게 통합.

정량적인 WER이나 그런 점수는 없군

Reference

https://arxiv.org/pdf/2305.11000

Audio-011, SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities, Findings of EMNLP 2023

1. 텍스트와 오디오가 동시에 되는 모델을 학습한게 맞지?

모델 입력/출력

2. 학습할때 텍스트와 오디오 모두 discrete token으로 변환해서 next token prediction한게 맞는지?

어떤 토크나이저를 썼는지

Speech tokenizer

Text tokenizer

둘 다 discrete token으로 next-token prediction 했는지

Speech

Text + Speech 혼합

3. 모델 학습 순서를 설명해봐

전체 학습 순서

Stage 1 — Modality-Adaptation Pre-training

목적

초기 상태

학습 데이터

학습 방식

Stage 2 — Cross-modal Instruction Fine-Tuning

목적

데이터

speech-text 데이터 양

텍스트와 오디오 비율

Stage 3 — Chain-of-Modality Instruction Fine-Tuning

목적

학습 방식

데이터

처음부터 text+audio jointly training 했는지?

post-training도 하는지?

4. 오디오 데이터라고 하는 것은 어떤 형태인지?

(1) Plain audio

(2) Audio-text pair

speech 자체 저장 형태

5. 모델 평가를 어떻게 하는지

평가 방식

각 학습 stage별 평가를 하는지?

텍스트와 오디오 benchmark 모두 평가하는지?

평가 예시

6. 논문의 동기 및 기여점은 뭐야

동기

핵심 아이디어

주요 기여점

1. speech input/output 가능한 multimodal LLM

2. SpeechInstruct dataset 구축

3. Chain-of-Modality training

4. discrete representation 기반 modality unification

댓글

댓글 쓰기