다음 답변은 SPIRIT LM: Interleaved Spoken and Written Language Model 논문 기준이다.

1. 텍스트와 오디오가 동시에 되는 모델을 학습한게 맞지?

맞다. SPIRIT LM은 텍스트와 음성을 하나의 autoregressive language model 안에서 동시에 다루는 멀티모달 모델이다. 핵심은 speech token과 text token을 하나의 token stream으로 섞어서(next-token prediction) 학습한 점이다.

논문에서 명시적으로:

“speech and text sequences are concatenated as a single stream of tokens”
“language model trained with next token prediction”

이라고 설명한다.

모델 입력/출력

입력

모델 입력은 다음 세 종류다.

Text-only
Speech-only
Interleaved speech-text sequence

예시:

[TEXT] hello world
[SPEECH][Hu34][Hu12]...
[TEXT] the cat [SPEECH]... [TEXT] on the mat

출력

출력도 동일하게:

text 생성 가능
speech token 생성 가능
text→speech
speech→text
speech→speech continuation
text→text continuation

모두 가능하다.

즉, modality를 바꿔가며 생성할 수 있다.

2. 학습할때 텍스트와 오디오 모두 discrete token으로 변환해서 next token prediction한게 맞는지?

맞다.

SPIRIT LM은 텍스트와 음성을 모두 discrete token sequence로 바꾼 뒤, 일반적인 autoregressive next-token prediction으로 학습한다.

텍스트 토크나이저

텍스트는:

LLaMA 기본 tokenizer
subword BPE token

을 사용한다.

논문 표현:

“text is encoded with subword BPE tokens”
“We tokenize text with the default LLaMA tokenizer”

오디오 토크나이저

BASE 모델

speech는:

HuBERT 기반 discrete unit tokenizer
vocabulary size 501

을 사용한다.

즉:

raw waveform
→ HuBERT SSL encoder
→ clustering
→ discrete speech unit

방식이다.

EXPRESSIVE 모델

EXPRESSIVE는 여기에 추가로:

pitch token
style token

을 더한다.

즉 speech token이:

phonetic token (HuBERT)
pitch token
style token

3종류 조합으로 구성된다.

3. 모델 학습 순서를 설명해봐

텍스트 백본에서 시작했는지?

맞다.

SPIRIT LM은:

pretrained LLaMA 2 7B
를 출발점으로 사용한다.

즉 처음부터 multimodal pretraining을 한 게 아니라:

텍스트 LLM pretrained
이후 speech/text multimodal continual pretraining

순서다.

논문 표현:

“continuously pretraining a text-pretrained language model”

처음부터 텍스트&오디오 데이터를 같이 학습했는지?

아니다.

처음부터 joint training이 아니라:

이미 학습된 text LLM(LLaMA2)
위에 speech capability를 추가하는 continual pretraining이다.

실제 학습 데이터 구성

학습 데이터는 3종류다.

데이터	양
Text-only	307B text tokens
Speech-only	458K hours / 28.2B speech tokens
Speech+Text aligned	111K hours / 7B speech + 1.4B text

학습 시 데이터 비율

배치 샘플링 비율은:

speech-only 33.3%
speech+text 33.3%
text-only 33.3%

로 거의 동일하게 맞춘다.

논문은 token 수 차이가 너무 크기 때문에:

각 modality를 roughly equal number of times 보도록 sampling weight를 조절했다고 설명한다.

학습 방식

학습 objective는:

autoregressive next-token prediction

단 하나다.

별도 encoder-decoder ASR objective나 CTC objective를 쓰지 않는다.

interleaving 방식

핵심 아이디어는:

word boundary 단위로
text span ↔ speech span
을 랜덤하게 섞는 것.

예:

[TEXT] the cat
[SPEECH] sat on
[TEXT] the mat

post-training 하는지?

일반적인 instruction tuning / RLHF / chat tuning은 하지 않았다.

논문에서:

foundational model이라고 설명
safety instruction tuning 필요하다고 별도 언급

한다.

다만:

continual pretraining 자체는 수행
일부 ablation에서는 ASR/TTS special token training 수행

은 있다.

4. 오디오 데이터라고 하는 것은 어떤 형태인지?

둘 다 사용한다.

1) plain speech-only 데이터

speech-only corpus를 사용한다.

transcription 없이 speech token sequence만 학습

예:

[SPEECH][Hu12][Hu88]...

2) aligned speech-text pair 데이터

또한:

ASR 스타일 speech-text aligned data
도 사용한다.

다만 중요한 점은:

논문은 이것을 일반 supervised ASR task로 학습하지 않고,
speech/text interleaving을 위한 aligned corpus로 사용한다.

word-level alignment까지 수행한다:

dataset provided alignment
또는
forced alignment tool

사용.

즉:

단순 ASR supervised learning
보다는
cross-modal token alignment 학습
목적이 더 크다.

5. 모델 평가를 어떻게 하는지

각 학습 단계별 평가를 하는지?

메인 결과는 최종 모델 기준이다.

다만 ablation은 수행한다:

no interleaving
speech-only
text-only
random init
ASR+TTS-only
등.

그리고 training step별 성능 변화도 일부 분석한다.

하지만:

stage-wise checkpoint evaluation 체계
pretrain→midtrain→posttrain 각각 공식 평가

형태는 아니다.

텍스트와 오디오 모두 평가하는지?

맞다.

논문은:

text benchmark
speech benchmark
cross-modal benchmark

모두 평가한다.

텍스트 평가

대표적으로:

MMLU
BLIMP
WUGGY
StoryCloze

speech 평가

speech 버전:

sBLIMP
sWUGGY
spoken StoryCloze

등을 사용한다.

cross-modal 평가

추가로:

speech→text StoryCloze
text→speech StoryCloze
few-shot ASR
few-shot TTS
speech intent classification

등 평가한다.

expressivity 평가

논문 핵심 contribution 중 하나.

새 benchmark:

STSP (Speech-Text Sentiment Preservation)

를 제안한다.

이 벤치마크는:

감정이 담긴 speech/text prompt를 주고
생성 결과가 감정을 유지하는지 평가

한다.

6. 논문의 동기 및 기여점은 뭐야

동기

논문 동기는 매우 명확하다.

기존 방식:

Speech
 → ASR
 → Text LLM
 → TTS

pipeline은:

semantic generation은 잘하지만
speech expressivity/prosody/emotion이 깨진다

는 문제의식이다.

또 기존 SpeechLM은:

speech-only training
task-specific training
이라 text LLM 수준 generalization/few-shot ability가 부족하다고 본다.

핵심 기여점

논문 기여는 크게 6개다.

1. speech+text unified autoregressive LM

텍스트와 speech를 하나의 token stream으로 통합한 LLM 제안.

2. interleaving training 제안

word-level speech/text interleaving 학습 방식 제안.

이게 cross-modal alignment 핵심이라고 주장한다.

3. pretrained text LLM capability transfer

LLaMA2의:

semantic ability
few-shot learning

을 speech modality로 transfer 가능함을 보였다.

4. expressive speech modeling

HuBERT token만이 아니라:

pitch
style

token을 추가해 expressive speech generation 수행.

5. cross-modal generation

다음이 모두 가능:

speech→text
text→speech
speech→speech
text→text

6. STSP benchmark 제안

speech/text sentiment preservation benchmark 새로 제안.

Reference

https://arxiv.org/pdf/2402.05755

Audio-012, Spirit LM: Interleaved Spoken and Written Language Model, TACL 2025