다음 답변은 Step-Audio 논문 기준이다.

1. 텍스트와 오디오가 동시에 되는 모델을 학습한게 맞지?

맞다. Step-Audio는 텍스트와 오디오를 함께 다루는 unified multimodal model이다. 논문에서는 “speech-text multi-modal model”이라고 설명한다.

모델 구조는 크게:

speech tokenizer
LLM
speech decoder

로 구성된다.

모델 입력/출력

입력

오디오 입력
- speech token으로 변환됨
텍스트 입력
(Step-Omni 전체 기준으로는 이미지도 포함)

출력

텍스트 출력
오디오 출력 (waveform 생성)

논문에서 LLM이 “text and speech tokens”를 함께 modeling한다고 명시한다.

다만 실제 실시간 음성 대화 시스템은:

Audio input
Text output
이후 TTS decoder로 음성 생성

형태(AQTA + TTS)를 채택했다.

즉:

내부적으로는 speech-text unified LM
실제 inference pipeline은 audio→text→speech 구조를 일부 활용

이다.

2. 학습할때 텍스트와 오디오 모두 discrete token으로 변환해서 next token prediction한게 맞는지?

맞다.

논문 핵심이 바로:

speech를 discrete token으로 변환
text token과 함께 autoregressive next-token prediction

하는 것이다.

특히:

semantic token
linguistic token

두 종류의 audio token을 interleave해서 사용한다.

어떤 tokenizer를 썼는지

텍스트

논문은 별도 tokenizer 이름을 자세히 밝히지 않는다.
기반은 Step-1 text LLM의 tokenizer를 그대로 사용한 것으로 보인다. 명시 정보는 없음.

오디오 tokenizer

(1) Linguistic tokenizer

Paraformer encoder 기반.

특징:

phoneme/linguistic 정보 중심
16.7Hz
codebook size = 1024

(2) Semantic tokenizer

CosyVoice tokenizer 사용.

특징:

semantic + acoustic 정보
25Hz
codebook size = 4096

next token prediction인가?

맞다.

논문에서 반복적으로:

next token prediction perplexity
LM loss
autoregressive continuation

을 언급한다.

즉 training objective는 기본적으로 autoregressive LM objective이다.

3. 모델 학습 순서를 설명해봐

전체 흐름

1단계

기존 text LLM(Step-1) 기반 continual pretraining

2단계

audio-text multimodal pretraining

3단계

ASR/TTS 추가 학습

4단계

task-specific SFT

5단계

RLHF/PPO

텍스트 백본에서 텍스트&오디오 데이터를 학습했는지?

맞다.

논문에 명시적으로:

“based on a pretrained text model”
“audio continual pretraining based on Step-1”

이라고 적혀 있다.

즉:

먼저 text-only pretrained LLM 존재
여기에 audio token vocabulary 추가
multimodal continual pretraining 수행

이다.

처음부터 텍스트&오디오 데이터를 학습했는지?

아니다.

처음부터 multimodal scratch training이 아니라:

pretrained text LLM 위에
audio token 추가 후
continual pretraining

방식이다.

학습 스테이지별 데이터 구성

Stage 1

작업

text vocabulary 확장
audio token 5120개 추가
pure audio continuation 학습

데이터 비율

audio:text:image = 2:1:1

오디오 데이터

pure audio continuation만 사용

stage1에 사용된 텍스트 백본은 step-audio에서 만든 모델임 (130B)

Stage 2

작업

audio-text interleaved data 추가

데이터 구성

audio continuation
audio-text interleaved

비율 = 1:1

전체 modality 비율은 계속:
audio:text:image = 2:1:1

Stage 3

작업

ASR + TTS 데이터 추가

세부 비율

audio continuation
audio-text interleaved
ASR
TTS

= 1:1:1:1

modality 비율

audio:text:image = 4:3:3

여기서 기억할점은, stage2, 3에 쓰이는 데이터에서(audio continuaion vs audio-text interleaved) 스페셜토큰이 같다는 것

즉, {텍스트 토큰} -> {paried 오디오 토큰} or {다음에 나와야할 오디오 토큰} 일수 있는데 이거를 구분하는 스페셜토큰을 쓰지 않은거 같음

전체 데이터 양

오디오

audio continuation: 1.1T audio tokens
TTS: 113B tokens
ASR: 105B tokens
audio-text alternating: 350B tokens

텍스트

800B text tokens

이미지

800B image-text tokens

post-training도 하는지?

한다.

논문에서 별도 섹션으로 Post-Training을 설명한다.

포함되는 것:

TTS SFT
AQTA SFT
reward model training
PPO RLHF

4. 오디오 데이터라고 하는 것은 어떤 형태인지?

여러 종류가 섞여 있다.

(1) Pure audio continuation

단순 오디오 continuation 데이터.

즉:

다음 speech token 예측
audio-only continuation

형태.

(2) Audio-text interleaved data

오디오와 텍스트가 번갈아 등장하는 데이터.

예:

speech + transcript
dialogue
mixed modality sequence

정도로 보이지만 상세 포맷은 안 나옴.

(3) ASR 데이터

명확한 speech-text pair다.

오디오 → transcript.

(4) TTS 데이터

text → speech pair.

(5) AQTA 데이터

Audio Question → Text Answer.

즉:

사용자 음성 질문
텍스트 응답

형태.

plain audio만 있는가?

있다.
Stage1의 pure audio continuation이 그 역할이다.

하지만 전체적으로는:

plain audio
speech-text pair
instruction data
dialogue data

가 모두 섞여 있다.

5. 모델 평가를 어떻게 하는지

각 학습 스테이지마다 평가하는가?

부분적으로 그렇다.

논문에서는:

pretrain model
final chat model

을 따로 평가한다.

예:

Step-Audio Pretrain
Step-Audio Chat

ASR 성능 비교.

즉 intermediate stage evaluation이 존재한다.

최종 모델만 평가하는가?

아니다.

다음을 각각 평가한다:

pretrain model
TTS model
chat model
reward model 일부

텍스트와 오디오 benchmark 모두 평가하는가?

맞다.

ASR benchmark

Aishell
Librispeech
Wenetspeech

TTS benchmark

SEED TTS

Dialogue benchmark

StepEval-Audio-360
Llama Question
TriviaQA
Web Questions
ComplexBench
HSK-6

평가 방식

자동 평가

CER
WER
speaker similarity
GPT-4o judging

인간 평가

MOS
instruction following
chat quality

6. 논문의 동기 및 기여점은 뭐야

동기

논문은 기존 open-source speech model의 문제를 다음처럼 본다:

1. 이해와 생성이 분리됨

ASR → LLM → TTS pipeline이라:

latency 증가
error propagation
end-to-end optimization 어려움

2. 고품질 speech data 부족

특히:

dialect
emotion
singing
RAP

데이터 구축 비용이 큼.

3. controllability 부족

기존 모델은:

감정 제어
화자 스타일
dialect
speech rate

제어가 약함.

4. tool use 부족

실시간 외부 지식 활용이 어렵다.

핵심 기여점

1. Unified speech-text model

130B unified multimodal speech-text model 제안.

2. Dual-codebook tokenizer

linguistic token
semantic token

을 함께 쓰는 tokenizer 제안.

3. Synthetic speech data engine

LLM과 Step-Audio를 이용해 synthetic TTS data 생성.

4. Fine-grained controllable speech

emotion
dialect
RAP
singing
speaking style

제어 가능.

5. RLHF 기반 speech chat alignment

AQTA에 대해:

SFT
reward model
PPO

적용.

6. 새로운 evaluation benchmark

StepEval-Audio-360 제안.

Reference

https://arxiv.org/pdf/2502.11946

Audio-014, Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction, Preprint 2025