다음 답변은 Kimi-Audio Technical Report 기반으로 정리했다.

1. 텍스트와 오디오가 동시에 되는 모델을 학습한 게 맞나?

결론부터 말하면 맞다.
Kimi-Audio는 텍스트와 오디오를 함께 다루는 unified audio foundation model이다.

논문에서 모델 목표를 다음처럼 설명한다:

audio understanding
audio generation
speech conversation
audio-to-text chat

을 하나의 모델에서 수행한다고 명시한다.

또 architecture 설명에서도:

입력: audio + text
출력: audio token + text token

을 동시에 처리한다고 설명한다.

모델 입력/출력은 무엇인가?

입력

입력은 두 종류다.

텍스트 토큰
오디오 표현
- discrete semantic audio token
- continuous acoustic feature (Whisper feature)

즉 오디오는 단순 waveform을 직접 넣는 게 아니라:

semantic token
continuous feature

로 변환 후 입력한다.

출력

출력은 두 갈래(head)로 나뉜다.

text head
- text token autoregressive generation
audio head
- discrete semantic audio token generation

그리고 생성된 audio token을 detokenizer가 waveform으로 복원한다.

즉 모델은:

text→text
audio→text
text→audio
audio→audio

모두 가능하도록 설계됐다.

2. 텍스트와 오디오 모두 discrete token으로 변환해서 next token prediction 했나?

부분적으로 맞다.

정확히 말하면:

텍스트: discrete token
오디오 출력: discrete semantic token
오디오 입력: discrete token + continuous feature

조합이다.

즉 “모든 입력을 discrete token만 사용했다”는 것은 아니다.

어떤 tokenizer를 썼나?

논문 핵심은 hybrid representation이다.

오디오 tokenizer

논문은 GLM-4-Voice의 tokenizer를 사용했다고 말한다.

구체적으로:

Whisper encoder 기반
vector quantization layer 추가
12.5Hz semantic token 생성
single codebook 사용

이라고 설명한다.

즉 구조는:

Whisper encoder
→ VQ
→ discrete semantic token

이다.

next-token prediction 했나?

텍스트

일반적인 autoregressive next-token prediction 수행.

오디오

audio semantic token에 대해서도 next-token prediction 수행.

예를 들면:

Audio Only task:
- 다음 audio semantic token 예측
TTS:
- text conditioned audio token prediction
ASR:
- audio conditioned text token prediction

이다.

3. 모델 학습 순서 설명

이 부분은 논문에서 꽤 자세히 설명한다.

전체 순서는:

pretrained text LLM 초기화
multimodal continual pretraining
supervised fine-tuning
audio detokenizer training

이다.

텍스트 백본에서 시작했는가?

맞다.

논문은:

Qwen2.5 7B로 초기화

했다고 명시한다.

또한:

shared transformer layer
text head

는 pretrained text LLM weight로 초기화했다고 설명한다.

반면:

audio head는 random initialization

이다.

즉 처음부터 audio-text jointly trained model은 아니다.

처음부터 text+audio 데이터를 학습했나?

아니다.

순서는:

Step 1

이미 학습된 text LLM(Qwen2.5 7B) 사용

Step 2

그 위에서 continual pretraining 수행

이때:

text-only
audio-only
audio-text mapping
audio-text interleaving

task들을 함께 학습한다.

즉 “처음부터 multimodal pretraining”이 아니라:

pretrained text LLM → multimodal continual pretraining

구조다.

pretraining task 구성

논문 Table 3 기준으로 task는 3종류다.

1) unimodal

text only
audio only

2) audio-text mapping

ASR (audio→text)
TTS (text→audio)

3) audio-text interleaving

audio→semantic token
audio→text
audio→semantic+text

데이터 비율 및 양

데이터 양

논문은:

13 million hours audio data
585B audio tokens
585B text tokens

사용했다고 말한다.

task mixing ratio

Table 3에 weight가 있다.

비율:

Task	Weight
Text only	7
Audio only	1
Audio→Text	1
Text→Audio	1
Audio interleaving	1
Audio→Text interleaving	1
Audio→Semantic+Text	2

즉 text-only 비중이 가장 높다.

post-training 하는가?

한다.

논문에서:

supervised fine-tuning(SFT)
detokenizer fine-tuning

을 수행한다고 설명한다.

RLHF/DPO 같은 alignment는 명시되지 않았다.

다만 Kimi-TTS에는 reinforcement learning을 사용했다고만 적혀 있다.

4. 오디오 데이터는 어떤 형태인가?

둘 다 사용한다.

plain audio도 있는가?

있다.

논문은:

대부분 raw audio only

라고 명시한다.

예시:

audiobook
podcast
interview
music
environmental sound
vocalization

등이다.

즉 상당수는 transcription 없는 순수 오디오다.

audio-text pair도 있는가?

있다.

논문 핵심 중 하나가:

raw audio에 자동 annotation pipeline 적용해서

transcription
speaker
segmentation

등을 생성한다는 점이다.

즉:

raw audio
→ diarization
→ ASR transcription
→ audio-text pair 생성

구조다.

ASR 기반 pair인가?

맞다.

특히:

Whisper-large-v3
Paraformer-Zh

를 사용해 transcription 생성했다고 적혀 있다.

5. 모델 평가를 어떻게 하는가?

각 학습 스테이지별 평가를 하나?

논문에는:

pretraining intermediate evaluation
stage-wise ablation evaluation

은 거의 없다.

주로:

최종 모델 평가
benchmark comparison

위주다.

즉 “각 학습 단계마다 benchmark 결과”는 제공하지 않는다.

없는 정보다.

어떤 benchmark를 평가하나?

텍스트/오디오 모두 평가한다.

ASR 평가

LibriSpeech
AISHELL
WenetSpeech

등 사용.

metric:

Audio understanding 평가

MMAU
MELD
VocalSound
TUT2017

등 사용.

Audio-to-text chat 평가

OpenAudioBench
VoiceBench

사용.

Speech conversation 평가

사람 평가(human evaluation) 사용.

평가 항목:

emotion control
empathy
accent
style control
speed control

등.

evaluation toolkit도 만듦

논문 contribution 중 하나가:

standardized evaluation toolkit

을 공개했다는 점이다.

6. 논문의 동기 및 기여점

동기

논문은 기존 audio LLM의 한계를 다음처럼 정리한다.

기존 모델들은:

특정 task만 수행
- ASR만
- audio understanding만
- speech generation만
audio pretraining 부족
closed-source

라는 문제를 가진다고 설명한다.

즉 목표는:

universal audio foundation model

을 만드는 것이다.

핵심 기여점

논문 기준 핵심 contribution은 다음이다.

1) Universal audio model

하나의 모델로:

ASR
audio understanding
audio generation
speech conversation

모두 지원.

2) Hybrid audio representation

discrete semantic token
continuous acoustic feature

를 함께 사용.

이게 이 논문의 핵심 architectural contribution 중 하나다.

3) 대규모 audio pretraining

13M hours audio

사용.

논문이 매우 강조하는 부분이다.

4) audio-text interleaving pretraining

단순 ASR/TTS만이 아니라:

interleaved multimodal pretraining

설계.

5) streaming speech conversation

실시간 chunk-wise generation 지원.

6) evaluation toolkit 공개

audio foundation model 비교용 evaluation framework 제공.

한 줄 요약

Kimi-Audio는:

pretrained text LLM(Qwen2.5)을 기반으로,
discrete semantic audio token + continuous Whisper feature를 사용해,
massive audio-text continual pretraining과 interleaving training을 수행한
unified audio foundation model

이라고 이해하면 된다.

Reference

https://arxiv.org/pdf/2504.18425

Audio-017, Kimi-Audio Technical Report, Preprint 2025

1. 텍스트와 오디오가 동시에 되는 모델을 학습한 게 맞나?

모델 입력/출력은 무엇인가?

입력

출력

2. 텍스트와 오디오 모두 discrete token으로 변환해서 next token prediction 했나?

어떤 tokenizer를 썼나?

오디오 tokenizer

next-token prediction 했나?

텍스트

오디오

3. 모델 학습 순서 설명

텍스트 백본에서 시작했는가?

처음부터 text+audio 데이터를 학습했나?

Step 1

Step 2

pretraining task 구성

1) unimodal

2) audio-text mapping

3) audio-text interleaving

데이터 비율 및 양

데이터 양

task mixing ratio

post-training 하는가?

4. 오디오 데이터는 어떤 형태인가?

plain audio도 있는가?

audio-text pair도 있는가?

ASR 기반 pair인가?

5. 모델 평가를 어떻게 하는가?

각 학습 스테이지별 평가를 하나?

어떤 benchmark를 평가하나?

ASR 평가

Audio understanding 평가

Audio-to-text chat 평가

Speech conversation 평가

evaluation toolkit도 만듦

6. 논문의 동기 및 기여점

동기

핵심 기여점

1) Universal audio model

2) Hybrid audio representation

3) 대규모 audio pretraining

4) audio-text interleaving pretraining

5) streaming speech conversation

6) evaluation toolkit 공개

한 줄 요약

댓글

댓글 쓰기