Paper history (3)

텍스트가 아닌 모달리티(오디오, 옴니모델)들에 관련된 논문 읽기

읽어볼것

SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing, ACL 2022
AudioLM: a Language Modeling Approach to Audio Generation, 2023
SoundStorm: Efficient Parallel Audio Generation, 2023
VALL-E: Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers, 2023
AudioPaLM: A Large Language Model That Can Speak and Listen, 2023
CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens, 2024
CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models, 2024
CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training, 2025
CompA: Addressing the Gap in Compositional Reasoning in Audio-Language Models

wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations, NeurIPS 2020 [포스팅]
HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Unit, TASLP 2021 [포스팅]
SoundStream: An End-to-End Neural Audio Codec, TASLP 2021 [포스팅]
EnCodec: High Fidelity Neural Audio Compression, TMLR 2023 [포스팅]
Whisper: Robust Speech Recognition via Large-Scale Weak Supervision, OpenAI 2022 [포스팅]
CLAP: Contrastive Language-Audio Pretraining, Preprint 2022 [포스팅]