Paper history (3)
텍스트가 아닌 모달리티(오디오, 옴니모델)들에 관련된 논문 읽기
읽어볼것
- MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer, Preprint 2025
- WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling, ICLR 2025
- Qwen2.5-Omni Technical Report
- Qwen3-Omni Technical Report
- Ming-Omni: A Unified Multimodal Model for Perception and Generation
- Model Spec Midtraining: Improving How Alignment Training Generalizes
- Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models, NeurIPS 2025
1. Tokenizer
- ReTok: Replacing Tokenizer to Enhance Representation Efficiency in Large Language Model, Preprint 2024 [포스팅]
- OmniTokenizer: A Joint Image-Video Tokenizer for Visual Generation, NeurIPS 2025 [포스팅]
- Vision as a Dialect: Unifying Visual Understanding and Generation via Text-Aligned Representations, NeurIPS 2025 [포스팅]
- CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training, Preprint 2025 [포스팅]
2. Multi / Omni Models
2.1 Omni Models
- Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution, Preprint 2024 [포스팅]
- Ola: Pushing the Frontiers of Omni-Modal Language Model, Preprint 2025 [포스팅]
- Emu3: Next-Token Prediction is All You Need, Preprint 2024 [포스팅]
- Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation, Preprint 2024 [포스팅]
- MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer, Preprint 2025
- BAGEL: Emerging Properties in Unified Multimodal Pretraining, Preprint 2025 [포스팅]
- Tuna: Taming Unified Visual Representations for Native Unified Multimodal Models, Preprint 2025 [포스팅]
- Emu3.5: Native Multimodal Models are World Learners, Preprint 2025 [포스팅]
- LongCat-Flash-Omni Technical Report, Preprint 2025 [포스팅]
2.2 Audio-Language Models
- wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations, NeurIPS 2020 [포스팅]
- HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Unit, TASLP 2021 [포스팅]
- SoundStream: An End-to-End Neural Audio Codec, TASLP 2021 [포스팅]
- EnCodec: High Fidelity Neural Audio Compression, TMLR 2023 [포스팅]
- Whisper: Robust Speech Recognition via Large-Scale Weak Supervision, OpenAI 2022 [포스팅]
- CLAP: Contrastive Language-Audio Pretraining, Preprint 2022 [포스팅]
- AudioPaLM: A Large Language Model That Can Speak and Listen, Preprint 2023 [포스팅]
- Resurfacing Paralinguistic Awareness in Large Audio Language Models, Preprint 2026 [포스팅]
- Scaling Open Discrete Audio Foundation Models with Interleaved Semantic, Acoustic, and Text Tokens, Preprint 2026 [포스팅]
2.2.1 Discrete Token
- Vall-E: Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers, Preprint 2023 [포스팅]
- SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities, Findings of EMNLP 2023 [포스팅]
- Spirit LM: Interleaved Spoken and Written Language Model, TACL 2025 [포스팅]
- Moshi: a speech-text foundation model for real-time dialogue, Preprint 2024 [포스팅]
- GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot, Preprint 2024 [포스팅]
- Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction, Preprint 2025 [포스팅]
- Scaling Open Discrete Audio Foundation Models with Interleaved Semantic, Acoustic, and Text Tokens, Preprint 2026 [포스팅]
- Llama-Mimi: Exploring the Limits of Flattened Speech Language Modeling, Preprint 2025 [포스팅]
2.2.1 Discrete Token + Continuous Feature
2.3 Vision-Language Models
- Chameleon: Mixed-Modal Early-Fusion Foundation Models, Preprint 2024
댓글
댓글 쓰기