Paper history (3)

텍스트가 아닌 모달리티(오디오, 옴니모델)들에 관련된 논문 읽기

읽어볼것

  • MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer, Preprint 2025
  • Emu3.5: Native Multimodal Models are World Learners, Preprint 2025
  • WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling, ICLR 2025

1. Audio

  • wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations, NeurIPS 2020 [포스팅]
  • HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Unit, TASLP 2021 [포스팅]
  • SoundStream: An End-to-End Neural Audio Codec, TASLP 2021 [포스팅]
  • EnCodec: High Fidelity Neural Audio Compression, TMLR 2023 [포스팅]
  • Whisper: Robust Speech Recognition via Large-Scale Weak Supervision, OpenAI 2022 [포스팅]
  • CLAP: Contrastive Language-Audio Pretraining, Preprint 2022 [포스팅]

2. Multi / Omni Models

  • Scaling Laws for Native Multimodal Models, ICCV 2025 [포스팅]

2.1 Architecture

  • Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution, Preprint 2024 [포스팅]
  • Ola: Pushing the Frontiers of Omni-Modal Language Model, Preprint 2025 [포스팅]
  • Emu3: Next-Token Prediction is All You Need, Preprint 2024 [포스팅]
  • Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation, Preprint 2024 [포스팅]
  • MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer, Preprint 2025
  • BAGEL: Emerging Properties in Unified Multimodal Pretraining, Preprint 2025 [포스팅]
  • Tuna: Taming Unified Visual Representations for Native Unified Multimodal Models, Preprint 2025 [포스팅]

2.2 Tokenizer

  • OmniTokenizer: A Joint Image-Video Tokenizer for Visual Generation, NeurIPS 2025 [포스팅]
  • Vision as a Dialect: Unifying Visual Understanding and Generation via Text-Aligned Representations, NeurIPS 2025 [포스팅]









































댓글