오디오 기본

https://sanghyu.tistory.com/category/Domain%20Knowledge/Speech

키워드

  • 파형(waveform)
  • 샘플링
  • STFT(Short-Time Fourier Transform)
  • 멜스펙트로그램
  • MFCC

논문

  • Self-supervised 음성 표현 학습
    • wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations, 2020
    • HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units, 2021
  • Neural Codec 계열
    • SoundStream: An End-to-End Neural Audio Codec, 2021
    • EnCodec: High Fidelity Neural Audio Compression, 2022
  • 대규모 음성 인식/ASR
    • Whisper: Robust Speech Recognition via Large-Scale Weak Supervision, 2022
  • Representation alignment
    • CLAP: Contrastive Language-Audio Pretraining, 2022
    • SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing, ACL 2022
  • Audio as Language
    • AudioLM: a Language Modeling Approach to Audio Generation, 2023
    • SoundStorm: Efficient Parallel Audio Generation, 2023
  • LLM + Audio
    • VALL-E: Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers, 2023
    • AudioPaLM: A Large Language Model That Can Speak and Listen, 2023
  • 최신 확장형 TTS / LLM
    • CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens, 2024
    • CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models, 2024
BYOL-A
Data2vec 


















댓글