오디오 기본
https://sanghyu.tistory.com/category/Domain%20Knowledge/Speech
키워드
- 파형(waveform)
- 샘플링
- STFT(Short-Time Fourier Transform)
- 멜스펙트로그램
- MFCC
논문
- Self-supervised 음성 표현 학습
- wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations, 2020
- HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units, 2021
- Neural Codec 계열
- SoundStream: An End-to-End Neural Audio Codec, 2021
- EnCodec: High Fidelity Neural Audio Compression, 2022
- 대규모 음성 인식/ASR
- Whisper: Robust Speech Recognition via Large-Scale Weak Supervision, 2022
- Representation alignment
- CLAP: Contrastive Language-Audio Pretraining, 2022
- SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing, ACL 2022
- Audio as Language
- AudioLM: a Language Modeling Approach to Audio Generation, 2023
- SoundStorm: Efficient Parallel Audio Generation, 2023
- LLM + Audio
- VALL-E: Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers, 2023
- AudioPaLM: A Large Language Model That Can Speak and Listen, 2023
- 최신 확장형 TTS / LLM
- CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens, 2024
- CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models, 2024
BYOL-A
Data2vec
댓글
댓글 쓰기