← Back
Startup Showcase:

AudioLDM 2

Visit AudioLDM 2Share on Twitter

Pricing Model: Open-Source

Tags: AudioAIOpen Source


Although audio generation shares commonalities across different types of audio, such as speech, music, and sound effects, designing models for each type requires careful consideration of unique objectives and biases that can significantly differ from those of other types. To bring us closer to a unified perspective of audio generation, this paper proposes a framework that utilizes the same learning method for speech, music, and sound effect generation. Our framework introduces a generic representation of audio, which can be treated as a new "language of audio" and used as a proxy for audio generation. For any conditional input, we first translate it into the "language of audio" with a language model, followed by a model that performs audio synthesis with a latent diffusion model based on the audio language. The proposed framework naturally brings advantages such as a reusable self-supervised pretrained model and in-context learning abilities. Experiments on the major benchmarks of text-to-audio (TTA), text-to-music (TTM), and text-to-speech (TTS) demonstrate new state-of-the-art or competitive performance to previous approaches.

AudioLDM 2

Submit Your Startup

Contact me with the following information:
  • Company Name
  • URL
  • Description
  • Pricing Model
  • Tags
  • Thumbnail
  • Category
  • Notes (optional)