← Startup Showcase

AudioLDM 2

Music

Although audio generation shares commonalities across different types of audio, such as speech, music, and sound effects, designing models for each type requires careful consideration of unique objectives and biases that can significantly differ from those of other types. To bring us closer to a unified perspective of audio generation, this paper proposes a framework that utilizes the same learning method for speech, music, and sound effect generation. Our framework introduces a generic representation of audio, which can be treated as a new "language of audio" and used as a proxy for audio generation. For any conditional input, we first translate it into the "language of audio" with a language model, followed by a model that performs audio synthesis with a latent diffusion model based on the audio language. The proposed framework naturally brings advantages such as a reusable self-supervised pretrained model and in-context learning abilities. Experiments on the major benchmarks of text-to-audio (TTA), text-to-music (TTM), and text-to-speech (TTS) demonstrate new state-of-the-art or competitive performance to previous approaches.

Logo / Image

AudioLDM 2 icon

Pricing Model

  • Open-Source

Tags

  • Audio
  • AI
  • Open Source

Website

Submit Your Startup

Send me the following information with an email subject of: STARTUP SHOWCASE - COMPANY NAME

  • Company Name
  • URL
  • Description
  • Pricing Model
  • Tags
  • Thumbnail/Image
  • Category
  • Notes (optional)