AudioLDM 2

Startup Showcase:

Pricing Model: Open-Source

Tags: AudioAIOpen Source

Description

Although audio generation shares commonalities across different types of audio, such as speech, music, and sound effects, designing models for each type requires careful consideration of unique objectives and biases that can significantly differ from those of other types. To bring us closer to a unified perspective of audio generation, this paper proposes a framework that utilizes the same learning method for speech, music, and sound effect generation. Our framework introduces a generic representation of audio, which can be treated as a new "language of audio" and used as a proxy for audio generation. For any conditional input, we first translate it into the "language of audio" with a language model, followed by a model that performs audio synthesis with a latent diffusion model based on the audio language. The proposed framework naturally brings advantages such as a reusable self-supervised pretrained model and in-context learning abilities. Experiments on the major benchmarks of text-to-audio (TTA), text-to-music (TTM), and text-to-speech (TTS) demonstrate new state-of-the-art or competitive performance to previous approaches.

Submit Your Startup

Contact me with the following information:

Company Name
URL
Description
Pricing Model
Tags
Thumbnail
Category
Notes (optional)