↑

SonyResearch / Woosh

Public release of the Sound Effect Foundation model by Sony AI.

WooshSound Effect Foundation modelSony AISonyResearch

236 16 0 更新于 2026-04-15 19:32

Woosh - Sound Effect Generative Models

This repository provides inference code and open weights for the sound effect generative models developed at Sony AI. The current public release includes four models addressing the text-to-audio (T2A) and video-to- audio (V2A) tasks:

Audio encoder/decoder (Woosh-AE): High-quality latent encoder/decoder providing latents for generative modeling and decoding audio from generated latents.
Text conditioning (Woosh-CLAP): Multimodal text-audio alignment model providing token la- tents for diffusion model conditioning.
T2A Generation (Woosh-Flow and Woosh-DFlow): Original and distilled LDMs generating au- dio unconditionally or from given a text prompt.
V2A Generation (Woosh-VFlow): Multimodal LDM generating audio from a video sequence with optional text prompts.

Installation

Start by installing uv first

bash
1
 pip install uv

and then the Woosh environment, with either:

cpu support,

bash
1
uv sync --extra cpu

or cuda support,

bash
1
uv sync --extra cuda

Download model weights

Open model weights are available for all Woosh models trained on public datasets. You can download and unzip the pretrained weights from the releases page, or otherwise using the github CLI as

bash
12
gh release download v1.0.0unzip '*.zip'

The checkpoints should be located in folders named checkpoints/MODEL_NAME, each containing config and weight files.

Download media samples

We provide audio samples to be used as inputs to our test_Woosh-*.py test scripts. You can download and unzip the file samples.zip from the releases page, or otherwise using the github CLI as

bash
12
gh release download v1.0.0 -p 'samples.zip'unzip samples.zip

Usage

Test scripts

An inference test script for every model is provided. Just run any of the following

python
123456
uv run test_Woosh-AE.pyuv run test_Woosh-Flow.pyuv run test_Woosh-DFlow.pyuv run test_Woosh-VFlow.pyuv run test_Woosh-DVFlow.pyuv run test_Woosh-CLAP.py

and the generated audio/video will be written to outputs/ as .wav/.mp4 audio/video files.

Check our tech report on arxiv.org for a description of all models.

Gradio demos

Two basic Gradio demos, for Woosh-Flow and Woosh-DFlow models, are available. To launch a Gradio demo locally, run one of the following

python
12
uv run gradio_Woosh-Flow.pyuv run gradio_Woosh-DFlow.py

Open a web browser on the same machine and access the demo at https://127.0.0.1:7860.

API server

Woosh models can be served via our API server. Check the API folder for usage details.

Citation

For details about model architecture, training and evaluation, please check our tech report available on arxiv.org.

bibtex
123456789
@misc{hadjeres2026,      title={Woosh: A Sound Effects Foundation Model},      author={Gaetan Hadjeres, Marc Ferras, Khaled Koutini, Benno Weck, Alexandre Bittar, Thomas Hummel, Zineb Lahrichi, Hakim Missoum, Joan Serrà and Yuki Mitsufuji},      year={2026},      eprint={2604.01929},      archivePrefix={arXiv},      primaryClass={cs.SD},      url={https://arxiv.org/abs/2604.01929},}

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

Please make sure to update tests as appropriate.

License

The majority of the code in this repository is released under the MIT license. The video-to-audio Woosh-VFlow and Woosh-DVFlow models use adapted code from MM-AUDIO and MotionFormer. The code for these models is made available under Apache v2 license terms.
The open weights in the releases page are released under the CC-BY-NC license.
The test audio and video samples in the releases page contain their individual license terms in the corresponding download file.