This repository provides inference code and open weights for the sound effect generative models developed at Sony AI. The current public release includes four models addressing the text-to-audio (T2A) and video-to- audio (V2A) tasks:
Audio encoder/decoder (Woosh-AE): High-quality latent encoder/decoder providing latents for generative modeling and decoding audio from generated latents.
Text conditioning (Woosh-CLAP): Multimodal text-audio alignment model providing token la- tents for diffusion model conditioning.
T2A Generation (Woosh-Flow and Woosh-DFlow): Original and distilled LDMs generating au- dio unconditionally or from given a text prompt.
V2A Generation (Woosh-VFlow): Multimodal LDM generating audio from a video sequence with optional text prompts.
Start by installing uv first
pip install uv
and then the Woosh environment, with either:
cpu support,
uv sync --extra cpu
or cuda support,
uv sync --extra cuda
Open model weights are available for all Woosh models trained on public datasets. You can download and unzip the pretrained weights from the releases page, or otherwise using the github CLI as
gh release download v1.0.0unzip '*.zip'
The checkpoints should be located in folders named checkpoints/MODEL_NAME, each containing config and weight files.
We provide audio samples to be used as inputs to our test_Woosh-*.py test scripts. You can download
and unzip the file samples.zip from the releases
page, or otherwise using the github CLI as
gh release download v1.0.0 -p 'samples.zip'unzip samples.zip
An inference test script for every model is provided. Just run any of the following
uv run test_Woosh-AE.pyuv run test_Woosh-Flow.pyuv run test_Woosh-DFlow.pyuv run test_Woosh-VFlow.pyuv run test_Woosh-DVFlow.pyuv run test_Woosh-CLAP.py
and the generated audio/video will be written to outputs/ as .wav/.mp4 audio/video files.
Check our tech report on arxiv.org for a description of all models.
Two basic Gradio demos, for Woosh-Flow and Woosh-DFlow models, are available. To launch a Gradio demo locally, run one of the following
uv run gradio_Woosh-Flow.pyuv run gradio_Woosh-DFlow.py
Open a web browser on the same machine and access the demo at https://127.0.0.1:7860.
Woosh models can be served via our API server. Check the API folder for usage details.
For details about model architecture, training and evaluation, please check our tech report available on arxiv.org.
@misc{hadjeres2026, title={Woosh: A Sound Effects Foundation Model}, author={Gaetan Hadjeres, Marc Ferras, Khaled Koutini, Benno Weck, Alexandre Bittar, Thomas Hummel, Zineb Lahrichi, Hakim Missoum, Joan Serrà and Yuki Mitsufuji}, year={2026}, eprint={2604.01929}, archivePrefix={arXiv}, primaryClass={cs.SD}, url={https://arxiv.org/abs/2604.01929},}
Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.
Please make sure to update tests as appropriate.
Woosh-VFlow and Woosh-DVFlow models use adapted code from MM-AUDIO and MotionFormer. The code for these models is made available under Apache v2 license terms.