Foley-Omni: A Unified Multimodal Generation Model from Task-Level Audio Synthesis to Complete Video Soundtrack

Foley-Omni not only supports TTA, TTM, TTS, V2A, and VisualTTS, but also generates a complete video soundtrack in a single pass, jointly producing video-synchronized sound effects, speech, and music.

Complete Video Soundtrack

All videos below are re-dubbed by Foley-Omni.
Foley-Omni on VEO3 Videos

[AUDIO] A soft female voice with a crisp soda-can open, bright fizz, pouring sounds, and light background music.
[WORD] This drink sounds really fresh when it opens.
[MUSIC] Light piano and acoustic guitar background music.

[AUDIO] Mechanical keyboard typing, followed by a clear female voice, with soft background music.
[WORD] Hi, and welcome to the channel
[MUSIC] Soft piano and ambient lifestyle background music.

[AUDIO] A clear elderly female voice speaks over sharp knife chops on a wooden cutting board, followed by louder sizzling as chopped green onions are dropped into a frying pan.
[WORD] I'll add this now.
[MUSIC] none

[AUDIO] A clear, neutral English-speaking voice with a car passing on a quiet urban street, soft tire rolling, and subtle daytime ambience.
[WORD] That car came by faster than I expected.
[MUSIC] none

V2ST-Bench

[AUDIO] A clear instructional female voice with rhythmic spoon-on-pot stirring sounds.
[WORD] ...and a yellow are totally combined. Okay. And then I'm gonna...
[MUSIC] none

[AUDIO] A clear, casual male voice speaking followed by two distinct, rhythmic, high-pitched metallic clacks and slides of a firearm action being cycled.
[WORD] This piece of the AK, you can barely see it. So you open up the action...
[MUSIC] none

[AUDIO] A calm male voice with low boat-engine rumble and rhythmic water splashes.
[WORD] And it's a tough ride back in no matter what kind of a boat you're in.
[MUSIC] none

[AUDIO] A casual female voice followed by loud sewing-machine buzzing and whirring.
[WORD] change this one's for the serger and then there's a light down there.
[MUSIC] none

[AUDIO] none
[WORD] Um, all right, who are my role models? Uh, I would say my dad for one. Uh, I've always just looked up to him for his work ethic. When he came to
[MUSIC] A very subtle ambient underscore with soft sustained pads or a gentle repeating piano motif, creating a calm and unobtrusive backdrop.

[AUDIO] none
[WORD] What makes my body regenerate better, for example, in the night? And there comes simple things.
[MUSIC] Sustained airy synth pads with sparse piano or bell-like notes, creating a very slow, reflective atmosphere.

[AUDIO] none
[WORD] Thanks for watching. If you like what you just saw, hit the subscribe button for more clips and full episodes.
[MUSIC] A gentle synthesized keyboard melody with a subtle electronic beat, keeping a relaxed medium tempo and a calm, pleasant mood.

[AUDIO] none
[WORD] about how I feel about being called a YouTuber. You know, at first, I loved being called a YouTuber. I was like, this is my dream. I grew up watching YouTube. YouTubers are my movie stars.
[MUSIC] Soft arpeggiated piano chords with a faint atmospheric underscore, creating a slow and reflective mood.

V2A

VisualTTS

GRID

[WORD] bin blue by p five please

[WORD] bin blue at d six now

[WORD] bin blue at i six soon

[WORD] bin blue at b nine now

Text-conditioned Generation

[AUDIO] Clip-clops gallop as the wind blows and thunder cracks

[AUDIO] A dog snoring loudly

[SPEECH] Among other things on which she cast her eyes was a small crucifix of solid silver, standing on a cabinet near the window.

[MUSIC] A classical sounding music piece which sounds like a music box being played through a tiny, distorted speaker of an ice cream truck. Low fidelity.