MMEDIT: A Unified Framework for Multi-Type Audio Editing via Multimodal language model

Ye Tao†1,2, Wen Wu2, Chao Zhang2 Shuai Wang3, Mengyue Wu1, Xuenan Xu*1,2,
1 MoE Key Lab of Artificial Intelligence, X-LANCE Lab, Shanghai Jiao Tong University
2 Shanghai AI Laboratory
3 Nanjing University
Work done partially during internship at Shanghai AI Lab.
* Corresponding author.

Abstract

Text-guided audio editing requires modifying specific acoustic events while strictly preserving non-target content. Current approaches generally fall into two paradigms: training-based methods, which are often confined to limited operation types, and training-free methods based on diffusion inversion, which frequently struggle with controllability and signal degradation. To overcome these constraints, we propose MMEdit, a unified, large-model-driven framework that significantly expands the scope and precision of audio editing. We systematically extend task definitions to cover a comprehensive range of operations,including addition, replacement, removal, reordering, and attribute modification. Furthermore, we construct a scalable synthetic pipeline to generate large-scale paired data with fine-grained event-level annotations. To capture complex editing semantics, we integrate the Qwen2-Audio encoder with an MMDiT-based generator, achieving precise cross-modal alignment and localized editing. Experiments demonstrate that our approach delivers superior edit localization, robust instruction following, and high fidelity in non-edited regions, highlighting the potential of large-scale models for controllable audio editing.

Scalable Data Pipeline

Data Construction Pipeline

To overcome the data scarcity bottleneck in audio editing, we construct a large-scale synthetic dataset comprising over one million triplets. Our pipeline begins by extracting clean, event-level segments from AudioCaps using automated grounding (TAG) and semantic filtering (CLAP). We then leverage the Scaper library to compositionally synthesize these segments into realistic soundscapes. This allows us to systematically simulate six diverse editing operations: Addition, Removal, Replacement, Reordering, Loudness Adjustment, and Speed Modification. This automated process ensures fine-grained event-level annotations, forming a solid foundation for training controllable editing models.

Method Overview

MMEdit Architecture

MMEdit synergizes the Qwen2-Audio multimodal encoder with an MMDiT-based diffusion backbone. Unlike standard approaches, it leverages joint attention mechanisms to facilitate deep, token-level interaction between audio latents and editing instructions. The model employs the powerful Qwen2-Audio for joint understanding of audio and instruction, generating the high-level multi-modal representation $\mathbf{H}$. Then the MMDiT backbone predicts the audio latent $\mathbf{z}_0$ from a Gaussian latent conditioned on $\mathbf{H}$ and the latent of the source audio $\mathbf{z}_\mathrm{in}$. Finally, the VAE decoder converts $\mathbf{z}_\mathrm{0}$ to the edited audio.

🎧 Audio Editing Samples

Select a task below to view examples.

➕ 1. Addition

Insert a new target sound into the audio while preserving the original context.

Instruction Source Audio AUDIT AudioEditer MMEdit(ours)
Mix in dog barking in the middle.
Include typing as a background sound.
Incorporate gunshots fired three times from 3.2s to 10.0s.
Insert a pig oinks repeatedly sound in the middle.
Mix in loud train horn into the background.

➖ 2. Removal

Remove a specified sound event and keep the remaining audio natural and unchanged.

Instruction Source Audio AUDIT Audioediter MMEdit(Ours)
Eliminate male giving a speech from the audio.
Remove the dog barking.
Silence the sound of a person is snoring loudly.
Erase the sound of rain falling from the background.

🔄 3. Replacement

Replace a target sound with a new one at the same time region while preserving the background.

Instruction Source Audio AUDIT Audioediter MMEdit(Ours)
Take the sound of Quack and turn it into Bell.
Replace the electric shaver sound with a busy signal.
Swap the male voice on the dispatch radio for a car revving.
Find the goat bleats and change it to laughing from an individual.

🔀 4. Reordering

Change the temporal order of specified events without altering their content.

Instruction Source Audio MMEdit(Ours)
Swap the order of a man speaking and a drilling sound. Waveform Waveform
Swap the order of a person types on a computer keyboard quietly and emergency sirens sound. Waveform Waveform
Make a vehicle with a siren play before children cry together. Waveform Waveform
Reorder the static-filled speech and the fast sewing machine noise. Waveform Waveform

🔊 5. Loudness Adjustment

Increase or decrease the volume of a target event while keeping other sounds unchanged.

Instruction Source Audio MMEdit(Ours)
Turn down the volume of a bell ringing. Waveform Waveform
Turn up the sounds of race cars rev in the distance. Waveform Waveform
Soften the dog barking. Waveform Waveform
Turn up the volume of the woman speaking. Waveform Waveform

⏩ 6. Speed Modification

Speed up or slow down a target event, with its duration adjusted accordingly, while preserving overall audio coherence.

Instruction Source Audio MMEdit(Ours)
Slow down bell considerably. Waveform Waveform
Halve the duration of toilet flushing. Waveform Waveform
Extend train horn duration by 20%. Waveform Waveform
Speed up the sound of housefly. Waveform Waveform

BibTeX

@article{YourPaperKey2024,
  title={Your Paper Title Here},
  author={First Author and Second Author and Third Author},
  journal={Conference/Journal Name},
  year={2024},
  url={https://your-domain.com/your-project-page}
}