Tongyi Wanxiang VACE (Wan2.1) Open Source - Groundbreaking Video Editing AI

Latest Updates on VACE Open Source

Tongyi Wanxiang VACE (Wan2.1) Open Sourced: A Unified Model for Diverse Video Editing

[May 15, 2025] Alibaba Cloud's Tongyi Wanxiang team has officially open-sourced its groundbreaking VACE (Video Anymator and Composer Engine) model, a core component of the Wan2.1 series. VACE aims to provide a one-stop, efficient, and flexible video creation and editing experience by consolidating multiple complex tasks into a single, powerful model. This release includes VACE-1.3B supporting 480P resolution and VACE-14B supporting both 480P and 720P.

Category: Latest Updates, Model Release | Learn More Technical Details »

Key Milestones:

Wan2.1 Series Release: VACE unveiled as a core component, showcasing its unified video processing capabilities.
VACE 1.3B & 14B Models Open Sourced: Powerful models supporting different resolutions (480P/720P) released to the community.
Active Community Feedback & Exploration: Developers begin secondary creation, evaluation, and exploring VACE's application potential across various industries.

One Model, Multiple Tasks: How VACE Simplifies Complex Video Editing

[May 15, 2025] With VACE, users can now seamlessly perform text-to-video generation, image-referenced generation, local video editing, and video duration/spatial extension tasks without frequently switching between different models or tools. This unified approach is set to significantly boost efficiency and flexibility in video content creation. Its innovative multimodal input system is a key highlight, capable of simultaneously accepting text, images, video clips, masks, and various control signals (like human pose, depth maps, etc.).

Category: Core Features, Workflow Optimization | Source: Tongyi Wanxiang Official Announcement

VACE Core Features: Redefining Video Creation Boundaries

Powerful Controllable Inpainting & Generation

Overcoming the pain point of difficult post-generation adjustments in traditional video creation. VACE supports highly controllable video content inpainting and generation based on human pose, motion optical flow, structure preservation, spatial motion paths, video recoloring, and more. It also supports video generation based on reference subject images and background images, ensuring visual element consistency.

Unified & Powerful Multimodal Input System

Unlike traditional models that rely solely on text prompts, VACE has built a unified input system that integrates text, images (object reference images or video frames), videos (supporting regeneration after erasure or local extension), Masks (0/1 binary signals to specify editing areas), and various control signals (depth maps, optical flow, layout, grayscale images, line art, poses, etc.).

Precise Spatio-temporal Editing Capabilities

VACE empowers users with powerful capabilities for fine-grained video content editing. In the time dimension, it can intelligently complete the entire video duration based on any video segment or just the first and last frames. In the spatial dimension, it supports extended generation for image edges or background areas, such as achieving background replacement—changing the video background environment according to a new Prompt while keeping the main subject animasi.

Expert-Level Task Handling

VACE easily handles complex functions that traditionally require multiple expert models, such as image-referenced generation, video inpainting, and local editing.

Free Combination of Atomic Abilities

A revolutionary feature allowing the natural fusion of basic abilities like text-to-video, pose control, background replacement, without needing separate model training for each function.

Multiple Resolution Support

The open-sourced VACE-1.3B supports 480P, while VACE-14B supports both 480P and 720P resolutions, catering to various video quality needs.

Unlocking Creativity: The Power of Combined Atomic Abilities

One of VACE's most revolutionary features is its support for the free combination of various single-task capabilities, completely breaking the bottleneck of traditional expert models working in silos and facing collaboration difficulties. As a unified model, VACE can naturally fuse basic (atomic) abilities such as text-to-video generation, pose control, background replacement, and local editing, without the need to train new models separately for a single function. This flexible combination mechanism not only significantly simplifies the video creation workflow but also greatly expands the creative boundaries of AI video generation.

Examples of Combined Capabilities:

Object Replacement in Video: Combine "Image Reference" + "Subject Reshaping" features.
Dynamic Pose Control for Static Images: Combine "Motion Control" + "First Frame Reference" features.
Expanding a Portrait Image to a Landscape Video with Referenced Elements: Combine "Image Reference" + "First Frame Reference" + "Background Extension" + "Duration Extension" features.

Tech Deep Dive: The "Magic" Behind VACE

Core Foundation: Unified Input Paradigm – Video Condition Unit (VCU)

To achieve flexible combination and efficient processing of multiple tasks, the VACE team, after in-depth analysis and summarization of the input forms of four common video tasks (text-to-video, image-to-video, video-to-video, local video-to-video), innovatively proposed a flexible and unified input paradigm: the Video Condition Unit (VCU). The core idea of VCU is to generalize and unify complex multimodal context inputs into three basic forms: Text, Frame Sequence, and Mask Sequence. This design not only unifies the input form for the four core video generation and editing tasks mentioned above, but more crucially, the frame sequences and Mask sequences within VCU can be mathematically overlaid and fused in their representation, creating the necessary conditions for the free combination and synergistic processing of multi-task capabilities later on.

Key Step: Unified Encoding of Multimodal Inputs & DiT Integration

How to uniformly encode the diverse multimodal inputs within VCU (text, images, videos, Masks, various control signals, etc.) into token sequences that a Diffusion Transformer (DiT) model can efficiently process was a major technical hurdle for VACE. VACE's solution primarily involves the following steps: First, the Frame sequence in the VCU input undergoes conceptual decoupling, categorizing it into two types: one is the RGB pixel part that needs to be preserved verbatim in the generated result (invariant frame sequence), and the other is the content part that needs to be regenerated based on text prompts or other control signals (variable frame sequence). Next, these three types of inputs (variable frames, invariant frames, and Mask) are separately encoded into latent space. Specifically, variable and invariant frames are encoded via a VAE (Variational Autoencoder) into a latent space consistent with the DiT model's noise dimension, with 16 channels. Mask sequences, on the other hand, are mapped to a spatio-temporally consistent latent space feature with 64 channels through elaborate deformation and sampling operations. Finally, the latent space features of the Frame sequences and Mask sequences are effectively combined and then mapped via a set of trainable parameters into token sequences directly processable by the DiT model.

Optimization Strategy: Efficient Context Adapter Fine-tuning

In selecting the training strategy, the VACE team compared two approaches: Full Fine-tuning and Context Adapter Fine-tuning. Experiments showed that while global fine-tuning (i.e., training all DiT model parameters) can achieve faster inference speeds; however, the context adapter fine-tuning scheme—whose core idea is to fix the parameters of the original base model (e.g., Wan2.1) and only selectively copy and train some of the original Transformer layers as additional adapters—demonstrated faster convergence. Furthermore, it effectively avoids the risk of the base model's core capabilities suffering from "catastrophic forgetting" or performance degradation during fine-tuning. Therefore, all VACE series models released open-source this time were trained using the context adapter fine-tuning method to ensure model stability and efficiency.

Performance Evaluation: Significant Improvements in Key Metrics, Outstanding Results

Comprehensive quantitative evaluation results for the newly released VACE series models show that, compared to the previous 1.3B preview version, the new models have achieved significant and encouraging improvements across multiple key performance metrics, including the quality of generated video content, controllability of the generation process, and the finesse of editing results. This clearly marks another solid step forward for VACE in its journey towards becoming a more mature and powerful AI video editing and creation tool. (Specific evaluation data charts or comparative examples can be inserted here as appropriate).

Get Started Now: Experience and Develop with VACE

For developers intrigued by the VACE model and eager to experience it or undertake secondary development, you can easily embark on your VACE exploration and innovation journey by following these straightforward steps:

Visit the official GitHub repository: Download the core source code for Wan2.1.
Obtain Model Weights: Go to the HuggingFace community or the domestic ModelScope (魔搭) platform to download the model weight files corresponding to your chosen VACE version.
Stay Updated with Official Channels: Keep a close watch on the official Tongyi Wanxiang main website, as some user-friendly VACE features and online experience portals will soon be launched, providing you with more support.

Direct Links to Core Relevant Resources:

GitHub 魔 ModelScope 🤗 Hugging Face

Official Tongyi Wanxiang Online Experience & Exploration Portals:

Domestic Site (Alibaba Cloud Tongyi) International Site (Wan.Video Platform)

VACE (Wan2.1) by Tongyi Wanxiang: Open Source Now!