AI Video Glossary — T2V, I2V, CFG, Diffusion and More Explained

用語集

AI動画用語集

Plain-language definitions of every technical term you'll encounter when using Wan AI video models. Click any related term or model link to go deeper.

T2V (Text-to-Video): Text-to-Video (T2V) takes a written prompt and synthesizes a video clip from scratch — no input image required. You describe a scene, pose, lighting, and camera move, and the model renders motion directly from that description. All Wan video models support T2V. Quality improves significantly with detailed, specific prompts.
I2V (Image-to-Video): Image-to-Video (I2V) takes a reference image as the first frame and animates it forward in time. The model keeps the subject's identity, pose, and environment consistent while adding motion. Wan 2.2+ and Wan 2.5/2.6 support I2V. It produces more predictable results than T2V when you already have a strong source image.
Diffusion Model: A diffusion model generates content by learning to reverse a noise process: it starts from random noise and progressively refines it into a coherent image or video, guided by a text or image condition. Wan AI models are built on diffusion architectures. The quality and coherence of the output depends heavily on how many denoising steps are run and the guidance scale used.
CFG Scale (Classifier-Free Guidance): CFG (Classifier-Free Guidance) is a number that controls how strictly the model follows your prompt. A low CFG (e.g. 3–5) gives the model more creative freedom and tends to produce smoother motion but may drift from the prompt. A high CFG (e.g. 10–15) forces the model to stick closer to your words but can produce over-saturated or artifact-heavy frames. Most generators on this site use a sensible default — adjust only if you see obvious prompt drift or color blowout.
Denoise Steps (Sampling Steps): Each denoising step refines the output from noise toward a coherent result. More steps (e.g. 50) generally produce sharper, more detailed frames at the cost of longer generation time. Fewer steps (e.g. 20) are faster but may leave the output slightly blurry or inconsistent. For NSFW video generation, 20–30 steps is typically a good balance. The generator on this site manages step count automatically.
Prompt: A prompt is the text you write to tell the model what to generate. For video generation, a strong prompt typically includes: subject description (appearance, pose), action or motion, setting and lighting, camera angle or movement, and style cues. Wan models respond well to explicit, concrete descriptions. Vague prompts produce generic output; specific prompts produce controlled results. See the Prompt Guide for model-specific tips.
Negative Prompt: A negative prompt is a second text input where you list things you do not want in the output — e.g. "blurry, watermark, extra limbs, distorted face". The model pushes generation away from these concepts. Effective negative prompts for NSFW video typically include common artifact descriptions. Not all interfaces expose the negative prompt field; the Wan generator on this site includes it under advanced settings.
Subject Consistency: Subject consistency measures whether a character's face, body, and appearance remain stable from one frame to the next. Poor consistency causes "identity drift" — the subject subtly changes between frames, breaking immersion. Wan 2.2 was Alibaba's first major improvement in this area. Wan 2.5 and 2.6 improve it further. Using I2V mode with a clear reference image generally produces better consistency than T2V alone.
FPS (Frames Per Second): FPS determines how smooth a video clip looks. Standard film is 24fps; most AI video models target 16–24fps. A 10-second clip at 16fps is 160 frames — each frame costs compute. Wan 2.6 generates up to 15 seconds of video, meaning up to ~360 frames per generation. Higher FPS produces smoother motion but increases generation time and file size.
Resolution: Resolution is the width × height of each video frame, expressed in pixels. Higher resolution captures finer detail but requires more compute and memory. Wan 2.2/2.5/2.6 output 1080p native. Actual perceived sharpness also depends on the model's detail capacity — a 1080p frame from a lower-quality model may look softer than expected. Output files are typically MP4 at native resolution.