Generate Talking Video
Generate a talking video by animating a face image or video with speech. Provide either text + voice to generate audio, or supply your own audio file.
Input combinations:
- Image + text + voice — Generates speech from text using the specified voice, then animates the face in the image
- Image + audio — Uses the provided audio to animate the face in the image
- Video + text + voice — Generates speech from text, then lip-syncs the video
- Video + audio — Lip-syncs the video with the provided audio
Models:
premium— High-quality avatar generation (VEED Fabric). Supports 480p and 720p.standard— Faster generation (WaveSpeed InfiniteTalk). Supports 480p and 720p.
Video generation is asynchronous. Use the Get Video endpoint to poll for results.
Note: Requires a paid plan. Image must be at least 512x512 pixels. Audio/video max 5 minutes.
Workflow
- Submit a talking video request using this endpoint
- Save the
idfrom the response - Poll the Get Video endpoint every 15–30 seconds until
statusiscompleted - Download the video from the
urlfield
Input combinations
| Input | Audio Source | Description |
|---|---|---|
image | text + voiceID | Generates speech, then animates the face |
image | audio | Animates the face with provided audio |
video | text + voiceID | Generates speech, then lip-syncs the video |
video | audio | Lip-syncs the video with provided audio |
Voice IDs
Use the Get TTS Voices endpoint to discover available voice IDs. Both ElevenLabs and OpenAI voices are supported.Requirements
- Image: minimum 512x512 pixels
- Video: .mp4 or .mov format, 3–300 seconds
- Audio: max 5 minutes
Authorizations
API key for authentication. Get yours at https://easy-peasy.ai/settings/api
Headers
Your API key
Body
URL of a face image to animate. Must be at least 512x512 pixels. Provide either image or video.
URL of a video to lip-sync. Supported formats: .mp4, .mov. Duration: 3–300 seconds. Provide either image or video.
Text to convert to speech. Required if audio is not provided.
Voice ID for text-to-speech. Get available voices from the Get TTS Voices endpoint. Required if text is provided and audio is not.
URL of an audio file to use directly (instead of generating from text). Max 5 minutes.
Avatar generation model. premium uses VEED Fabric (higher quality), standard uses WaveSpeed InfiniteTalk (faster). Only applies to image input.
premium, standard Output video resolution.
480p, 720p Whether to generate captions on the video.
Highlight color for captions (hex code).
Response
Talking video generation started
