Skip to main content
POST
/
api
/
generate-talking-video
curl --request POST \
  --url https://easy-peasy.ai/api/generate-talking-video \
  --header 'Content-Type: application/json' \
  --header 'x-api-key: <api-key>' \
  --data '
{
  "image": "https://example.com/portrait.jpg",
  "text": "Hello! Welcome to our product demo.",
  "voiceID": "21m00Tcm4TlvDq8ikWAM",
  "avatarModel": "premium",
  "resolution": "720p"
}
'
{
  "id": 12345,
  "prompt": "Hello! Welcome to our product demo.",
  "image_url": "",
  "is_video": true,
  "created_at": "2025-01-15T10:30:00.000Z"
}

Workflow

  1. Submit a talking video request using this endpoint
  2. Save the id from the response
  3. Poll the Get Video endpoint every 15–30 seconds until status is completed
  4. Download the video from the url field
Talking video generation requires a paid plan. Processing typically takes 1–5 minutes depending on audio length and resolution.

Input combinations

InputAudio SourceDescription
imagetext + voiceIDGenerates speech, then animates the face
imageaudioAnimates the face with provided audio
videotext + voiceIDGenerates speech, then lip-syncs the video
videoaudioLip-syncs the video with provided audio

Voice IDs

Use the Get TTS Voices endpoint to discover available voice IDs. Both ElevenLabs and OpenAI voices are supported.

Requirements

  • Image: minimum 512x512 pixels
  • Video: .mp4 or .mov format, 3–300 seconds
  • Audio: max 5 minutes

Authorizations

x-api-key
string
header
required

API key for authentication. Get yours at https://easy-peasy.ai/settings/api

Headers

x-api-key
string
required

Your API key

Body

application/json
image
string<uri>

URL of a face image to animate. Must be at least 512x512 pixels. Provide either image or video.

video
string<uri>

URL of a video to lip-sync. Supported formats: .mp4, .mov. Duration: 3–300 seconds. Provide either image or video.

text
string

Text to convert to speech. Required if audio is not provided.

voiceID
string

Voice ID for text-to-speech. Get available voices from the Get TTS Voices endpoint. Required if text is provided and audio is not.

audio
string<uri>

URL of an audio file to use directly (instead of generating from text). Max 5 minutes.

avatarModel
enum<string>
default:premium

Avatar generation model. premium uses VEED Fabric (higher quality), standard uses WaveSpeed InfiniteTalk (faster). Only applies to image input.

Available options:
premium,
standard
resolution
enum<string>
default:480p

Output video resolution.

Available options:
480p,
720p
generateCaptions
boolean

Whether to generate captions on the video.

captionColor
string

Highlight color for captions (hex code).

Response

Talking video generation started

id
integer

Video ID. Use this to poll for the result with the Get Video endpoint.

prompt
string

The prompt used for generation

image_url
string

Video URL. Empty string while processing.

model
string

The model used for generation

is_video
boolean
created_at
string<date-time>