Text-to-Speech

Convert text to lifelike speech. Three modes: synchronous (instant WAV response), streaming (chunked real-time audio), and async jobs (up to 50,000 chars with webhook delivery).

Synchronous Generation

Best for short text (max 500 chars). Use output_format to control the response shape: raw returns binary WAV (default), blob/url/hex return structured JSON with generation metadata.

POST/api/v1/tts

Generate speech from text. Response format depends on output_format parameter.

Request Body

Name	Type	Required	Default	Description
text	string	Required	—	Text to convert to speech. 1–5000 characters.
voice_id	string	Optional	—	Voice identifier. Defaults to the account's default voice.
language	string	Optional	auto	Language code: auto, en, vi, fr, de, es, ja, ko, zh.
speed	number	Optional	1.0	Playback speed multiplier. Range: 0.5–2.0.
quality	integer	Optional	32	Inference quality steps. Range: 4–64. Higher values produce better audio but take longer.
guidance_scale	number	Optional	2.0	Controls how closely output matches the reference voice. Range: 0.0–4.0. Higher = more precise.
denoise	boolean	Optional	true	Apply denoising to the generated audio.
output_format	string	Optional	raw	Response format: raw (binary WAV), blob (base64 JSON), url (presigned URL JSON), hex (hex-encoded JSON).

200Binary WAV (output_format=raw)

<binary audio/wav>

200JSON (output_format=blob|url|hex)

{
  "generation_id": "gen_456",
  "status": "completed",
  "text": "Hello world",
  "voice_id": null,
  "language": "en",
  "chars": 11,
  "duration_ms": 1200,
  "audio": {
    "format": "url",
    "data": "https://r2.example.com/generations/...",
    "content_type": "audio/wav",
    "size_bytes": 52800,
    "expires_in": 3600
  },
  "created_at": "2026-01-15T10:30:00Z"
}

curl -X POST https://getvrex.com/api/v1/tts \
  -H "Authorization: Bearer sk-your-api-key" \
  -H "Content-Type: application/json" \
  -d '{"text": "Hello world", "language": "en", "speed": 1.0}' \
  --output speech.wav

Streaming Generation

Same request body as synchronous. Returns a chunked audio/wav stream — audio begins playing before generation finishes, reducing perceived latency. Only output_format: "raw" is supported (default). Response headers include X-Generation-Id, X-Language, X-Chars.

POST/api/v1/tts/stream

Stream audio chunks in real-time as speech is synthesized.

Request Body

Name	Type	Required	Default	Description
text	string	Required	—	Text to convert to speech. 1–5000 characters.
voice_id	string	Optional	—	Voice identifier. Defaults to the account's default voice.
language	string	Optional	auto	Language code: auto, en, vi, fr, de, es, ja, ko, zh.
speed	number	Optional	1.0	Playback speed multiplier. Range: 0.5–2.0.
quality	integer	Optional	32	Inference quality steps. Range: 4–64. Higher values produce better audio but take longer.
guidance_scale	number	Optional	2.0	Controls how closely output matches the reference voice. Range: 0.0–4.0. Higher = more precise.
denoise	boolean	Optional	true	Apply denoising to the generated audio.
output_format	string	Optional	raw	Response format: raw (binary WAV), blob (base64 JSON), url (presigned URL JSON), hex (hex-encoded JSON).

200Chunked audio/wav transfer-encoding stream

<chunked binary stream>

curl -X POST https://getvrex.com/api/v1/tts/stream \
  -H "Authorization: Bearer sk-your-api-key" \
  -H "Content-Type: application/json" \
  -d '{"text": "Hello world", "language": "en"}' \
  --output stream.wav

Async Jobs

For long-form content up to 50,000 characters. Submit a job and poll for status or receive a webhook callback on completion.

POST/api/v1/jobs

Submit a long-form TTS job for asynchronous processing.

Request Body

Name	Type	Required	Default	Description
text	string	Required	—	Text to convert to speech. 1–5000 characters.
voice_id	string	Optional	—	Voice identifier. Defaults to the account's default voice.
language	string	Optional	auto	Language code: auto, en, vi, fr, de, es, ja, ko, zh.
speed	number	Optional	1.0	Playback speed multiplier. Range: 0.5–2.0.
quality	integer	Optional	32	Inference quality steps. Range: 4–64. Higher values produce better audio but take longer.
guidance_scale	number	Optional	2.0	Controls how closely output matches the reference voice. Range: 0.0–4.0. Higher = more precise.
denoise	boolean	Optional	true	Apply denoising to the generated audio.
output_format	string	Optional	raw	Response format: raw (binary WAV), blob (base64 JSON), url (presigned URL JSON), hex (hex-encoded JSON).
webhook_url	string	Optional	—	URL to receive a POST callback when the job completes.

202Job accepted and queued

{
  "job_id": "abc123",
  "generation_id": "gen_456",
  "status": "pending"
}

GET/api/v1/jobs/:id

Poll job status. Use ?output_format=url for a presigned download link.

Query Parameters

Name	Type	Required	Default	Description
output_format	string	Optional	raw	raw returns R2 object key. url returns a presigned download URL (valid 1 hour).

200Completed job with audio URL (output_format=url)

{
  "job_id": "abc123",
  "generation_id": "gen_456",
  "status": "completed",
  "audio_url": "https://r2.example.com/generations/...",
  "chars": 1500,
  "duration_ms": 12000,
  "created_at": "2026-01-15T10:30:00Z",
  "audio_url_expires_in": 3600
}