VrexAPI Docs

Text-to-Speech

Convert text to lifelike speech. Three modes: synchronous (instant WAV response), streaming (chunked real-time audio), and async jobs (up to 50,000 chars with webhook delivery).

Synchronous Generation

Best for short text (max 500 chars). Use output_format to control the response shape: raw returns binary WAV (default), blob/url/hex return structured JSON with generation metadata.

POST/api/v1/tts

Generate speech from text. Response format depends on output_format parameter.

Request Body

NameTypeRequiredDefaultDescription
textstringRequiredText to convert to speech. 1–5000 characters.
voice_idstringOptionalVoice identifier. Defaults to the account's default voice.
languagestringOptionalautoLanguage code: auto, en, vi, fr, de, es, ja, ko, zh.
speednumberOptional1.0Playback speed multiplier. Range: 0.5–2.0.
qualityintegerOptional32Inference quality steps. Range: 4–64. Higher values produce better audio but take longer.
guidance_scalenumberOptional2.0Controls how closely output matches the reference voice. Range: 0.0–4.0. Higher = more precise.
denoisebooleanOptionaltrueApply denoising to the generated audio.
output_formatstringOptionalrawResponse format: raw (binary WAV), blob (base64 JSON), url (presigned URL JSON), hex (hex-encoded JSON).
200Binary WAV (output_format=raw)
<binary audio/wav>
200JSON (output_format=blob|url|hex)
{
  "generation_id": "gen_456",
  "status": "completed",
  "text": "Hello world",
  "voice_id": null,
  "language": "en",
  "chars": 11,
  "duration_ms": 1200,
  "audio": {
    "format": "url",
    "data": "https://r2.example.com/generations/...",
    "content_type": "audio/wav",
    "size_bytes": 52800,
    "expires_in": 3600
  },
  "created_at": "2026-01-15T10:30:00Z"
}
curl -X POST https://getvrex.com/api/v1/tts \
  -H "Authorization: Bearer sk-your-api-key" \
  -H "Content-Type: application/json" \
  -d '{"text": "Hello world", "language": "en", "speed": 1.0}' \
  --output speech.wav

Streaming Generation

Same request body as synchronous. Returns a chunked audio/wav stream — audio begins playing before generation finishes, reducing perceived latency. Only output_format: "raw" is supported (default). Response headers include X-Generation-Id, X-Language, X-Chars.

POST/api/v1/tts/stream

Stream audio chunks in real-time as speech is synthesized.

Request Body

NameTypeRequiredDefaultDescription
textstringRequiredText to convert to speech. 1–5000 characters.
voice_idstringOptionalVoice identifier. Defaults to the account's default voice.
languagestringOptionalautoLanguage code: auto, en, vi, fr, de, es, ja, ko, zh.
speednumberOptional1.0Playback speed multiplier. Range: 0.5–2.0.
qualityintegerOptional32Inference quality steps. Range: 4–64. Higher values produce better audio but take longer.
guidance_scalenumberOptional2.0Controls how closely output matches the reference voice. Range: 0.0–4.0. Higher = more precise.
denoisebooleanOptionaltrueApply denoising to the generated audio.
output_formatstringOptionalrawResponse format: raw (binary WAV), blob (base64 JSON), url (presigned URL JSON), hex (hex-encoded JSON).
200Chunked audio/wav transfer-encoding stream
<chunked binary stream>
curl -X POST https://getvrex.com/api/v1/tts/stream \
  -H "Authorization: Bearer sk-your-api-key" \
  -H "Content-Type: application/json" \
  -d '{"text": "Hello world", "language": "en"}' \
  --output stream.wav

Async Jobs

For long-form content up to 50,000 characters. Submit a job and poll for status or receive a webhook callback on completion.

POST/api/v1/jobs

Submit a long-form TTS job for asynchronous processing.

Request Body

NameTypeRequiredDefaultDescription
textstringRequiredText to convert to speech. 1–5000 characters.
voice_idstringOptionalVoice identifier. Defaults to the account's default voice.
languagestringOptionalautoLanguage code: auto, en, vi, fr, de, es, ja, ko, zh.
speednumberOptional1.0Playback speed multiplier. Range: 0.5–2.0.
qualityintegerOptional32Inference quality steps. Range: 4–64. Higher values produce better audio but take longer.
guidance_scalenumberOptional2.0Controls how closely output matches the reference voice. Range: 0.0–4.0. Higher = more precise.
denoisebooleanOptionaltrueApply denoising to the generated audio.
output_formatstringOptionalrawResponse format: raw (binary WAV), blob (base64 JSON), url (presigned URL JSON), hex (hex-encoded JSON).
webhook_urlstringOptionalURL to receive a POST callback when the job completes.
202Job accepted and queued
{
  "job_id": "abc123",
  "generation_id": "gen_456",
  "status": "pending"
}
GET/api/v1/jobs/:id

Poll job status. Use ?output_format=url for a presigned download link.

Query Parameters

NameTypeRequiredDefaultDescription
output_formatstringOptionalrawraw returns R2 object key. url returns a presigned download URL (valid 1 hour).
200Completed job with audio URL (output_format=url)
{
  "job_id": "abc123",
  "generation_id": "gen_456",
  "status": "completed",
  "audio_url": "https://r2.example.com/generations/...",
  "chars": 1500,
  "duration_ms": 12000,
  "created_at": "2026-01-15T10:30:00Z",
  "audio_url_expires_in": 3600
}
API Documentation — Vrex