Text-to-Speech
Convert text to lifelike speech. Three modes: synchronous (instant WAV response), streaming (chunked real-time audio), and async jobs (up to 50,000 chars with webhook delivery).
Synchronous Generation
Best for short text (max 500 chars). Use output_format to control the response shape: raw returns binary WAV (default), blob/url/hex return structured JSON with generation metadata.
/api/v1/ttsGenerate speech from text. Response format depends on output_format parameter.
Request Body
| Name | Type | Required | Default | Description |
|---|---|---|---|---|
| text | string | Required | — | Text to convert to speech. 1–5000 characters. |
| voice_id | string | Optional | — | Voice identifier. Defaults to the account's default voice. |
| language | string | Optional | auto | Language code: auto, en, vi, fr, de, es, ja, ko, zh. |
| speed | number | Optional | 1.0 | Playback speed multiplier. Range: 0.5–2.0. |
| quality | integer | Optional | 32 | Inference quality steps. Range: 4–64. Higher values produce better audio but take longer. |
| guidance_scale | number | Optional | 2.0 | Controls how closely output matches the reference voice. Range: 0.0–4.0. Higher = more precise. |
| denoise | boolean | Optional | true | Apply denoising to the generated audio. |
| output_format | string | Optional | raw | Response format: raw (binary WAV), blob (base64 JSON), url (presigned URL JSON), hex (hex-encoded JSON). |
<binary audio/wav>{
"generation_id": "gen_456",
"status": "completed",
"text": "Hello world",
"voice_id": null,
"language": "en",
"chars": 11,
"duration_ms": 1200,
"audio": {
"format": "url",
"data": "https://r2.example.com/generations/...",
"content_type": "audio/wav",
"size_bytes": 52800,
"expires_in": 3600
},
"created_at": "2026-01-15T10:30:00Z"
}curl -X POST https://getvrex.com/api/v1/tts \
-H "Authorization: Bearer sk-your-api-key" \
-H "Content-Type: application/json" \
-d '{"text": "Hello world", "language": "en", "speed": 1.0}' \
--output speech.wavStreaming Generation
Same request body as synchronous. Returns a chunked audio/wav stream — audio begins playing before generation finishes, reducing perceived latency. Only output_format: "raw" is supported (default). Response headers include X-Generation-Id, X-Language, X-Chars.
/api/v1/tts/streamStream audio chunks in real-time as speech is synthesized.
Request Body
| Name | Type | Required | Default | Description |
|---|---|---|---|---|
| text | string | Required | — | Text to convert to speech. 1–5000 characters. |
| voice_id | string | Optional | — | Voice identifier. Defaults to the account's default voice. |
| language | string | Optional | auto | Language code: auto, en, vi, fr, de, es, ja, ko, zh. |
| speed | number | Optional | 1.0 | Playback speed multiplier. Range: 0.5–2.0. |
| quality | integer | Optional | 32 | Inference quality steps. Range: 4–64. Higher values produce better audio but take longer. |
| guidance_scale | number | Optional | 2.0 | Controls how closely output matches the reference voice. Range: 0.0–4.0. Higher = more precise. |
| denoise | boolean | Optional | true | Apply denoising to the generated audio. |
| output_format | string | Optional | raw | Response format: raw (binary WAV), blob (base64 JSON), url (presigned URL JSON), hex (hex-encoded JSON). |
<chunked binary stream>curl -X POST https://getvrex.com/api/v1/tts/stream \
-H "Authorization: Bearer sk-your-api-key" \
-H "Content-Type: application/json" \
-d '{"text": "Hello world", "language": "en"}' \
--output stream.wavAsync Jobs
For long-form content up to 50,000 characters. Submit a job and poll for status or receive a webhook callback on completion.
/api/v1/jobsSubmit a long-form TTS job for asynchronous processing.
Request Body
| Name | Type | Required | Default | Description |
|---|---|---|---|---|
| text | string | Required | — | Text to convert to speech. 1–5000 characters. |
| voice_id | string | Optional | — | Voice identifier. Defaults to the account's default voice. |
| language | string | Optional | auto | Language code: auto, en, vi, fr, de, es, ja, ko, zh. |
| speed | number | Optional | 1.0 | Playback speed multiplier. Range: 0.5–2.0. |
| quality | integer | Optional | 32 | Inference quality steps. Range: 4–64. Higher values produce better audio but take longer. |
| guidance_scale | number | Optional | 2.0 | Controls how closely output matches the reference voice. Range: 0.0–4.0. Higher = more precise. |
| denoise | boolean | Optional | true | Apply denoising to the generated audio. |
| output_format | string | Optional | raw | Response format: raw (binary WAV), blob (base64 JSON), url (presigned URL JSON), hex (hex-encoded JSON). |
| webhook_url | string | Optional | — | URL to receive a POST callback when the job completes. |
{
"job_id": "abc123",
"generation_id": "gen_456",
"status": "pending"
}/api/v1/jobs/:idPoll job status. Use ?output_format=url for a presigned download link.
Query Parameters
| Name | Type | Required | Default | Description |
|---|---|---|---|---|
| output_format | string | Optional | raw | raw returns R2 object key. url returns a presigned download URL (valid 1 hour). |
{
"job_id": "abc123",
"generation_id": "gen_456",
"status": "completed",
"audio_url": "https://r2.example.com/generations/...",
"chars": 1500,
"duration_ms": 12000,
"created_at": "2026-01-15T10:30:00Z",
"audio_url_expires_in": 3600
}