Speech-to-Text Transcription

Metadata

Category: compute
SDK: @0glabs/0g-serving-broker ^0.6.5, ethers ^6.13.0
Activation Triggers: "transcribe", "speech-to-text", "Whisper", "audio transcription"

Purpose

Transcribe audio files using 0G Compute Network providers running Whisper Large V3. Supports multiple audio formats and output types (JSON, text, SRT subtitles).

Prerequisites

Node.js >= 22
@0glabs/0g-serving-broker and ethers installed
Funded and acknowledged provider with speech-to-text service
Audio file in supported format (mp3, wav, ogg, flac, webm)
.env with PRIVATE_KEY, RPC_URL, PROVIDER_ADDRESS

Quick Workflow

Initialize broker
Get service metadata (endpoint, model)
Create FormData with audio file and parameters
Generate auth headers
Make transcription request
Extract ChatID from ZG-Res-Key header ONLY
Call processResponse(providerAddress, chatID, usageData)

Core Rules

ALWAYS

Use FormData for audio upload (not JSON)
Get ChatID from ZG-Res-Key header (no body fallback for speech)
Call processResponse() after every transcription
Use correct processResponse() param order: (providerAddress, chatID, usageData)
Include usage data if available in response
Acknowledge provider before first use

NEVER

Send audio as base64 in JSON body (use FormData)
Skip processResponse() after transcription
Try to get ChatID from response body for speech-to-text
Hardcode private keys
Use ethers v5 syntax

Code Examples

Basic Transcription

import { ethers } from 'ethers';
import { createZGComputeNetworkBroker } from '@0glabs/0g-serving-broker';
import * as fs from 'fs';
import 'dotenv/config';

async function transcribe(audioPath: string): Promise<string> {
  const provider = new ethers.JsonRpcProvider(process.env.RPC_URL);
  const wallet = new ethers.Wallet(process.env.PRIVATE_KEY!, provider);
  const broker = await createZGComputeNetworkBroker(wallet);

  const providerAddress = process.env.PROVIDER_ADDRESS!;
  const { endpoint, model } = await broker.inference.getServiceMetadata(providerAddress);
  const headers = await broker.inference.getRequestHeaders(providerAddress);

  const formData = new FormData();
  const audioBuffer = fs.readFileSync(audioPath);
  const audioBlob = new Blob([audioBuffer]);
  formData.append('file', audioBlob, audioPath.split('/').pop());
  formData.append('model', model);
  formData.append('response_format', 'json');

  const response = await fetch(`${endpoint}/audio/transcriptions`, {
    method: 'POST',
    headers: { ...headers },
    body: formData,
  });

  const data = await response.json();

  // ChatID from header ONLY for speech-to-text
  const chatID = response.headers.get('ZG-Res-Key') || response.headers.get('zg-res-key');

  await broker.inference.processResponse(
    providerAddress,
    chatID,
    data.usage ? JSON.stringify(data.usage) : undefined,
  );

  return data.text;
}

// Usage
const text = await transcribe('./audio/podcast.mp3');
console.log('Transcription:', text);

Transcription with Format Options

type OutputFormat = 'json' | 'text' | 'srt' | 'verbose_json';

async function transcribeWithFormat(
  audioPath: string,
  format: OutputFormat = 'json',
  language?: string,
): Promise<any> {
  const provider = new ethers.JsonRpcProvider(process.env.RPC_URL);
  const wallet = new ethers.Wallet(process.env.PRIVATE_KEY!, provider);
  const broker = await createZGComputeNetworkBroker(wallet);

  const providerAddress = process.env.PROVIDER_ADDRESS!;
  const { endpoint, model } = await broker.inference.getServiceMetadata(providerAddress);
  const headers = await broker.inference.getRequestHeaders(providerAddress);

  const formData = new FormData();
  const audioBuffer = fs.readFileSync(audioPath);
  formData.append('file', new Blob([audioBuffer]), audioPath.split('/').pop());
  formData.append('model', model);
  formData.append('response_format', format);
  if (language) formData.append('language', language);

  const response = await fetch(`${endpoint}/audio/transcriptions`, {
    method: 'POST',
    headers: { ...headers },
    body: formData,
  });

  const chatID = response.headers.get('ZG-Res-Key') || response.headers.get('zg-res-key');

  if (format === 'text' || format === 'srt') {
    const text = await response.text();
    if (chatID) {
      await broker.inference.processResponse(providerAddress, chatID);
    }
    return text;
  }

  const data = await response.json();
  await broker.inference.processResponse(
    providerAddress,
    chatID,
    data.usage ? JSON.stringify(data.usage) : undefined,
  );

  return data;
}

// Usage
const srt = await transcribeWithFormat('./audio/meeting.mp3', 'srt', 'en');
fs.writeFileSync('./output/meeting.srt', srt);

Error Handling

async function safeTranscribe(audioPath: string): Promise<string | null> {
  const provider = new ethers.JsonRpcProvider(process.env.RPC_URL);
  const wallet = new ethers.Wallet(process.env.PRIVATE_KEY!, provider);
  const broker = await createZGComputeNetworkBroker(wallet);

  const providerAddress = process.env.PROVIDER_ADDRESS!;

  try {
    // Validate file exists
    if (!fs.existsSync(audioPath)) {
      throw new Error(`Audio file not found: ${audioPath}`);
    }

    // Validate file size (most providers have limits)
    const stats = fs.statSync(audioPath);
    const maxSize = 25 * 1024 * 1024; // 25MB typical limit
    if (stats.size > maxSize) {
      throw new Error(`File too large (${stats.size} bytes). Max: ${maxSize} bytes`);
    }

    const { endpoint, model } = await broker.inference.getServiceMetadata(providerAddress);
    const headers = await broker.inference.getRequestHeaders(providerAddress);

    const formData = new FormData();
    const audioBuffer = fs.readFileSync(audioPath);
    formData.append('file', new Blob([audioBuffer]), audioPath.split('/').pop());
    formData.append('model', model);
    formData.append('response_format', 'json');

    const response = await fetch(`${endpoint}/audio/transcriptions`, {
      method: 'POST',
      headers: { ...headers },
      body: formData,
    });

    if (!response.ok) {
      throw new Error(`HTTP ${response.status}: ${await response.text()}`);
    }

    const data = await response.json();
    const chatID = response.headers.get('ZG-Res-Key') || response.headers.get('zg-res-key');

    await broker.inference.processResponse(
      providerAddress,
      chatID,
      data.usage ? JSON.stringify(data.usage) : undefined,
    );

    return data.text;
  } catch (error) {
    console.error('Transcription failed:', error);
    return null;
  }
}

Supported Audio Formats

Format	Extension	Notes
MP3	`.mp3`	Most common
WAV	`.wav`	Uncompressed
OGG	`.ogg`	Compressed
FLAC	`.flac`	Lossless
WebM	`.webm`	Web native

Output Formats

Format	Description
`json`	`{ "text": "..." }`
`text`	Plain text string
`srt`	SubRip subtitle format
`verbose_json`	Includes timestamps and segments

Cost Estimate

~0.0001 0G per minute of audio (varies by provider).

Anti-Patterns

// BAD: Sending audio as JSON
const response = await fetch(endpoint, {
  body: JSON.stringify({ audio: base64Data }), // WRONG — use FormData
});

// BAD: Getting chatID from body
const chatID = data.id; // WRONG for speech — header only

// BAD: Missing processResponse
const data = await response.json();
return data.text; // processResponse() never called!

// BAD: Hardcoding private keys
const wallet = new ethers.Wallet('0xabc123...', provider); // NEVER do this

// BAD: ethers v5 syntax
const provider = new ethers.providers.JsonRpcProvider(url); // v5!

Common Errors & Fixes

Error	Cause	Fix
`Insufficient balance`	Sub-account empty	Transfer more funds
`unsupported format`	Wrong audio format	Use mp3, wav, ogg, flac, or webm
`file too large`	Audio file too big	Split into smaller segments
`Fee verification failed`	Missing chatID	Check `ZG-Res-Key` header
`Provider not acknowledged`	First-time provider	`acknowledgeProviderSigner()`

Related Skills

Provider Discovery — find speech providers
Account Management — fund accounts
Compute + Storage — transcribe and store

ナビゲーション

Skillsとは？

リンク

Speech-to-Text Transcription

Speech-to-Text Transcription

Metadata

Purpose

Prerequisites

Quick Workflow

Core Rules

ALWAYS

NEVER

Code Examples

Basic Transcription

Transcription with Format Options

Error Handling

Supported Audio Formats

Output Formats

Cost Estimate

Anti-Patterns

Common Errors & Fixes

Related Skills

References

関連スキル(🌐 Web開発)