NVIDIA NIM · Arazzo Workflow

NVIDIA NIM Voice Assistant Loop

Version 1.0.0

Transcribe an audio clip with Riva ASR, answer the transcript with an LLM, then synthesize the reply with Riva TTS.

1 workflow 2 source APIs 1 provider

View Spec View on GitHub AIArtificial IntelligenceInferenceMicroservicesLLMFoundation ModelsGPUKubernetesNVIDIAOpenAI CompatibleArazzoWorkflows

Provider

nvidia-nim

Workflows

voice-assistant-loop

Speech-to-text, chat answer, then text-to-speech in a single loop.

Transcribes an audio clip, generates a chat answer to the transcript, and synthesizes the answer back into audio.

3 steps inputs: apiKey, asrModel, audioFile, chatModel, ttsModel, voice outputs: audio, replyText, transcript

transcribeAudio

createTranscription

Transcribe the uploaded audio clip into text using a Riva ASR NIM via a multipart/form-data upload.

answerTranscript

createChatCompletion

Send the transcript to a chat model to generate a spoken-style reply.

synthesizeReply

createSpeech

Synthesize the chat reply back into audio bytes using a Riva TTS NIM.

Source API Descriptions

openapi

speechApi https://raw.githubusercontent.com/api-evangelist/nvidia-nim/refs/heads/main/openapi/nvidia-nim-speech-api-openapi.yml

openapi

chatCompletionsApi https://raw.githubusercontent.com/api-evangelist/nvidia-nim/refs/heads/main/openapi/nvidia-nim-chat-completions-api-openapi.yml

Arazzo Workflow Specification

arazzo: 1.0.1
info:
  title: NVIDIA NIM Voice Assistant Loop
  summary: Transcribe an audio clip with Riva ASR, answer the transcript with an LLM, then synthesize the reply with Riva TTS.
  description: >-
    A full speech-to-speech assistant loop built from NVIDIA Riva and LLM NIMs.
    An uploaded audio clip is transcribed to text by an ASR NIM (Parakeet /
    Canary), the transcript is answered by an OpenAI-compatible chat model, and
    the textual answer is synthesized back to audio by a TTS NIM (Magpie-TTS /
    FastPitch). The transcription step uses multipart/form-data per the spec.
    Every step spells out its request inline so the flow can be read and
    executed without opening the underlying OpenAPI description.
  version: 1.0.0
sourceDescriptions:
- name: speechApi
  url: ../openapi/nvidia-nim-speech-api-openapi.yml
  type: openapi
- name: chatCompletionsApi
  url: ../openapi/nvidia-nim-chat-completions-api-openapi.yml
  type: openapi
workflows:
- workflowId: voice-assistant-loop
  summary: Speech-to-text, chat answer, then text-to-speech in a single loop.
  description: >-
    Transcribes an audio clip, generates a chat answer to the transcript, and
    synthesizes the answer back into audio.
  inputs:
    type: object
    required:
    - apiKey
    - audioFile
    properties:
      apiKey:
        type: string
        description: NVIDIA developer API key (nvapi-...) sent as a Bearer token.
      audioFile:
        type: string
        format: binary
        description: WAV/FLAC/MP3 audio clip to transcribe.
      asrModel:
        type: string
        description: Riva ASR model id.
        default: nvidia/parakeet-ctc-1.1b-asr
      chatModel:
        type: string
        description: LLM model id used to answer the transcript.
        default: meta/llama-3.3-70b-instruct
      ttsModel:
        type: string
        description: Riva TTS model id.
        default: nvidia/magpie-tts
      voice:
        type: string
        description: TTS voice identifier.
        default: en-US.Female-1
  steps:
  - stepId: transcribeAudio
    description: >-
      Transcribe the uploaded audio clip into text using a Riva ASR NIM via a
      multipart/form-data upload.
    operationId: createTranscription
    parameters:
    - name: Authorization
      in: header
      value: Bearer $inputs.apiKey
    requestBody:
      contentType: multipart/form-data
      payload:
        file: $inputs.audioFile
        model: $inputs.asrModel
        language: en-US
        response_format: json
    successCriteria:
    - condition: $statusCode == 200
    outputs:
      transcript: $response.body#/text
      detectedLanguage: $response.body#/language
  - stepId: answerTranscript
    description: >-
      Send the transcript to a chat model to generate a spoken-style reply.
    operationId: createChatCompletion
    parameters:
    - name: Authorization
      in: header
      value: Bearer $inputs.apiKey
    requestBody:
      contentType: application/json
      payload:
        model: $inputs.chatModel
        messages:
        - role: system
          content: You are a concise voice assistant. Reply in one or two short spoken sentences.
        - role: user
          content: $steps.transcribeAudio.outputs.transcript
        max_tokens: 256
        temperature: 0.4
        stream: false
    successCriteria:
    - condition: $statusCode == 200
    outputs:
      replyText: $response.body#/choices/0/message/content
      totalTokens: $response.body#/usage/total_tokens
  - stepId: synthesizeReply
    description: >-
      Synthesize the chat reply back into audio bytes using a Riva TTS NIM.
    operationId: createSpeech
    parameters:
    - name: Authorization
      in: header
      value: Bearer $inputs.apiKey
    requestBody:
      contentType: application/json
      payload:
        model: $inputs.ttsModel
        input: $steps.answerTranscript.outputs.replyText
        voice: $inputs.voice
        response_format: mp3
        speed: 1.0
    successCriteria:
    - condition: $statusCode == 200
    outputs:
      audio: $response.body
  outputs:
    transcript: $steps.transcribeAudio.outputs.transcript
    replyText: $steps.answerTranscript.outputs.replyText
    audio: $steps.synthesizeReply.outputs.audio