Hugging Face · Arazzo Workflow

Hugging Face Dataset Validate and Preview

Version 1.0.0

Check a dataset is viewer-ready, list its splits, then preview the first rows.

1 workflow 1 source API 1 provider
View Spec View on GitHub ArazzoWorkflows

Provider

hugging-face

Workflows

dataset-validate-and-preview
Validate a dataset, discover its first split, and preview its first rows.
Confirms a dataset is preview-capable, resolves its first subset and split, and reads the first rows of that split.
3 steps inputs: dataset, hfToken outputs: config, features, numRowsTotal, split
1
checkValidity
isValid
Check whether the dataset is processed by the viewer and previewable. Branches to split discovery only when preview is available.
2
getSplits
getSplits
List the subsets (configs) and splits for the dataset and take the first split as the preview target.
3
previewFirstRows
getFirstRows
Fetch the first rows of the selected subset and split to inspect the dataset's feature schema and sample content.

Source API Descriptions

Arazzo Workflow Specification

hugging-face-dataset-validate-and-preview-workflow.yml Raw ↑
arazzo: 1.0.1
info:
  title: Hugging Face Dataset Validate and Preview
  summary: Check a dataset is viewer-ready, list its splits, then preview the first rows.
  description: >-
    A safe dataset onboarding flow over the Dataset Viewer API. The workflow first
    checks whether a dataset is valid and processed by the viewer, branching to
    stop early when preview is unavailable. When the dataset is previewable it
    lists the subsets and splits, selects the first split, and fetches its first
    rows for inspection. Every step spells out its request inline so the flow can
    be read and executed without opening the underlying OpenAPI description.
  version: 1.0.0
sourceDescriptions:
- name: datasetViewerApi
  url: ../openapi/hugging-face-dataset-viewer-api.yml
  type: openapi
workflows:
- workflowId: dataset-validate-and-preview
  summary: Validate a dataset, discover its first split, and preview its first rows.
  description: >-
    Confirms a dataset is preview-capable, resolves its first subset and split,
    and reads the first rows of that split.
  inputs:
    type: object
    required:
    - hfToken
    - dataset
    properties:
      hfToken:
        type: string
        description: Hugging Face access token used as a Bearer credential.
      dataset:
        type: string
        description: The dataset id on the Hugging Face Hub (e.g. squad).
  steps:
  - stepId: checkValidity
    description: >-
      Check whether the dataset is processed by the viewer and previewable.
      Branches to split discovery only when preview is available.
    operationId: isValid
    parameters:
    - name: Authorization
      in: header
      value: Bearer $inputs.hfToken
    - name: dataset
      in: query
      value: $inputs.dataset
    successCriteria:
    - condition: $statusCode == 200
    outputs:
      preview: $response.body#/preview
      viewer: $response.body#/viewer
    onSuccess:
    - name: previewable
      type: goto
      stepId: getSplits
      criteria:
      - context: $response.body
        condition: $.preview == true
        type: jsonpath
  - stepId: getSplits
    description: >-
      List the subsets (configs) and splits for the dataset and take the first
      split as the preview target.
    operationId: getSplits
    parameters:
    - name: Authorization
      in: header
      value: Bearer $inputs.hfToken
    - name: dataset
      in: query
      value: $inputs.dataset
    successCriteria:
    - condition: $statusCode == 200
    outputs:
      config: $response.body#/splits/0/config
      split: $response.body#/splits/0/split
  - stepId: previewFirstRows
    description: >-
      Fetch the first rows of the selected subset and split to inspect the
      dataset's feature schema and sample content.
    operationId: getFirstRows
    parameters:
    - name: Authorization
      in: header
      value: Bearer $inputs.hfToken
    - name: dataset
      in: query
      value: $inputs.dataset
    - name: config
      in: query
      value: $steps.getSplits.outputs.config
    - name: split
      in: query
      value: $steps.getSplits.outputs.split
    successCriteria:
    - condition: $statusCode == 200
    outputs:
      features: $response.body#/features
      numRowsTotal: $response.body#/num_rows_total
  outputs:
    config: $steps.getSplits.outputs.config
    split: $steps.getSplits.outputs.split
    features: $steps.previewFirstRows.outputs.features
    numRowsTotal: $steps.previewFirstRows.outputs.numRowsTotal