Hugging Face · Arazzo Workflow

Hugging Face Dataset Size and Parquet Files

Version 1.0.0

Confirm a dataset on the Hub, read its size profile, then list its Parquet files.

1 workflow 2 source APIs 1 provider
View Spec View on GitHub ArazzoWorkflows

Provider

hugging-face

Workflows

dataset-size-and-parquet
Verify a dataset, read its size profile, and list its Parquet files.
Confirms a dataset exists on the Hub, retrieves its size information from the Dataset Viewer, and lists its converted Parquet files.
3 steps inputs: dataset, hfToken outputs: configs, datasetNumRows, parquetFiles
1
confirmDataset
$sourceDescriptions.hubApi.getDataset
Confirm the dataset exists on the Hub before querying the viewer for its size and Parquet files.
2
getSize
$sourceDescriptions.datasetViewerApi.getDatasetSize
Read the dataset's size profile including row counts and byte sizes for the full dataset and for each subset and split.
3
listParquet
$sourceDescriptions.datasetViewerApi.getParquetFiles
List the auto-converted Parquet files for the dataset so a consumer can plan efficient bulk access.

Source API Descriptions

Arazzo Workflow Specification

hugging-face-dataset-size-and-parquet-workflow.yml Raw ↑
arazzo: 1.0.1
info:
  title: Hugging Face Dataset Size and Parquet Files
  summary: Confirm a dataset on the Hub, read its size profile, then list its Parquet files.
  description: >-
    A data-engineering planning flow that spans the Hub API and the Dataset
    Viewer API. The workflow first confirms a dataset exists on the Hub, then
    reads its size profile (row counts and byte sizes per subset and split) from
    the Dataset Viewer, and finally lists the auto-converted Parquet files so a
    consumer can plan efficient bulk access. Every step spells out its request
    inline so the flow can be read and executed without opening the underlying
    OpenAPI description.
  version: 1.0.0
sourceDescriptions:
- name: hubApi
  url: ../openapi/hugging-face-hub-api.yml
  type: openapi
- name: datasetViewerApi
  url: ../openapi/hugging-face-dataset-viewer-api.yml
  type: openapi
workflows:
- workflowId: dataset-size-and-parquet
  summary: Verify a dataset, read its size profile, and list its Parquet files.
  description: >-
    Confirms a dataset exists on the Hub, retrieves its size information from the
    Dataset Viewer, and lists its converted Parquet files.
  inputs:
    type: object
    required:
    - hfToken
    - dataset
    properties:
      hfToken:
        type: string
        description: Hugging Face access token used as a Bearer credential.
      dataset:
        type: string
        description: The dataset id on the Hugging Face Hub.
  steps:
  - stepId: confirmDataset
    description: >-
      Confirm the dataset exists on the Hub before querying the viewer for its
      size and Parquet files.
    operationId: $sourceDescriptions.hubApi.getDataset
    parameters:
    - name: repo_id
      in: path
      value: $inputs.dataset
    successCriteria:
    - condition: $statusCode == 200
    outputs:
      datasetId: $response.body#/id
    onSuccess:
    - name: exists
      type: goto
      stepId: getSize
      criteria:
      - condition: $statusCode == 200
  - stepId: getSize
    description: >-
      Read the dataset's size profile including row counts and byte sizes for the
      full dataset and for each subset and split.
    operationId: $sourceDescriptions.datasetViewerApi.getDatasetSize
    parameters:
    - name: Authorization
      in: header
      value: Bearer $inputs.hfToken
    - name: dataset
      in: query
      value: $inputs.dataset
    successCriteria:
    - condition: $statusCode == 200
    outputs:
      datasetNumRows: $response.body#/size/dataset/num_rows
      configs: $response.body#/size/configs
  - stepId: listParquet
    description: >-
      List the auto-converted Parquet files for the dataset so a consumer can
      plan efficient bulk access.
    operationId: $sourceDescriptions.datasetViewerApi.getParquetFiles
    parameters:
    - name: Authorization
      in: header
      value: Bearer $inputs.hfToken
    - name: dataset
      in: query
      value: $inputs.dataset
    successCriteria:
    - condition: $statusCode == 200
    outputs:
      parquetFiles: $response.body#/parquet_files
  outputs:
    datasetNumRows: $steps.getSize.outputs.datasetNumRows
    configs: $steps.getSize.outputs.configs
    parquetFiles: $steps.listParquet.outputs.parquetFiles