Hugging Face · Arazzo Workflow

Hugging Face Dataset Search and Statistics

Version 1.0.0

Resolve a dataset split, full-text search within it, then pull column statistics.

1 workflow 1 source API 1 provider
View Spec View on GitHub ArazzoWorkflows

Provider

hugging-face

Workflows

dataset-search-and-statistics
Search a dataset split and pull descriptive statistics for the same split.
Discovers the first split of a dataset, performs a full-text search within it, and fetches column statistics for the split.
3 steps inputs: dataset, hfToken, length, query outputs: config, matchedRows, split, statistics
1
resolveSplit
getSplits
Resolve the dataset's first subset and split to use as the search and statistics target.
2
searchSplit
searchRows
Run a full-text search across the resolved split and return the matching rows.
3
getColumnStatistics
getStatistics
Retrieve descriptive statistics for the columns of the resolved split to contextualize the search matches.

Source API Descriptions

Arazzo Workflow Specification

hugging-face-dataset-search-and-statistics-workflow.yml Raw ↑
arazzo: 1.0.1
info:
  title: Hugging Face Dataset Search and Statistics
  summary: Resolve a dataset split, full-text search within it, then pull column statistics.
  description: >-
    An analysis flow over the Dataset Viewer API. The workflow resolves the
    dataset's first subset and split, runs a full-text search across that split to
    find matching rows, and then retrieves descriptive column statistics for the
    split so the search results can be understood in the context of the underlying
    distributions. Every step spells out its request inline so the flow can be
    read and executed without opening the underlying OpenAPI description.
  version: 1.0.0
sourceDescriptions:
- name: datasetViewerApi
  url: ../openapi/hugging-face-dataset-viewer-api.yml
  type: openapi
workflows:
- workflowId: dataset-search-and-statistics
  summary: Search a dataset split and pull descriptive statistics for the same split.
  description: >-
    Discovers the first split of a dataset, performs a full-text search within
    it, and fetches column statistics for the split.
  inputs:
    type: object
    required:
    - hfToken
    - dataset
    - query
    properties:
      hfToken:
        type: string
        description: Hugging Face access token used as a Bearer credential.
      dataset:
        type: string
        description: The dataset id on the Hugging Face Hub.
      query:
        type: string
        description: Full-text search query string to run against the split.
      length:
        type: integer
        description: Number of matching rows to return (max 100).
        default: 100
  steps:
  - stepId: resolveSplit
    description: >-
      Resolve the dataset's first subset and split to use as the search and
      statistics target.
    operationId: getSplits
    parameters:
    - name: Authorization
      in: header
      value: Bearer $inputs.hfToken
    - name: dataset
      in: query
      value: $inputs.dataset
    successCriteria:
    - condition: $statusCode == 200
    outputs:
      config: $response.body#/splits/0/config
      split: $response.body#/splits/0/split
  - stepId: searchSplit
    description: >-
      Run a full-text search across the resolved split and return the matching
      rows.
    operationId: searchRows
    parameters:
    - name: Authorization
      in: header
      value: Bearer $inputs.hfToken
    - name: dataset
      in: query
      value: $inputs.dataset
    - name: config
      in: query
      value: $steps.resolveSplit.outputs.config
    - name: split
      in: query
      value: $steps.resolveSplit.outputs.split
    - name: query
      in: query
      value: $inputs.query
    - name: length
      in: query
      value: $inputs.length
    successCriteria:
    - condition: $statusCode == 200
    outputs:
      matchedRows: $response.body#/rows
      numRowsTotal: $response.body#/num_rows_total
  - stepId: getColumnStatistics
    description: >-
      Retrieve descriptive statistics for the columns of the resolved split to
      contextualize the search matches.
    operationId: getStatistics
    parameters:
    - name: Authorization
      in: header
      value: Bearer $inputs.hfToken
    - name: dataset
      in: query
      value: $inputs.dataset
    - name: config
      in: query
      value: $steps.resolveSplit.outputs.config
    - name: split
      in: query
      value: $steps.resolveSplit.outputs.split
    successCriteria:
    - condition: $statusCode == 200
    outputs:
      numExamples: $response.body#/num_examples
      statistics: $response.body#/statistics
  outputs:
    config: $steps.resolveSplit.outputs.config
    split: $steps.resolveSplit.outputs.split
    matchedRows: $steps.searchSplit.outputs.matchedRows
    statistics: $steps.getColumnStatistics.outputs.statistics