Cross-Provider Workflow

Databricks Job Run to DataHub Lineage

Version 1.0.0

Run a Databricks job, then register the produced dataset and its lineage in DataHub.

1 workflow 2 source APIs 2 providers
View Spec View on GitHub ArazzoWorkflowsCross-Provider

Providers Orchestrated

databricks datahub

Workflows

databricks-job-to-datahub-lineage
Run a Databricks job and register its output dataset lineage in DataHub.
Triggers a Databricks job run, then upserts the produced dataset entity into the DataHub catalog so its lineage and metadata are tracked.
2 steps inputs: datasetUrn, jobId, upstreamUrn outputs: registeredUrn, runId
1
run-databricks-job
$sourceDescriptions.databricksApi.runJobNow
Trigger a Databricks job run that produces a dataset.
2
register-lineage
$sourceDescriptions.datahubApi.upsertEntities
Upsert the produced dataset and its lineage into DataHub.

Source API Descriptions

Arazzo Workflow Specification

data-databricks-job-to-datahub-lineage.yml Raw ↑
arazzo: 1.0.1
info:
  title: Databricks Job Run to DataHub Lineage
  summary: Run a Databricks job, then register the produced dataset and its lineage in DataHub.
  description: >-
    A data pipeline workflow that triggers a Databricks job to produce a dataset, then
    upserts the resulting dataset entity and its lineage metadata into the DataHub
    catalog. Demonstrates connecting a compute/processing platform to a metadata and
    lineage catalog for governance.
  version: 1.0.0
sourceDescriptions:
  - name: databricksApi
    url: https://raw.githubusercontent.com/api-evangelist/databricks/refs/heads/main/openapi/databricks-openapi.yml
    type: openapi
  - name: datahubApi
    url: https://raw.githubusercontent.com/api-evangelist/datahub/refs/heads/main/openapi/datahub-openapi-openapi.yml
    type: openapi
workflows:
  - workflowId: databricks-job-to-datahub-lineage
    summary: Run a Databricks job and register its output dataset lineage in DataHub.
    description: >-
      Triggers a Databricks job run, then upserts the produced dataset entity into the
      DataHub catalog so its lineage and metadata are tracked.
    inputs:
      type: object
      properties:
        jobId:
          type: integer
        datasetUrn:
          type: string
        upstreamUrn:
          type: string
    steps:
      - stepId: run-databricks-job
        description: Trigger a Databricks job run that produces a dataset.
        operationId: $sourceDescriptions.databricksApi.runJobNow
        requestBody:
          contentType: application/json
          payload:
            job_id: $inputs.jobId
        successCriteria:
          - condition: $statusCode == 200
        outputs:
          runId: $response.body#/run_id
      - stepId: register-lineage
        description: Upsert the produced dataset and its lineage into DataHub.
        operationId: $sourceDescriptions.datahubApi.upsertEntities
        requestBody:
          contentType: application/json
          payload:
            - urn: $inputs.datasetUrn
              aspects:
                upstreamLineage:
                  upstreams:
                    - dataset: $inputs.upstreamUrn
                      type: TRANSFORMED
        successCriteria:
          - condition: $statusCode == 200
        outputs:
          registeredUrn: $response.body#/0/urn
    outputs:
      runId: $steps.run-databricks-job.outputs.runId
      registeredUrn: $steps.register-lineage.outputs.registeredUrn