Edit on github

Ingesting dbt Metadata into DataHub

Prerequisites

DataHub utilizes dbt artifacts to populate metadata. Before configuring DataHub, ensure that dbt artifacts are available in an S3 bucket.

These artifacts include:

  • catalog.json
  • manifest.json
  • run_results.json
  • sources.json

Configuring the dbt Source in DataHub

To ingest dbt metadata, configure the dbt source in DataHub. Refer to the official documentation for details.

Sample Configuration

The following sample demonstrates how to configure a dbt ingestion source in DataHub.

[!Note] This configuration requires a DataHub secret ( S3_secret_key ) for secure access to S3. Ensure that this secret is created before proceeding.

source:
  type: dbt
  config:
    platform_instance: balboa
    target_platform: snowflake
    manifest_path: "s3://<s3-bucket>/dbt_artifacts/manifest.json"
    catalog_path: "s3://<s3-bucket>/dbt_artifacts/catalog.json"
    sources_path: "s3://<s3-bucket>/dbt_artifacts/sources.json"
    test_results_path: "s3://<s3-bucket>/dbt_artifacts/run_results.json"
    include_column_lineage: true
    aws_connection:
      aws_access_key_id: ABC.....
      aws_secret_access_key: "${S3_secret_key}"
      aws_region: us-west-2
    git_info:
      repo: github.com/datacoves/balboa
      url_template: "{repo_url}/blob/{branch}/transform/{file_path}"

Configuring DataHub for a Second dbt Source

When using Datacoves Mesh (also known as dbt Mesh), you can ingest metadata from multiple dbt projects.

Note

To prevent duplicate nodes , exclude the upstream project by specifying patterns to deny in the node_name_pattern section.

Sample Configuration

The following configuration demonstrates how to add a second dbt source in DataHub:

source:
  type: dbt
  config:
    platform_instance: great_bay
    target_platform: snowflake
    manifest_path: "s3://<s3-bucket>/dbt_artifacts_great_bay/manifest.json"
    catalog_path: "s3://<s3-bucket>/dbt_artifacts_great_bay/catalog.json"
    sources_path: "s3://<s3-bucket>/dbt_artifacts_great_bay/sources.json"
    test_results_path: "s3://<s3-bucket>/dbt_artifacts_great_bay/run_results.json"

    # Prevent duplication of upstream nodes
    entities_enabled:
      sources: No

    # Stateful ingestion settings
    stateful_ingestion:
      enabled: false
      remove_stale_metadata: true

    include_column_lineage: true
    convert_column_urns_to_lowercase: false
    skip_sources_in_lineage: true

    # AWS credentials (requires secret `S3_secret_key`)
    aws_connection:
      aws_access_key_id: ABC.....
      aws_secret_access_key: "${S3_secret_key}"
      aws_region: us-west-2

    # Git repository information
    git_info:
      repo: github.com/datacoves/great_bay
      url_template: "{repo_url}/blob/{branch}/transform/{file_path}"

    # Exclude upstream dbt project nodes to prevent duplication
    node_name_pattern:
      deny:
        - "model.balboa.*"
        - "seed.balboa.*"