Skip to content

Conversation

@roller100
Copy link

Summary

This PR completes the PostgreSQL migration for the dbt producer compatibility tests and validates full GitHub Actions CI/CD integration.

Changes

1. PostgreSQL Migration

  • ✅ Migrated from DuckDB to PostgreSQL 15 for dbt-ol compatibility
  • ✅ Added docker-compose.yml for local PostgreSQL testing
  • ✅ Updated profiles.yml, requirements.txt, and test scenarios
  • ✅ Renamed scenario: csv_to_duckdb_local → csv_to_postgres_local

2. GitHub Actions Integration

  • ✅ Added PostgreSQL 15 service container to workflow
  • ✅ Fixed validation path (producer-dir: 'producer' instead of 'producer/dbt')
  • ✅ Successfully tested with workflow_dispatch
  • ✅ Matrix testing: dbt 1.8.0 with OpenLineage 1.23.0 and 1.39.0

3. Test Results

  • 22 OpenLineage events generated successfully
  • ✅ All dbt operations complete (seed, run, test)
  • ✅ Core facets validate correctly (schema, dataSource, sql, columnLineage, etc.)
  • ✅ GitHub Actions workflow passing

4. Documentation

  • ✅ Updated README.md with local vs GitHub Actions testing workflows
  • ✅ Enhanced SPECIFICATION_COVERAGE_ANALYSIS.md with PostgreSQL details
  • ✅ Documented custom dbt facets and validation warnings
  • ✅ Added cross-references between documentation files

Known Validation Warnings (Expected)

The dbt integration emits custom facets that generate expected validation warnings:

  • dbt_version - Captures dbt-core version
  • dbt_run - Captures dbt execution metadata

These are vendor-specific extensions to OpenLineage with valid schemas. The warnings indicate the validator doesn't recognize these custom facets, not that the events are invalid. This is documented and accepted behavior.

Workaround (Temporary)

  • fail-for-new-failures temporarily disabled in main_pr.yml for feature branch testing
  • Reason: Baseline from main branch has no dbt entries, so all dbt results are flagged as 'new failures'
  • Resolution: Will be resolved when merged to main and baseline includes dbt results

Testing

Local Testing:

docker-compose up -d
python test_runner/cli.py run-scenario --scenario csv_to_postgres_local

GitHub Actions:

Compliance

  • ✅ All commits properly attributed: roller100 (BearingNode) contact@bearingnode.com
  • ✅ Follows automated test standards
  • ✅ Documentation comprehensive for contributors
  • ✅ PostgreSQL setup validated locally and in CI/CD

Feedback Welcome

This PR represents our implementation of dbt producer compatibility testing meeting the repository's automated test standards. Any feedback on the approach, documentation, or test coverage is greatly appreciated!

roller100 (BearingNode) and others added 19 commits September 22, 2025 11:19
- Add atomic test runner with CLI interface and validation
- Add OpenLineage event generation and PIE framework integration
- Add scenario-based testing structure for csv_to_duckdb_local
- Include comprehensive documentation and maintainer info
- Add gitignore exclusions for local artifacts and sensitive files

This implements a complete dbt producer compatibility test that validates:
- OpenLineage event generation from dbt runs
- Event schema compliance using PIE framework validation
- Column lineage, schema, and SQL facet extraction
- Community-standard directory structure and documentation

Signed-off-by: roller100 (BearingNode) <contact@bearingnode.com>
…mentation version testing framework as future

Signed-off-by: roller100 (BearingNode) <contact@bearingnode.com>
…enhance documentation, and improve testing framework

Signed-off-by: roller100 (BearingNode) <contact@bearingnode.com>
… testing and recommendations

Signed-off-by: roller100 (BearingNode) <contact@bearingnode.com>
Add version matrix and scenario constraints:
- versions.json: Define testable dbt-core (1.8.0) and OpenLineage (1.23.0) versions
- config.json: Add component_versions and openlineage_versions to scenario

Tested with get_valid_test_scenarios.sh - scenario correctly detected.

These version constraints allow the workflow to filter which scenarios
run for given version combinations, matching the pattern used by
spark_dataproc and hive_dataproc producers.

Signed-off-by: roller100 (BearingNode) <contact@bearingnode.com>
- Add producer_dbt.yml workflow for automated CI/CD testing
- Add run-scenario command to CLI for per-scenario event generation
- Update releases.json to include dbt version tracking
- Fix requirements.txt syntax for pip compatibility

The workflow follows the official OpenLineage compatibility test framework:
- Uses get_valid_test_scenarios.sh for version-based scenario filtering
- Generates events in per-scenario directories as individual JSON files
- Integrates with run_event_validation action for syntax/semantic validation
- Produces standardized test reports for compatibility tracking

This addresses Steering Committee feedback on PR OpenLineage#180 to integrate
dbt producer tests with GitHub Actions workflows.

Signed-off-by: roller100 (BearingNode) <contact@bearingnode.com>
Add 'schema: main' to source definition so dbt tests can find the
seed tables. Without this, source tests were looking for tables in
a non-existent 'raw_data' schema, causing 7 test failures.

Result: All 15 dbt tests now pass (PASS=15 WARN=0 ERROR=0)
Signed-off-by: roller100 (BearingNode) <contact@bearingnode.com>
Trivial change to test GitHub Actions dbt workflow execution.

Signed-off-by: roller100 (BearingNode) <contact@bearingnode.com>
Enable workflow_dispatch to allow manual testing of dbt workflow.

Signed-off-by: roller100 (BearingNode) <contact@bearingnode.com>
Trivial change to test dbt workflow with complete integration.

Signed-off-by: roller100 (BearingNode) <contact@bearingnode.com>
Add missing dbt file change detection in pull request workflow.
This enables the dbt job to trigger when dbt producer files are modified.

Signed-off-by: roller100 (BearingNode) <contact@bearingnode.com>
- Add workflow_dispatch event with component selection input
- Support manual workflow execution without requiring PRs
- Add conditional logic to handle both PR and manual triggers
- Add dbt job definition with matrix strategy
- Add dbt to collect-and-compare-reports dependencies

This eliminates the need for internal test branches and PRs.
Testing can now be done directly on feature branch with:
  gh workflow run main_pr.yml --ref feature/dbt-producer-compatibility-test

Resolves workflow testing complexity and branch proliferation issues.

Signed-off-by: roller100 (BearingNode) <contact@bearingnode.com>
- Replace DuckDB adapter with PostgreSQL adapter in profiles.yml
- Update requirements.txt: dbt-postgres, psycopg2-binary
- Rename scenario csv_to_duckdb_local to csv_to_postgres_local
- Update scenario config.json lineage_level to postgres
- Add docker-compose.yml for local PostgreSQL container
- Update GitHub Actions workflow with PostgreSQL service container
- Fix README.md installation instructions (dbt-duckdb -> dbt-postgres)
- Update SPECIFICATION_COVERAGE_ANALYSIS.md to reference PostgreSQL

Tested locally: 22 OpenLineage events generated successfully with postgres namespace.

Signed-off-by: roller100 (BearingNode) <contact@bearingnode.com>
The validation action constructs path as producer-dir/component/scenarios,
but we were passing producer-dir='producer/dbt' with component='dbt',
resulting in producer/dbt/dbt/scenarios (extra dbt).

Changed producer-dir to 'producer' to match other workflow patterns.

Signed-off-by: roller100 (BearingNode) <contact@bearingnode.com>
Add dbt compatibility test results to baseline report.json.
Includes known validation warnings for custom dbt facets
(dbt_version, dbt_run) which are not yet in official OpenLineage spec.

Signed-off-by: roller100 (BearingNode) <contact@bearingnode.com>
The collect-and-compare-reports check compares against main branch baseline,
which doesn't have dbt producer results yet. All dbt test "failures" are
expected validation warnings for custom dbt facets (dbt_version, dbt_run).

This should be re-enabled after:
- Merging to main and baseline includes dbt results, OR
- Upstream OpenLineage spec accepts dbt custom facets

Signed-off-by: roller100 (BearingNode) <contact@bearingnode.com>
@roller100 roller100 force-pushed the feature/dbt-producer-compatibility-test branch from de0a81c to dcdf61d Compare November 18, 2025 16:17
roller100 (BearingNode) added 2 commits November 18, 2025 17:09
Final documentation updates for PostgreSQL migration:

1. SPECIFICATION_COVERAGE_ANALYSIS.md:
   - Updated test configuration (PostgreSQL 15, 22 events, matrix testing)
   - Added comprehensive 'Known Validation Warnings' section
   - Documented dbt_version and dbt_run custom facets
   - Explained why warnings occur (vendor extensions vs official spec)
   - Clarified impact: tests pass, events valid, warnings expected
   - Listed resolution options and current workaround status

2. README.md:
   - Distinguished local vs GitHub Actions testing workflows
   - Added 'Custom dbt Facets and Validation Warnings' section
   - Cross-referenced SPECIFICATION_COVERAGE_ANALYSIS.md at two key points
   - Clarified that validation warnings are expected behavior

These docs ensure contributors understand:
- The difference between local Docker Compose and CI/CD testing
- Why dbt events generate validation warnings (custom facets)
- That warnings are documented, expected, and acceptable
- Where to find detailed technical analysis

Ready for upstream PR to OpenLineage compatibility-tests repo.

Signed-off-by: roller100 (BearingNode) <contact@bearingnode.com>
- Created event_013.json to capture the START event for the stg_customers model with data quality assertions.
- Created event_014.json for the stg_orders model, including assertions for customer_id and order_id.
- Created event_015.json for the raw_customers model, detailing assertions for customer_id and email.
- Created event_016.json for the raw_orders model, with assertions for customer_id and order_id.
- Created event_017.json for the customer_analytics model, capturing the COMPLETE event with relevant assertions.
- Created event_018.json for the stg_customers model, detailing the COMPLETE event and assertions.
- Created event_019.json for the stg_orders model, capturing the COMPLETE event with assertions.
- Created event_020.json for the raw_customers model, detailing the COMPLETE event and assertions.
- Created event_021.json for the raw_orders model, capturing the COMPLETE event with assertions.
- Created event_022.json to log the completion of the dbt run with job details.

Signed-off-by: roller100 (BearingNode) <contact@bearingnode.com>
@roller100 roller100 force-pushed the feature/dbt-producer-compatibility-test branch from f7666ee to 444e0ac Compare November 18, 2025 17:10
Copy link
Collaborator

@tnazarew tnazarew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, thanks for the PR! I left some comments, mostly about cleanup because there are lot of files that I think should not be in the PR

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why change the status of tests in here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

appreciate examples but we don't want to keep output events in the repository :)

dbt Producer Compatibility Test
This test validates that dbt generates compliant OpenLineage events
when using local file transport with CSV → dbt → DuckDB scenario.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's old, since we migrated to Postgres

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this file seems to be doing way too much stuff, the test should only be dbt execution with configured transport so we have events for validation step, no validation should actually happen here.

while in case of spark or hive we need to have some logic defined here, in case of dbt it could be enough to have shell script with some dbt-ol + argumets, we should only ensure that we produce ol events into some location that is available for, maybe we will need to split the events into separate files though

echo "Finished running all scenarios"
- name: Validation
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that you use validation script but there is also validation in the test scenario file, is there some reason for it?

bin/

# OpenLineage event files generated during local testing
openlineage_events.jsonl
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

jsonl?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think something here is duplicated

roller100 pushed a commit to BearingNode/openlineage-compatibility-tests that referenced this pull request Nov 19, 2025
roller100 pushed a commit to BearingNode/openlineage-compatibility-tests that referenced this pull request Nov 19, 2025
Follow-up to PR OpenLineage#186 feedback addressing final alignment issues:
- Rename csv_to_postgres_local to csv_to_postgres (removes 'local' qualifier)
- Remove README.md from scenario (community pattern uses scenario.md only)
- Update documentation to reflect CI/CD service container deployment model
- Correct residual DuckDB references to PostgreSQL

Refs: OpenLineage#186
Signed-off-by: roller100 (BearingNode) <contact@bearingnode.com>
roller100 (BearingNode) added 2 commits November 19, 2025 16:27
Signed-off-by: roller100 (BearingNode) <contact@bearingnode.com>
Follow-up to PR OpenLineage#186 feedback addressing final alignment issues:
- Rename csv_to_postgres_local to csv_to_postgres (removes 'local' qualifier)
- Remove README.md from scenario (community pattern uses scenario.md only)
- Update documentation to reflect CI/CD service container deployment model
- Correct residual DuckDB references to PostgreSQL

Refs: OpenLineage#186
Signed-off-by: roller100 (BearingNode) <contact@bearingnode.com>
@roller100 roller100 force-pushed the feature/dbt-producer-compatibility-test branch from ac60007 to 59d2f11 Compare November 19, 2025 16:28
@roller100
Copy link
Author

Hi @tnazarew,

Thank you very much for your prompt and thorough review! We really appreciate the detailed feedback. Here's a summary of what we've addressed:

✅ Completed in Commit 3ea4557 (Address PR #186 review feedback)

Cleanup Items

  1. test_output event files (#2541285905) - ✅ Deleted all 22 event output files, added .gitkeep
  2. test.py removal (#2541315324) - ✅ Deleted entire test.py (was violating community pattern - dbt doesn't need test.py since cli.py handles execution)
  3. docker-compose.yml - ✅ Removed (GitHub Actions provides PostgreSQL service container)
  4. future/ directory (#2541348656) - ✅ Removed duplicate scripts, kept design docs for potential TSC discussion
  5. generated-files/report.json (#2541272805) - ✅ Reverted manual changes (should be auto-generated by CI/CD)
  6. dbt_producer_report.json - ✅ Deleted (should not be committed)

Pattern Alignment

  • Duplicate validation (#2541319098) - ✅ Resolved by removing test.py; validation now only happens via validate_ol_events.py in CI/CD workflow
  • Followed community pattern: test scenarios don't need test.py when CLI runner handles execution

✅ Completed in Commit 59d2f11 (Align with community conventions)

  1. Scenario naming - ✅ Renamed csv_to_postgres_localcsv_to_postgres (aligned with Spark/Hive patterns)
  2. README.md removal - ✅ Removed scenario README (community uses scenario.md only, verified against Hive/Spark scenarios)
  3. Documentation updates (#2541289936) - ✅ Updated all docs to reflect CI/CD service container deployment, corrected PostgreSQL references

📊 Summary

  • Files Changed: 10 files (6 deleted, 3 modified, 1 created)
  • All 7 review comments: Addressed
  • Pattern Compliance: Now aligned with Spark/Hive community conventions

📝 Clarification on .gitignore patterns

.gitignore jsonl pattern (#2541321601) - We're ignoring both .json and .jsonl extensions intentionally:

Why both formats?

  1. OpenLineage file transport creates JSONL - When configured with type: file and append: true, the OpenLineage client writes events to a single file in JSONL (JSON Lines) format, where each event is one line (source)
  2. Our CLI then splits into individual JSON files - The validation framework expects separate JSON files (one per event), so our CLI reads the JSONL file and splits it into event_001.json, event_002.json, etc.

Process flow:

dbt-ol → openlineage_events.jsonl (JSONL format, multiple events)
       → cli.py splits into → event_001.json, event_002.json, ... (individual JSON files)
       → validate_ol_events.py validates individual JSON files

The .gitignore patterns cover both intermediate (.jsonl) and final (.json) output files generated during local testing.


Let us know if anything else needs attention. Thanks again for the guidance!

@roller100 roller100 requested a review from tnazarew November 19, 2025 18:28
@roller100
Copy link
Author

Hi @tnazarew,

Hope you're doing well.

We believe we have addressed all 7 review comments in commits 3ea4557 and 59d2f11:

  • ✅ Removed test output files, added .gitkeep
  • ✅ Deleted test.py (eliminated duplicate validation)
  • ✅ Removed docker-compose.yml (CI uses service container)
  • ✅ Cleaned up future/ directory
  • ✅ Reverted manual report.json changes
  • ✅ Renamed scenario to csv_to_postgres
  • ✅ Updated all PostgreSQL documentation

The PR has been ready for re-review for little while now. Would you please be able to take another look when you have a chance?

If there are any blockers or concerns we should be aware of, please let us know and we'll work to resolve them.

Thanks for your time and guidance on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants