-
Notifications
You must be signed in to change notification settings - Fork 3
feat(dbt): Migrate dbt producer to PostgreSQL and validate CI/CD integration #186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat(dbt): Migrate dbt producer to PostgreSQL and validate CI/CD integration #186
Conversation
- Add atomic test runner with CLI interface and validation - Add OpenLineage event generation and PIE framework integration - Add scenario-based testing structure for csv_to_duckdb_local - Include comprehensive documentation and maintainer info - Add gitignore exclusions for local artifacts and sensitive files This implements a complete dbt producer compatibility test that validates: - OpenLineage event generation from dbt runs - Event schema compliance using PIE framework validation - Column lineage, schema, and SQL facet extraction - Community-standard directory structure and documentation Signed-off-by: roller100 (BearingNode) <contact@bearingnode.com>
…mentation version testing framework as future Signed-off-by: roller100 (BearingNode) <contact@bearingnode.com>
…enhance documentation, and improve testing framework Signed-off-by: roller100 (BearingNode) <contact@bearingnode.com>
… testing and recommendations Signed-off-by: roller100 (BearingNode) <contact@bearingnode.com>
Add version matrix and scenario constraints: - versions.json: Define testable dbt-core (1.8.0) and OpenLineage (1.23.0) versions - config.json: Add component_versions and openlineage_versions to scenario Tested with get_valid_test_scenarios.sh - scenario correctly detected. These version constraints allow the workflow to filter which scenarios run for given version combinations, matching the pattern used by spark_dataproc and hive_dataproc producers. Signed-off-by: roller100 (BearingNode) <contact@bearingnode.com>
- Add producer_dbt.yml workflow for automated CI/CD testing - Add run-scenario command to CLI for per-scenario event generation - Update releases.json to include dbt version tracking - Fix requirements.txt syntax for pip compatibility The workflow follows the official OpenLineage compatibility test framework: - Uses get_valid_test_scenarios.sh for version-based scenario filtering - Generates events in per-scenario directories as individual JSON files - Integrates with run_event_validation action for syntax/semantic validation - Produces standardized test reports for compatibility tracking This addresses Steering Committee feedback on PR OpenLineage#180 to integrate dbt producer tests with GitHub Actions workflows. Signed-off-by: roller100 (BearingNode) <contact@bearingnode.com>
Add 'schema: main' to source definition so dbt tests can find the seed tables. Without this, source tests were looking for tables in a non-existent 'raw_data' schema, causing 7 test failures. Result: All 15 dbt tests now pass (PASS=15 WARN=0 ERROR=0) Signed-off-by: roller100 (BearingNode) <contact@bearingnode.com>
Trivial change to test GitHub Actions dbt workflow execution. Signed-off-by: roller100 (BearingNode) <contact@bearingnode.com>
Enable workflow_dispatch to allow manual testing of dbt workflow. Signed-off-by: roller100 (BearingNode) <contact@bearingnode.com>
[TEST] dbt Workflow Validation
Trivial change to test dbt workflow with complete integration. Signed-off-by: roller100 (BearingNode) <contact@bearingnode.com>
[TEST] Validate dbt Workflow
Add missing dbt file change detection in pull request workflow. This enables the dbt job to trigger when dbt producer files are modified. Signed-off-by: roller100 (BearingNode) <contact@bearingnode.com>
- Add workflow_dispatch event with component selection input - Support manual workflow execution without requiring PRs - Add conditional logic to handle both PR and manual triggers - Add dbt job definition with matrix strategy - Add dbt to collect-and-compare-reports dependencies This eliminates the need for internal test branches and PRs. Testing can now be done directly on feature branch with: gh workflow run main_pr.yml --ref feature/dbt-producer-compatibility-test Resolves workflow testing complexity and branch proliferation issues. Signed-off-by: roller100 (BearingNode) <contact@bearingnode.com>
- Replace DuckDB adapter with PostgreSQL adapter in profiles.yml - Update requirements.txt: dbt-postgres, psycopg2-binary - Rename scenario csv_to_duckdb_local to csv_to_postgres_local - Update scenario config.json lineage_level to postgres - Add docker-compose.yml for local PostgreSQL container - Update GitHub Actions workflow with PostgreSQL service container - Fix README.md installation instructions (dbt-duckdb -> dbt-postgres) - Update SPECIFICATION_COVERAGE_ANALYSIS.md to reference PostgreSQL Tested locally: 22 OpenLineage events generated successfully with postgres namespace. Signed-off-by: roller100 (BearingNode) <contact@bearingnode.com>
The validation action constructs path as producer-dir/component/scenarios, but we were passing producer-dir='producer/dbt' with component='dbt', resulting in producer/dbt/dbt/scenarios (extra dbt). Changed producer-dir to 'producer' to match other workflow patterns. Signed-off-by: roller100 (BearingNode) <contact@bearingnode.com>
Add dbt compatibility test results to baseline report.json. Includes known validation warnings for custom dbt facets (dbt_version, dbt_run) which are not yet in official OpenLineage spec. Signed-off-by: roller100 (BearingNode) <contact@bearingnode.com>
The collect-and-compare-reports check compares against main branch baseline, which doesn't have dbt producer results yet. All dbt test "failures" are expected validation warnings for custom dbt facets (dbt_version, dbt_run). This should be re-enabled after: - Merging to main and baseline includes dbt results, OR - Upstream OpenLineage spec accepts dbt custom facets Signed-off-by: roller100 (BearingNode) <contact@bearingnode.com>
de0a81c to
dcdf61d
Compare
Final documentation updates for PostgreSQL migration: 1. SPECIFICATION_COVERAGE_ANALYSIS.md: - Updated test configuration (PostgreSQL 15, 22 events, matrix testing) - Added comprehensive 'Known Validation Warnings' section - Documented dbt_version and dbt_run custom facets - Explained why warnings occur (vendor extensions vs official spec) - Clarified impact: tests pass, events valid, warnings expected - Listed resolution options and current workaround status 2. README.md: - Distinguished local vs GitHub Actions testing workflows - Added 'Custom dbt Facets and Validation Warnings' section - Cross-referenced SPECIFICATION_COVERAGE_ANALYSIS.md at two key points - Clarified that validation warnings are expected behavior These docs ensure contributors understand: - The difference between local Docker Compose and CI/CD testing - Why dbt events generate validation warnings (custom facets) - That warnings are documented, expected, and acceptable - Where to find detailed technical analysis Ready for upstream PR to OpenLineage compatibility-tests repo. Signed-off-by: roller100 (BearingNode) <contact@bearingnode.com>
- Created event_013.json to capture the START event for the stg_customers model with data quality assertions. - Created event_014.json for the stg_orders model, including assertions for customer_id and order_id. - Created event_015.json for the raw_customers model, detailing assertions for customer_id and email. - Created event_016.json for the raw_orders model, with assertions for customer_id and order_id. - Created event_017.json for the customer_analytics model, capturing the COMPLETE event with relevant assertions. - Created event_018.json for the stg_customers model, detailing the COMPLETE event and assertions. - Created event_019.json for the stg_orders model, capturing the COMPLETE event with assertions. - Created event_020.json for the raw_customers model, detailing the COMPLETE event and assertions. - Created event_021.json for the raw_orders model, capturing the COMPLETE event with assertions. - Created event_022.json to log the completion of the dbt run with job details. Signed-off-by: roller100 (BearingNode) <contact@bearingnode.com>
f7666ee to
444e0ac
Compare
tnazarew
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, thanks for the PR! I left some comments, mostly about cleanup because there are lot of files that I think should not be in the PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why change the status of tests in here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
appreciate examples but we don't want to keep output events in the repository :)
| dbt Producer Compatibility Test | ||
| This test validates that dbt generates compliant OpenLineage events | ||
| when using local file transport with CSV → dbt → DuckDB scenario. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's old, since we migrated to Postgres
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this file seems to be doing way too much stuff, the test should only be dbt execution with configured transport so we have events for validation step, no validation should actually happen here.
while in case of spark or hive we need to have some logic defined here, in case of dbt it could be enough to have shell script with some dbt-ol + argumets, we should only ensure that we produce ol events into some location that is available for, maybe we will need to split the events into separate files though
| echo "Finished running all scenarios" | ||
| - name: Validation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that you use validation script but there is also validation in the test scenario file, is there some reason for it?
| bin/ | ||
|
|
||
| # OpenLineage event files generated during local testing | ||
| openlineage_events.jsonl |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
jsonl?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think something here is duplicated
Follow-up to PR OpenLineage#186 feedback addressing final alignment issues: - Rename csv_to_postgres_local to csv_to_postgres (removes 'local' qualifier) - Remove README.md from scenario (community pattern uses scenario.md only) - Update documentation to reflect CI/CD service container deployment model - Correct residual DuckDB references to PostgreSQL Refs: OpenLineage#186 Signed-off-by: roller100 (BearingNode) <contact@bearingnode.com>
Signed-off-by: roller100 (BearingNode) <contact@bearingnode.com>
Follow-up to PR OpenLineage#186 feedback addressing final alignment issues: - Rename csv_to_postgres_local to csv_to_postgres (removes 'local' qualifier) - Remove README.md from scenario (community pattern uses scenario.md only) - Update documentation to reflect CI/CD service container deployment model - Correct residual DuckDB references to PostgreSQL Refs: OpenLineage#186 Signed-off-by: roller100 (BearingNode) <contact@bearingnode.com>
ac60007 to
59d2f11
Compare
|
Hi @tnazarew, Thank you very much for your prompt and thorough review! We really appreciate the detailed feedback. Here's a summary of what we've addressed: ✅ Completed in Commit
|
|
Hi @tnazarew, Hope you're doing well. We believe we have addressed all 7 review comments in commits 3ea4557 and 59d2f11:
The PR has been ready for re-review for little while now. Would you please be able to take another look when you have a chance? If there are any blockers or concerns we should be aware of, please let us know and we'll work to resolve them. Thanks for your time and guidance on this. |
Summary
This PR completes the PostgreSQL migration for the dbt producer compatibility tests and validates full GitHub Actions CI/CD integration.
Changes
1. PostgreSQL Migration
2. GitHub Actions Integration
3. Test Results
4. Documentation
Known Validation Warnings (Expected)
The dbt integration emits custom facets that generate expected validation warnings:
dbt_version- Captures dbt-core versiondbt_run- Captures dbt execution metadataThese are vendor-specific extensions to OpenLineage with valid schemas. The warnings indicate the validator doesn't recognize these custom facets, not that the events are invalid. This is documented and accepted behavior.
Workaround (Temporary)
fail-for-new-failurestemporarily disabled in main_pr.yml for feature branch testingTesting
Local Testing:
GitHub Actions:
Compliance
Feedback Welcome
This PR represents our implementation of dbt producer compatibility testing meeting the repository's automated test standards. Any feedback on the approach, documentation, or test coverage is greatly appreciated!