Skip to content

fix: Resolve unqualified columns in JOIN queries using schema info#50

Merged
mingjerli merged 1 commit intomainfrom
fix/unqualified-column-resolution
Jan 19, 2026
Merged

fix: Resolve unqualified columns in JOIN queries using schema info#50
mingjerli merged 1 commit intomainfrom
fix/unqualified-column-resolution

Conversation

@mingjerli
Copy link
Owner

Summary

  • Fixes incorrect column lineage when unqualified columns are used in JOIN queries
  • Uses sqlglot's qualify_columns optimizer with upstream table schemas to correctly resolve which table each column belongs to
  • Fixes _extract_select_from_query to use dialect when serializing SQL, preventing function argument reordering (e.g., DATE_TRUNC)

Problem

When a SQL query joins multiple tables and columns are unqualified (no table prefix), the lineage builder would default to the first table, which was often incorrect. For example:

SELECT DATE_TRUNC(order_date, MONTH) as month
FROM analytics.user_metrics
JOIN staging.user_orders USING (user_id)

The order_date column would be incorrectly attributed to user_metrics instead of user_orders, causing the lineage edge staging.user_orders.order_date -> reports.monthly_revenue.month to be dropped.

Solution

  1. Added _convert_to_nested_schema() helper to convert flat schema format to sqlglot's nested format
  2. Added _qualify_sql_with_schema() helper to qualify unqualified columns using schema info before parsing
  3. Modified RecursiveLineageBuilder.__init__ to qualify SQL before building lineage
  4. Fixed _extract_select_from_query() to use dialect parameter when serializing SQL

Test plan

  • Added 16 new tests in test_unqualified_column_resolution.py covering:
    • Schema conversion helpers
    • SQL qualification helpers
    • Single-query lineage with schema
    • Multi-query pipeline lineage (3-layer example)
    • Edge cases (columns in both tables, aggregates, fallback)
  • All 778 existing tests pass (no regressions)
  • Pre-commit checks pass

🤖 Generated with Claude Code

When a SQL query joins multiple tables and columns are unqualified (no
table prefix), the lineage builder previously defaulted to the first
table, which was often incorrect. For example, in:

  SELECT DATE_TRUNC(order_date, MONTH) as month
  FROM analytics.user_metrics
  JOIN staging.user_orders USING (user_id)

The `order_date` column would be incorrectly attributed to `user_metrics`
instead of `user_orders`, causing lineage edges to be dropped.

This fix:
1. Uses sqlglot's qualify_columns optimizer with upstream table schemas
   to add correct table prefixes before building lineage
2. Fixes _extract_select_from_query to use dialect when serializing SQL,
   which was causing DATE_TRUNC arguments to be reordered incorrectly

The 3-layer pipeline example now correctly shows:
- staging.user_orders.order_date -> reports.monthly_revenue.month
- trace_column_backward returns raw.orders.order_date

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@mingjerli mingjerli merged commit 98a894b into main Jan 19, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant