Skip to content

reconcile_schemas: Parameter "schema" column names should be lowercased before column differential check #18

@gavinpaes

Description

@gavinpaes

operators/s3_to_redshift_operator.py (Lines 189-191)

pg_query = \
            """
            SELECT column_name, udt_name
            FROM information_schema.columns
            WHERE table_schema = '{0}' AND table_name = '{1}';
            """.format(self.redshift_schema, self.table)
pg_schema = dict(pg_hook.get_records(pg_query))
incoming_keys = [column['name'] for column in schema]
diff = list(set(incoming_keys) - set(pg_schema.keys()))

In above snippet:

If "schema" column name contains any uppercase character, the column differential (diff) will erroneously be a non-empty set. This will in turn cause logic to attempt to insert a column that is already present in created table.

Example

Assume schema = {"name": "ColumnName", "type": _ }

pg_query will report column_name == "columnname" (automatically lowercased by redshift) but incoming keys will leverage column['name'] == "ColumnName" so:

In [1]: diff =  list(set(["ColumnName"]) - set(["columnname"]))
In [2]: diff
Out[2]: ['ColumnName']

This will cause subsequent logic to try to insert a new column called 'ColumnName' which will fail since 'columnname' already exists in created table.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions