Skip to content

Comments

Add data validation and missing data handling practices#11

Open
nick-gorman wants to merge 1 commit intomainfrom
data-validation
Open

Add data validation and missing data handling practices#11
nick-gorman wants to merge 1 commit intomainfrom
data-validation

Conversation

@nick-gorman
Copy link
Member

@nick-gorman nick-gorman commented Jan 29, 2026

A write-up of the proposed data validation practices we discussed at the 2026 planning workshop.

My notes on this session were a little light, so I mostly went off memory, and have added extra details where I thought they were warranted. As always please treat as draft and let me know what you think, or if you want to add details from your notes.

Addresses #5

Documents schema-based approach for handling missing data in templater
and translator modules, including schema enforcement at module boundaries,
testing requirements, and documentation standards.
@nick-gorman
Copy link
Member Author

Another thought for consideration. In the schemas, should we either allow additional columns which aren't required or optional, but get dropped at enforcement time? I'm thinking of columns like "status" in the ecaa_generators table. It seems strange to make the status column required or even create it and fill it with NaNs if not included. But we could either specify a set of metadata columns that are allowed but dropped (silently) on schema enforcement or just drop (silently) all columns which aren't either required or optional.


### NaN values in compulsory columns

**Compulsory columns should not permit NaN values.** If a column can contain `NaN` values, the data it represents is effectively optional and the column should be defined as such. This distinction ensures the schema accurately reflects data requirements—compulsory columns guarantee complete data, while optional columns explicitly signal that missing values are acceptable and handled by the model.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we stick with this it might help to shape other decisions around potential templater restructures - particularly about what tables should be combined in the templating step. I'm just imagining for example generation vs. storage assets having different sets of columns that are compulsory under this definition, so it wouldn't make sense to create one combined assets table.

And a follow up q that might clarify for me - is this definition specifically only referring to NaN values, or does it encompass other "empty" values as well? (e.g. "")

#### Optional Elements

- Missing optional tables are added as empty DataFrames with all schema-defined columns
- Missing optional columns are added and populated with `NaN` values
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need/want to specify a blanket NaN type to use in these cases (and elsewhere) as part of these docs? E.g. None vs pd.NA vs np.nan


#### Error message conventions

Error messages should identify the table name, column name (if applicable), and the nature of the violation. This consistency aids debugging and helps users quickly identify and resolve issues with their input data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And include module info too? e.g. for the first example - "Missing required table 'generators' as input to 'translator'" (or something). Maybe that's already assumed but could be good to lay out explicitly (addresses #6 too)

Schema enforcement must be tested to ensure:

- Compulsory tables and columns that are missing raise appropriate errors
- Optional tables and columns that are missing are correctly added
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it useful to specify logging practices for stuff like this too, and include notes in this kind of docs about checking logs? I don't have a good sense of best practices for checking logging so that might be overkill


#### Empty DataFrame Handling

- Functions accept DataFrames with no rows without raising errors
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my clarity - this should be the case on the assumption that if the function requires a particular DataFrame to contain data, the error should have already been raised at the schema enforcement step?

|---------|-------------|
| **Data validation** | Invalid inputs are caught early with clear error messages |
| **Workflow flexibility** | Users can provide simplified datasets with only relevant tables/columns |
| **Robustness** | Functions handle edge cases gracefully (e.g., removing all generators for greenfield optimisation) |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| **Robustness** | Functions handle edge cases gracefully (e.g., removing all generators for greenfield optimisation) |
| **Robustness** | Functions handle edge cases gracefully (e.g., removing all existing generators for greenfield optimisation) |

?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants