Add data validation and missing data handling practices by nick-gorman · Pull Request #11 · Open-ISP/dev-practices

nick-gorman · 2026-01-29T05:24:54Z

A write-up of the proposed data validation practices we discussed at the 2026 planning workshop.

My notes on this session were a little light, so I mostly went off memory, and have added extra details where I thought they were warranted. As always please treat as draft and let me know what you think, or if you want to add details from your notes.

Addresses #5

Documents schema-based approach for handling missing data in templater and translator modules, including schema enforcement at module boundaries, testing requirements, and documentation standards.

nick-gorman · 2026-02-04T02:58:51Z

Another thought for consideration. In the schemas, should we either allow additional columns which aren't required or optional, but get dropped at enforcement time? I'm thinking of columns like "status" in the ecaa_generators table. It seems strange to make the status column required or even create it and fill it with NaNs if not included. But we could either specify a set of metadata columns that are allowed but dropped (silently) on schema enforcement or just drop (silently) all columns which aren't either required or optional.

EllieKallmier · 2026-02-19T03:57:24Z

data-validation.md

+
+### NaN values in compulsory columns
+
+**Compulsory columns should not permit NaN values.** If a column can contain `NaN` values, the data it represents is effectively optional and the column should be defined as such. This distinction ensures the schema accurately reflects data requirements—compulsory columns guarantee complete data, while optional columns explicitly signal that missing values are acceptable and handled by the model.


I think if we stick with this it might help to shape other decisions around potential templater restructures - particularly about what tables should be combined in the templating step. I'm just imagining for example generation vs. storage assets having different sets of columns that are compulsory under this definition, so it wouldn't make sense to create one combined assets table.

And a follow up q that might clarify for me - is this definition specifically only referring to NaN values, or does it encompass other "empty" values as well? (e.g. "")

EllieKallmier · 2026-02-19T03:59:33Z

data-validation.md

+#### Optional Elements
+
+- Missing optional tables are added as empty DataFrames with all schema-defined columns
+- Missing optional columns are added and populated with `NaN` values


Do we need/want to specify a blanket NaN type to use in these cases (and elsewhere) as part of these docs? E.g. None vs pd.NA vs np.nan

EllieKallmier · 2026-02-19T04:04:49Z

data-validation.md

+
+#### Error message conventions
+
+Error messages should identify the table name, column name (if applicable), and the nature of the violation. This consistency aids debugging and helps users quickly identify and resolve issues with their input data.


And include module info too? e.g. for the first example - "Missing required table 'generators' as input to 'translator'" (or something). Maybe that's already assumed but could be good to lay out explicitly (addresses #6 too)

EllieKallmier · 2026-02-19T04:07:22Z

data-validation.md

+Schema enforcement must be tested to ensure:
+
+- Compulsory tables and columns that are missing raise appropriate errors
+- Optional tables and columns that are missing are correctly added


Is it useful to specify logging practices for stuff like this too, and include notes in this kind of docs about checking logs? I don't have a good sense of best practices for checking logging so that might be overkill

EllieKallmier · 2026-02-19T04:10:37Z

data-validation.md

+
+#### Empty DataFrame Handling
+
+- Functions accept DataFrames with no rows without raising errors


For my clarity - this should be the case on the assumption that if the function requires a particular DataFrame to contain data, the error should have already been raised at the schema enforcement step?

EllieKallmier · 2026-02-19T04:14:00Z

data-validation.md

+|---------|-------------|
+| **Data validation** | Invalid inputs are caught early with clear error messages |
+| **Workflow flexibility** | Users can provide simplified datasets with only relevant tables/columns |
+| **Robustness** | Functions handle edge cases gracefully (e.g., removing all generators for greenfield optimisation) |


Suggested change

| **Robustness** | Functions handle edge cases gracefully (e.g., removing all generators for greenfield optimisation) |

| **Robustness** | Functions handle edge cases gracefully (e.g., removing all existing generators for greenfield optimisation) |

?

Add data validation and missing data handling practices

71bf3ea

Documents schema-based approach for handling missing data in templater and translator modules, including schema enforcement at module boundaries, testing requirements, and documentation standards.

nick-gorman requested review from EllieKallmier and dylanjmcconnell January 29, 2026 05:25

EllieKallmier reviewed Feb 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Add data validation and missing data handling practices#11

Add data validation and missing data handling practices#11
nick-gorman wants to merge 1 commit intomainfrom
data-validation

nick-gorman commented Jan 29, 2026 •

edited

Loading

Uh oh!

nick-gorman commented Feb 4, 2026

Uh oh!

EllieKallmier Feb 19, 2026

Uh oh!

EllieKallmier Feb 19, 2026

Uh oh!

EllieKallmier Feb 19, 2026

Uh oh!

EllieKallmier Feb 19, 2026

Uh oh!

EllieKallmier Feb 19, 2026

Uh oh!

EllieKallmier Feb 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		### NaN values in compulsory columns

		Compulsory columns should not permit NaN values. If a column can contain `NaN` values, the data it represents is effectively optional and the column should be defined as such. This distinction ensures the schema accurately reflects data requirements—compulsory columns guarantee complete data, while optional columns explicitly signal that missing values are acceptable and handled by the model.


		#### Error message conventions

		Error messages should identify the table name, column name (if applicable), and the nature of the violation. This consistency aids debugging and helps users quickly identify and resolve issues with their input data.


		#### Empty DataFrame Handling

		- Functions accept DataFrames with no rows without raising errors

	\| Robustness \| Functions handle edge cases gracefully (e.g., removing all generators for greenfield optimisation) \|
	\| Robustness \| Functions handle edge cases gracefully (e.g., removing all existing generators for greenfield optimisation) \|

Comments

Conversation

nick-gorman commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nick-gorman commented Feb 4, 2026

Uh oh!

EllieKallmier Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

EllieKallmier Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

EllieKallmier Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

EllieKallmier Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

EllieKallmier Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

EllieKallmier Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nick-gorman commented Jan 29, 2026 •

edited

Loading