Skip to content

The marine DA should not fail when the obs are corrupted #2008

@guillaumevernieres

Description

@guillaumevernieres

Hardening prep_ocean_obs.py and Establishing Robust Ocean Observation Pre-Processing

Background and Motivation

Recent real-time parallel failures (see obsForge issue
NOAA-EMC/obsForge#172) exposed remaining fragility in the GDAS ocean DA workflow when encountering:

  • Missing observations
  • Partially or fully corrupted NetCDF observation files

While the system is expected to tolerate the absence of observations, corrupted files can still pass early stages and trigger failures during DA execution.

This effort aims to harden the observation pre-processing stage so that bad inputs degrade gracefully instead of crashing the cycle.

In addition, this work will spearhead a broader refactor of ocean observation handling by introducing a dedicated pre-processing phase that:

  • Applies basic QC and thinning
  • Validates observations independently of the main DA
  • Uses simple, stable backgrounds derived from the WOA

The primary implementation target is:

ush/soca/prep_ocean_obs.py


Objectives

  1. Make real-time and retrospective GDAS ocean DA robust to:

    • Empty observation sets
    • Corrupted or partially unreadable IODA/NetCDF files
  2. Establish a defensive pre-processing layer for ocean observations that:

    • Applies basic QC and thinning
    • Identifies problematic observation sources early
    • Prevents single-observation failures from crashing the full DA cycle

Failures must be non-fatal, clearly logged, and actionable.


Proposed Multi-Level Validation Strategy

Level 1 — Lightweight NetCDF Integrity Checks

Purpose: Detect obviously corrupted observation files before DA.

For each candidate NetCDF observation file:

  • Attempt to open the file using netCDF4 or xarray
  • Perform minimal structural validation:
    • File opens successfully
    • Required dimensions exist
    • Key variables are readable (no HDF, truncation, or I/O errors)

Behavior:

  • If a file fails validation:

    • Remove it from the list of files to be assimilated
    • Emit a clear warning identifying:
      • The file name
      • The failure reason
  • Continue processing remaining files

  • Do not abort the cycle unless all observation files are rejected

This check should be:

  • Fast
  • Conservative (better to skip than crash)
  • Explicitly non-fatal

Level 2 — Per-Observation-Space DA Smoke Tests (WOA Background)

Purpose: Catch subtler failures that pass NetCDF checks but break DA.

After Level-1 filtering:

  • Run a set of low-resolution 3DVAR smoke tests
  • Each test uses:
    • A single observation space
    • Simplistic or static background error
    • A background generated from WOA climatology
    • Minimal resolution and iteration count

Assumptions:

  • The WOA-based background is stable and reproducible
  • A DA failure at this stage strongly suggests:
    • Observation corruption
    • Invalid values
    • Metadata inconsistencies not caught by Level 1

Behavior:

  • If a smoke test fails for a given observation space:
    • Assume the observation source is invalid
    • Exclude it from the full DA run
    • Emit a warning including:
      • Observation space name
      • Failure context (log excerpt if available)

This step acts as a functional validation of observations, not just file integrity.


Observation Pre-Processing Scope

This workflow is intended to evolve into a general ocean observation pre-processing stage, responsible for:

  • Input validation
  • Basic quality control
  • Thinning / subsampling
  • Early detection of problematic observation sources

The output of this stage should be a clean, minimal, DA-ready observation set that downstream workflows can safely consume.


Expected System Behavior

  • No observations available
    DA proceeds safely with zero observations.

  • Some observation files corrupted
    Corrupted files are skipped and DA continues.

  • All observation files corrupted
    DA is cleanly skipped with explicit messaging.

  • Corruption detected during smoke tests
    Offending observation space is excluded and flagged.

No silent failures.
No hard crashes.
Warnings should be loud, explicit, and traceable.


Deliverables

  • Enhancements to prep_ocean_obs.py implementing:

    • Level-1 NetCDF integrity checks
    • Level-2 per-observation-space DA smoke tests using WOA backgrounds
  • Clear logging and warning messages

  • Inline documentation describing failure-handling logic

  • A foundation for future QC and thinning extensions


Non-Goals

  • Diagnosing the root cause of observation corruption
  • Heavy-weight DA configurations during pre-processing
  • Fixing obsForge itself (this work is defensive by design)

Original issue

While for now, we want to fix obsforge and then restart things, but for actual operations, is this an EE2 thing where we'd not want gfs_marineanlvar to fail, even if there were obs file issues?

Originally posted by @JessicaMeixner-NOAA in #2007

I thought this was addressed, but apparently not.
We were relying on the filter

  missing file action: warn

but we clearly miss-understood the scope of what it does.

Metadata

Metadata

Labels

Type

Projects

Status

Todo

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions