feat: Add CMIP7 variable mapping workflow with compound names#247
Open
siligam wants to merge 8 commits intoprep-releasefrom
Open
feat: Add CMIP7 variable mapping workflow with compound names#247siligam wants to merge 8 commits intoprep-releasefrom
siligam wants to merge 8 commits intoprep-releasefrom
Conversation
- Create Excel file with 987 CMIP7 variables pre-populated from data request - Add conversion script to generate YAML from Excel - Include comprehensive README with usage instructions - Excel has color-coded columns: blue (CMIP7 metadata), green (model mappings), yellow (processing info) - Supports collaborative mapping for FESOM, OIFS, REcoM, LPJ-Guess models - Includes dropdown validation for status and priority fields
- Update Excel to use compound_name as primary key (1,974 rows vs 987) - Handle duplicate variable names across different contexts * 414 variables appear in multiple variants (e.g., tas has 20 variants) * Different frequencies (mon, day, yr, 6hr, etc.) * Different regions (GLB, ATA, GRL, NH, SH, etc.) * Different methods (tavg, tmax, tmin, tpt, etc.) - Add columns: compound_name, table, region, method_level_grid - Each variant can have different model mappings and preprocessing - Update conversion script to use compound names as keys in YAML - Addresses issue where variable_id alone was ambiguous Example: tas variants now properly distinguished: - atmos.tas.tavg-h2m-hxy-u.day.GLB (daily mean) - atmos.tas.tavg-h2m-hxy-u.mon.GLB (monthly mean) - atmos.tas.tmax-h2m-hxy-u.day.GLB (daily maximum)
- Remove unused Path import - Fix f-strings without placeholders - Sort imports according to isort/black profile - Remove trailing whitespace - All linting checks now pass (flake8, black, isort)
- Extract priority levels (Core, High, Medium, Low) from dreq_v1.2.2.2.json - Add dreq_priority column showing CMIP7 Data Request priorities per compound name - Remove empty long_name column (reduced from 20 to 19 columns) - Remove Excel table/filter formatting for better compatibility with Numbers on Mac - Keep simple frozen panes (header row and first 3 columns) - Maintain color-coded columns and dropdown validation - Total: 1,974 compound names with priority distribution: * High: 1,038 variables * Medium: 469 variables * Core: 131 variables * Low: 112 variables
- Add dreq_v1.2.2.2.json and dreq_v1.2.2.2_metadata.json (required by create script) - Update README with: - Information about compound names (1,974 entries covering 987 unique variables) - CMIP7 Data Request priority levels (Core, High, Medium, Low) - Updated column structure reflecting actual Excel columns - Instructions for fetching data request files using CMIP7-data-request-api - Clarification that JSON files are required by create_cmip7_variable_mapping.py
Member
|
Why did you choose to include this using excel instead of something more universal like plain text? |
Contributor
Author
Based on the conversation on Pycmor (old SEAMORE channel), I was under the impression that more than one person could be contributing to fill in data for fesom/iofs/recom/lpj thing. Some online service like google sheets or Airtable or maybe hedgedoc be more ideal for easy gathering of contributions. As it takes a little more effort to do that, I settled for excel as an exemplar. csv, json, plain text were my initial though but quickly got carried away with the online service idea. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds a collaborative workflow for mapping CMIP7 variables to model-specific outputs using Excel and YAML conversion.
Key Features
atmos.tas.tavg-h2m-hxy-u.day.GLB) as unique identifiersFiles Added
cmip7_variable_mapping.xlsx- Pre-populated Excel file (1,974 rows, 19 columns)create_cmip7_variable_mapping.py- Script to generate Excel from data requestexcel_to_yaml.py- Script to convert Excel to YAMLCMIP7_VARIABLE_MAPPING_README.md- Comprehensive documentationStructure
Excel Columns (19 total)
Identifiers (3): compound_name, table, variable_id
CMIP7 Metadata (7): standard_name, long_name, units, frequency, modeling_realm, region, method_level_grid
Model Mappings (4): fesom, oifs, recom, lpj_guess
Processing Info (5): preprocess, formula, comment, status, priority
Why Compound Names?
Example:
tas(temperature) has 20 variants:atmos.tas.tavg-h2m-hxy-u.day.GLB(daily mean)atmos.tas.tavg-h2m-hxy-u.mon.GLB(monthly mean)atmos.tas.tmax-h2m-hxy-u.day.GLB(daily maximum)atmos.tas.tmin-h2m-hxy-u.day.GLB(daily minimum)Each variant may require different preprocessing, so they need separate mappings.
Usage
cmip7_variable_mapping.xlsxpython excel_to_yaml.pyto generate YAMLTesting
Related
Addresses the need for collaborative CMIP7 variable mapping discussed in internal documentation.