Skip to content

Conversation

@cmungall
Copy link
Member

Summary

  • Add new schema modules: assays, expression_units, gene_perturbation, specimen_processing, transplantation, data_catalog (access, contributor_roles, relations, resource_types), imaging_platforms, preservation (events, fixity, digital_objects), and osti_record
  • Update existing enums across bio, medical, energy, and environmental_health domains
  • Include ontology cache updates and regenerated code/documentation

Test plan

  • Run just validate to ensure schema validity
  • Run just test to verify tests pass
  • Review new enum definitions for completeness and correctness

🤖 Generated with Claude Code

New schema modules added:
- bio/assays.yaml - Biological assay types (OBI-based)
- bio/expression_units.yaml - Gene expression measurement units
- bio/gene_perturbation.yaml - Gene perturbation methods
- bio/specimen_processing.yaml - Specimen processing methods
- bio/transplantation.yaml - Transplantation types and related terms
- data_catalog/ - Data catalog terms (access, contributor_roles, relations, resource_types)
- medical/imaging_platforms.yaml - Medical imaging platforms
- preservation/ - Digital preservation (events, fixity, digital_objects)
- publishing/osti_record.yaml - OSTI record types

Updates to existing enums:
- developmental_stages, structural_biology, uniprot_species
- pediatric oncology enums
- environmental health exposures
- energy/renewable enums
- fao_soil

Also includes ontology cache updates, generated code, and documentation.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings December 22, 2025 15:13
@claude
Copy link
Contributor

claude bot commented Dec 22, 2025

Claude finished @cmungall's task —— View job


Pull Request Review

I've conducted a comprehensive review of this large PR that adds significant new value sets and expands existing enums. Overall, this is a well-structured addition that follows project conventions. Below are my findings:

Strengths

1. Schema Organization & Structure

  • Excellent domain-based organization with new modules: bio/assays, data_catalog/*, preservation/*, publishing/osti_record
  • Consistent LinkML schema structure across all new files
  • Proper namespace management with appropriate prefix declarations
  • Clear imports and dependency management

2. Ontology Integration

  • Strong adherence to project guideline of using meaning: for ontology mappings
  • Good use of OBO ontologies (OBI, CHMO, BAO, ENVO, NCIT, etc.)
  • Proper CURIE formatting with prefixes declared in headers
  • Rich ontological context from sources like PREMIS 3.0, DCAT 3.0, EU Vocabularies

3. Documentation & Metadata

  • Comprehensive descriptions for enums and permissible values
  • Proper attribution with contributor ORCIDs
  • Good use of see_also references to standards and specifications
  • Consistent status marking (DRAFT)

4. Code Generation

  • Generated Python classes follow consistent patterns
  • Proper use of RichEnum base class
  • Clean enum member definitions
  • Auto-generated files properly excluded from manual editing

⚠️ Areas for Improvement

1. OSTI Schema Values - Missing Descriptions

  • src/valuesets/schema/publishing/osti_record.yaml:71-100 - Many enum values are empty placeholders ({}) without descriptions
  • This violates the project's emphasis on comprehensive documentation
  • Recommendation: Add descriptions for all OSTI enum values or mark them as placeholders that need documentation

2. Naming Convention Inconsistencies

  • File: src/valuesets/schema/data_catalog/access.yaml:1 uses data-catalog-access (with hyphens) while directory uses underscores
  • Some enum values in bio modules mix underscore and hyphen conventions (e.g., THREE_D_CONFOCAL_IMAGING vs natural naming)
  • Recommendation: Standardize on underscore convention for consistency

3. Generated File Size

  • The datamodel files show significant size increases (1097 additions in some files)
  • This could impact performance for large-scale usage
  • Consider: Breaking down very large enums into sub-modules if they become unwieldy

🔒 Security & Performance

Security: ✅ No security concerns identified

  • No sensitive data or credentials in schema files
  • Proper use of public ontology references
  • No executable code in schema definitions

Performance: ✅ Generally good

  • Generated enums are efficient for Python usage
  • OAK validator config properly updated to include new ontologies
  • Caching system for ontology terms is well-maintained

🧪 Test Coverage

Current Coverage: Adequate

  • Basic enum validation and structure tests exist
  • OLS validator tests ensure ontology mappings work
  • Gap: No specific tests for new enum modules

Recommendations:

  • Add smoke tests for new enum modules to ensure they instantiate properly
  • Consider adding validation tests for critical enums like preservation events and assays

📊 Change Analysis

Scale: 101,730 additions, 1,854 deletions across 88 files

  • New Schema Modules: 14 new YAML schema files
  • Generated Code: Properly regenerated Python, JSON schema, documentation
  • Ontology Cache: 151 new ontology term entries added

Impact: Low risk - mostly additive changes following established patterns

🎯 Specific Recommendations

  1. High Priority: Complete OSTI enum documentation (lines 71-100 in osti_record.yaml)

  2. Medium Priority:

    • Standardize naming convention (data-catalog vs data_catalog)
    • Add basic tests for new enum modules
  3. Low Priority:

    • Consider breaking down very large enum modules if they become difficult to navigate
    • Add more specific aliases where domain experts might use alternative terms

Validation Commands

The PR author should run:

just validate  # Ensure schema validity
just test      # Verify all tests pass  
just site      # Check documentation generation

Final Assessment

Overall: ✅ Approve with minor fixes recommended

This PR significantly expands the project's coverage of important domains while maintaining high standards for schema design and ontological rigor. The additions are valuable and well-structured. The main concern is the incomplete OSTI enum documentation, which should be addressed before merge.

The generated artifacts are properly maintained, dependencies are correctly updated, and the overall architectural patterns are sound.


Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR significantly expands the valuesets schema by adding multiple new domain-specific modules across biological research, data preservation, data catalogs, and publishing metadata, alongside updates to existing enumerations with corrected ontology mappings.

Key Changes:

  • Addition of 12 new schema modules covering assays, expression units, gene perturbation, specimen processing, transplantation, imaging platforms, data catalog (access, contributor roles, relations, resource types), preservation (events, fixity, digital objects), and OSTI record metadata
  • Correction of ontology term mappings across multiple existing enums including NCIT, ENVO, CHMO, and EDAM identifiers
  • Updates to existing medical, energy, and environmental health domain schemas with improved standardization

Reviewed changes

Copilot reviewed 78 out of 88 changed files in this pull request and generated no comments.

Show a summary per file
File Description
src/valuesets/validators/oak_config.yaml Extended ontology adapter configuration to support preservation vocabularies (PREMIS, DataCite, DCAT) and data catalog standards
src/valuesets/schema/valuesets.yaml Added imports for 12 new schema modules spanning biological assays, preservation, and data catalog domains
src/valuesets/schema/publishing/osti_record.yaml New comprehensive OSTI E-Link 2 metadata schema with workflow statuses, access limitations, and identifier types
src/valuesets/schema/preservation/*.yaml Three new PREMIS-based schemas for preservation events, digital object metadata, and cryptographic hash functions
src/valuesets/schema/data_catalog/*.yaml Four new DataCite/DCAT-based schemas for resource types, relations, contributor roles, and access rights
src/valuesets/schema/bio/*.yaml Six new biological research schemas covering assays, expression units, gene perturbation, specimen processing, transplantation, and imaging platforms
src/valuesets/schema/medical/*.yaml Corrected NCIT term mappings for neuroblastoma staging and pediatric oncology diagnosis categories
src/valuesets/schema/energy/*.yaml Fixed ontology mappings for hydrogen, geothermal, and bioenergy enums
src/valuesets/schema/environmental_health/exposures.yaml Added HHEAR exposure assessment enum with 24 environmental exposure categories
Generated Python enums Auto-generated Python code for all new schemas with proper metadata and rich enum support
Comments suppressed due to low confidence (10)

src/valuesets/schema/medical/pediatric_oncology/staging/neuroblastoma.yaml:1

  • The titles for INRG stages have been changed from descriptive format (e.g., 'L1 - Localized, no IDRFs') to standard format ('INRG Stage L1'). While more standardized, the descriptive titles were more informative. Consider whether the additional context in the original titles was valuable for users.
    src/valuesets/schema/medical/pediatric_oncology/staging/neuroblastoma.yaml:1
  • The 'meaning' field has been removed from VERY_LOW, LOW, and INTERMEDIATE risk categories while retained for HIGH. This inconsistency suggests either incomplete ontology mapping or intentional distinction. Ensure all risk categories have ontology mappings if they exist in NCIT, or document why HIGH is the only mapped category.
    src/valuesets/schema/bio/structural_biology.yaml:1
  • The meaning for NUCLEIC_ACID changed from NCIT:C813 to NCIT:C706. The title also changed to 'Nucleic Acids' (plural). Verify that NCIT:C706 is the correct term and that the plural form is intentional, as this affects semantic consistency.
    src/valuesets/schema/bio/structural_biology.yaml:1
  • Multiple CHMO ontology term corrections throughout this file (NEUTRON_CRYSTALLOGRAPHY from CHMO:0000175 to CHMO:0000698, WAXS from CHMO:0000213 to CHMO:0000207, NEGATIVE_STAIN_EM from FBbi:00000568 to FBbi:00000399). While these corrections improve accuracy, they represent breaking changes for any downstream systems using these mappings. Consider documenting these mapping changes.
    src/valuesets/schema/bio/structural_biology.yaml:1
  • The 'meaning' field (previously CHMO:0001107) has been removed from ROTATING_ANODE, CCD, CMOS, PHOTOSTIMULABLE_PHOSPHOR, and several other enum values. This reduces semantic interoperability. Verify these terms lack appropriate ontology mappings or restore them if removed in error.
    src/valuesets/schema/data_catalog/access.yaml:1
  • The CONTINUOUS frequency value uses the generic 'dcterms:Frequency' as its meaning rather than a specific term. Consider using a more specific Dublin Core term if available, or document why the generic term is appropriate for this value.
    src/valuesets/schema/energy/renewable/bioenergy.yaml:1
  • SYNGAS changed from having 'meaning: CHMO:0001501' to 'close_mappings: CHMO:0001501', and added 'gasification' as an alias. This downgrade from exact to close mapping changes semantic precision. Document why this term is now considered a close rather than exact match.
    src/valuesets/schema/energy/renewable/bioenergy.yaml:1
  • The title for BUTANOL changed from 'Biobutanol' to 'butan-1-ol' (chemical name), with 'Biobutanol' moved to aliases. While the chemical name is more precise, 'Biobutanol' in the context of renewable energy may be more meaningful. Verify this naming change aligns with domain conventions.
    src/valuesets/schema/earth_science/fao_soil.yaml:1
  • Multiple ENVO ontology term corrections in this file (ANDOSOLS from ENVO:00002030 to ENVO:00002232, CRYOSOLS from ENVO:00000134 to ENVO:00002236, etc.). These represent significant mapping changes that could affect data integration. Ensure these corrections are documented and communicated to users.
    src/valuesets/schema/bio/plant_sex.yaml:1
  • The title for HERMAPHRODITIC moved from after the meaning to before it. This is a formatting change only and doesn't affect functionality, but maintaining consistent field ordering improves schema readability.

@dragon-ai-agent dragon-ai-agent merged commit d0af7a2 into main Dec 22, 2025
12 of 13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants