Skip to content

[RFD] FRU Functional Spec Proposal #105

@bmcdonald3

Description

@bmcdonald3

Introduction

This document captures user stories and requirements for Field Replaceable Unit (FRU) tracking in a modern system management platform. It is intended to guide the implementation and evaluation of FRU features, controls, and APIs across all relevant infrastructure management components.


Background

What is a FRU?

A Field Replaceable Unit (FRU) is any modular hardware component across the entire infrastructure that is designed to be wholly replaced in the field if a defect or failure occurs. The scope of a FRU is not limited to components within a single compute node, but extends to include components in the broader compute, networking, power, and cooling ecosystems.

The definition of a FRU is broad, encompassing granular components within a node like CPUs and GPUs, larger assemblies at the chassis level such as compute and switch blades, entire rack-level network switches, and even infrastructure like pumps and controllers in liquid cooling infrastructure.

Each FRU is uniquely identified and tracked throughout its lifecycle. FRU metadata is associated with its geolocation (e.g., xname), allowing the system to record all FRUs ever present in a physical location.

FRU Identification and Naming Convention

Each FRU is assigned a unique identifier (FRUID) using the format:
<Component Type>.<Manufacturer>.<Part Number>.<Serial Number>_<Node Designator>

For example: Node.HPE.P53321001B.JN22300084_0. Part Number and Serial Number correspond to the part and serial of the component. This convention applies to all component types (CPUs, DIMMs, GPUs, NICs, switches), each with a physical location identifier. For example, in CSM, this is accomplished using a specific xname standard defined by its underlying hardware management services. A formal, cross-platform naming convention for OpenCHAMI has not yet been established.

FRU tracking maintains a real-time, authoritative record of all such components, their lifecycle events, locations, and relationships to the broader infrastructure. The system supports querying hardware history by geolocation (xname), showing all FRUs ever associated with a physical location, including event types (Detected, Removed) and timestamps.

Why FRU Tracking Matters

  • Scalability and Complexity Management: While simple tracking methods may suffice for small systems, they fail at the scale of modern data centers. An automated, authoritative FRU tracking system is essential for managing the lifecycle of thousands of similar components, reducing human error, and enabling efficient troubleshooting in large, complex environments.
  • Operational Visibility: Enables accurate, up-to-date inventory of all hardware assets, supporting troubleshooting, capacity planning, and lifecycle management.
  • Audit and Compliance: Provides immutable event history for each component, supporting regulatory, security, and warranty audits.
  • Automation and Integration: Serves as a foundation for automated workflows (e.g., incident response, asset decommissioning).
  • Risk Reduction: Reduces downtime and human error by enabling rapid identification and replacement of failed or at-risk components.

Operational Scenarios and Querying

Hardware history for any FRU or physical location can be queried using CLI tools (e.g., with CSM, cray hsm inventory hardware history describe, sat hwhist). Output includes physical location (e.g., xname), FRUID, timestamp, and event type. Filtering by physical location allows viewing all FRUs associated with a node and its subcomponents over time.


User Stories

Viewing FRU Details

As an infrastructure admin, I want to view a real-time inventory of all FRUs across the environment, so that I can plan maintenance, upgrades, and capacity.

  • Description: The admin can query the system for a complete, filterable list of all FRUs, including type, manufacturer, serial, location, status, and warranty.
  • Requirements:
    • API to list/filter FRUs by type, status, location, etc.
    • Export inventory for reporting

As an infrastructure admin, I want to quickly identify a FRU in the field, so that I can verify its status and perform the correct maintenance.

  • Description: The infrastructure admin enters a FRU identifier and receives current status, location, and maintenance history.
  • Requirements:
    • API for FRU lookup by serial/asset tag

Tracking FRU Lifecycle Events

As an infrastructure admin, I want to record the installation or removal of a FRU, so that the system maintains an accurate, auditable record.

  • Description: The infrastructure admin updates the system when a FRU is installed, removed, or replaced, triggering event logging and (optionally) workflow automation.
  • Requirements:
    • API for recording FRU install/remove/replace
    • Event log with timestamps and user attribution

As an infrastructure admin, I want to track the movement and replacement history of each FRU, so that I can support audits and root cause analysis.

  • Description: The admin can view a timeline of all events (install, remove, move, fail, replace) for any FRU, with timestamps and attribution.
  • Requirements:
    • Event log per FRU (preferably immutable)
    • API to query FRU history
    • Audit logging and export
    • Events generated only when actual FRU changes

As an infrastructure admin, I want to export FRU data for reporting purposes, so that I can generate inventory, compliance, or audit reports as needed.

  • Description: The infrastructure admin can export filtered FRU inventory and event data in standard formats (e.g., CSV, JSON) for use in external reporting, compliance, or audit processes.
  • Requirements:
    • API to export FRU inventory and event data (i.e. no pagination)
    • Support for standard export formats (CSV, JSON)
    • Filtering options for export (by type, status, date, etc.)

Logging Maintenance Actions on FRUs

As a hardware engineer, I want to log maintenance actions or comments for a FRU, so that all rework or repair activities are recorded with a timestamp and user attribution.

  • Description: The hardware engineer can record a time-stamped comment or maintenance action for any FRU ID, even if the FRU identity does not change. This enables tracking of all repair, rework, or maintenance activities for audit and troubleshooting purposes.
  • Requirements:
    • API to log a maintenance action or comment for a FRU ID
    • Time-stamped and user-attributed maintenance log entries
    • Support for arbitrary user comments/messages
    • Maintenance actions do not require FRU identity change
    • Ability to query and export maintenance logs for FRUs

Discovery and Configuration

As an infrastructure admin, I want to configure and manage discovery endpoints, so that the system can automatically discover FRUs from various sources.

  • Description: The admin can register discovery endpoints (Redfish BMCs, SNMP devices, etc.), configure their access credentials, set discovery schedules, and monitor discovery status.
  • Requirements:
    • API to configure discovery endpoints
    • Redfish-based discovery of FRUs and subcomponents
    • Support for other discovery methods (e.g. SNMP, file import)
    • Discovery endpoint authentication and credential management
    • Discovery triggers: API (manual), scheduled, event-based (e.g. node power-on)
    • Discovery scheduling and status monitoring

As an infrastructure admin, I want to manage custom metadata and properties for FRUs, so that I can track organization-specific information.

  • Description: The admin can define custom fields and properties for different FRU types, add location-specific attributes, and manage extensible metadata schemas.
  • Requirements:
    • Extensible metadata schema management
    • Custom property definitions for FRU types
    • Flexible location hierarchy configuration
    • Metadata validation and constraints
    • Extensible event metadata

As a system administrator, I want to configure data retention and quality policies, so that the system maintains clean, reliable FRU data over time.

  • Description: The system admin can set retention policies for FRU events, configure data quality rules, manage synthetic FRUID generation, and ensure secure access to FRU data.
  • Requirements:
    • Configurable retention and pruning policies for FRU history events
    • Incomplete data handling
    • FRUID construction and synthetic ID management

Requirements

ID Requirement Description
F1 API to list/filter FRUs by type, status, location, etc. Query/filter FRUs by type, status, location
F2 Export inventory for reporting Export inventory data for external use
F3 API for FRU lookup by serial/asset tag Fast field lookup of FRU status/history
F4 API for recording FRU install/remove/replace Event logging for maintenance actions
F5 Event log with timestamps and user attribution Track who performed what actions when
F6 Event log per FRU (preferably immutable) Track complete FRU lifecycle history
F7 API to query FRU history Retrieve FRU movement and event history
F8 Audit logging and export Support audits and compliance reporting
F9 Events generated only when actual FRU changes Avoid redundant events, prune as needed
F10 API to export FRU inventory and event data (no pagination) Export inventory/event data for reporting
F11 Support for standard export formats (CSV, JSON) Multiple export format options
F12 Filtering options for export (by type, status, date, etc.) Configurable export filtering
F13 API to configure discovery endpoints Manage discovery endpoint configuration
F14 Redfish-based discovery of FRUs and subcomponents Walk Redfish endpoints for all relevant data
F15 Support for other discovery methods (e.g. SNMP, file import) Multiple discovery protocol support
F16 Discovery endpoint authentication and credential management Secure discovery endpoint access
F17 Discovery triggers: API (manual), scheduled, event-based Flexible discovery activation
F18 Discovery scheduling and status monitoring Automated discovery management
F19 Extensible metadata schema management Custom metadata framework
F20 Custom property definitions for FRU types Type-specific metadata
F21 Flexible location hierarchy configuration Configurable location structures
F22 Metadata validation and constraints Data integrity enforcement
F23 Extensible event metadata Custom fields, location mapping
F24 Configurable retention and pruning policies for FRU events Event lifecycle management
F25 Incomplete data handling Data integrity and completeness
F26 FRUID construction and synthetic ID management Identifier generation and management
F27 API to log a maintenance action or comment for a FRU ID Log maintenance actions/comments for FRUs
F28 Time-stamped and user-attributed maintenance log entries Track who performed maintenance and when
F29 Support for arbitrary user comments/messages Allow freeform maintenance notes
F30 Maintenance actions do not require FRU identity change Log actions without changing FRU identity

Design Considerations

Key Implementation Insights from CSM/HMS-SMD

  • Redfish-Based Discovery: FRU discovery walks specific Redfish endpoints (/Chassis, /Systems, /Managers) and their sub-collections (e.g., NICs, GPUs) to gather enclosure, node, and BMC data. Node-specific components are discovered via sub-URLs under /Chassis/<systemID>.
  • FRUID Construction: Each discovered FRU is assigned a unique identifier in the form <Type>.<Manufacturer>.<PartNumber>.<SerialNumber>. If required fields are missing, a synthetic FRUID is generated and flagged as incomplete.
  • Event Generation and Pruning: FRU history events are generated only when a FRU changes location (not on every scan) to avoid database bloat. A pruning mechanism ensures only meaningful change events are retained, supporting long-term retention without excessive growth.
  • Normalized Data Model: FRU and location data are stored separately and linked by FRUID, allowing for detachment/reattachment as hardware is moved or removed.
  • Discovery Triggers: Discovery can be triggered by manual API calls, endpoint addition/modification, or node power-on events.
  • Retention Policy: FRU history events are retained for at least one year, with redundant events pruned to keep the database manageable.
  • Basic Audit Trail: HMS-SMD provides FRU lifecycle audit capabilities through the hwinv_hist table, tracking hardware inventory events (Added, Removed, Scanned, Detected) with timestamps and event types.

Research and Design Questions for Discussion

Data Structures and Format

  • Data Structures and Format: Need to research what all data was being tracked in SMD and how it was being used.
  • UUID-Based Identification: How should the service handle the transition from CSM's FRUID format to a UUID-based identification system?
  • Component Hierarchy Tracking: How should we encode parent-child relationships between components (e.g., DIMM → Node → Chassis). Are there any modifications to SMD's data model required to store these relationships?
  • Extensible Properties: Do we need to support custom/extensible properties for both FRUs and locations to accommodate different hardware types and organizational needs?
  • Physical Location Identifier Scope: What identifier should represent a component's physical location? While xname is the traditional identifier, we need to decide if it's the right choice for this service. A key consideration is that this location identifier must be strictly limited to physical hardware only, ensuring it is never used for virtual resources (e.g., VMs, containers) that run on the hardware.

Discovery and Data Collection

  • Discovery Methods: Other than Redfish-based discovery for out-of-band data, what other discovery methods do we need to support? What security and access considerations apply to each method?
  • Discovery Triggers: Are there any additional discovery triggers we should support beyond manual API calls? How about component state changes? Node power-on events?
  • Discovery Coordination: How should discovery scans be coordinated to avoid conflicts between multiple discovery methods or overlapping scans of the same hardware?
  • Import/Export: Do we need to support importing FRU data? Are there any specific import formats we should consider? This would also be useful for backup and recovery.
  • Incomplete Data Handling: HMS-SMD generates synthetic FRUIDs when data is insufficient. Should we continue with this mechanism? If so, should we reuse the same format "FRUIDfor" + componentID?
  • Discovery Endpoint Management: How should discovery endpoints be configured, managed, and monitored? What authentication and configuration options are needed for different endpoint types?
  • Asynchronous Discovery: How should long-running discovery scans be handled? What status reporting and progress tracking mechanisms are needed?

Service Architecture

  • Independent service vs SMD: Should FRU tracking be its own service in OpenCHAMI, or should it be tightly integrated with existing SMD APIs? If independent, what kind of data relationships and synchronization are needed with SMD?
  • Configuration Management: If FRU tracking becomes its own service, what strategies are needed for managing discovery endpoints for SMD? How about for FRU tracking service from Magellan? Is there any specific FRU tracking service configuration we'd need to do, like retention policy settings?
  • Monitoring and Observability: Are there any specific requirements for service health monitoring, metrics, and troubleshooting?
  • Deployment Patterns: How will the FRU tracking service be deployed?
  • Event Architecture: How should FRU lifecycle events be generated, stored, and potentially published internally and/or to external systems? What event sourcing or messaging patterns are needed?
  • Event Standardization: Should FRU events follow a standard format like CloudEvents to ensure interoperability with other systems and event-driven architectures? What are the tradeoffs between custom event formats and standardized schemas?

Security

  • Authentication and Authorization: What authentication and authorization mechanisms are needed for the FRU service? Should it align with existing SMD practices?
  • Audit Requirements: Do we need audit capabilities beyond basic lifecycle events (Added, Removed, Scanned, Detected)?

Scalability and Performance

  • Discovery Scheduling: What scheduling and queuing mechanisms are needed (if any) to handle long-running discovery scans?
  • Event Processing: How should the service handle high-volume event generation during large discovery scans while maintaining real-time responsiveness for critical events?
  • Data Retention: What retention policies and archival strategies are appropriate for different types of FRU data and events in OpenCHAMI environments?

UX

  • Magellan CLI Integration: The FRU service's APIs should be exposed through a CLI to provide a user experience inspired by the core functionality of existing system administration tools. The primary focus should be on inventory and historical tracking. Relevant sat commands that serve as a model for this functionality include:
    • sat hwinv: For displaying the current hardware inventory.
    • sat hwhist: For querying the lifecycle event history of FRUs.
  • Other utilities that exist in the ecosystem, such as hardware anomaly matching (sat hwmatch) or identifier translation (nid2xname), are considered distinct functionalities and are out of scope for this specific service.
  • OpenCHAMI API Patterns: Review existing SMD API endpoints and patterns.
  • Error Handling and User Experience: What error response formats and user feedback mechanisms should be provided? How should validation errors and system failures be communicated?

Metadata

Metadata

Assignees

No one assigned

    Labels

    rfdRequest for Discussion

    Type

    No type

    Projects

    Status

    Inbox

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions