-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Introduction
This document captures user stories and requirements for Field Replaceable Unit (FRU) tracking in a modern system management platform. It is intended to guide the implementation and evaluation of FRU features, controls, and APIs across all relevant infrastructure management components.
Background
What is a FRU?
A Field Replaceable Unit (FRU) is any modular hardware component across the entire infrastructure that is designed to be wholly replaced in the field if a defect or failure occurs. The scope of a FRU is not limited to components within a single compute node, but extends to include components in the broader compute, networking, power, and cooling ecosystems.
The definition of a FRU is broad, encompassing granular components within a node like CPUs and GPUs, larger assemblies at the chassis level such as compute and switch blades, entire rack-level network switches, and even infrastructure like pumps and controllers in liquid cooling infrastructure.
Each FRU is uniquely identified and tracked throughout its lifecycle. FRU metadata is associated with its geolocation (e.g., xname), allowing the system to record all FRUs ever present in a physical location.
FRU Identification and Naming Convention
Each FRU is assigned a unique identifier (FRUID) using the format:
<Component Type>.<Manufacturer>.<Part Number>.<Serial Number>_<Node Designator>
For example: Node.HPE.P53321001B.JN22300084_0. Part Number and Serial Number correspond to the part and serial of the component. This convention applies to all component types (CPUs, DIMMs, GPUs, NICs, switches), each with a physical location identifier. For example, in CSM, this is accomplished using a specific xname standard defined by its underlying hardware management services. A formal, cross-platform naming convention for OpenCHAMI has not yet been established.
FRU tracking maintains a real-time, authoritative record of all such components, their lifecycle events, locations, and relationships to the broader infrastructure. The system supports querying hardware history by geolocation (xname), showing all FRUs ever associated with a physical location, including event types (Detected, Removed) and timestamps.
Why FRU Tracking Matters
- Scalability and Complexity Management: While simple tracking methods may suffice for small systems, they fail at the scale of modern data centers. An automated, authoritative FRU tracking system is essential for managing the lifecycle of thousands of similar components, reducing human error, and enabling efficient troubleshooting in large, complex environments.
- Operational Visibility: Enables accurate, up-to-date inventory of all hardware assets, supporting troubleshooting, capacity planning, and lifecycle management.
- Audit and Compliance: Provides immutable event history for each component, supporting regulatory, security, and warranty audits.
- Automation and Integration: Serves as a foundation for automated workflows (e.g., incident response, asset decommissioning).
- Risk Reduction: Reduces downtime and human error by enabling rapid identification and replacement of failed or at-risk components.
Operational Scenarios and Querying
Hardware history for any FRU or physical location can be queried using CLI tools (e.g., with CSM, cray hsm inventory hardware history describe, sat hwhist). Output includes physical location (e.g., xname), FRUID, timestamp, and event type. Filtering by physical location allows viewing all FRUs associated with a node and its subcomponents over time.
User Stories
Viewing FRU Details
As an infrastructure admin, I want to view a real-time inventory of all FRUs across the environment, so that I can plan maintenance, upgrades, and capacity.
- Description: The admin can query the system for a complete, filterable list of all FRUs, including type, manufacturer, serial, location, status, and warranty.
- Requirements:
- API to list/filter FRUs by type, status, location, etc.
- Export inventory for reporting
As an infrastructure admin, I want to quickly identify a FRU in the field, so that I can verify its status and perform the correct maintenance.
- Description: The infrastructure admin enters a FRU identifier and receives current status, location, and maintenance history.
- Requirements:
- API for FRU lookup by serial/asset tag
Tracking FRU Lifecycle Events
As an infrastructure admin, I want to record the installation or removal of a FRU, so that the system maintains an accurate, auditable record.
- Description: The infrastructure admin updates the system when a FRU is installed, removed, or replaced, triggering event logging and (optionally) workflow automation.
- Requirements:
- API for recording FRU install/remove/replace
- Event log with timestamps and user attribution
As an infrastructure admin, I want to track the movement and replacement history of each FRU, so that I can support audits and root cause analysis.
- Description: The admin can view a timeline of all events (install, remove, move, fail, replace) for any FRU, with timestamps and attribution.
- Requirements:
- Event log per FRU (preferably immutable)
- API to query FRU history
- Audit logging and export
- Events generated only when actual FRU changes
As an infrastructure admin, I want to export FRU data for reporting purposes, so that I can generate inventory, compliance, or audit reports as needed.
- Description: The infrastructure admin can export filtered FRU inventory and event data in standard formats (e.g., CSV, JSON) for use in external reporting, compliance, or audit processes.
- Requirements:
- API to export FRU inventory and event data (i.e. no pagination)
- Support for standard export formats (CSV, JSON)
- Filtering options for export (by type, status, date, etc.)
Logging Maintenance Actions on FRUs
As a hardware engineer, I want to log maintenance actions or comments for a FRU, so that all rework or repair activities are recorded with a timestamp and user attribution.
- Description: The hardware engineer can record a time-stamped comment or maintenance action for any FRU ID, even if the FRU identity does not change. This enables tracking of all repair, rework, or maintenance activities for audit and troubleshooting purposes.
- Requirements:
- API to log a maintenance action or comment for a FRU ID
- Time-stamped and user-attributed maintenance log entries
- Support for arbitrary user comments/messages
- Maintenance actions do not require FRU identity change
- Ability to query and export maintenance logs for FRUs
Discovery and Configuration
As an infrastructure admin, I want to configure and manage discovery endpoints, so that the system can automatically discover FRUs from various sources.
- Description: The admin can register discovery endpoints (Redfish BMCs, SNMP devices, etc.), configure their access credentials, set discovery schedules, and monitor discovery status.
- Requirements:
- API to configure discovery endpoints
- Redfish-based discovery of FRUs and subcomponents
- Support for other discovery methods (e.g. SNMP, file import)
- Discovery endpoint authentication and credential management
- Discovery triggers: API (manual), scheduled, event-based (e.g. node power-on)
- Discovery scheduling and status monitoring
As an infrastructure admin, I want to manage custom metadata and properties for FRUs, so that I can track organization-specific information.
- Description: The admin can define custom fields and properties for different FRU types, add location-specific attributes, and manage extensible metadata schemas.
- Requirements:
- Extensible metadata schema management
- Custom property definitions for FRU types
- Flexible location hierarchy configuration
- Metadata validation and constraints
- Extensible event metadata
As a system administrator, I want to configure data retention and quality policies, so that the system maintains clean, reliable FRU data over time.
- Description: The system admin can set retention policies for FRU events, configure data quality rules, manage synthetic FRUID generation, and ensure secure access to FRU data.
- Requirements:
- Configurable retention and pruning policies for FRU history events
- Incomplete data handling
- FRUID construction and synthetic ID management
Requirements
| ID | Requirement | Description |
|---|---|---|
| F1 | API to list/filter FRUs by type, status, location, etc. | Query/filter FRUs by type, status, location |
| F2 | Export inventory for reporting | Export inventory data for external use |
| F3 | API for FRU lookup by serial/asset tag | Fast field lookup of FRU status/history |
| F4 | API for recording FRU install/remove/replace | Event logging for maintenance actions |
| F5 | Event log with timestamps and user attribution | Track who performed what actions when |
| F6 | Event log per FRU (preferably immutable) | Track complete FRU lifecycle history |
| F7 | API to query FRU history | Retrieve FRU movement and event history |
| F8 | Audit logging and export | Support audits and compliance reporting |
| F9 | Events generated only when actual FRU changes | Avoid redundant events, prune as needed |
| F10 | API to export FRU inventory and event data (no pagination) | Export inventory/event data for reporting |
| F11 | Support for standard export formats (CSV, JSON) | Multiple export format options |
| F12 | Filtering options for export (by type, status, date, etc.) | Configurable export filtering |
| F13 | API to configure discovery endpoints | Manage discovery endpoint configuration |
| F14 | Redfish-based discovery of FRUs and subcomponents | Walk Redfish endpoints for all relevant data |
| F15 | Support for other discovery methods (e.g. SNMP, file import) | Multiple discovery protocol support |
| F16 | Discovery endpoint authentication and credential management | Secure discovery endpoint access |
| F17 | Discovery triggers: API (manual), scheduled, event-based | Flexible discovery activation |
| F18 | Discovery scheduling and status monitoring | Automated discovery management |
| F19 | Extensible metadata schema management | Custom metadata framework |
| F20 | Custom property definitions for FRU types | Type-specific metadata |
| F21 | Flexible location hierarchy configuration | Configurable location structures |
| F22 | Metadata validation and constraints | Data integrity enforcement |
| F23 | Extensible event metadata | Custom fields, location mapping |
| F24 | Configurable retention and pruning policies for FRU events | Event lifecycle management |
| F25 | Incomplete data handling | Data integrity and completeness |
| F26 | FRUID construction and synthetic ID management | Identifier generation and management |
| F27 | API to log a maintenance action or comment for a FRU ID | Log maintenance actions/comments for FRUs |
| F28 | Time-stamped and user-attributed maintenance log entries | Track who performed maintenance and when |
| F29 | Support for arbitrary user comments/messages | Allow freeform maintenance notes |
| F30 | Maintenance actions do not require FRU identity change | Log actions without changing FRU identity |
Design Considerations
Key Implementation Insights from CSM/HMS-SMD
- Redfish-Based Discovery: FRU discovery walks specific Redfish endpoints (
/Chassis,/Systems,/Managers) and their sub-collections (e.g., NICs, GPUs) to gather enclosure, node, and BMC data. Node-specific components are discovered via sub-URLs under/Chassis/<systemID>. - FRUID Construction: Each discovered FRU is assigned a unique identifier in the form
<Type>.<Manufacturer>.<PartNumber>.<SerialNumber>. If required fields are missing, a synthetic FRUID is generated and flagged as incomplete. - Event Generation and Pruning: FRU history events are generated only when a FRU changes location (not on every scan) to avoid database bloat. A pruning mechanism ensures only meaningful change events are retained, supporting long-term retention without excessive growth.
- Normalized Data Model: FRU and location data are stored separately and linked by FRUID, allowing for detachment/reattachment as hardware is moved or removed.
- Discovery Triggers: Discovery can be triggered by manual API calls, endpoint addition/modification, or node power-on events.
- Retention Policy: FRU history events are retained for at least one year, with redundant events pruned to keep the database manageable.
- Basic Audit Trail: HMS-SMD provides FRU lifecycle audit capabilities through the
hwinv_histtable, tracking hardware inventory events (Added, Removed, Scanned, Detected) with timestamps and event types.
Research and Design Questions for Discussion
Data Structures and Format
- Data Structures and Format: Need to research what all data was being tracked in SMD and how it was being used.
- UUID-Based Identification: How should the service handle the transition from CSM's FRUID format to a UUID-based identification system?
- Component Hierarchy Tracking: How should we encode parent-child relationships between components (e.g., DIMM → Node → Chassis). Are there any modifications to SMD's data model required to store these relationships?
- Extensible Properties: Do we need to support custom/extensible properties for both FRUs and locations to accommodate different hardware types and organizational needs?
- Physical Location Identifier Scope: What identifier should represent a component's physical location? While
xnameis the traditional identifier, we need to decide if it's the right choice for this service. A key consideration is that this location identifier must be strictly limited to physical hardware only, ensuring it is never used for virtual resources (e.g., VMs, containers) that run on the hardware.
Discovery and Data Collection
- Discovery Methods: Other than Redfish-based discovery for out-of-band data, what other discovery methods do we need to support? What security and access considerations apply to each method?
- Discovery Triggers: Are there any additional discovery triggers we should support beyond manual API calls? How about component state changes? Node power-on events?
- Discovery Coordination: How should discovery scans be coordinated to avoid conflicts between multiple discovery methods or overlapping scans of the same hardware?
- Import/Export: Do we need to support importing FRU data? Are there any specific import formats we should consider? This would also be useful for backup and recovery.
- Incomplete Data Handling: HMS-SMD generates synthetic FRUIDs when data is insufficient. Should we continue with this mechanism? If so, should we reuse the same format
"FRUIDfor" + componentID? - Discovery Endpoint Management: How should discovery endpoints be configured, managed, and monitored? What authentication and configuration options are needed for different endpoint types?
- Asynchronous Discovery: How should long-running discovery scans be handled? What status reporting and progress tracking mechanisms are needed?
Service Architecture
- Independent service vs SMD: Should FRU tracking be its own service in OpenCHAMI, or should it be tightly integrated with existing SMD APIs? If independent, what kind of data relationships and synchronization are needed with SMD?
- Configuration Management: If FRU tracking becomes its own service, what strategies are needed for managing discovery endpoints for SMD? How about for FRU tracking service from Magellan? Is there any specific FRU tracking service configuration we'd need to do, like retention policy settings?
- Monitoring and Observability: Are there any specific requirements for service health monitoring, metrics, and troubleshooting?
- Deployment Patterns: How will the FRU tracking service be deployed?
- Event Architecture: How should FRU lifecycle events be generated, stored, and potentially published internally and/or to external systems? What event sourcing or messaging patterns are needed?
- Event Standardization: Should FRU events follow a standard format like CloudEvents to ensure interoperability with other systems and event-driven architectures? What are the tradeoffs between custom event formats and standardized schemas?
Security
- Authentication and Authorization: What authentication and authorization mechanisms are needed for the FRU service? Should it align with existing SMD practices?
- Audit Requirements: Do we need audit capabilities beyond basic lifecycle events (Added, Removed, Scanned, Detected)?
Scalability and Performance
- Discovery Scheduling: What scheduling and queuing mechanisms are needed (if any) to handle long-running discovery scans?
- Event Processing: How should the service handle high-volume event generation during large discovery scans while maintaining real-time responsiveness for critical events?
- Data Retention: What retention policies and archival strategies are appropriate for different types of FRU data and events in OpenCHAMI environments?
UX
- Magellan CLI Integration: The FRU service's APIs should be exposed through a CLI to provide a user experience inspired by the core functionality of existing system administration tools. The primary focus should be on inventory and historical tracking. Relevant
satcommands that serve as a model for this functionality include:sat hwinv: For displaying the current hardware inventory.sat hwhist: For querying the lifecycle event history of FRUs.
- Other utilities that exist in the ecosystem, such as hardware anomaly matching (
sat hwmatch) or identifier translation (nid2xname), are considered distinct functionalities and are out of scope for this specific service. - OpenCHAMI API Patterns: Review existing SMD API endpoints and patterns.
- Error Handling and User Experience: What error response formats and user feedback mechanisms should be provided? How should validation errors and system failures be communicated?
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
Status