I'm looking for help diagnosing an intermittent hardware issue on an HPC node and would appreciate any insight into whether this points to a motherboard/SATA/backplane problem, a power issue, or something else.
After rebooting this node a few weeks ago, we began seeing SATA-related errors in the kernel logs, after which the SATA connection seemingly recovers:
ata7.00: exception Emask 0x10 SAct 0x400000 SErr 0x280100 action 0x6 frozen
ata7.00: irq_stat 0x08000000, interface fatal error
ata7: SError: { UnrecovData 10B8B BadCRC }
ata7.00: failed command: READ FPDMA QUEUED
ata7.00: cmd 60/80:b0:68:98:72/00:00:14:00:00/40 tag 22 ncq dma 65536 in
res 40/00:b0:68:98:72/00:00:14:00:00/40 Emask 0x10 (ATA bus error)
ata7.00: status: { DRDY }
ata7: hard resetting link
ata7: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
ata7.00: configured for UDMA/133
ata7: EH complete
However, the node occasionally powers off unexpectedly and later reboots. The IPMI SEL contains entries such as: Power Unit Pwr Unit Status | Power off/down | Asserted followed by boot events.
The system also reports a POST error after every boot: System Firmwares POST Err Sensor | Unknown Error | Asserted
After a couple of such reboots where everything else is seemingly fine, the node powers off completely.
The SSD itself appears healthy according to its SMART data, although the CRC_Error_Count increases with every reboot. The PSUs also appear stable (each draws ~750 W under load, out of a max power output of 2130 W).
In your experience, does this pattern suggest that power instability issues are at play, or is it a SATA connection issue that somehow leads to the system powering off intermittently? Additionally, is there any way to decode the recurring IPMI POST error?
Thanks so much for your help!