익명 18:40

SATA issues and random reboots

SATA issues and random reboots

I'm looking for help diagnosing an intermittent hardware issue on an HPC node and would appreciate any insight into whether this points to a motherboard/SATA/backplane problem, a power issue, or something else.

After rebooting this node a few weeks ago, we began seeing SATA-related errors in the kernel logs, after which the SATA connection seemingly recovers:

ata7.00: exception Emask 0x10 SAct 0x400000 SErr 0x280100 action 0x6 frozen
ata7.00: irq_stat 0x08000000, interface fatal error
ata7: SError: { UnrecovData 10B8B BadCRC }
ata7.00: failed command: READ FPDMA QUEUED
ata7.00: cmd 60/80:b0:68:98:72/00:00:14:00:00/40 tag 22 ncq dma 65536 in
                                    res 40/00:b0:68:98:72/00:00:14:00:00/40 Emask 0x10 (ATA bus error)
ata7.00: status: { DRDY }
ata7: hard resetting link
ata7: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
ata7.00: configured for UDMA/133
ata7: EH complete

However, the node occasionally powers off unexpectedly and later reboots. The IPMI SEL contains entries such as: Power Unit Pwr Unit Status | Power off/down | Asserted followed by boot events.

The system also reports a POST error after every boot: System Firmwares POST Err Sensor | Unknown Error | Asserted

After a couple of such reboots where everything else is seemingly fine, the node powers off completely.

The SSD itself appears healthy according to its SMART data, although the CRC_Error_Count increases with every reboot. The PSUs also appear stable (each draws ~750 W under load, out of a max power output of 2130 W).

In your experience, does this pattern suggest that power instability issues are at play, or is it a SATA connection issue that somehow leads to the system powering off intermittently? Additionally, is there any way to decode the recurring IPMI POST error?

Thanks so much for your help!



Top Answer/Comment:

Comment: contact the vendor support ,since only business related question or problem are on topic. normally you consider the change of an Device in case imho before you just play around

상단 광고의 [X] 버튼을 누르면 내용이 보입니다