IntroductionIn my role as family system administrator, I try to keep the NAS running -- don't want to lose all those billions of cat photos that the kids take. On my NAS, I run the following servers to collect and export machine measurements:
Bad Disk Investigation
When I awoke on Friday the 27th, I noticed the electric power must have went out during the night because the clock on the microwave was flashing. The NAS didn't reboot because I use a UPS. That is when I noticed a hard disk failed and now needs to be replaced. I replaced the drive two days later. But I wondered -- did the power event cause the drive to fail? The answer turned out to be no -- but I thought it would be useful to capture the data I used to make this conclusion.
- Fri, 20 Nov 2015 10:51:48 GMT:monitoring noticed drive failed
- Fri, 27 Nov 2015 05:21:15 GMT:very short power event
- Fri, 27 Nov 2015 12:00:00 GMT:noticed hard disk failed
- Sat, 28 Nov 2015 20:24:52 GMT: replaced failed hard disk
When did the drive fail?
I use smartctl to measure the temperature of my hard disks. I noticed that the temperature of one of my disks was missing. This measurement helped me determine when the drive failed.
When did the power event happen?
I poll the status of the UPS every 10 seconds. The power event had to be shorter than 10 seconds, because the monitoring never even noticed this event -- the UPS status was always "on mains". I found a spike in the UPS remaining capacity measurement -- that is the only clue I had about when the power event happened. Comparing these two graphs showed that the disk failed about a week before the power event.
Did any alerts fire?
Prometheus stores pending/firing alerts in the time series database. I wanted to know if there were any firing alerts that I missed -- did I miss an alert for a week about my bad disk? Nope -- the only firing alert was when I rebooted the NAS. I will need to add another alert for failing disks.
What did I learn?
- The FreeBSDism of using 'sysctl -n kern.disks' to find a list of disks in a machine appears to quietly drop failed or missing hard disks -- this is a feature for removable USB drives. I cannot use this measurement to track the disks expected to be alive on a machine.
- Polling the UPS might not give the required data for very short power events. I might want to enable SNMP traps to capture very short power events.
- I don't want to leave my raid array with one dead disk for a whole week. I am going to setup some alerts for this!