TFTS 1: What NOT to do when your server crashes randomly.

I recently built a lit­tle server whose main task is to serve files. Bought a moth­er­board that had a decent amount of SATA ports, added two addi­tional SATA con­troller cards and packed the whole case full of disks. As the server is at home, I didn’t bother with enterprise-grade hard­ware stuff.

Fun was had, but soon I noticed it crash­ing when doing heavy I/O. First, I sus­pected the main­board, and some quick google­ing rec­om­mended to turn off the Enhanced Halt State (C1E), among other some other new CPU fea­tures, as this appar­ently led the main­board to hic­cup. When I did this, the sys­tem seemed to be sta­ble. Hap­pily, I went on.

But soon the crashes started again. And most often, when heavy I/O was going on. As it’s not uncom­mon for cheap SATA chipsets to over­heat, I now sus­pected one of the addi­tional con­trollers to be the cul­prit, to be more exactly, the one which was used when access­ing the “big” RAID in this sys­tem (the other one had just a few small disks con­nected and never moved a lot of data). What should have put me off was that it wasn’t exactly cheap back in the day, I never had prob­lems with it, and while it wasn’t a big name brand, the man­u­fac­turer is still there and enjoys a good rep­u­ta­tion. But I just fig­ured that it’s age (over 3 years) to take its toll and bought a brand-new Adaptec controller.

Now the sys­tem was okay for a while, but as you may have guessed: crashes again. Des­per­ately, I switched the sec­ond con­troller as well, but no dice.

Now that I had the con­trollers out of the pic­ture, I won­dered what else could crash the server that ran­domly. The RAM? Nah, c’mon, usu­ally that’s solid. It either works or not. But still, can’t hurt to test.

After 2 min­utes of memtest86+, I got errors left, right, and centre.

Morale of the story: Don’t believe the RAM you buy for your home sys­tems has the same reli­a­bil­ity as the enter­prise RAM you use at your employer’s server room. And always check the RAM first, if you can’t make out exactly what’s fail­ing in your system.


