Posted 2015-08-02 3:58 AM (#485871) Subject: 2015-08-02 Stabilized at last?
Location: Pittsburgh, PA
I think we may be stabilized on some backup hardware for now.
For those of you interested in the forensics of all this, here's what went down.
The UPS (redundant power supply) blew out literally as I was boarding a plane out of town on 7/22. The normal workaround for that would be to cut over to the redundant equipment, but I was in the middle of replacing it with what will be the new main server and was therefore unable.
Cut the UPS out of the loop, but the main ESXi server had the boot flash blown out from the power spike. Got that corrected and booted on 7/26 and we had basic services restored.
As y'all probably experienced, what remained up & running was a slow, unpredictable mess -- nothing worked right as a whole, but everything was passing individual tests with no problem. This kept me up all night for the whole week (not kidding when I say I haven't had more than 1-2 hours sleep for over a week)
Aside from the damaged hardware, what appeared to cause the funkiness was that the Trusted Root Cert from the Forward Look C.A. expired while the server was powered down, so it had no chance to auto-renew (it's a @#$@!#$ 10-year cert so what are the chances that it would happen to be down on that particular day??? Ugh.). Even once the TRC was renewed, the two DCs were both having trouble with expiring Kerberos tickets even after resetting their machine accounts. Once that was resolved, the firewall could only intermittently contact the DCs. So where did the problem lie? The firewall, the two switches, the VMware server, or the DC VMs? Spinning up the backup DC VMs was no luck. VMotioning to the other VMware server was no luck, and swapping out the switches had no effect either. That left the firewall, so I spun up the latest VM copy, but still no luck (still the Kerberos ticket issue?). While running on that, I reverted to a backup on the firewall and cut back over -- and still no luck. The only component not fully swapped out was the firewall hardware, so I configured an old Juniper and swapped that into place and presto -- we're up & running on a piece of hardware I almost threw out last month.
So I have to either rebuild or replace the Firewall as well. Been an expensive month -- had to add a new UPS in addition to the new server -- and still may need to replace the firewall. I think we now have the luxury of time again to get the rest sorted. Please let me know in the 'Help' section if there are any severe problems. I will try to post warnings of any known down time, but it should be minimal.