FlyerTalk Forums - View Single Post - Site Down Again?
View Single Post
Old Nov 18, 2005 | 11:40 am
  #90  
SPIT
 
Join Date: Nov 2003
Location: SEA
Posts: 655
Originally Posted by Sneezy
Some change(s) they made broke something. And therefore they had a "learning experience". The symptoms are classic: system doesn't work after scheduled downtime, and instead of just returning the system to the original configuration, they just bulled their way through to "success". And no one really wants to talk about the self-inflicted period of living hell.

I just hope that the lesson they learned was not "when doing X to an operating system, don't do Y", but rather "don't ever do X to an operating system until it's been thoroughly tested on a test system, and have an upgrade plan such that if X doesn't work even after being tested (as it sometimes won't), we can revert to the pre-upgrade setup very quickly". If they think the lesson is the first and not the second, they think they're smarter than they really are. Which really means they're just smart enough to be dangerous, because it means that think they can always take into account every contingency instead of facing the reality that they can't. No one can.

When trying to do X, there's always the possibility that something will come up that either prevents you from doing X or stops X from working. Dismissing that possibility because you can't think of how that could happen is underestimating reality and overestimating your ability to comprehend reality, and unfortunately that's all too common in the IT field. Heck, I'm sure there a quite a few people reading this post thinking, "That won't ever happen to me because I'm smarter than that, and I do cover all bases."

In my experience, the best IT personnel are the ones that are smart enough to know how smart they really are and who never try to be smarter than that. No "Hey, I think this might work" cowboy-ops on operational systems, no competitions to see who can write the most obscure C++ or Perl code. And no untested changes to operational systems.
You're making a lot of assumptions here. As someone who is in IT Management for a large company (large enough you've all heard of it), you can't test everything before implementation... without extreme costs. (Try simulating millions of customers from around the world hitting your web servers before live production). You can test, test, test, and then still implement and have unforseen issues, possibly out of your control. Not all changes are easy to back out.

All the best websites have experienced extended downtimes. Depends on loss of revenue on how much resources you put towards a fix.... in this case.... the loss of revenue isn't huge so I doubt FT would spend what someone like Amazon.com would spend to get things up and running again.

Don't just assume the guys behind FT are a bunch of cowboys with no testing or change control skills. They provide this site at no cost to you... so do you really want to badmouth them?
SPIT is offline