Originally Posted by
flatlander
I don't have inside knowledge, but based on some years of experience of this sort of thing in IT and the reported problems, my guess is that they have built a set of software components to talk to each other and some of them are not fast enough, or do not operate with the same speed under increasing load (not linearly scalable). This is pretty common in several sorts of modern software systems, where there are non-linear effects that mean it works fine until a certain load or traffic level, then stops working spectacularly.
(Examples are systems which assume all data is in main memory so when data set exceeds this performance falls off a cliff, systems that allocate and free a lot of memory for each work item which puts a lot of stress on the memory allocator and garbage collection (Java is particularly prone to this), databases systems with processing scaling proportional to number of items or number of items squared (O(N) or O(N*N), when what you really want is O(1) or O(log(N)) ), etc).
Building linearly scalable systems is hard, and some modern software abstraction frameworks make it harder. I suspect this project had more new software design people and fewer grumpy old experienced people than it needed.
Experienced scalability engineers are usually somewhat grumpy. A couple of decades of telling people they're going to fail and how they will fail, then having to fix it when it goes how you said, will do that to you.