IT outsourcing prediction

Reply Subscribe

Thread Tools

Search this Thread

May 29, 2017, 11:48 am

#151

Egoldstein

Join Date: Dec 2015

Location: UK

Programs: BAEC Silver, *A, Marriott

Posts: 181

Quote:

Originally Posted by toothy

More likely - the failover plan has never been tested

I find this astonishing if true, but some chatter in another forum seems to confirm that airlines tend to test failover very rarely, unlike financial institutions that have to comply with regulatory minimums and audits on DR compliance.

Talk about penny wise, one hundred million pound foolish!

May 29, 2017, 1:18 pm

#152

Banana4321

Join Date: Apr 2014

Location: London

Programs: Don't even mention it. Grrrrrrr.

Posts: 968

Quote:

Originally Posted by Egoldstein

It's actually very risky executing a DR plan. You may leave yourself with nothing if it doesn't work. So....why test it if you'll probably never need it anyway?

And if you do end up needing it, you can fix the problems at that time.

May 29, 2017, 2:13 pm

#153

techie

Join Date: Mar 2013

Location: US of A

Programs: Delta Diamond, United 1K, BA Blue, Marriott Titanium, Hilton Gold, Amex Platinum

Posts: 1,775

Quote:

Originally Posted by Banana4321

As insurance that it is there WHEN you'll need it, because these days it is not a matter of IF. A well-thought out, well-engineered DR plan carries a degree of risk, but that is why it must be well-prepared with back-out contingencies at every stage. You have systems and processes that have been tested rigorously to ensure that whatever backup you have does work and all components have individual redundancy so that there is no domino effect.

May 29, 2017, 2:19 pm

#154

Plato90s

A FlyerTalk Posting Legend

Join Date: Feb 2000

Location: Cambridge

Posts: 63,623

Quote:

Originally Posted by Jimmie76

Whilst I agree

Does the substance of what is alleged to have actually happened sound likely? i.e. doing patches on both data centres, not just one?

Yes, it could easily happen that way. Patching is managed by centralized software tools these days, and someone could easily have scheduled both sets to kick off if it was the high urgency patch for the recent WannaCry outbreak.

Usually there'd be change control procedures, but sometimes those get bypassed for high urgency patches.

Despite the mockery of some comments on TheRegister, reboots actually do spike the power usage of servers and there are often component failures that only show themselves upon initialization. Many's the time when a working server hung upon restart.

Quote:

Originally Posted by Egoldstein

Having worked in the financial industry, I assure you that the failover testing of financial institutions is on the level of "good enough for the auditors".

It's rarely "good enough to restore the business". In my time in the mutual fund part of a top-10 American bank, we could only restore the core systems within 8 hours. Portfolio managers could do their jobs, but no one would be answering calls and the web site won't be updated. If we had to operate out of the DR site and restore 100% of business functionality, it'd take 3+ weeks at a minimum. And that still wouldn't be full functionality, because CSR, QA, document, legal, HR, etc... would not be online at all.

I've only ever worked at one place that was sufficiently dedicated to the idea of disaster recovery to be able to restore 100% of business functionality - including office space for the bare minimum required business users. At that place we ran weekly minimal IT-only DR tests, validated by QA team. Then monthly test with more systems brought up. Then quarterly tests that require business users to put in synthetic transactions. And then a full annual DR test where at least 1 trade was executed from the DR site with minimal staff on-site using the DR site's computers to make sure that part worked as well.

This was only possible by devoting close to 30% of the entire IT budget to the DR functionality. It was a quant hedge fund, and they understood the value of lost time.

Every time I mentioned that DR site experience, it's met with shock and that includes people who work at other top 10 banks.

Even then... that wasn't a really good DR site the facility was less than 70 miles from the primary site.

May 29, 2017, 2:29 pm

#155

Banana4321

Join Date: Apr 2014

Location: London

Programs: Don't even mention it. Grrrrrrr.

Posts: 968

Good to get more reality on here Plato90s

There's obviously a perception amongst the general public that somehow every company has a perfect DR plan that is guaranteed to work perfectly every time. The reality is that is it a cost/risk decision like everything else.

May 29, 2017, 3:57 pm

#156

FliGuy

Join Date: Sep 2004

Programs: BA Gold

Posts: 458

Quote:

Originally Posted by Plato90s

It's rarely "good enough to restore the business". In my time in the mutual fund part of a top-10 American bank, we could only restore the core systems within 8 hours. Portfolio managers could do their jobs, but no one would be answering calls and the web site won't be updated. If we had to operate out of the DR site and restore 100% of business functionality, it'd take 3+ weeks at a minimum. And that still wouldn't be full functionality, because CSR, QA, document, legal, HR, etc... would not be online at all.

I've only ever worked at one place that was sufficiently dedicated to the idea of disaster recovery to be able to restore 100% of business functionality - including office space for the bare minimum required business users. At that place we ran weekly minimal IT-only DR tests, validated by QA team. Then monthly test with more systems brought up. Then quarterly tests that require business users to put in synthetic transactions. And then a full annual DR test where at least 1 trade was executed from the DR site with minimal staff on-site using the DR site's computers to make sure that part worked as well.

This was only possible by devoting close to 30% of the entire IT budget to the DR functionality. It was a quant hedge fund, and they understood the value of lost time.

Every time I mentioned that DR site experience, it's met with shock and that includes people who work at other top 10 banks.

Even then... that wasn't a really good DR site the facility was less than 70 miles from the primary site.

I don't think that's truly representative. Also working in the Financial Industry for a top 10 US institution, we regularly (every 6 months) run DR / SR failovers (Disaster Recovery / Sustained Resilience) which means we fail over to the DR systems and site then run our business from there for the next 6 months, then fail back. Therefore we don't really have the concept of a 'DR' site and system, since they are used to run our production platform every 6 months.
I can see how this would be difficult to implement in the airline industry since it does require some downtime at the weekends to flip/flop back and forth.

May 29, 2017, 4:05 pm

#157

Plato90s

A FlyerTalk Posting Legend

Join Date: Feb 2000

Location: Cambridge

Posts: 63,623

Quote:

Originally Posted by FliGuy

It depends a lot on which components of the bank you're talking about. The parts of the business which are based on mainframe technology is indeed fully redundant with geographic diversity, etc....

The part of the business based on LUM (Linux-Unix-Microsoft) are almost never DR ready. That means at the mutual fund unit, client data and holdings were on mainframe - fully redundant. But the PM can only access the mainframe data through an application, and the same was true for statements, client communication, etc... In the event of total outage, the PM can still pick up a phone and call to have orders placed and holdings printed on paper for overnight delivery. But that's hardly 100% business functionality. From a customer perspective, the mutual fund would effectively be crippled until we could restore the mixed Solaris-Windows environment no matter how functional the back office was.

Airlines like BA don't spend enough on IT to run on mainframe technology when it comes to their internal tech. The reservation/ticketing systems are on mainframe and they're robust, but I'm confident that the BA-specific technology are based on LUM.

May 29, 2017, 4:06 pm

#158

AmaaiZeg

Join Date: Mar 2014

Posts: 582

So happy I'm no longer in IT. Pity whoever was on call when this disaster happened!

May 29, 2017, 4:08 pm

#159

FliGuy

Join Date: Sep 2004

Programs: BA Gold

Posts: 458

I am referring to our trading platforms, front office , middle office and back office. a reasonable mix of mainframe and Linux (who runs on Unix these days :-) ) and Cloud platforms which simplifies the whole process even further.

May 29, 2017, 4:14 pm

#160

Ancient Observer

Join Date: Jun 2009

Location: UK

Programs: Lemonia. Best Greek ever.

Posts: 2,274

The IT spend in the Fin businesses is huge. I don't have the Gartner numbers to hand, but merchant bank type businesses at c 14% of t/o. rings a bell. The timing critical guys and girls see 25% of t/o.

BA are chiselling away at their IT. Anyone know what their spend is?

Maybe they were sold the Outsourcing with the "good" folk fronting up, right down to the code writers and the spannerfolk. What most clients don't know, is that these good folk move on after a couple of weeks, and are replaced by, er, others. That is the culture of Consultants and Outsourcers world wide.

May 29, 2017, 4:14 pm

#161

Plato90s

A FlyerTalk Posting Legend

Join Date: Feb 2000

Location: Cambridge

Posts: 63,623

Quote:

Originally Posted by FliGuy

This was some years ago when Solaris was still viable.

If your data was on mainframe and there were no intermediate stages holding data - just UI and middleware, it's a lot easier to fail over. Especially since there are market holidays every week.

Consider that Amazon has had outages to their cloud platform which lasted for days. Even for professional large-scale providers, the LUM ecosystem is just not designed for that level of redundancy especially when there's near-continuous data streams.

I current work with an environment where we have active-active for parts of the environment, but that's hugely expensive (reflected in what the client is charged) and there's still the headache of getting bad data replicated across to both sites.

(sigh) I miss working with a mainframe back end. You could depend on those guys to keep their promises of uptime.

Then again... it did cost an arm and a leg.

Quote:

Originally Posted by Ancient Observer

At the quant hedge fund, IT spending was the 2nd biggest cash line item - after disbursements to partners.

We spent more on IT than on labor, and that's with some pretty highly paid folks on staff (not me - the folks doing the financial models).

I suspect the BA chief is parsing his words very carefully in that the data center was in UK and the on-site staff are British.

But the team actually supporting the software are likely NOT to be 100% local.

As I recall, there was a major outage of a British bank some years back where it was because the India-based team made a procedural mistake and it fouled up the system horribly where payments weren't scheduled, deposits weren't processed, etc....

A system which is precarious needs experienced hands, and outsourcing is often the worst thing you can do.

ETA:

Here we go - 2012 outage of RBS

http://www.telegraph.co.uk/finance/p...-in-India.html

Last edited by Plato90s; May 29, 2017 at 4:19 pm

May 30, 2017, 3:44 pm

#162

MeanwhileBackAtFAI

Join Date: Apr 2017

Location: BNA, ATL

Programs: AS MVPG, LH, Marriott Titanium, National Executive Elite

Posts: 118

Quote:

Originally Posted by Egoldstein

A lot of DR compliance is tested "in theory" without to actually do it in practice, even in highly regulated industries. It's both an infrastructure cost issue as well as a business disruption issue.

In order to be fully redundant you would need 2 N infrastructure whereas most companies run on 1.2 to maybe 1.7 N. So they can fail over in emergencies but will have to discontinue some services, that's why full failover is extremely rarely tested.

For highly transactional workloads failover creates potential data integrity issues during consolidation after the failover, which is again why most outfits don't actually do it for real.

Quote:

Originally Posted by Ancient Observer

I am not convinced.
In US healthcare the IT spend is about 3-5% of revenue, if memory serves right then the IT spend of US financial institutions was about 7-9% of revenue. I don't have access to the report anymore since I left healthcare, but I was in charge of IT infrastructure with a 22M annual budget (at a healthcare company where we spent about 4.2% of revenue on IT total, not just infra).

I also have to object to Outsourcers constantly changing staff as a general statement. It's true if you go with the lowest bidder, but especially in the data center world managed data centers the turnover is pretty low because the staff is well paid. TDS is one such example (https://www.transitionaldata.com).

It's simply not true that outsourcing is always less expensive and always of ...... quality. The problem is that consumers never see the magnitude of oursourcing, all they know is that sometimes they pick up the phone and are greeted by substandard outsourced support.

Lots of other operations in the airline industry, and most other industries, are outsourced. Why doesn't BA have it's own food services, baggage handlers, aircraft cleaning, aircraft fueling, etc. etc. etc.?

Quote:

Originally Posted by Plato90s

Then again... it did cost an arm and a leg.At the quant hedge fund, IT spending was the 2nd biggest cash line item - after disbursements to partners.

A system which is precarious needs experienced hands, and outsourcing is often the worst thing you can do.

Ehm ..., of course IT is the 2nd biggest cash line item, and in most companies it's actually the largest line item. Because it's a corporate service whereas all other product areas or business lines are rarely if ever corp wide. That's just accounting for you. IT is less expensive than say product development, because product dev is spread across a dozen line items, if you were to combine all of those then product dev would be much more expensive than IT.

The line item size is completely irrelevant, what matters is what percentage of the budget IT operations is in comparison to revenue.

Again, the implication that outsourcing is always low quality and that if you want "experienced hands" you have to do it in-house is simply not true. I really wish that people would just think for a moment before making such a baseless claim.

Continuing with the "you know what else is outsourced" line; emergency physicians in the US are generally not employed by the hospital they are staffing, they are an "outsourced" resource very much because experienced staff is needed at all times, and a staffing company can provide a level of redunancy and resiliancy that in-house staff never ever could unless you vastly overstaff. It's simple math really.

May 30, 2017, 5:15 pm

#163

onylon

Join Date: Jan 2017

Programs: BAEC Gold

Posts: 39

Quote:

Originally Posted by Banana4321

Disasters are inevitable, I've personally had to work on infrastructure hit by floods, lightning strikes, DDOS... Some environments were well maintained and failed over cleanly and some were not. Regular fail-over testing may be a pain but it saves a lot of finger pointing and RFO paperwork in the long run.

May 30, 2017, 5:51 pm

#164

Guvner067

Join Date: Sep 2014

Location: Melbourne, Australia

Programs: AY Platinum, UA Premier Platinum, OneWorld Emerald, VA Platinum

Posts: 558

Quote:

Originally Posted by Banana4321

You do realise that a DR plan involves more than hitting the off switch and crossing your fingers don't you?

May 30, 2017, 5:58 pm

#165

NickP 1K

Join Date: Jan 2000

Location: SoCal to the rest of the world...

Programs: AA EXP with lots of BA. UA 2MM Lifetime Plat - No longer chase hotel loyalty

Posts: 6,699

Quote:

Originally Posted by Plato90s

Despite the mockery of some comments on TheRegister, reboots actually do spike the power usage of servers and there are often component failures that only show themselves upon initialization. Many's the time when a working server hung upon restart.Having worked in the financial industry, I assure you that the failover testing of financial institutions is on the level of "good enough for the auditors".

Those reboot failures are really due to that platform having hit it's MTBF - Running the same servers for 10+ years is ASKING for trouble. Even IBM midrange, mainframe, DEC mainframe's, etc had a service regime to replace components before MTBF failures. This included logic boards, power supply, I/O controllers, etc, etc. Buying an off the shelf Dell server and crossing your fingers you can run them 24x7 for 10 years should ensure that decision maker gets a PhD in Stupidity. 5 Years is industry max for servers. Your process for running the datacenter should be so virtualized that new HW can come in and replace existing running platforms MID cycle with no or limited downtime.

Even AWS expects a 1-3 year lifecycle. You should be building ANY platform to have compute nodes that can come online and replace others for processing - e.g. an elastic cloud so if you have to take resources down you can bring others up elsewhere to replace them. Basically you plan for never going down