July 14, 2021 Alex Woodie
Chances are, you’ve been through a lot in the past year, from a personal, a professional, and a public health perspective. Your company’s IT estate has probably gone through some changes, too. To keep your IBM i and open systems resilient and able to withstand misfortune, it’s a good idea to revisit your disaster recovery plan to make sure it reflects current reality.
During COMMON’s recent NAViGATE event, IBM i DR expert Richard Dolewski provided a great description of why it’s so important to check your DR plan now. With so much turmoil caused by COVID, it’s simply too easy for some of the small but important details of your DR plan to get out of date.
The best way to shake out the bugs in your DR plan is by doing a DR test, of course. But even before then, Dolewski recommended just spending some time thinking through the DR plan, because even that simple exercise can catch some of the simple shortcomings that could become major problems in the event of a real disaster.
“The first thing I want you to do — and I’m not joking here — is book a meeting with your DR plan,” said Dolewski. “Go into your calendar and schedule it, because if you don’t schedule it, it won’t happen.”
You might feel like nothing has changed over the past year. We’re all working from home, after all, and the office has stayed the same. Well, don’t fall for that ruse, Dolewski said. Plenty of things have changed.
The first thing you should do is dust off that DR plan. Oh, you see that it’s dated 2009? That would be great if it was a Cabernet from Napa, Dolewski said, but DR plans don’t age nearly as well as wine.
The real objective is to understand what has changed. “We know the answer — plenty,” Dolewski said. “This is your last line to defense to make sure that the plan is current.”
IT life revolves around servers. So have you added any new servers over the past year? The change control document should state that every new server added to the data center should have a backup policy and it should have a DR solution, Dolewski said.
Next, check the personnel listed. Have there been any new hires, fires, or downsizing in the past year? There has been plenty of change in staffing levels and many companies, and likely yours, too.
Dolewski remembered one client from Colorado who had a DR plan that documented 60 servers that needed to be included in the DR plan. But there was only one other IT tech on staff to execute that plan. This didn’t sound right to the DR expert, who has seen his share of disasters.
“So there’s two of you, and there’s 60 servers to recover here?” Dolewski recalled telling the man. “He said ‘Try 600…We haven’t had time to document the others.’ So you have a document for 60, you’re [running 600], and there’s two of you? Well, what would you do if there was a disaster? He said ‘I’d simply resign.’
“Probably not the right way to go about DR planning,” Dolewski quipped.
DR plans are a lot like security and family vacations — it’s all about the process, not the destination. If you’re not periodically engaging with your DR plan and testing it from time to time, then you really have no idea if it will actually work when it’s really needed.
“A DR plan is an ongoing process,” Dolewski said. “Plan, implement, test. Plan, implement, test. That’s the process. By having this process, we are in a position to ultimately be able to deliver a repeatable solution. And that repeatable solution is about making sure that it works when we need it. If you don’t follow this process, it becomes a project all over again and you need a consultant or you need dedicated time. Who has time? None of us, obviously.”
Dolewski packed a ton of great tips into his allotted 20 minutes of time. Are you using offsite storage? If so, is it still available and likely to be accessible in a disaster? If you keep your tapes in a bank vault, how are you going to get them at night, or on a weekend? Nobody knows what day of the week your disaster will occur on, so it’s important to be ready for it to happen at any time.
“Murphy always shows up today, the only day you don’t . . . do something,” Dolewski said, citing the famous law.
The best preparation for a disaster that may or may not happen is through meticulous preparation and attention to detail. So it’s counter intuitive that, when testing your DR plan, you should not try to eliminate all sources of entropy, such as the state of your backup tape (you do have a backup, right?
Want to prepare for your DR test with a fresh, clean Option 21 save? Not in Dolewski’s house. “I just absolutely hate that,” he said.
When your big day arrives, and you’re going over the DR plan, take special care to read it thoroughly, Dolewski advised. “Every page should answer who, what, why, where, when, and how,” he said. “Who’s going to do it? How are you going to do it? Where are you going to do it?…I’ve done this 100 times, but I still follow the book.”
Did you add any new applications during the pandemic? Maybe you migrated to JD Edwards or SAP ERP systems? It’s critical to match the specific IT systems that need to be recovered with the folks who have the necessary expertise to recover them. “You can’t really expect the IBM Power guy, the IBM i person, to restore and recover AIX,” he said.
It should go without saying, but you have to know how each and every application is designed to be recovered. That may be more difficult after 17 months of COVID, as IT systems change and memories become foggy (reminder: that’s why it’s so important to write it all down.)
“A lot of our people don’t even know where our recovery site is,” Dolewski said. “Is there HA or a failover involved. Do we have a third party [running] that for us? Do we fail over just the IBM i? What about the other systems?”
Communication is an especially important area to script out as much as possible in advance, because it is something that is likely to fail during an actual disaster. Dolewski advised customers to collect more details and phone numbers, email addresses, and even social media handles than they might think is necessary, because you never know what you’re going to come up against.
“Look at the call tree,” he said. “If your name is in the plan, I need to know about you. I need the name, the title, and where you live. ‘Oh, I’m not telling you where I live. That’s confidential.’ Well, this is a confidential document of our company.”
Dolewski has actually driven to the homes of IT folks during disasters to recover items needed for a recovery. “They weren’t ignoring us,” he said. “We needed a password. We needed a key to get into a box. They were happy to help us.”
Texting becomes a lifeline during a disaster. “Will the phones even work?” Dolewski said. “If you ever have a major outage, the telephones go into a fast busy, so texting very important.”
During an actual disaster, you’re likely to be distracted from your job as an IT professional. That’s not necessarily a bad thing, but it’s something that should be kept in mind–and possibly a reason why outsourcing to a third-party could be a good insurance policy.
“When we had the fires in Santa Rosa, California, and I talked to the people on the ground, they were getting their families and weren’t really interested in talking to me,” Dolewski recalled. “I wasn’t their priority. I was focused on getting their business failed over to our private cloud…. They were focusing on the most important, prized possession of their life–their family.”
If a member of the press shows up, Dolewski recommends keeping mum. “Don’t talk to the media, ever. Have a spokesman do it,” he said. “The media will get through you. They’ll ask you questions. They will quote you, misquote you.”
When you first described your DR plan, it was likely a good one. But time is not kind to DR plans. People come and go, servers are upgraded or outsourced to the cloud, new applications are brought online, and entire new forms of communications have been invented. The past 17 months have been a period of rapid change, so the bottom line is that it’s time to check out that DR plan.
“A lot has happened to a lot of us this past year. Let’s make sure we’re resilient,” Dolewski said. “We were residing with our families, with our children. Let’s make sure we’re resilient now with our business.”
(P.S. It’s OK to talk to media – well, with IT Jungle, anyway.)