Methodology:
Steps Used to Analyze the Problem
Our processes began by meeting with IEMA IT personnel. For our work to be successful, we need to foster an environment of open communication with the key players in the current handling of your IT infrastructure. Our success depends on our ability to fully understand the systems under review and, more specifically, how the subsystem functions affect your business processes and overall flow of information. Access to documentation of existing infrastructure, and to personnel working with the infrastructure is imperative and without that access the time required to thoroughly analyze systems increases exponentially. Luckily, this project enjoyed complete access to both documentation and key personnel, and thus we were able to adhere to a very speedy work schedule. Once we had met with key personnel and discussed overall issues, we moved into our process, consisting of 8 steps:
- First, we described and summarized the business problems or issues that plague (or have plagued) with respect to performance and survivability and speculated as to what the survivability/performance requirements should be.
- Our second step addressed the creation of a diagram of the information technology infrastructure indicating all of the main servers and computer systems, networking components such as switches and routers, core WAN and LAN link and there capacities, and key peripheral devices and locations where major groups of user hosts are situated.
- We next identified and listed potential single points of failure (SPOFs) within the IT infrastructure and operation, and for each SPOF, the business service(s) that would be affected. We also identified the applications (and/or databases) supporting each service that would be affected.
- Then, we defined an “envelope of performance” for each service and identified metrics to measure that performance. The performance, availability and survivability goals were identified in this step. We have established the use of traditional metrics such as % Availability, MTBF, MTTR, Latency/Delay, and Capacity/Throughput.
- We next ranked the services and/or their associated applications by their critical importance using the ranking criterion established by the Illinois Central Management Services Agency. Then we identified the SPOFs that would have the greatest degree of impact on the Agency’s operations.
- In step #6, we identified other potential risks to the applications and services in addition to the already uncovered SPOFs based on our assessment as to likelihood of risk and the scale of the resulting impact. We estimated expected losses associated with single occurrences of each risk, and the annual loss expectancy (ALE), the expected monetary loss on an annual basis. We ranked our findings by service/application importance and then by ALE.
- Step 7 addressed the costs to eliminate or reduce the risks we identified in the last step. Looking at costs like capital (fixed) expenses for procurement of any additional hardware, software or network infrastructure, and additional operational expenses to perform new recovery or backup procedures to reduce MTTR and damages respectively, we performed a percentage return on investment (ROI) calculation.
- The final step was to compile our findings into a final report including findings and recommendations for improvements and remedial actions to improve the overall survivability and performance of the subject IT infrastructure.
Assumptions
The main assumption we made is that the documentation provided to us is an accurate representation of the current systems. If the confidence in such documentation were not close to 100%, we would have to adjust our processes to include an inventory and assessment stage of the analysis. The IT personnel were very confident in the accuracy of documentation, with the exception of some of the older legacy systems at the smaller agency from the pre-merger days. Since the majority of legacy systems from one of the merger subjects were basically replaced in the post-merger days, this has not been much of an issue.
Findings:
Summary of Critical Applications
An assessment of applications that are run from each Agency facility was made. From this assessment, critical agency functions and supporting applications were identified.
The Illinois Department of Central Management Services (CMS) provided a method of ranking agency applications. This method requires that applications be assigned a category number from 1 to 5 using the following guidelines, which were approved by the Governor’s Office for use by all State agencies:
Category One – Human Safety
Any resources that directly impact the lives and safety of Illinois citizens, including state employees. Examples: Police, Fire, Medical, Corrections, Child Welfare, etc.
Category Two – Welfare Human Service
Any resources that directly impact the well being of Illinois citizens. Examples: Assistance, Benefits, Vital Records, etc.
Category Three – Non-Welfare Human Service
A human service resource that indirectly impacts the welfare of Illinois citizens. Examples: Registries, Licensure, Tracking, Vendor related, etc.
Category Four – Administrative State Functions & Processes
Resources that support the administration of state processes. Examples: Payroll, Compensation, Procurement, Accounts Payable, etc.
Category Five – Support of Specific Agency Functions & Processes
Resources related to the maintenance of a specific agency function or process.
Examples: Laboratory, Utilities, Diagnostics, Statistical systems, Application Code Tools, etc.
Based on those categories, the following tables, taken directly from the IEMA Disaster Recovery Plan, provide a breakdown of the Agency’s critical and non-critical applications:
Application Restoration Priorities
Based on the critical application assessments, applications that exist at a given facility were noted and placed in either a Critical or Non-Critical category. The restoration process is designed to restore Critical applications first, followed by any remaining Non-Critical applications, if possible. The tables below outline the predominate applications that are active at the two main Agency facilities.
Current Fault Tolerant Systems in Place
1) VPN – 8 Regional Offices – Each regional office has a vpn/firewall box. Hub and spoke setup with the hub at Outer Park Drive. 2 VPN/Firewall boxes are tied together with commonality – a heartbeat. If one goes down, the other one takes over all traffic. They are in High-A mode, High Availability mode.
2) IEMA connectivity with the nuclear power plants in Illinois. Frame Relay connection – a frame relay cloud – operated by SBC, and SBC has fault tolerance built into its systems. The fault tolerance enables data to find another path, if the path it is on becomes un-passable for some reason. The information shared through this data network is sent to two locations – Outer Park Drive and Rodgers Street locations. The same information goes to both places and systems are in place at each location – independent from each other.
3. SERVERS:
The majority of our servers have RAID Controllers – redundant array of inexpensive disks. All servers have redundant power supplies and redundant fan supplies. Disks are “hot swappable”. Mirrored sets – if one drive fails the mirror set has a copy. Some servers have a backup server. For disaster recovery, a file server storing some critical data/information has its own backup server.
4. SOFTWARE:
Doubletake – this sw replicates files to a server at another location, Rodgers Street, dynamically, all the time, between Outer Park and Rodgers Street. Incidentally, MS 2003 has some of this capability also, but Doubletake is better for our purposes.
Recommendations:
Overall, our analysis proved that much has been already with respect to ensuring survivability of systems, and to ensuring delivery of critical services provided by the client.
However, there are potential areas of improvement, many of which were identified by your own IT personnel, and confirmed in our analysis of the survivability of your overall systems infrastructure. In cooperation with your very capable IT personnel, the following 5 recommendations have been developed and agreed upon as the best, most efficient and most inexpensive methods for the improvements necessary to fulfill your Agency’s mandates:
1) Getting a second Internet service provider. Currently our Internet access is through CMS. We have had several occasions where CMS has had outages. Those outages affect connectivity to the Regional offices, which are connected via VPN over the Internet. In addition, Internet browsing capability and Internet E-mail are lost during an outage. By adding a second T1 connection to an alternate ISP, the number of outages will be reduced, providing additional fault tolerance for access to the Internet.
Estimated Cost: $12,000.00
2) E-mail has become a mission critical application. Therefore, during the migration from Microsoft Exchange Server 5.5 to Microsoft Exchange Server 2003, we recommend the creation of a server cluster, basically two servers that share resources. If one server fails, the other takes over the load transparently to the users and application. Thus we increase the fault tolerance of the mail server.
Estimated Cost: $4,000.00
3) Creation of a new networking structure to provide additional fault tolerance. With implementation of this recommendation, each server will get a second network card. The second network card will be connected to a different network switch. If a network switch fails, or a server’s network card fails, connectivity will still exist via the second network card and second network switch. This is accomplished not only by hardware, but also using special software drivers from HP/Compaq that allow "Teaming" of network cards.
Estimated Cost: $15,000.00
4) Improve the performance of the Remote Monitoring Systems, by upgrading the connections with Nuclear power plants. Currently there are 56K dedicated lines to each Nuclear Power Plant in Illinois. Our recommendation is to replace those connections with T1 frame relay connections. This upgrade will continue to enjoy the built-in fault tolerance that the SBC data network, frame-relay cloud, provides.
Estimated Cost: $10,000.00
5) Distributed File Sharing within Windows server 2003. DFS would provide fault tolerance for everyone's file shares.
Estimated Cost: Free
COSTS SUMMARY – RECOMMENDED UPGRADES
Estimated Costs Total: $41,000.00 in SFY05
$22,000.00 Annually starting in SFY06
Business Continuity Disclosure:
It is our hope that you will adopt our recommendations for the enhancements to your current infrastructure. We feel that these enhancements will immediately improve the survivability and performance characteristics of your systems. If these recommendations are accepted, our firm will immediately begin work to implement these recommendations, working in concert with your information technology personnel, as we have been fortunate to do thus far in our mutually beneficially relationship. Following the implementation phase, your agency could effectively use language similar to the following example of a business continuity disclosure, in your public information brochures and public information in general:
Business Continuity Disclosure Statement:
The Illinois Emergency Management Agency (IEMA) supports several systems that are critical to the safety of the citizens of Illinois. To ensure the delivery of critical services supported by those systems, IEMA utilizes a comprehensive approach to continuity planning, an approach that increases the survivability and fault tolerance of those critical systems. In the event of a disaster that limits our current capabilities, we have plans in place to restore operations quickly and efficiently. These plans are “living documents” and thus are evaluated constantly and updated according to an established schedule.
REFERENCES & ENDNOTES
- Mission-Critical Network Planning, Matthew Liotine, 2003
- IDS 594 Business Continuity Planning, Instructor Dr. Matthew Liotine
- Illinois Emergency Management Agency, Systems Documentation
- IEMA Disaster Recovery Plan, William M. Waggoner, IEMA
- Special Thanks to Steve Ellis, IEMA Networks, IT Guru
From SFY05 State Budget Book:
The state spends over $640 million annually on technology, including personnel, infrastructure and contractual services. The state’s information technology function is highly decentralized, with most IT decisions made independently with little consistency or synergy among agencies. The consequence is a duplication of investments between agencies and higher technology costs. In addition to developing critical IT standards for the state, the IT initiative has achieved substantial savings through a series of renegotiations and reductions. For example, CMS achieved a 20 percent reduction in the cost of Centrex telephone lines, a 45 percent reduction in the cost of selected wireless services, and a 20 percent reduction in the number of IT contractors working with state agencies.
Information Technology Initiative Savings
FY2004-FY2006i
FY2004 FY2005 FY2006
Gross Savings $35,000,000 $70,600,000 $127,000,000
Investment $14,400,000 $20,000,000 $ 10,000,000
Net Savings $20,600,000 $50,600,000 $117,000,000