A summary of critical failures is given below:
- 29 January 2005 CPU Failure, Data Loss, Restore Failure , 17 hrs lost
- 29 January 2005 ,CPU Failure, Failover Recovery Failure, Data Loss, 5 hour outage
- 06 November 2005 .Single Disk Failure, Mirror Failure, Server Failure, Database Impact, 2.5 h outage
- 09 May 2006 ,Power Module Failure, Media Failure, Database Loss, 11 hour outage
Statistic:
There have been 9 unplanned failover (service recovery) attempts as a result of some other failure. Of these only 5 recoveries have been successful (a 55% success rate).
Current Hardware Diagram - Stevenage
Current Hardware Description
Current Situation Storage Overview
Local Domain Storage (Production Environment)
Each of the server domains has local SCSI storage defined for its’ root file system, swap storage and other server specific file systems. . A total of twelve 4.2Gbyte disks equally distributed across two D1000 SCSI disk trays are provided for each domain.
Shared Storage Overview (Production Environment)
The following points summarise the aspects of the shared storage:
- Fourteen A5000 Arrays available fully populated.
- All disks mirrored (Raid 0+1) - 891.8 GByte available storage.
- Mirrors to be built across arrays in different cabinets.
- One Disk used by Sun Cluster 2.1 (mandatory).
- Two Disks used by SAP Central Instance Logical Host.
- Remaining 95 disks used for Oracle 8.0.4 database.
- All storage is mirrored using the Veritas Volume Manager. This is a mandatory requirement for Sun Cluster 2.1.
- The storage is striped (Software RAID 0) for performance.
Local Domain Storage (QA/Test Environment)
Each of the server domains had local SCSI storage defined for its’ root file system, swap storage and other server specific file systems. A total of twelve 4.2Gbyte disks equally distributed across two D1000 SCSI disk trays are provided for each domain.
Shared Storage Overview (QA/Test Environment)
- Eight A5000 Arrays available fully populated.
- All disks mirrored (Raid 0+1) - 670 GByte approx. available storage.
- Mirrors to be built across arrays in different cabinets.
- One Disk used by Sun Cluster 2.1 (mandatory).
- Two Disks used by SAP Central Instance Logical Host.
- Remaining 53 discs used for Oracle 8.0.4 database.
- All storage is mirrored using the Veritas Volume Manager. This is a mandatory requirement for Sun Cluster 2.1.
- The storage is striped (Software RAID 0) for performance.
Current Situation Software Description
- Solaris 2.x Operating system
- Oracle 8.04 RDBMS
- SunCluster 2.1
- SAP R/3
- Veritas Volume Manager
- Tivioli Workload Scheduler
- Veritas Netbackup
High Level Requirements
High Level Requirements List.
Bring Production landscape back in line with SLA’s;
Stabilise and add resiliency to the production landscape at Stevenage;
Minimise risk of data loss resulting from hardware (CPU, power, memory, disk, etc) failures;
Reduce impact of backup failures on Production landscape;
Improved reliability of SAP service;
Improved availability of SAP service.
Alternative Solutions
Alternatives List
1. Replace Sun storage with EMC (£1.3M, 4-5 mths)
2. Add additional Sun storage, alter approach to Backup (£. 8M, 3-4 mths)
3. Replace Sun storage w/ EMC; alter approach to Backup (£1.7M, 5-6 mths)
4. Change approach to clustering (£.4M, 3-4 mths)
5. Change Platform to HP/Oracle/EMC (£4.2M, 8-10 mths)
6. Change Platform to IBM/DB2/EMC (£4.2M, 8-10 mths)
7. Establish Premium Support Service (£2.7M, 6-7 mths)
8. Acquire Regular Extended Downtime (£.3M, 3-4 mths)
9. Rework entire architecture (£1.9M, 7-8 mths)
Evaluated Alternatives
Replace Sun Storage with EMC (utilise EMC HW mirroring)
Implement additional Sun storage and establish Oracle "standby" of Production Database; Utilise Oracle Logshipping
Replace Sun storage with EMC and establish Oracle "standby" of Production Database; Utilise Oracle Logshipping
Change Approach to Clustering (replace Sun Cluster with Veritas Cluster)
Change Platform to HP V-class/Oracle/EMC
Change Platform to IBM AIX & OS390/DB2/EMC
Establish Premium Availability Support Service
Acquire Regular Maintenance Downtime
Rearchitect entire MRP Infrastructure (Establish new infrastructure based on GSK standards - no one-off technology/implementations)
Options not explored
Change Backup Library
Just a change to the library would not change the transports -- without definitive root cause its speculation that changing the transports would eliminate the media error
Change to manual failover
Automatic failover is desired/necessary to hope to achieve business requirements for availability and reliability.
Recommended Solution
Recommendation
-
Replace Sun storage with EMC and establish Oracle "standby" of Production Database; Utilise Oracle Log shipping
- Change Approach to Clustering (replace Sun Cluster with Veritas Cluster)
Recommended Solution Diagram - Stevenage
Affected/Unaffected Services List
To be defined
Affected/Unaffected Applications List
To be defined
Criteria for Solution
Constraints List
Downtime window on production environment
Other criteria to be defined
Critical Success Factors
To be defined
References
As-Is Architecture Document060300.doc, Rachel Raymond, 20/03/00
SAP_Phase1_STPV021.ppt, Tom Revak, 07/08/01
Nah, F. F. and Lau, J. L. (2003). Critical factors for successful implementation of enterprise systems. Business Process Management, 7(3), 285-296.
Alavi, M. and Joachimsthaler, E. A. (1992). Revisiting DSS implementation research: a meta-analysis of the literature and suggestions for researchers. MIS Quarterly, March, 95-116.
Hutchins, H. A. (1999). Seven key elements of a successful implementation, and eight mistakes you will make anyway. Hospital Materiel Management Quarterly, 21(2), 76-82.