Completion Report

Project Summary:

This project was initiated to build on the work of the deployment of new application and database tiers, to implement ways that we could provide more resilience beyond our current basic manual disaster recovery failover provision. Most systems can be failed over to a secondary site in the event of a major problem but require manual, attended intervention.

During the project we have looked at technologies, such as VMware Site Recovery Manager (SRM) but due to changes in the IT Infrastructure strategy of delivering a stretched metro cluster we have decided to not implement SRM. Instead the project focused on 'quick wins' where with little effort, gains can be made.

One of these “quick wins” was to implement and test facilities where applications still work when failing over databases and no manual intervention is required. This has been implemented for two top priority services – MyEd and Central Wiki - with no impact to service encountered.

 

Scope

To look at a specific service and implement a failover facility to manage planned / unplanned events in the infrastructure or application layer e.g. ORACLE, SQL or MySQL.

 

Objectives, Deliverables, Benefits, Success Criteria

Objectives

Deliverables

Benefits

Success criteria

  • To test failover process on a candidate application service using ORACLE Dataguard
  • To deliver a structure for each technology set used in our environment, highlighting what services fall into what category and producing guidance for how failover options could be achieved
  • Complete a review of Top and Medium Priority Services to understand level of resilience (FULL, PARTIAL, NONE) - Complete
  • Utilise Dataguard to configure Application to initiate Database failover for an ORACLE application - Already in place
  • For a selected application, Events Booking, which has a simple infrastructure progress through Development and Test environment to prove failover process - Removed from scope
  • Use successful proven process for Events Booking and for a further selected application, EBIS Online/ Web Central ,which has a more complex Infrastructure, prove failover through Development, Test and Live Environments - Central Wiki Service was selected as a replacement
  • For each candidate applications confirm successful manual failover and complete a manual failback to original state - Completed (Switchover)
  • If through analysis, the project team can provide a proven automated failover process for candidate applications this will be tested and manually failed back - Not completed
  • Guidance roadmap on how to achieve for remaining ORACLE applications and establish a route for SQL and MySQL applications (this will be dependent on availability of remaining budget) - Only completed for WIKI. 
    • As a follow up to the MyEd failover activity completed by Production the project also fixed the IDM MyED connector in TEST as an additional task

 

Within the University of Edinburgh, Staff and students have been operating more and more on a 24x7x365 basis. Service availability is key to ensure this. While our infrastructure is robust, issues do arise and as we do not have personnel on hand to the same timeframe, a failover option facility is required.

Benefits include:

  1. Better availability of services - while the service is reliable in general we would expect this facility to reduce the number of unplanned outages with at least 2 potential events being mitigated a year with this in place.
  2. Improve customer satisfaction for users across the service with the facility in place.
  3. Opportunity to use the function to reduce planned outages for individual services.
  4. Input into planning round for implementation of resilience facility for services.

 

  • Provide a proven manual failover process for candidate applications (Must Have) - Complete
  • Confirm Application can be agnostics to database (Must Have) - Complete
  • A mechanism command which can make a decision to complete partial failover for planned or unplanned events (Should Have) - Not required for Switchover
  • An Observer process (Should Have) - This may be possible when Stretch Clusters are introduced

 

 

Out of Scope

  • Following discussions with project team using Events Booking would result in failover of AppsDEV, AppsTest would result in impacting a number of applications and also EBIS Online / Web Central had complexities with SAMBA drives that we made the decision to look at other applications and agreed to use Central WIKI Service as the candidate application.
  • The project was asked to incorporate SRM solution following discussions with ITI however due to issues encountered to implement the technology the project agreed to remove as there was a preferred solution being introduced using VM Stretch Cluster which is not expected to be availability until Summer 2017.
  • Also the analysis for automated failover process was agreed to not progress due to the future introduction of Stretch Cluster infrastructure.

 

Analysis of Resource Usage:

Staff Usage Estimate: 100 days

Staff Usage Actual: 28 days

Staff Usage Variance: (72%)

Other Resource Estimate: 0 days

Other Resource Actual: 0 days

Other Resource Variance: 0%

Explanation for variance:

  • 16 days removed as unused days in 15/16 Financial Year
  • 23 days returned to INF programme following removal of tasks aligned to SRM activity. This was based on discussions with the Programme Manager and Project Sponsor as well as ITI, the SRM solution was de-scoped from this project. The main reasons are:
    • Difficulties were encountered by ITI to implement the SRM technology
    • The SRM technology does not deliver automated failover capabilities. A proposal between IS Applications and ITI are under way to deliver a VM Metro stretched cluster, which will deliver an automated seamless cross site storage and server solution. This solution will be rolled out by Summer 2017 and is not ready for this project to use..
    • The focus of the project is now to deliver a solution where applications are database agnostic. This means that when databases move sites, the application does not require manual intervention to “point” to the new database, but the application will by itself “know” where the database is served from. This has already been implemented and tested for MyEed. This project will establish other services Central WIKI where this is being implemented and tested.

  • 25 days effort returned to INF programme following re-estimation of remaining activity.

Key Learning Points:

  • These types of project can be difficult to resource as key stakeholders are required from Production Management and their time can be limited as they need to focus on ensuring production has first priority to limit any potential impact on service
  • With Infrastructure landscape changing and evolving the project team were able to use this situational awareness and make the correct decision to descope SRM allowing unused days to be transferred back to INF programme to be used on other projects
  • The INF programme Manager thanked the project team for being pragmatic and scoping sensibly

Outstanding issues:

  • None

Project Info

Project
Service Resilience Capability
Code
INF119
Programme
ISG - IS Applications Infrastructure (INF)
Management Office
ISG PMO
Project Manager
Karen Stirling
Project Sponsor
Stefan Kaempf
Current Stage
Close
Status
Closed
Start Date
06-Apr-2016
Planning Date
n/a
Delivery Date
n/a
Close Date
18-May-2017
Programme Priority
7
Overall Priority
Normal
Category
Discretionary

Documentation

Close