Overview
Overview
This project follows the outcome of the Service monitoring requirements and options appraisal (2013/14) and recommendations of the "INF107 - Environment Monitoring System - Assessment" project (2014/15).
Background
IS manages a large number of University High Priority services with 24x365 uptime and 99.9% availability expectations. Over the last ten years the number of infrastructure components involved in providing these priority services has grown very significantly and to a level where manual checks are no longer feasible. Additionally, the numbers of lower priority but high value services has grown at an even greater rate over the same period.
Automated monitoring, reporting and alerting of these technological\infrastructural fundamentals and services is key to the delivering of a successful portfolio of business systems to the University. Not having such facilities in place not only considerably diminishes the ability to cohesively manage, grow, and deliver reliable performant services but also distracts key resources and puts considerable strain on these resources as manual and imperfect monitoring is demanded. This in turn translates into increased levels of risk to services. Furthermore, this persistent and increasing workload prevents resources from progressing with other work that might be of more benefit to the organisation.
Scope
This project will pilot the use of Microsoft System Center Operations Manager (SCOM) against a maximum of 3 key IS managed services to evaluate its ability to deliver effective information to Service Owners, and also to explore the types of organisational changes needed to deliver and support the system and service monitoring required by Information Services.
The Pilots against each service will attempt to cover the width and breadth of the technology stacks representing the majority of coverage across IS.
The Project Team propose that the initial candidate service should be one based on a Microsoft technology stack to reduce the potential technical obstacles, with the expectation that this should prove a useful baseline to start with (for example, UniDesk). This first stage would be reviewed before moving on to a Linux based service (for example, Learn), which would be reviewed before moving on to one of UoE's corporate services (for example, euclid / HR / efin).
The project will include a recommendations report outlining SCOM's effectiveness as an enterprise level monitoring tool and provide guidance on how it can be employed across key IS services, as well as the staffing and other supporting services required.
Objectives, Deliverables, Success Criteria, Benefits
Objectives, Deliverables, Benefits, Success Criteria
Objective | Deliverables | Success Criteria | Benefits | Priority |
1. Evaluate SCOM as a candidate for an enterprise level system monitoring tool. | 1.1 Selection of 2-3 key IS services to pilot use of SCOM. To include at least one Windows based, one Linux, and one Corporate service.
1.2 Configuration of SCOM for each of the selected services - including documentation of the settings for each service.
1.3 Dashboard capability for each service available to relevant Service Owner.
1.4 Report on SCOM capability at UoE based on pilots of selected services.
1.5 SCOM training for project team / virtual service monitoring team. | 1.1 Candidate services agreed and measurement criteria established.
1.2 Each service has a level of horizontal and vertical monitoring, with settings fully documented. Technical staff in Apps and ITI should be able to use the SCOM monitoring as part of their daily routine.
1.3 Dashboards are made available to relevant service owners.
1.4 ITI and Apps are able to determine scope and effectiveness of SCOM for a given IS service.
1.5 ITI and Apps staff are able to configure SCOM and share knowledge to grow that capability within relevant teams. | 1.1 Scope is agreed, metrics are baselined, and can be measured. Baselines can be used to define "normal" and "unusual" behaviour.
1.2 Understanding capabilities of the SCOM tool is widespread among IS ITI and IS Apps, making it easier and more efficient to configure as part of operational business and to target specific areas of concern.
1.3 Efficient and proactive visibility of service incidents and their impact.
1.4 ITI and Apps staff better equipped to make informed decisions on efficient and cost-effective management of key services to SLvices in line with SLAs and OLAs.
1.5 Flexibility across ITI and Apps to respond quickly to business demands. | 1.1 Must have
1.2 Must have
1.3 Must have
1.4 Must have
1.5 Should have |
2. Outline IS service monitoring role(s). | 2.1 Description for relevant new roles including "Availability Manager" and "Capacity Manager" role(s). | 2.1 Roles are defined and unambiguous. | 2.1 It is clear what is required to ensure adequate capacity exists to meet agreed levels of service, and how to determine availability requirements for new and existing services. | 2.1 Should have |