Closure Report
Project Summary
This project was broken into four key work streams;
Mellanox - Scope
The Core Ethernet Switches and Network for the Eddie Cluster are going out of support and need to be replaced. Scope was;
- To Design a replacement Solution Using Mellanox Technology and Alces Services to Install the Solution.
- Specify, Order and have kit onside and receipted by end of June.
- Scope was to go live although we had no clear go live date for this.
Eddie Node Refresh - Scope
- Buy a number of Eddie Compute nodes to replace some of the number that have gone out of Warranty.
- We have ongoing issues with power in the Data Centre (ACF CR2) in that the room has no available UPS capacity. Agree a way to progress this with ACF Management to the point we can get these new nodes and other outstanding infrastructure into the room.
- Create and execute deployment plan.
- The scope of the project will run until the nodes are racked, stacked and have an operating system installed. Incorporating them into the New Scheduler is out of scope.
VMWare Servers – Scope
- The Research Services VMWare servers have reached end of life.
- These will be replaced and improved on.
- Scope of the project will be from definition, purchase, rack, stack, power up and bringing into service.
TS3500 Backup Tape Drives Scope
We had 14 tape drives in the TS3500 at the ACF which were pretty much fully utilized, so we increased this number to 20, the current maximum the tape rack will allow. This will shorten the time it takes to restructure the TSM Library and in the longer time add to the general capacity of the Service.
Scope was to Spec, purchase and implement Drives.
Delivery Summary;
Mellanox
We had multiple meetings with Dell and Mellanox technologists to analyse our requirements and design a solution to fit our budget and requirement. It was never intended that this would replace the whole core network and that a follow on project would be required.
We originally intended to use Mellanox Services to deploy the solution, but after initial engagement they were going to be too expensive and to slow. It seemed that we were to small al project for them to be interested in.
At this point we engaged Alces Services, who know our environment and we’ve successfully worked with on numerous occasions.
We agreed the bill of materials and ordered the kit based on a six week delivery window.
We then agreed a deployment plan with Alces.
The only thing we couldn’t agree was an outage to allow us to go live. This is outstanding.
Eddie Node Refresh
After discussion with the Systems Team we agree the specification of astandard Eddie worker node. We then agreed budget and what existing, out of warranty nodes would be replaced.
We discussed this with Alces Services who normally test and deploy our nodes and come up with an installation plan, rack design, etc.
The key delivery issue we had with the nodes was the on-going UPS Issue.
To resolve this, we worked closely with the ACF and agreed a go live design that included reducing out of of Phase 1 of the UPS and utilising phase 2 and 3 more. This required an audit of what phase of UPS every piece of our kit in the machine room was using and creating a UPS Phase Utilisation Map of our kit in that room. By doing this, we reached agreement to put these nodes and several other pieces of outstanding kit into the machine room, including the TS3500 Tape drives that are also covered by this project. This was probably the key achievement in this project.
Once the above was achieved, Alces were able to come in, remove the redundant nodes and install the replacements. The nodes were passed onto RSS051 to be incorporated into the new scheduler.
VMWare Servers.
We agreed a design and spec of the VMWare servers. These were purchased, delivered and receipted. They have been racked stacked and tested.
There is however an outstanding task to migrate the service from the existing servers and close the old servers down. This task will be carried over to next year’s project.
TS3500 Backup Tape Drives.
We ordered 6 additional tape drives. While these couldn’t be commissioned initially due to the ACF UPS issue. Once that was resolved, the Drives were installed and commissioned and are now in service.
Status of Project benefits.
Mellanox
- We’ve yet to realise any project benefits from this work stream as the switches aren’t installed or live yet. To achieve this there is a key management dependency of agreeing a go live outage.
Eddie Node Refresh
- 56 Replaced nodes handed over for inclusion into the new scheduler.
- On time on budget with an overall increase in capacity.
- A UPS Utilisation Process that we should be able to carry into the next year with the ACF.
VMWare Server refresh
- The servers are racked stacked and powered up.
- We haven’t actually
TS3500 Backup Tape Drives
- The additional six drives are in place and being used by the service. This should significantly speed up splitting down the TSM Libraries and ultimately improving the backup service as a whole.
Explanation of Variance
The Mellanox Switches did not go live for various reasons.
- As an impact of Kevin Tomlinson leaving and the systems team being short of resource to support Alces.
- Resource impacts of attacks on Datastore and Systems teams resources being diverted into supporting this.
- Not being able to agree a go live outage. See key learning point below.
- The delivery time of the switches was not as we were advised by Esteem and Dell. See key learning point below.
The VWWare server Service migration was impacted by other priorities within the systems team. This will be picked up and completed by next year’s Infrastructure Project.
Key Learning Points
- We have a key issue where we have no agreed Maintenance windows for the Eddie Cluster and no process to agree an outage to perform maintenance and upgrades. We should discuss half yearly or quarterly maintenance windows, not just for this, but for all Service platforms.
- Given the physical issues at the ACF, primarily UPS, but also cooling and space. It will pay dividends to continue to work very closely with ACF going forward and appreciate their requirements from us. We should endeavour to let them know what changes we intend to make from inception.
- The delivery service we get from Dell and Esteem is no better that you would get from eBay, probably worse. We had several deliveries go to the wrong buildings, be signed for by randoms or the delivery drivers themselves and kit just dumped in non-secure areas. Get Dell and Esteem to commit to a delivery process any deliveries go to a where we want them, when we want them and to who we want them to go to.
Follow on Tasks
Mellanox.
- Agree a go-live window.
- Plan for Alces to come in, install the new switches and cutover.
- Discuss delivery issues with Mellanox, Esteem and Dell. Basically we didn’t get the switches and cables delivered until well after we expected them.
Eddie Node Replacement.
Given that there will probably be an element of more of the same in the coming year. Try and quantify this as soon as possible and continue to engage the ACF from a Power and cooling perspective. By doing that, hopefully we won’t have the same issues we had this year.
VMWare Server Refresh.
The outstanding Service Migration requirement will be picked up by next year’s infrastructure project.
