Famous Last Words
There have been many (alleged) Famous Last Words over the years.
- Kismet, Hardy.
- I told you I was ill.
- Either that Wallpaper goes, or I do.
But the most memorable famous last words that I heard in the past 12 months was. “We’ve designed it so that we don’t need Capacity Management”.
Regular readers will appreciate that I will have greeted this statement with a fair degree of scepticism. Having made a career lasting over 25 years in the Capacity Management world, this would be the first time in my life that anyone had designed a system that didn’t need any form of Capacity Management. I’d go further. This would be the first time that anyone had designed a system that even gave more than a passing thought to Capacity Management!!!
So what had this genius designer done, and how had they achieved the previously impossible?
The service in question was an Oracle RAC Cluster solution that provided a virtualised Oracle environment to multiple clients. Each cluster was made up of 8 physical blades that had their own processor and memory but shared storage. The cluster solution ensured that the blades would be paired together to provide resiliency. All pretty normal so far.
The first step to eradicate Capacity Management was to define T-Shirt sizes for the client databases. Clients could request a Small, Medium, Large, or X-Large database. Each blade could accommodate a fixed number of these T-Shirts, and once the blade was full, then no further requests could be placed. Yes. I know. Anyone reading this who has even the slightest connection to Capacity Management is already seeing that we have a “finite limit” there, and that is bread-and-butter for Capacity Management.
But no. The designer had thought of this (he said), and each blade could only support one type of T-Shirt at a time. Therefore once a client had requested a “Medium” database and this had been placed onto blade “A”, then blade “A” would only ever accept “Medium” requests. This way, the designer said, you don’t need to consider the complicated combination of lots of different T-Shirt sizes on each blade, you just need know the maximum number of “Medium” databases that can be supported and monitor for when you reach that value. Yes. I still know. The designer has said that Capacity Management isn’t going to be required… but there will definitely need to be some MONITORING done of the CAPACITY, and someone will need to MEASURE the UTILISATION to check for when we reach the LIMITS.
Now that sure sounds like Capacity Management to me.
For the first 6 months of the service, the Product team would not let me have access to the monitoring data, and were adamant that they were not going to need more infrastructure for their service.
But then, 6 months ago, disaster. They couldn’t quite understand it. They had plenty of spare infrastructure, but clients were complaining that their requests for new databases were failing. How could this be?
I was asked to take a look.
I have simplified the situation below, but in essence this is where they had ended up.
The maximum capability of a blade to support each T-Shirt size was as follows:
T-Shirt Size | Maximum per blade |
Small | 32 |
Medium | 16 |
Large | 8 |
X-Large | 4 |
And these were the clusters and configurations in place
Blade | Cluster A (T-shirt Size, current builds) |
Cluster B (T-shirt Size, current builds) |
Cluster C (T-shirt Size, current builds) |
1 | Small 15 | Medium 16 | Small 3 |
2 | Small 15 | Medium 16 | Small 3 |
3 | Medium 16 | Small 1 | Medium 16 |
4 | Medium 16 | Small 1 | Medium 16 |
5 | Large 2 | Large 1 | Medium 16 |
6 | Large 2 | Large 1 | Medium 16 |
7 | X-Large 1 | Large 1 | X-Large 1 |
8 | X-Large 1 | Large 1 | X-Large 1 |
So the total status across all clusters was as follows
T-Shirt Size | Live Instances | Spare Capacity |
SMALL | 38 | 154 |
MEDIUM | 128 | 0 |
LARGE | 48 | 40 |
X-LARGE | 16 | 10 |
As you can see, there is plenty of spare capacity… enough for over 200 databases… but sadly… there is no spare capacity for any more MEDIUM databases and each of the three clusters is fully populated so there are no spare blades that could be assigned to MEDIUM databases.
If the SMALL and X-LARGE databases on Cluster C could be moved onto Cluster A, then that would free up some blades for use as MEDIUM hosts. But guess what? The same designer that had allegedly achieved the impossible and made Capacity Management unnecessary had also made it impossible to migrate from one cluster to another.
So relax everyone… Capacity Management is still as necessary as it has always been.. and if this example is anything to go by… it is MORE necessary than ever!