What are some of the key skills that highly effective operational teams possess? What makes one operations team much more effective than another? Here is my take on a few of the key traits highly effective operations teams have.
Trait 1: React with speed and effectiveness
The quality of service depends on how quickly you can identify and resolve the issue. Mean Time to Identify (MTTI) and Mean Time to Resolution (MTTR) are two key metrics that will tell you how you are doing on this dimension. Your team should be tracking those two numbers depending their impact on the customers. It’s okay if those numbers are higher for low impact issues, but for high impact issues those metrics should be low. Having good visibility into what is going on deep within the system is key to resolving issues within minutes as opposed to hours and days.
Trait 2: Proactively monitor – Actively look for failures
Teams at this level are hypersensitive about failures that went undetected. The goal is to lower the frequency of issues that are missed by monitoring and alerting framework. Better than having one metric, such a metric should be broken down by event severity. Over time the number of issues missed by monitoring/alerting for a really high severity should go down from “frequently” to “very rare”. If your team is small, you might want to focus on a particular set of KPIs and develop full loop capability for detecting, alerting and remediating those issues.
Trait 3: Build really good playbooks – Know how to respond when an alert happens
High levels of monitoring and alerting can quickly lead to “alert fatigue.’ To reduce this, create easy to find and execute playbooks. Playbooks are a simplified list of steps that tell an operator how to respond when an alert happens. Playbooks should be written in a way that it requires zero thinking for an operator to execute on it. And remember to make those playbooks really easy to discover. Or heck, put a link to that in the alert itself.
Trait 4: Do retrospectives – Learn from failures
Failures are inevitable. There is always something that will happen, your goal is to avoid repeating them. To go one step beyond Trait 2, look at the issues and ask questions as to what was it about the process, architecture or people that led that failure to happen. Was the time it took to resolve the issue acceptable? If not, what can be done to reduce the time it took to resolve it? Can we automate some of the identification/resolution steps to reduce MTTI and MTTR? Teams can get really good at this by building a culture of blameless post mortems, focusing relentlessly on finding the root cause. For if a team doesn’t truly understand the root cause, they can’t be sure that the issue is fixed. And if you aren’t sure that you have fixed the issue, you cannot be sure that it won’t happen again. Ask yourself the five whys until you get to the root cause. Sometimes five is not enough. You have to really get down to the core issue, an issue that you can fix. If you cannot fix right away, at least detect and recover from that very quickly, hopefully without any impact to the service.
Trait 5: Build resiliency into the system – Make use of auto-healing systems
Having said all the above, many of the issues that turn into operational nightmares can be caught and taken care of at design time. Make the requirement to be able to run a service at a high quality a key requirement from the get go. You will be paying much more for bad design/architectural choices several times over by the time the service is generally available and used.