Site Reliability Engineering is as much a methodology as a culture. Similar to the concept of DevOps.
Changing Culture of Teams to Incorporate SRE Practices
Psychological Profiles of Team Members
There are four basic psychological profiles into which you can general categorize your team members.
- Navigators: Those that want to move forward and help you succeed
- Critics: Those that have valid fears about change and have passion and energy about their thoughts and positions
- Victims: Those that see change as an attack on them personally
- Bystanders: Those that are generally apathetic. For them, you need to work to figure out what there feelings are regarding this change and see if/how you can engage them in the process
The best ways to manage potentially negative emotional responses to change
- Involve people in the change
- Set realistic expectations
- Identify opportunities for co-creation and coach instead of providing a complete solution. Ask questions that lead people to the conclusions that you are looking for to give them the opportunity to discover the answer themselves and then own the ideas
- Simplify messaging and focus on key concepts on a group-by-group basis
- Ensure that communications are engaging and that training is interactive
- Allow people time to build new habits
Measure Everything
- Reliability: error budget, SLI, SLO, indicators of user happines
- Toil: how much time is spent on toil
- Monitoring: monitor symptoms not causes
- Four Golden Signals
- Latency
- Traffic
- Errors
- Saturation
- Four Golden Signals
SRE Skills
- Operations and Software Engineering
- Monitoring principals
- Production automation
- System architecture
- Troubleshooting and debugging
- Culture of trust
- Incident management and communication
Key Concepts
- SLI: Service Level Indicator. INDICATOR. A Quantitative measurement, typically a metric, that expresses the health of a given component in the system.
- SLO: Service Level Objective. GOAL. A target value for a services availability or performance as measured by the SLIs
- SLA: Service Level Agreement. PROMISE. A guarantee the defines the results of missing your SLOs
Stakeholders
- Product Managers
- Executives
- Customers
- Developers and SREs
The key is the focus on the user experience
Choosing a good SLI
- Define a quality target
- How do users interact with the product
- SLI = good events/valid event = %
SLO = applied SLI, that must include a target and a time window
SLE = environments
SLU = updates – communications to “customers”
SLY = why?
Montior | Metric |
Time Based | Counts |
OK time/total time = % | Good events/total events = % |
Gamedays and Chaos Monkey: purposefully breaking stuff while watching it