| In Reliability

This is the fourth in a series of posts on increasing overall availability of your service or system. Have you ever gotten paged, and known right away that this problem isn’t like the last 15 operations issues you’ve dealt with this week? That this problem is special, and is really, really bad? You know, that […]

| In Reliability

This is the third in a series of posts on increasing overall availability of your service or system. In the first post of this series, we defined and introduced some concepts of system availability, including mean time between failure – MTBF – and mean time to recovery – MTTR. In our second post we went on to […]

| In Reliability

This is the second in a series of posts on increasing overall availability of your service or system. In the first post of this series, we defined and introduced some concepts of system availability, including mean time between failure – MTBF – and mean time to recovery – MTTR.  Both increasing MTBF and reducing MTTR […]

| In Reliability

This post is meant as a quick introduction to some concepts of system availability, so that subsequent posts in this series make sense. I’ll go over concepts like availability, SLA, mean time between failure, mean time to recovery, etc.