Ben Godfrey ⚠️ Username must be exactly 3 alphanumeric characters. By signing up, you agree to the Terms and Privacy Policy.
When does redundancy become redundant?
Tl;dr
Redundancy is needed for a team to be productive, but there is such a thing as too much.
Instaface
Picture a software development team with three members - Alice, Bob, and Carol. All three of these team members are software developers, so they write code and build features for the team’s product, a hot new social media site called Instaface. As well as their usual software development work, each team member has some other responsibility. Alice is the team’s leader, the creator and owner of Instaface. Bob is a quality engineer, and he tests the output of the team’s work, making sure that Instaface remains a great product. Carol is the team’s infrastructure guru, she knows what is under the hood and what keeps Instaface ticking over. Note that we are assuming that these additional roles are genuinely above and beyond what is usually expected for a software engineer - Alice, Bob, and Carol are experts and owners of their domains.
These extra responsibilities are not just for show, or to pad CVs. These are vital parts of the software development journey. If Bob was not testing new Instaface features, the team would not catch mistakes in their code, so the team’s first sign of things going wrong would be users making complaints or deleting their Instaface accounts. If Carol did not maintain and monitor the team’s servers, then one out of date dependancy could lead to a security issue which brings the app to its knees. If Alice wasn’t providing some vision and guidance on what the product should be, then Instaface would scare away potential users. These are necessary roles, and the team simply could not go without them.
So, what happens if one of these roles becomes unavailable? What happens when Alice, Bob, or Carol need to go to hospital, or have a holiday booked, or (as the traditional framing of this question puts it) are hit by a bus? We know that we cannot just ignore these duties, but it is probably not a good idea to wait for the whole team to be back together before making progress, as they may end up waiting a while. To make sure that the team does not grind to a halt on every absence, they might double up on some of these responsibilities. Alice can take on some infrastructure responsibilities, Bob can make some decisions on product, Carol can test new features. Now the team is twice as resilient - the single absence which once ground production to a halt is now two absences. Our bus factor has gone from one to two.
This is not the end of the story though. The team are not completely safe. What if there are two absences? In this case, one of our critical roles cannot be filled. What do they do now? Well, they could triple up on these roles, so we end up with 3 product owners, 3 testers, and 3 infrastructure engineers, meaning that two members of the team could be absent and production does not need to stop. This sounds ok, but let’s stop and think. If two members of the team are absent, then only 33% of the team are available. Less than half. Is this a situation which we want to prepare for? Does the Instaface team want to accept a situation where only one member of staff is available as a genuine possibility? Perhaps they do, but would the same be true if the team grew to 10 people? 20? 100? I would argue that it is not reasonable for a large team to prepare for a situation where most of its members are going to be unavailable for a prolonged period of time.
As well as this situation not quite feeling right, there is a cost associated with this preparation. To properly train an engineer to be a QE, or a product owner, or whatever else, they need time out of their day to day, a training course, quarterly refreshers. Very quickly we are in a position where each person who is covering some responsibility requires thousands of pounds of investment. In a large team (or indeed, a large institution), this is not reasonable at all. If Instaface becomes a team of 100 people, then they would not prepare for a situation where only one of these people is keeping all of the lights on. We need to ask ourselves, where should the lines be drawn? What extent of preparation is reasonable, and what can we not justify?
The numbers
Coverage and impact
There are a few elements at play here. Let’s try to catch the essence of each of them and see what impact it has. Firstly, we have a few roles in our team which need to be done. It is not the case that each of these roles are either filled or not filled. In a development team, each of these roles will be covered by a group of people. With this in mind, we can say that the amount that a role rrr is ‘covered’ at a given point in time can be measured as a proportion of ‘full’. We can denote this proportion as cr∈[0,1]c_r \in [0, 1]cr∈[0,1].
As we have established, coverage is not the full picture....