Building a Soviet Nail Factory: how KPIs killed efficiency
In 2008, I landed my second job, in the network team at Orange Portails, the<br>division behind the websites and search engine of the French telecom operator<br>Orange. The place ran like clockwork: a comprehensive technical setup, a<br>dedicated team for every part of the business, and room to focus on what I do<br>best. A few years later, none of that mattered: thanks to an obsession with the<br>numbers, we could no longer deliver new services on time.
Disclaimer
This is a story I like to tell to warn people about<br>Goodhart’s law.1 As these events happened almost 15 years ago, my<br>recollection is a bit fuzzy. I left in 2012.
Goodhart’s law often gets the credit, but Campbell’s law<br>describes my experience even better: the more you lean on a number to make<br>decisions, the faster people corrupt it. ❦
The first years#
During my first years, the department operated like a startup. Its cradle was<br>the French company Echo. They built a search engine. France Télécom bought it<br>and renamed it Voila. It was the most visited search engine in France in the<br>early 2000s. France Télécom consolidated the portal activities into the Wanadoo<br>Portails division, later renamed Orange Portails.
The technical environment was excellent. We had many internal tools:2 a<br>ticket system, an RRD-based graphing tool, an IPAM, a reporting tool, and an<br>SNMP-based alerting tool.3 We deployed our Linux servers with<br>CFEngine. We installed systems and applications from internal Debian<br>repositories. We documented everything in a private MediaWiki instance.<br>Supervision was performed with an ancestor of Xymon. The network<br>architecture was clean and scalable with little legacy. We onboarded new people<br>in a day.
At the time, SaaS was not really a thing. I remember we considered,<br>with a couple of colleagues, selling Wiremaps as a SaaS, with<br>homomorphic encryption for the database. But who would outsource their<br>observability stack? ❦
Snalert was a metacircular alerting tool in Perl. It was able to<br>poll a very large number of SNMP targets in a short timespan. All our<br>monitoring was SNMP-based, including system monitoring. ❦
It was a nurturing environment for me. I developed several tools:<br>lldpd, an 802.1AB implementation, Snimpy, a pythonic binding for<br>Net-SNMP, Wiremaps, a layer-2 discovery tool with a time machine to know<br>which device is connected where, Kitérő, a tool to simulate network<br>conditions, QCSS-3, a controller for load-balancers, and ipoo, a service<br>available through a Jabber chatbot and a Greasemonkey script to expose<br>IP-related information. I added SNMP support for Keepalived and<br>Quagga. I also started this blog, with articles like<br>“Anycast DNS,” TLS-related articles like “TLS computational DoS<br>mitigation,” SNMP-related articles like “Integration of Net-SNMP into an<br>event loop,” Linux-related articles like “Tuning Linux IPv4 route cache,”<br>and an article about VXLAN long before it was cool.
The collapse#
When we needed new servers, the on-site team would take a set from the<br>inventory, install our base Linux distribution on them, put them in the<br>datacenter, and cable them to the top-of-the-rack switches. We opened a ticket<br>describing the servers we needed, and one week later, our servers were<br>available. 💫
Orange wanted to know if this team was performing well, so they asked for KPIs.<br>They decided to use the number of tickets completed in a year. They asked to<br>double this number. So instead of one ticket for a new service, we would open<br>six tickets—one per server. By the end of the year, the KPIs had more than<br>doubled.
Everybody saw it as a success for performance management. So, they asked to do<br>the same for the next year. Now, we needed to open a ticket per server and per<br>step. Again, the KPIs doubled. Behind the scenes, the tickets went to different<br>people and were no longer handled in order. So, for the next year, it was decided to<br>have meta-tickets and meetings to follow the progress of these tickets. Of<br>course, all these extra steps pushed the KPI even higher.
This performance management method spread to the other teams.4<br>Everything became slower. Instead of a couple of weeks, a new service now took<br>six months. We built a Soviet nail factory. But the KPIs were good, and we<br>stopped caring.
My team also managed the rules of many Linux-based firewalls. To<br>increase our KPIs, we used the same method: rather than accepting one ticket<br>with a flow matrix, we requested one ticket per flow. ❦
Let me give you another example. We had to estimate the impact of each night<br>operation. We weren’t half bad: we declared most operations “without any<br>expected impact.” Most of the time, there was no impact. One time out of five,<br>there was a 5-second impact. We were told to try harder to meet our expected<br>impact. What did we do? We started declaring a 5-second expected impact. One<br>day, we got a 30-second impact and were told we failed to match the expected<br>impact. In the end, we declared most...