How We Shortened Development Feedback Loops From 30m to 30s - monday engineering
How We Shortened Development Feedback Loops From 30m to 30s
Michał Szeląg<br>May 25, 20269 min read
monday.com runs on 100k+ vCPUs; you cannot run that on a laptop. This physical reality created a 30-minute tax on every developer’s day – a substantial delay in our feedback loop.<br>Since a fast feedback loop is fundamental to pushing features, gaining market traction, caring about your customers, staying ahead of your competition, and, in the end, earning money.<br>In this post, I’ll introduce you to monday.com’s approach to this problem–one that suits the unique needs and workflows embedded directly into our core.<br>I’ll focus less on the technical details and more on the sociotechnical challenges you will face in big organizations.<br>monday.com’s scale<br>Environment Nodes vCPUs Memory Usage context Production 4,90090,000292 TBGlobal user baseStaging 6456,70025 TBTests, automations, Monday Mirror (~200 users)Ephemeral 1406172.32 TBIsolated (~50 Users)Compute requirements per environment<br>I assume monday.com’s system complexity and the development issues that arise when you use 100k+ vCPUs globally aren’t a surprise to you. You cannot run the whole of monday.com on your laptop, which is the root of the problem. There is simply no mobile device on the market that offers 40 CPUs and 140 GB+ of memory to run the system on a local Kubernetes cluster.<br>This is a digression, but who knows – maybe with further Moore’s law improvements, we will be able to in a few years.
Our Existing Approach<br>The industry-standard solution to this scale problem is leveraging the cloud. By tapping into public infrastructure, you gain access to near-infinite compute resources.<br>The typical pattern involves spinning up individual, isolated, and ephemeral environments per developer, allowing them to test changes in a realistic system.
It might be enough for your use case, but we discovered substantial drawbacks and challenges:<br>Driving cost<br>To support just ~50 active users in ephemeral environments, we had to provision on average 140 dedicated nodes. This added a direct infrastructure cost of approximately $450/month per developer, purely to keep these isolated environments running.<br>This is on top of the staging environment resources we pay for, regardless, which also support a larger user base of ~200 users.
In summary, instead of adding a new cost that scales linearly with the number of developers, we can just tap into existing staging resources that we need anyway. In fact, the more developers use Monday Mirror with staging, the more efficient this tradeoff becomes.<br>Cold boot<br>When we talk about tight feedback loops, we want to eliminate all unnecessary time delays. I am not too old to remember the classic XKCD comic that showcases a problem analogous to a cold boot: long compile times. It demonstrates that, as an industry, we have simply learned to accept this as the status quo and live with it.<br>Source: https://xkcd.com/303/
Nowadays, in our development process, we wait for this flow to complete at least once a day, or multiple times a day, depending on the diversity of projects our engineers are working on.
End-to-end, this flow might take even 30 minutes when spinning up the core of our system: the Monolith and its dependencies.<br>We identified this as the main friction point for both business and developers. Not only does it reduce agility, but the time investment is also substantial; with about 1,000 developers, you spend 500 hours daily just waiting for the ephemeral development environment! It’s equivalent to 3 months of full-time development work.<br>Satisfaction loss<br>Developers have neither the time nor the patience to wait this long to see their impact on the system as a whole. This drives satisfaction down, people get annoyed, features are delivered more slowly, and you lose the agility critical for keeping up with the market.<br>Summarizing the business costs<br>All of these factors – the hard and quantifiable resources, the wasted time, and the less quantifiable social problems have consequences measured in real money.<br>Doing some napkin math, it costs us ~$450/month per developer just for ephemeral infrastructure; when you add the wasted time and frustration, the total cost for monday.com easily exceeds hundreds of dollars per developer per month!<br>One can see that it is clearly not an optimal solution. So, how about we shift our perspective and instead inject the laptop into the cloud?<br>Monday Mirror – Our Solution!
Meet Monday Mirror, which is based on the Mirrord tool.
Source: https://metalbear.com/mirrord/docs/overview/introduction/<br>We didn’t choose this approach simply because the tech is interesting; we chose it because it systematically dismantles the barriers I mentioned earlier.
Decreasing costs. By sharing staging resources and intelligently hijacking traffic, we can literally reduce costs by $20k+ per month by simply cutting out the ephemeral...