Semicolon & Sons
About
Hire Me
Contact & Support
Code Diary
Episodes
Blog
Subscribe
Sign in
Semicolon & Sons
About
Hire Me
Screencasts
Code Diary
Blog
Contact & Support
Subscribe
Sign In
Scaling a Monolith to 1M LOC: 113 Pragmatic Lessons
by Jack Kinsella
For context, this is based on a recent freelance contract in a Django/React (TS)/React Native codebase (size: 20-person team). I have 17 years' experience with web development in other frameworks (Rails/Laravel/Express etc.) and know that similar ideas will be applicable there too.
Performance
Page Counts Are a Major (but Surprising) Source of DB Performance Issues at Scale. Giving an accurate "total pages" count (or total "items" count) is a hard problem at scale. This is because the DB needs to count all matching rows—and, often, the queries used here select a great many related records and call many filters. This can potentially require millions of rows to be scanned. I don't have a full solution here. What we have done so far is automatically strip all "annotations" from paginators (extra data used to render items but not necessary for page counts). In some cases for very large tables, we use an estimated count paginator (which uses some DB stats instead of actual queries) instead of actually counting all rows. Lastly, for endpoints where the page count really matters, we use some systems to cache it. For example, if the page count is going to be less than the number on the page (or, more technically, per our "cache block," which is 1,000 items), then we instruct the paginator to not do an extra DB query since it can already infer the count of records by the fact of the block being incomplete. I could go on...
Long Cron-Job Reads Can Cripple Your System. A single inefficient, long-running DB query (often in a cron job for reporting) can halve the performance of your entire system if they share a DB. Moreover, queries happening outside the web process are often not as effectively tracked in APMs like New Relic, so they can be invisible. The solution here is to treat the performance of these queries as something worth worrying about... or create a read-only follower of your DB for these purposes so as to insulate your "hot" DB serving web traffic.
More generally, long queries OR large payloads are the chief troublemakers for systematic performance issues in Redis/PostgreSQL (etc.), so if in doubt about what to optimize, start here.
RAM Pressure Is Underrated as a Cause for Performance Issues. This cause can be a little insidious because your system will not crash if under moderate RAM pressure; instead, it will swap memory to disk and get much slower.
Mark Each Deploy in Your APM (e.g., New Relic). This is in order to correlate performance data with code changes. Connecting causing and effect speeds up debugging perf regressions.
Offloading Work to Background Jobs Is Very Effective. For example, sending emails, push notifications, WebSocket messages, and processing webhooks from payment providers, etc. You want to keep your web processes running fast.
If Data Is a Tree Structure (e.g., Categories), Use Power Tools for Data Modeling and Querying. Our category hierarchy was highly inefficient when using standard ORM modeling. Theoretically, there are systems for doing this well in SQL. However, in our case, it was easier to move all operations (other than grabbing all categories on boot) off the database completely and cache them as a tree in local RAM. This saves countless hundreds of millions of self-joins in SQL for operations such as "ancestors" or "descendants". It also goes to show: know your data structures.
Set Timeouts at Various Levels to Prevent Resource Starvation. As a sort of general immune system, set system limits, such as requiring that any web request must take under 25s; otherwise, it will be terminated. While this can create complexity for genuinely long-running ones and require the use of background jobs, it's worth it because of the protection it provides by preventing situations where a performance issue in a single endpoint can cause your entire server to get starved of threads/processes and crash. Another example is when you rely on a third-party API that suddenly experiences performance issues and takes (e.g., 180s) per request. Very quickly, all your parallelism will be used up doing this waiting, and you won't be able to serve "normal" requests and will experience downtime.
(Gracefully) Kill Worker Processes Regularly as Insurance Against Memory Leaks. Each of our servers has about 16 worker processes. We restart these worker processes once they reach a certain number of requests (e.g., 10,000 requests). Obviously, there should be jitter to ensure that the process is killed at a random time within some range of requests; you don't want all workers to suddenly go offline at the same time since this would effectively cause downtime.
Common Sources of FE Performance Issues (React for Us, but...