A Case Study in the Limits of Takeoff Automation, Andrew Ngo<br>During my time consulting for a steel frames manufacturer, I found that the one problem that kept coming up was the expensive takeoff workflow, which is the process of measuring areas, lengths and counts of materials off a set of construction drawings so they can be priced. This paper documents my attempt to automate takeoff for my client as part of a broader quote automation project, targeting an accuracy rate of >80% against the human takeoff. The target accuracy rate was selected as minimum proof that the extractor could potentially be commercialised.
The central finding turned out to be a "catch-22". The only input that reliably lets the model reproduce a full takeoff is a document called the takeoff breakdown containing the human's takeoff working, which only exists once the work the model was supposed to automate has already been done. Even when handed everything but the final numbers, the highest accuracy rate recorded was 77.5% ignoring labels and 68% with exact-label scoring (labels must be correctly assigned or scored at zero). Both rates are well above the unaided baseline (extracting just from the drawings), but still fall short of the target. However, the accuracy rate is not the important finding; it is that relevant information lives outside the drawings which cannot be extracted, creating a structural ceiling that blocks the model from achieving a commercially viable accuracy rate.
What I built and tested
This project involved building a vision-language model (VLM) extractor structured as a multi-stage pipeline (Claude Opus 4.7 + Sonnet 4.6). A VLM is simply a model that you feed the drawings and text instructions to, which replies in text. The extractor was ultimately tested on the drawings of five real multi-townhouse builds from five different customers of my client to extract measurements and labels. The final test compared the accuracy rates recorded between raw extraction and extraction that was also loaded with a "convention spec". The convention spec is a reference document that compiles the recurring habits of each customer's past jobs, such as scoping decisions, customer-specific labelling conventions and decomposition patterns.
Across the four scoreable builds in the final convention spec test, accuracy ranged between ~40-73%, with the convention spec giving at most a small and inconsistent lift in accuracy. The spread was driven by differences between builds, not run-to-run noise. The fifth build in the testing set returned no scoreable canonical assemblies (a standard group of related line items) and was excluded from this range as an anomaly. Because the convention spec was built partly from the customers' own past jobs, it had an unfair advantage on the very builds it was tested against. A properly built, unbiased spec would score lower in absolute terms and would lift accuracy even less, which is why I treated the spec's effect as an upper bound and not a clean measurement of how much it helps. These figures are also case study illustrations from a small sample and should not immediately be taken as representative of the greater population.
The final accuracy rate is not the key finding that should be taken away. Takeoffs cannot currently be automated as a commercial product if they are similar to the ones my client does, where customer conventions are non-standardised and drawings do not contain all required information. The finding does not claim that other takeoffs cannot be automated. The limitation documented in this paper is specific to takeoffs with non-standardised conventions and incomplete drawings.
The breakdown test
One of the most revealing tests conducted separate from the final test involved two builds in the final testing set. This test engaged with a document called the takeoff breakdown, which is a typically disorganised display of the human's takeoff work. The testing batch for this breakdown test was reduced to two builds due to funding limitations. The key features of a takeoff breakdown are:
Decomposition pattern: how the customer chooses to split line items per assembly
Labelling of each assembly
Multi-segment math: how each value of the assembly was calculated
In-scope identification: which regions of the architectural drawings are in the scope of the specific build
Final measurement numbers extracted by the human
It is important to note these features of a takeoff breakdown because they cannot fully be interpreted by the extractor from analysing the drawings alone.
Note the definitions for the three different test variants before continuing:
Baseline: extracts from the drawings alone with no breakdown
Full: extracts from the drawings and the breakdown (breakdown features 1-5)
Stripped: extracts from the drawings and the breakdown with final measurements redacted (breakdown features 1-4)
On build #1, the extractor recorded a 'baseline' accuracy rate of ~53%; an unaided...