How to scrape any website and get structured data with a single API call

joelolawanle1 pts0 comments

How to scrape any website and get structured data with a single API call | Spidra Blog

Blog/ How to scrape any website and get structured data with a single API call<br>May 21, 2026 · 14 min read<br>How to scrape any website and get structured data with a single API call

Joel Olawanle

Sometimes you do not just need the HTML. You need the actual data inside it, like the product names, prices, article text, and contact details, formatted cleanly, predictably, and ready to use without writing a custom parser for every site.<br>The gap between "I have a URL" and "I have clean structured JSON" is a lot wider than most developers expect the first time they try to cross it.<br>This guide walks through why that gap exists, what the DIY path actually looks like in practice, and how a single API call can take care of all of it.<br>Why "just pull the data" is harder than it sounds<br>Scraping a webpage sounds like a solved problem. It is not. Here is what you are actually dealing with.<br>JavaScript-rendered pages<br>Most modern sites do not deliver their actual content in the first HTML response. React, Vue, Angular, and Next.js pages typically send back a nearly empty shell and then populate all the real content using JavaScript after the page loads.

Run requests.get(url) on a page like this, and that shell is all you get. Not the product listings. Not the pricing table. Not the article text. To get the actual rendered content, you need a headless browser that runs the JavaScript, waits for all the async calls to finish, and hands you the DOM in its final state.<br>Anti-bot protections<br>A large share of websites now actively block automated requests. Cloudflare, DataDome, PerimeterX, and similar systems look at how your requests are structured and either serve you a CAPTCHA, return a 403, or silently give you a broken version of the page.<br>Getting past these reliably means rotating residential proxies, randomizing browser fingerprints, matching TLS fingerprints, and keeping all of it updated as detection methods improve. None of that is a one-time setup you do and forget.<br>Data buried in inconsistent HTML<br>Even when you do get the rendered HTML, pulling structured data out of it is its own problem. CSS selectors and XPath are brittle. A class name change, a layout update, or an A/B test on the target site silently breaks your scraper. Regex on HTML is even worse. You end up writing a lot of defensive code that still fails on edge cases.<br>Pages that need interaction before you can scrape them<br>Some data only shows up after the user does something first. Clicking a "Load More" button. Dismissing a cookie banner. Picking a filter. Scrolling down far enough to trigger lazy loading. A basic HTTP fetch does not touch any of this. You need a browser that can interact with the page first and then scrape it.<br>The DIY approach: headless browsers and custom parsers<br>The standard path most developers take is combining a headless browser with a parsing library. Playwright or Puppeteer handles the rendering, and then BeautifulSoup, Cheerio, or custom code handles pulling out the fields you want.<br>A basic Playwright example for extracting product data looks something like this:<br>from playwright.sync_api import sync_playwright<br>from bs4 import BeautifulSoup

def scrape_products(url: str) -> list[dict]:<br>with sync_playwright() as p:<br>browser = p.chromium.launch(headless=True)<br>page = browser.new_page()<br>page.goto(url, wait_until="networkidle")<br>html = page.content()<br>browser.close()

soup = BeautifulSoup(html, "html.parser")<br>products = []

for card in soup.select(".product-card"):<br>products.append({<br>"name": card.select_one(".product-title").get_text(strip=True),<br>"price": card.select_one(".product-price").get_text(strip=True),<br>"available": "out-of-stock" not in card.get("class", [])<br>})

return productsThis works fine for a demo or a one-off script. It falls apart in production. Here are the real problems you will run into.<br>Browser memory usage. Each Chromium instance chews through 200 to 400 MB of RAM. Running 20 concurrent scrapes means 4 to 8 GB just for the browser processes, plus memory leaks that will require you to build restart logic.

Selectors break constantly. That .product-card selector is one site redesign away from returning nothing. You will spend as much time maintaining your selectors as you spend on actual product work.

Headless Chrome is detectable. Sites check navigator.webdriver, the absence of GPU fingerprints, timing patterns, and TLS fingerprints. You will need playwright-extra with stealth plugins, and those plugins need updates every time detection techniques improve.

You need proxy infrastructure. At any real scale you will hit IP bans. That means signing up with proxy providers, building rotation logic, detecting dead proxies, and dealing with the billing.

It is slow. Launching a browser, loading a page, waiting for JavaScript to finish, and then parsing the result takes 5 to 15 seconds per URL in normal conditions. That adds up fast when you...

data browser html product page structured

Related Articles