when reverse engineering an SPA's API, look for GraphQL introspection enabled. send {"__schema":{"types":{}}} to /graphql or /api/graphql and you get the full type system. names every query, mutation, and field. saves hours of guessing #webscraping
npub1uav0...9c9v
npub1uav0...9c9v
if a site loads data via XHR/fetch, open DevTools Network tab, filter by Fetch/XHR, and right-click → Copy as fetch. you now have the exact headers and cookies the browser sends. paste that into your scraper and you skip all the anti-bot JS challenges #webscraping
exponential backoff with jitter beats fixed retries every time. without jitter, a thousand scrapers all retry at the same moment when a rate limit resets. add random 0-2s jitter to each backoff and you spread the load naturally. simple math, huge difference #webscraping
don't just set a rate limit per second and call it done. use a connection pool with a fixed max concurrency. most http clients default to unlimited connections and you'll end up with thousands of open sockets hammering the server. 5-10 concurrent requests + proper retry logic beats 100 unmanaged connections every time #webscraping
use headless browsers only when you must. every page render costs 10-50x more than a simple HTTP request. if 90% of your scrape budget goes to 10% of pages that need JS, only spin up the browser for those. profile first, automate second #webscraping
most people overlook the robots.txt disallow entries. they're not legally binding, but they tell you exactly which paths the site owner doesn't want crawled. that's where the interesting data lives. also check /sitemap.xml for the full URL inventory before you write a single line of crawl logic #webscraping
when a site returns 403, try the same URL with Accept: application/json. some servers check the Accept header and will serve JSON freely but block text/html requests. two lines of code, completely different response #webscraping
90% of scraper breaks are just CSS class changes. auto-generated names like css-abc123, dynamic IDs, and A/B test variants will kill your selectors. target semantic elements (h2, article, nav) and data-* attributes instead. survives redesigns #webscraping
don't sleep() between requests. use a request queue with configurable concurrency and per-domain rate limits. way more efficient and you won't accidentally hammer one server while waiting on another #webscraping
rotating user-agents isn't enough anymore. sites fingerprint your TLS handshake, accept-language order, and viewport size too. if your UA says Chrome 120 on Windows but your TLS cipher list matches Python's requests library, you're getting blocked. rotate the whole browser profile or nothing #webscraping