My blog_title_here · Blocking DDoS from scraper bots the easy way via HTTP-401 Basic Auth
Home
Who
Archives
Feed
My blog_title_here
May 15, 2026
Blocking DDoS from scraper bots the easy way via HTTP-401 Basic Auth
Bot traffic on the internet was always comparatively high, with search engine<br>scrapers and whatever odd research tools doing their thing, but that changed<br>massively in recent few years with LLM-training-content-scaping bots.
Those don't care about robots.txt and are designed to be unblockable, running on<br>regular users' machines, faking browser user-agents and doing their thing from a<br>giant massively-distributed IP pools by now (used to have somewhat blockable net<br>ranges, but not anymore).
This isn't a big deal for this blog for example, as there're only a few static<br>pages, which these bots grab and go away reasonably quickly, but for local git<br>repos this isn't the case - list of commits and various links there is effectively<br>infinite, and these idiot bots want EVERY-THING!
So what ends up happening is a never-ending stream of requests like this:
14.243.82.173 - - [15/May/2026:00:50:01 +0500]<br>"GET /code/git/.../pa-mixer.example.cfg?id=9537b82b... HTTP/1.1"<br>200 3516 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) ... Safari/537.36" ...
With specific bot pool typically doing this for hours or like a day at a time,<br>sometimes at rates well above 100 requests per second, which is kinda nuts load,<br>normally only seen by high-traffic commercial sites, not a random self-hosted git repo.
Or it's more like a distributed denial of service (DDoS) attack, where no value<br>is gained from it (incl. by whatever crap "AI" idiocy behind it), but resources<br>are knocked off the internet one way or another, presumably with side-benefit<br>that only way to look something up now will be going to those big-tech scrapers,<br>for a crappy simulacra of the stuff they destroyed in the process of extraction,<br>so dunno if DDoS aspect of it will go away anytime soon.
Despite cgit that I use being super-efficient and handling the load,<br>by May 2026 these scrapers even rack up a couple gigabytes of access.log<br>files per day (!!!), so it got annoying enough for me at this point.
There are few common ways of dealing with this crap:
Install Anubis tool, which runs a cpu-intensive challenge on clients via JS.
Set/check a cookie for all clients, e.g. via testcookie-nginx-module on HTTP level.
Detect bots somehow and serve them some "poisoned" content.
CAPTCHAs from Google/CloudFlare, or generated locally in some fashion.
Shutdown public access, requiring some auth, and rate-limit accounts/users (e.g. github).
Don't like any JS mechanisms, as I tend to disable/limit it myself, and Anubis<br>in particular looks a bit too high-maintenance for me, being like a whole complicated<br>anti-spam system (and those rarely work well).<br>It's a well-known and very common solution too, likely locked in an arms race<br>with countermeasures.
Training data "poisoning" is a variation on this, with some extra work towards<br>making such scraping less lucrative and "fighting back" in a way, with its own<br>spam-vs-ham arms race on top of bot-detection. Even more high-maintenance.
Simple cookies - if still work - likely have their days numbered as these bots<br>seem to be well-coordinated and well-funded, so probably advanced enough to do<br>cookies or maybe some JS nowadays.
So my thinking is to either shut public git down - which is an easy option,<br>esp. given that I don't really consider myself part of "FOSS community" of any kind,<br>sharing random projects doesn't benefit me in any meaningful way, and don't think<br>there's anything of value lost by dropping all that off the internet anyhow.
Or, alternatively, do something trivial low-maintenance that works, and an easy<br>idea I had is something between remaining CAPTCHA and user authentication<br>options - to put an HTTP-401 Basic Auth (which all browsers and tools like git<br>still support thankfully), but only during those bot-onslaught hours every few<br>days when it gets annoying, and just put login/pw right on http-401 error page<br>where a human might see it without right credentials.
First thing I did a while ago is to remove massive access/error logs that<br>needlessly trash CPU and SSD erase-cycles (and make these logs useless anyway<br>due to sheer size), starting with /etc/fstab:
tmpfs /run/nginx/temp-logs tmpfs size=50m,nodev,nosuid,uid=nginx,gid=nginx,mode=750
Then put nginx info-level error_log and unfiltered access_log there,<br>with separate logrotate-temp-nginx.service having size 5M + rotate 1 +<br>SIGUSR1 to nginx, and running often enough to handle high-spam-tides, to have more<br>than enough logs for any kind of observability/debug purposes.
This actually doesn't mean that all logs have to go there, as e.g. error_log<br>... warn doesn't get spammy from bots, and neither do access_log ... if=$log_bot_filter<br>if-filtered ones, with $log_bot_filter set as e.g.:
map "$status $request" $log_bot_filter {<br>"~*301 GET...