Blocking DDoS from scraper bots the easy way via HTTP-401 Basic Auth

My blog_title_here · Blocking DDoS from scraper bots the easy way via HTTP-401 Basic Auth

Home

Who

Archives

Feed

My blog_title_here

May 15, 2026

Blocking DDoS from scraper bots the easy way via HTTP-401 Basic Auth

Bot traffic on the internet was always comparatively high, with search engine scrapers and whatever odd research tools doing their thing, but that changed massively in recent few years with LLM-training-content-scaping bots.

Those don't care about robots.txt and are designed to be unblockable, running on regular users' machines, faking browser user-agents and doing their thing from a giant massively-distributed IP pools by now (used to have somewhat blockable net ranges, but not anymore).

This isn't a big deal for this blog for example, as there're only a few static pages, which these bots grab and go away reasonably quickly, but for local git repos this isn't the case - list of commits and various links there is effectively infinite, and these idiot bots want EVERY-THING!

So what ends up happening is a never-ending stream of requests like this:

14.243.82.173 - - [15/May/2026:00:50:01 +0500] "GET /code/git/.../pa-mixer.example.cfg?id=9537b82b... HTTP/1.1" 200 3516 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) ... Safari/537.36" ...

With specific bot pool typically doing this for hours or like a day at a time, sometimes at rates well above 100 requests per second, which is kinda nuts load, normally only seen by high-traffic commercial sites, not a random self-hosted git repo.

Or it's more like a distributed denial of service (DDoS) attack, where no value is gained from it (incl. by whatever crap "AI" idiocy behind it), but resources are knocked off the internet one way or another, presumably with side-benefit that only way to look something up now will be going to those big-tech scrapers, for a crappy simulacra of the stuff they destroyed in the process of extraction, so dunno if DDoS aspect of it will go away anytime soon.

Despite cgit that I use being super-efficient and handling the load, by May 2026 these scrapers even rack up a couple gigabytes of access.log files per day (!!!), so it got annoying enough for me at this point.

There are few common ways of dealing with this crap:

Install Anubis tool, which runs a cpu-intensive challenge on clients via JS.

Set/check a cookie for all clients, e.g. via testcookie-nginx-module on HTTP level.

Detect bots somehow and serve them some "poisoned" content.

CAPTCHAs from Google/CloudFlare, or generated locally in some fashion.

Shutdown public access, requiring some auth, and rate-limit accounts/users (e.g. github).

Don't like any JS mechanisms, as I tend to disable/limit it myself, and Anubis in particular looks a bit too high-maintenance for me, being like a whole complicated anti-spam system (and those rarely work well). It's a well-known and very common solution too, likely locked in an arms race with countermeasures.

Training data "poisoning" is a variation on this, with some extra work towards making such scraping less lucrative and "fighting back" in a way, with its own spam-vs-ham arms race on top of bot-detection. Even more high-maintenance.

Simple cookies - if still work - likely have their days numbered as these bots seem to be well-coordinated and well-funded, so probably advanced enough to do cookies or maybe some JS nowadays.

So my thinking is to either shut public git down - which is an easy option, esp. given that I don't really consider myself part of "FOSS community" of any kind, sharing random projects doesn't benefit me in any meaningful way, and don't think there's anything of value lost by dropping all that off the internet anyhow.

Or, alternatively, do something trivial low-maintenance that works, and an easy idea I had is something between remaining CAPTCHA and user authentication options - to put an HTTP-401 Basic Auth (which all browsers and tools like git still support thankfully), but only during those bot-onslaught hours every few days when it gets annoying, and just put login/pw right on http-401 error page where a human might see it without right credentials.

First thing I did a while ago is to remove massive access/error logs that needlessly trash CPU and SSD erase-cycles (and make these logs useless anyway due to sheer size), starting with /etc/fstab:

tmpfs /run/nginx/temp-logs tmpfs size=50m,nodev,nosuid,uid=nginx,gid=nginx,mode=750

Then put nginx info-level error_log and unfiltered access_log there, with separate logrotate-temp-nginx.service having size 5M + rotate 1 + SIGUSR1 to nginx, and running often enough to handle high-spam-tides, to have more than enough logs for any kind of observability/debug purposes.

This actually doesn't mean that all logs have to go there, as e.g. error_log ... warn doesn't get spammy from bots, and neither do access_log ... if=$log_bot_filter if-filtered ones, with $log_bot_filter set as e.g.:

map "$status $request" $log_bot_filter { "~*301 GET...

Blocking DDoS from scraper bots the easy way via HTTP-401 Basic Auth

Related Articles

Elevated error rates on requests to multiple models

Donald Trump and sons to be 'forever' exempt from tax audits

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self- Play

Old Reddit Is Down

The ultimate female fantasy – A feminist critique of Beauty and the Beast