Bots Are Eating My Blog for Lunch

Since I migrated this blog to a Hetzner server, I’ve been paying attention to the “Graphs” tab of the server dashboard to see how the capacity of the server matches the traffic I’m expecting it to handle.

One of the things I’ve noticed is that there are regular periods of very high CPU usage, periods where the 4 vCPUs are almost maxed out:

A graph showing 24 hours of CPU usage on my server, with a Y-axis of 0% to 400%

These periods are accompanied by corresponding jumps in network traffic:

A graph showing 24 hours of network traffic on my server, with a Y-axis of 0 bps to 1.5 Mbps

I got curious about what might be causing this, and, because I suspected web traffic bumps, I started by looking at the 20 most popular user-agents in my Apache logfiles, with:

awk -F'"' '{print $6}' access.log | sort | uniq -c | sort -nr | head -20

The result:

577651 Scrapy/2.11.2 (+https://scrapy.org)
 39018 Mozilla/5.0 (Linux; Android 5.0) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; Bytespider; spider-feedback@bytedance.com)
 23216 Mozilla/5.0 (iPhone; CPU iPhone OS 18_5 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/18.5 Mobile/15E148 Safari/604.1
 15561 Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36
 14793 Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.2; +https://openai.com/gptbot)
 14571 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/137.0.0.0 Safari/537.36
 12838 Mozilla/5.0 (Linux; Android 10; K) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/137.0.0.0 Mobile Safari/537.36
 12306 Mozilla/5.0 (compatible; SemrushBot/7~bl; +http://www.semrush.com/bot.html)
 10834 Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/18.5 Safari/605.1.15
  9714 Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)
  7104 Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot) Chrome/119.0.6045.214 Safari/537.36
  7005 Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) Chrome/116.0.1938.76 Safari/537.36
  6287 Wget/1.21.3
  5885 meta-externalagent/1.1 (+https://developers.facebook.com/docs/sharing/webmasters/crawler)
  5093 Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:139.0) Gecko/20100101 Firefox/139.0
  4798 Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/137.0.7151.68 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
  4593 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36
  4143 Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/137.0.0.0 Safari/537.36
  3446 -
  3228 Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:72.0) Gecko/20100101 Firefox/72.0

I asked ChatGPT to normalize and summarize, which gave me:

Lines	Normalized User-Agent	Type	Notes
577651	Scrapy	Bot	Likely automated scraping (Scrapy framework)
39043	Bytespider	Bot	From ByteDance (TikTok); known aggressive crawler
23337	Safari on iPhone	Browser	Human traffic, Apple mobile Safari
15590	Chrome on Linux	Browser	Human or automation (generic Linux desktop Chrome)
14793	GPTBot	Bot	OpenAI’s web crawler
14571	Chrome on Windows	Browser	Human or automation (Windows desktop Chrome)
12838	Chrome on Android	Browser	Human traffic, mobile Chrome
12307	SemrushBot	Bot	SEO bot from Semrush
10840	Safari on macOS	Browser	Human traffic
9727	ClaudeBot	Bot	From Anthropic (AI crawler)
7105	Amazonbot	Bot	Amazon’s crawler
7018	Bingbot	Bot	Microsoft’s search indexer
6296	Wget	Tool	Scripted fetch tool; likely automation or scraping
5891	Facebook External Agent	Bot	Facebook link preview/crawler bot
5101	Firefox on Windows	Browser	Human traffic
4798	Googlebot on Android	Bot	Google’s search bot, disguised as Android browser
4593	Chrome on Windows	Browser	Redundant with earlier Chrome/Windows
4143	Chrome on macOS	Browser	Human or automation, Mac desktop
3447	Unknown (“-”)	Unknown	Empty/missing user-agent
3231	Firefox on Linux (Ubuntu)	Browser	Human traffic

It also gave me this summary:

Total bot/tool traffic: ~695,940 (≈ 85% of top 20 traffic)
Likely human browser traffic: ~102,296 (≈ 13%)
Unknown/empty: ~3,447 (≈ 0.4%)

This log has a total of 964,802 lines in it, meaning that whatever “Scrapy” is doing is responsible for 60% of the traffic to my blog.

Ugh.

I followed up by asking ChatGPT to give me a robots.txt file that includes all of the bots, and I’ve added that to this site’s robots.txt (leaving out some friendly user-agents like NetNewsWire).

Because “Scrapy” seems particular evil, I also blocked it at the Apache level, with:

<IfModule mod_rewrite.c>
 RewriteEngine On
 RewriteCond %{HTTP_USER_AGENT} ^Scrapy [NC]
 RewriteRule ^.* - [F,L]
</IfModule>

I tested that this was working with:

curl -I -A "Scrapy/2.11.2 (+https://scrapy.org)" https://ruk.ca

Which properly returned:

HTTP/1.1 403 Forbidden
Date: Tue, 10 Jun 2025 22:31:15 GMT
Server: Apache/2.4.63 (Fedora Linux) OpenSSL/3.2.4
Content-Type: text/html; charset=iso-8859-1

I’ll wait 24 hours to see what effect all this has on network traffic and CPU.

Tuesday, June 10, 2025 at 7:07 pm