If you are reading this, I let you in. Don't be evil.

(But, of course you will ...)

Some concepts in the "ban everyone" approach to website(s):

Everyone stay out

top On my site, hacker/bot traffic was outweighing legitimate traffic 5000/1 or so. So, the main part of my site is "Deny All" by default.

.(But, I decided to put this info in the "public" part of my site because I went to a lot of work to collect this information & the point is to share it. Whether it is useful and how you use it is up to you.)

The concept:

block all traffic unless expressly permitted
bot unfriendly code
sparse code - no CMS bloatware
relatively simple access request
daily log review

Blocking all traffic except for "permitted" IPs makes for a much simpler .htaccess than trying to block all the evil-doers. Here are some IPs you might want to grant access:


#bingbot
 Require ip 207.46.13
#googlebot
 Require ip 64.21.98.41 66.249.65.1 66.249.65.5 66.249.66 
 Require ip 66.249.73.132 66.249.73.149 66.249.73.151
#mojeek
 Require ip 5.102.173.71
#qwant
 Require ip 194.187.168 91.242.162.18
#yandex
 Require ip 77.88.5

The search engine "giants" have many more IPs. I was having trouble with Bing and Yandex seeking non-existent gibberish files, so I haven't permitted many of their IPs. You can tell from your logs whether there are additional IPs you would like to add to your "permit" list. (I do want *some* traffic, so blocking search engine IPs is a bit counter-productive!) Huawei's "PetalBot" is just ridiculously misbehaved, so not listed at all. Nor are the ones that haven't bothered to show up at my site yet.

We do not need any SEO or "security" goombas, of which there are now many. They seem to have massive crawling power and show up constantly.

No site is "unhackable." There are sites that have been hacked and sites that will be hacked. But, there is no reason to provide an open invitation via bulky CMS code. There is no reason to allows known pests (TOR, VPNs, Cloud servers) access. If we limit bot and hacker access, it will take longer for the persistent hacker to find a doorway in.

You will note that my "semi-anonymous" approach would grant a hacker access who uses a VPN and a throw-away email address. This would also mean other VPN users on the same system would find a doorway via the now "open" IP. Hopefully by monitoring the logs I will catch the perps and block them before they do any damage.

Probably I need to automate "unpermitting" an IP as well as permitting it.

Why hostname blocks often don't work.

My understanding is that the apache server will run a reverse DNS lookup and if the IP and hostname do not match up, apache will not block the hostname. Hostnames are consistently spoofed.

Another "trick" can be to make the "hostname" look like it is the IP number. So, if you are not logging both the IP number and the hostname, it will not be apparent this has been done. For instance, my provider logs either the IP address or the hostname in the logs, but not both at the same time. (This is determined by the contents of the .htaccess file.) While it's possible to create one's own log via php files, that is an additional project and additional load on the server.

Also, many VPNs will incorporate the IP address in the hostname in some manner. Google content will reverse the IP address, as do some other serves. Some will separate the last three digits from the other three sets of numbers and place them with some letters, such as "sub.267". Apparently, these hosts want to be able to identify which of their customers are causing problems (based on logs) but don't want to make it to easy on webmasters to simply filter them out.

No More Favicon

top Who needs it, right? Waste of bandwidth -- and also, hackers will check for the favicon to see if they are banned.

Add this to your <head> section: <link rel="icon" href="data:,">

more here

Send 'em Back Where They Came From

top I don't know if this is a good idea or not. Code added to top of 404.php and/or 403.php will redirect to their own ip. A bot isn't a browser, will probably just add the redirect to the links it is crawling. (But, a lot of bots went away.)

<?php
    echo '<meta http-equiv="refresh" content="0; URL=https://'.$_SERVER['REMOTE_ADDR'].'" />';
?>

(Of course, you will have to add:


ErrorDocument 403 /403.php
ErrorDocument 404 /404.php

to your .htaccess file.)

Get rid of robots.txt

top For the most part, bots do not honor robots.txt directives. It's a pointless file, so get rid of it. Hacker bots check to see if it exists to verify site's existence.

The bots will then get a 404 and if you have 404 kicking them back to their own IP. Well, just another effort that failed.

Even good bots are bad ...

top

Since bots generally speaking don't honor robots.txt, they are also often misbehaved. I have a serious problem with bingbot (bing search) and petalbot (huawei search) constantly looking for non-existent files with gibberish names. No amount of "403" or "404" will get these bots to stop doing this. I assume some malevolent person is seeding the search engines with the gibberish file names.

Sometimes yandex joins in, but not too much; and google has occasionally made a pass at looking for gibberish files.

Yandex will launch into checking robots.txt repeatedly, even though there is no robots.txt and they receive 403 and 404 responses.

So, that's the "good guys". Then there are innumerable SEO, marketing and "security" bots that refuse to depart, despite being banned.User-Agents has a lot of data on "user agents" and the bots that uses them. Interestingly, I frequently see hack attempts from IPs in the same block(s) as these purportedly "useful" bots. Here are some of my frequent visitors who cannot be discouraged by any technique, and are essentially internet terrorists:
Ahrefs. Usually from the "54.36.148" block. Claims to be an "SEO Tool Set." They are tools, alright. Typical log entry: "51.222.253.8 /robots.txt 10/11/21, 9:41 AM 423 error 403 GET HTTP/1.1 Mozilla/5.0 (compatible; AhrefsBot/7.0; +http://ahrefs.com/robot/)"
ALittle Client. Listed as "Jesse Lugo, Jr." and "Amazon-groupon-com" (a non-existent website) in abuseipdb. Apparently a hacker client, have been seeing a few of these lately. "23.228.109.147 /wp-includes/css/buttons.css 10/12/21, 8:29 AM 425 error 403 GET HTTP/1.1 ALittle Client"
Bingbot. "52.162.161.148 /MGVweG94NzcwYmVweG9nVjE3MDk5bg== 10/12/21, 7:03 AM 425 error 403 GET HTTP/1.1 Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" Like Petalbot, bingbot has been harassing me for months with thousands of requests for non-existent gibberish files on a non-existent website. When I filled out their webmaster form asking them to stop, they told me my website was "too important" and could not be removed from the bingbot search. Abuseipdb has thousands of claims that this ip "masquerades" as bingbot. I tried to verify on Bing's "verify bingbot" whether this is an actual bingbot, but the verify tool is broken. Assuming it is not the "real" bingbot, it nevertheless runs on Microsoft servers, per whois.com. I filed an abuse complaint with Microsoft and they ignored it. The official bingbot crawler seems to come from this block, though "207.46.13.109".
BLEXBot (see Webmeup, below).
CensysInspect. Like Global INternet Observatory (below), CensysInspect is pervasive. It scraped my doti subdomain minutes after I created it, and despite the fact there was no link (that I am aware of, I didn't create one) to the subdomain that could be followed to it. Like GINO, Abuseipdb has it whitelisted, despite daily abuse reports that as of this writing exceeded a total of 1900 reports. It's a mystery to me why an "internet security company" is sending bots out to crawl sites it has absolutely nothing to do with. It cannot be any sort of "perceived threat", since it was wandering around my new subdomain while I was creating it, and as it is entirely html, there are no active scripts of any sort. They seem to have shown up right after the https validator visited my new subdomain, so perhaps get info direct from Comodo / Sectigo DCV. Their "about" page indicates that "CERTs and security researchers use it to discover new threats and assess their global impact." They claim "We regularly probe every public IP address and popular domain names, curate and enrich the resulting data, and make it intelligible through an interactive search engine and API." And, "We make a small number of harmless connection attempts to every IPv4 address worldwide each day. When we discover that a computer or device is configured to accept connections, we follow up by completing protocol handshakes to learn more about the running services." The way I look at it, one person's "good bot" is another person's "bad bot." "Censys scans the Internet from the 192.35.168.0/23, 162.142.125.0/24, 167.248.133.0/24, 167.94.138.0/24, 167.94.145.0/24, and 167.94.146.0/24 subnets, which you can allowlist or blocklist if you wish." "162.142.125.194 10/15/21, 12:05 PM / success 200 GET 623 Mozilla/5.0 (compatible; CensysInspect/1.1; +https://about.censys.io/)"
DataforSEO. "168.119.140.254 - - [10/Oct/2021:00:54:33 -0700] "GET /linux/cypress.html HTTP/1.1" 403 1371 "-" "Mozilla/5.0 (compatible; DataForSeoBot/1.0; +https://dataforseo.com/dataforseo-bot)" Also used 144.76.107.244 on same visit. Runs on Hetzner, Germany servers.
Global INternet Observatory A special case, as it runs numerous bots, is "whitelisted" by Abuseipdb despite over 3000 reports and on my site its IP of 138.246.253.24 get "200" responses with unidentifable file size even though it is blocked and should be getting 403 responses. Sometimes the bots will self-identify and other times not. They claim to be non-intrusive and follow internet "best practices", but that does not seem to be the case in my experience. It is also at university in Munich, and perhaps some of the "bad bots" are run by IT students without proper supervision Or, it's all a scam, just like everyone else's bots. Host name is often (but not always) a variant of "planetlab24.gino-research.net.in.tum.de". "138.246.253.24 - - [14/Oct/2021:13:07:47 -0700] "GET /robots.txt HTTP/1.1" 200 26 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.146 Safari/537.36"
Expanse. Also, "Palo Alto Networks" and "Cortex Xpanse". This company thinks it's the webmaster's duty to contact them and ask them to go away. Would you give these dogs your IP address for their database? Typical log entry: "8.162.77.34.bc.googleusercontent.com - - [16/Jul/2021:11:54:00 -0700] "GET / HTTP/1.1" 404 - "-" "Expanse, a Palo Alto Networks company, searches across the global IPv4 space multiple times per day to identify customers' presences on the Internet. If you would like to be excluded from our scans, please send IP addresses/domains to: scaninfo@paloaltonetworks.com"
hn.kd.ny.adsl. This bot is from China and the same hostname is used for a number of different IP addresses. "hn.kd.ny.adsl - - [30/Sep/2021:17:24:36 -0700] "GET / HTTP/1.1" 403 - "http://fastbk.com/" "-"
hrankbot. "40.86.96.40 / 10/12/21, 9:53 AM 422 error 403 GET HTTP/1.1 Mozilla/5.0 (compatible; hrankbot/1.0; +https://www.hrank.com/bot)"
Knowledge AI. As you can see, no link provided in log. "66.160.140.183 - - [08/Oct/2021:11:50:31 -0700] "GET /robots.txt HTTP/1.1" 403 - "-" "The Knowledge AI"
Lightspeed and/or LightEdge Solutions / LightspeedSystemsCrawler. Claims to be an internet security company, what is it doing on my website? "207.200.8.180 - onr.com [note: can't be reached ...] - - [27/Sep/2021:11:11:27 -0700] "GET /content/linux/cypress.html HTTP/1.1" 302 212 "https://daltrey.org" "LightspeedSystemsCrawler Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)"
MJ12bot. Majestic, SEO, "see who links to your site." "167.114.211.237 /linux/cypress%E2%80%A6 10/12/21, 9:25 AM 1416 error 403 GET HTTP/1.1 Mozilla/5.0 (compatible; MJ12bot/v1.4.8; http://mj12bot.com/)"
NetSystemsResearch Claims to be an internet security company. But, I have had cgi and autodiscover attacks from their block of IPs, "92.118.160.0 - 92.118.161.255". No amount of 403 discouragement will make them go away. Typical log entry: "92.118.160.45 / 10/11/21, 9:50 AM 424 error 403 GET HTTP/1.1 NetSystemsResearch studies the availability of various services across the internet. Our website is netsystemsresearch.com". NetSystemsResearch has been reported over 6,000 times to AbuseIPDB for malevolent actions.
Orbot. "ec2-184-72-4-89.us-west-1.compute.amazonaws.com - - [30/Sep/2021:17:43:36 -0700] "GET /robots.txt HTTP/1.1" 403 - "-" "Mozilla/5.0 (compatible; Orbbot/1.1;)"
Petalbot. "114.119.130.2 /M3p1MTAwMVV6OG0zNDMzcw== 10/12/21, 9:22 AM 424 error 403 GET HTTP/1.1 Mozilla/5.0 (Linux; Android 7.0;) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; PetalBot;+https://webmaster.petalsearch.com/site/petalbot)". This is just one of thousands of gibberish attempts by petalbot. Uses a lot of the "114.119" block.
Serpstatbot. Note how jerks like this make up their own rules: "Failures to retrieve a robots.txt file, for example, 403 Forbidden, are considered as the absence of any prohibitions. In this case, the bot will crawl all physically accessible pages." Dontcha think if a webmaster bans you, it means go away??? Typical log entry: "5.9.110.227 / 10/11/21, 4:33 PM 422 error 403 GET HTTP/1.1 serpstatbot/1.0 (advanced backlink tracking bot; curl/7.58.0; http://serpstatbot.com/; abuse@serpstatbot.com)." In this instance, the crawler made it's first request identified as "Go-http-client/1.1", the log entry above was the second request; and then a request identified as an iPhone, one identified as Mozilla 5.0 and one identified as a Linux / Android 10. They also identify their browser (sometimes) as "Safari/604.1 Edg/85.0.4183.102". Although none of these entries made an identifiable "hack" request (e.g., "wp-login.php"), I have had numerous hacking attempts from bots identifying as "Go-http-client/1.1" and "Safari/604.1 Edg/85.0". Such as this: "bucarest02.tor-exit.artikel10.org - - [24/Sep/2021:03:10:10 -0700] "GET /.git/config HTTP/1.1" 500 - "-" "Go-http-client/1.1" " or this "tor4e1.digitale-gesellschaft.ch - - [24/Sep/2021:03:10:39 -0700] "GET /.git/config HTTP/1.1" 500 - "-" "Go-http-client/1.1".
SemrushBot. "185.191.171.23 /b2evo1/blog2.php?disp=msgform&recipient_id=1&redirect_to=http%3A%2F%2Fdaltrey.org%2Fb2evo1%2Fblog2.php%2Fand-the-meltdown-continues%3Fdisp%3Dsingle%26title%3Dand-the-meltdown-continues%26more%3D1%26c%3D1%26tb%3D1%26pb%3D1 10/12/21, 2:04 PM 1408 error 403 GET HTTP/1.1 Mozilla/5.0 (compatible; SemrushBot/7~bl; +http://www.semrush.com/bot.html)" and on same visit also 185.191.171.14.
Sitelock. Sitelock claims to provide security by checking for malware. It's presumably a service you pay for, so why are they visiting my site(s) every day? Typical log entry: "184.154.139.20 / th1s_1s_a_4o4.html 10/11/21, 11:56 AM 425 error 403 GET HTTP/1.1 http://www.google.com/url?url=www.daltrey.net&yahoo.com Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/6.0)". Sitelock usually makes three separate requests each visit.
Sogou. Refuses to go away. "49.7.20.28 / 10/12/21, 5:26 AM 1389 error 403 GET HTTP/1.1 Sogou web spider/4.0 (http://www.sogou.com/docs/help/webmasters.htm#07)" On this occasion, Sogou made 5 attempts, using 3 different IP addresses. Having been rejected with an "identifying" log entry, it tried again with a anonymous log entry: "49.7.20.28 / 10/12/21, 5:26 AM 1389 error 403 GET HTTP/1.1 Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36". Also tried from 111.202.100.82 and 118.184.177.15. The Sogou site is in Chinese, so good luck to non-Chinese readers figuring out what "/docs/help/webmasters.htm#07" has to say. Which, of course, is a "turn about is fair play" sort of thing. But then again, I am not sending spyders to spy on and harass Sogou.
Webmeup Typical log: "157.90.181.149 /robots.txt 10/11/21, 9:30 AM 425 error 403 GET HTTP/1.1 Mozilla/5.0 (compatible; BLEXBot/1.0; +http://webmeup-crawler.com/)" Apparently runs on Hetzner, Germany servers, a whole category of evil unto itself, based on the number of hacking bots originating from Hetzner servers.
ZoomInfoBot. "240.175.231.35.bc.googleusercontent.com - - [30/Sep/2021:08:53:18 -0700] "GET /robots.txt HTTP/1.0" 403 - "-" "ZoominfoBot (zoominfobot at zoominfo dot com)" " 192.110.75.34.bc.googleusercontent.com - - [30/Sep/2021:10:52:58 -0700] "GET /robots.txt HTTP/1.0" 403 - "-" "ZoominfoBot (zoominfobot at zoominfo dot com)"

VPNs and/or Cloud Services

top I haven't figured out whether "my potential audience" will be visiting from virtual machines in the cloud. I don't really care if visitors wish to be anonymous, I just don't want them messing with (or attempting to mess with) my website. If I authorize "your" VPN IP, it's not just you, though, is it ... ?

The point is, most of the hacking traffic could be blocked at the source if these companies wanted to do so. It's not all that difficult to determine if your user is sending out "GET wp-login" requests or random "POST" requests via bot.

So, let's list some services that we probably don't want to ever hear from because their customers are constantly hacking away. As far as I can tell, VPN companies frequently set up their nodes on major cloud services, so the two problem sources are sort of inseparable. If you are running a legit website, you probably are not sending bots out to hack other people's websites, right? (At least I'm not ...) So, it won't matter if your website IP address is blocked from accessing my website.

Many bots will change IP addresses each request, running through a series of colocrossing, ovh, datacamp or other IPs from across the globe to test whether the site is banning by location. They will also change their user-agent string, just in case the webmaster is blocking some element of the bot's user string.

Alibaba
Amazon cloud
China Unicom/Telecom (although this sort of bans a billion+ people all at once ...
Colocrossing
contabo
Datacamp
DigitalOcean
Dreamhost cloud
Global INternet Observatory A special case, as it runs numerous bots, is "whitelisted" by Abuseipdb despite over 3000 reports and on my site its IP of 138.246.253.24 get "200" responses with unidentifable file size even though it is blocked and should be getting 403 responses. Sometimes the bots will self-identify and other times not. They claim to be non-intrusive and follow internet "best practices", but that does not seem to be the case in my experience. It is also at university in Munich, and perhaps some of the "bad bots" are run by IT students without proper supervision Or, it's all a scam, just like everyone else's bots. Host name is often (but not always) a variant of "planetlab24.gino-research.net.in.tum.de". "138.246.253.24 - - [14/Oct/2021:13:07:47 -0700] "GET /robots.txt HTTP/1.1" 200 26 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.146 Safari/537.36"
Google cloud
Hetzner
Internet Vikings and also.
Leaseweb
Microsoft cloud
Oracle cloud
Opera (browser) VPN
OVH
Performiv
Servermania
TOR (by anyone)

Multi-VPN attacks

topA common attack technique now makes requests from multiple VPNs simultaneously (or in rapid succession). Example:

Here we have someone who attempted to access my site at the exact same time from "Johannesburg S.A./Canada/Helsinki, Finland (Fibregrid x2), Amsterdam Netherlands/Israel (SC-RAPIDSEEDBOX), Ontario CA / Los Angeles CA (B2Net Solutions/Servermania) and Buffalo NY/Estonia (Colocrossing)".

107.172.170.169 / 10/13/21, 5:34 AM 1426 error 403 GET HTTP/1.1 Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 UBrowser/6.2.3964.2 Safari/537.36. Whois: 107.172.170.169

144.168.225.96 / 10/13/21, 5:34 AM 1401 error 403 GET HTTP/1.1 Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 UBrowser/6.2.3964.2 Safari/537.36. Whois: 144.168.225.96

185.122.170.19 / 10/13/21, 5:34 AM 1401 error 403 GET HTTP/1.1 Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 UBrowser/6.2.3964.2 Safari/537.36. Whois: 185.122.170.19

196.242.47.79 / 10/13/21, 5:34 AM 1398 error 403 GET HTTP/1.1 Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 UBrowser/6.2.3964.2 Safari/537.36. Whois: 196.242.47.79

196.244.200.193 /403_2.php 10/13/21, 5:34 AM 1060 success 200 GET HTTP/1.1 http://daltrey.org Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 UBrowser/6.2.3964.2 Safari/537.36. Whois: 196.244.200.193

Deny All