Abusing Bots/Scrapers and How to Block Them
There are several ways to block abusing bots and scrapers. If you run WordPress, there many plugins you can use.
Robots.txt: For Bots, try to block them using your robots.txt file. Some Bots respect robots.txt directives and some not. But it’s worth a try.
Cpanel method: You can add the IP or the whole range of IPs from Cpanel option Block IP. That will automatically add some IP block code in your .htaccess. Nothing else to do. But this is not the easiest way, since Bots can have many IPs all over the world, that also change from time to time.
.htaccess method: Filter the bots using mod_rewrite and “http_user_agent”. Not always reliable since they can change or fake the “http_user_agent”. Works in most of cases.
Here is an example of what to add (and modify as needed) on the Top of your .htaccess:
# Abuse User Agents Blocking
# Blocking user agents stops traffic from the named bots below
#<IfModule mod_rewrite.c>
RewriteCond %{HTTP_USER_AGENT} ^.*(zgrab|evc-batch|webceo|SEOkicks|python|Crawler|dataprovider|admantx|WebDataStats|Apache-HttpClient|netcraft|spider|Clickagy|ZoominfoBot|Barkrowler|babbar).*$ [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^.*(Yisou|package|censys|projectdiscovery|scrapy|Java|GuzzleHttp|colly|seostar|TTD-Content|Hatena|GumGum-Bot|serpstatbot|aiHitBot|SurdotlyBot|Adsbot|linkfluence|netsystemsresearch).*$ [NC]
RewriteRule .* - [F,L]
#</IfModule>
# Abuse bot blocking rule end
The above is just an example, not a full set of bot’s names by any means. But it is a good start :)
There is also a much better way to block all those bots (and more than bots) but you have to be technically inclined. If so, please check CIDRAM script.
What is CIDRAM?
CIDRAM (Classless Inter-Domain Routing Access Manager) is a PHP script designed to protect websites by blocking requests originating from IP addresses regarded as being sources of undesirable traffic, including (but not limited to) traffic from non-human access endpoints, cloud services, spambots, scrapers, etc.
With CIDRAM you can make ANY kind of custom rules, to block using by “user agent”, by “hostname”, by any words in the url with wildcards (for example “*.env”) or even block by Country.
CIDRAM also offers several modules to check visitors against AbuseIPDB and other popular blacklists.