Posts Tagged ‘Warning’

Block Web server over loading Bad Crawler Bots and Search Engine Spiders with .htaccess rules

Monday, September 18th, 2017

howto-block-webserver-overloading-bad-crawler-bots-spiders-with-htaccess-modrewrite-rules-file

In last post, I've talked about the problem of Search Index Crawler Robots aggressively crawling websites and how to stop them (the article is here) explaning how to raise delays between Bot URL requests to website and how to completely probhit some bots from crawling with robots.txt.

As explained in article the consequence of too many badly written or agressive behaviour Spider is the "server stoning" and therefore degraded Web Server performance as a cause or even a short time Denial of Service Attack, depending on how well was the initial Server Scaling done.

The bots we want to filter are not to be confused with the legitimate bots, that drives real traffic to your website, just for information

 The 10 Most Popular WebCrawlers Bots as of time of writting are:
 

1. GoogleBot (The Google Crawler bots, funnily bots become less active on Saturday and Sundays :))

2. BingBot (Bing.com Crawler bots)

3. SlurpBot (also famous as Yahoo! Slurp)

4. DuckDuckBot (The dutch search engine duckduckgo.com crawler bots)

5. Baiduspider (The Chineese most famous search engine used as a substitute of Google in China)

6. YandexBot (Russian Yandex Search engine crawler bots used in Russia as a substitute for Google )

7. Sogou Spider (leading Chineese Search Engine launched in 2004)

8. Exabot (A French Search Engine, launched in 2000, crawler for ExaLead Search Engine)

9. FaceBot (Facebook External hit, this crawler is crawling a certain webpage only once the user shares or paste link with video, music, blog whatever  in chat to another user)

10. Alexa Crawler (la_archiver is a web crawler for Amazon's Alexa Internet Rankings, Alexa is a great site to evaluate the approximate page popularity on the internet, Alexa SiteInfo page has historically been the Swift Army knife for anyone wanting to quickly evaluate a webpage approx. ranking while compared to other pages)

Above legitimate bots are known to follow most if not all of W3C – World Wide Web Consorium (W3.Org) standards and therefore, they respect the content commands for allowance or restrictions on a single site as given from robots.txt but unfortunately many of the so called Bad-Bots or Mirroring scripts that are burning your Web Server CPU and Memory mentioned in previous article are either not following /robots.txt prescriptions completely or partially.

Hence with the robots.txt unrespective bots, the case the only way to get rid of most of the webspiders that are just loading your bandwidth and server hardware is to filter / block them is by using Apache's mod_rewrite through

 

.htaccess


file

Create if not existing in the DocumentRoot of your website .htaccess file with whatever text editor, or create it your windows / mac os desktop and transfer via FTP / SecureFTP to server.

I prefer to do it directly on server with vim (text editor)

 

 

vim /var/www/sites/your-domain.com/.htaccess

 

RewriteEngine On

IndexIgnore .htaccess */.??* *~ *# */HEADER* */README* */_vti*

SetEnvIfNoCase User-Agent "^Black Hole” bad_bot
SetEnvIfNoCase User-Agent "^Titan bad_bot
SetEnvIfNoCase User-Agent "^WebStripper" bad_bot
SetEnvIfNoCase User-Agent "^NetMechanic" bad_bot
SetEnvIfNoCase User-Agent "^CherryPicker" bad_bot
SetEnvIfNoCase User-Agent "^EmailCollector" bad_bot
SetEnvIfNoCase User-Agent "^EmailSiphon" bad_bot
SetEnvIfNoCase User-Agent "^WebBandit" bad_bot
SetEnvIfNoCase User-Agent "^EmailWolf" bad_bot
SetEnvIfNoCase User-Agent "^ExtractorPro" bad_bot
SetEnvIfNoCase User-Agent "^CopyRightCheck" bad_bot
SetEnvIfNoCase User-Agent "^Crescent" bad_bot
SetEnvIfNoCase User-Agent "^Wget" bad_bot
SetEnvIfNoCase User-Agent "^SiteSnagger" bad_bot
SetEnvIfNoCase User-Agent "^ProWebWalker" bad_bot
SetEnvIfNoCase User-Agent "^CheeseBot" bad_bot
SetEnvIfNoCase User-Agent "^Teleport" bad_bot
SetEnvIfNoCase User-Agent "^TeleportPro" bad_bot
SetEnvIfNoCase User-Agent "^MIIxpc" bad_bot
SetEnvIfNoCase User-Agent "^Telesoft" bad_bot
SetEnvIfNoCase User-Agent "^Website Quester" bad_bot
SetEnvIfNoCase User-Agent "^WebZip" bad_bot
SetEnvIfNoCase User-Agent "^moget/2.1" bad_bot
SetEnvIfNoCase User-Agent "^WebZip/4.0" bad_bot
SetEnvIfNoCase User-Agent "^WebSauger" bad_bot
SetEnvIfNoCase User-Agent "^WebCopier" bad_bot
SetEnvIfNoCase User-Agent "^NetAnts" bad_bot
SetEnvIfNoCase User-Agent "^Mister PiX" bad_bot
SetEnvIfNoCase User-Agent "^WebAuto" bad_bot
SetEnvIfNoCase User-Agent "^TheNomad" bad_bot
SetEnvIfNoCase User-Agent "^WWW-Collector-E" bad_bot
SetEnvIfNoCase User-Agent "^RMA" bad_bot
SetEnvIfNoCase User-Agent "^libWeb/clsHTTP" bad_bot
SetEnvIfNoCase User-Agent "^asterias" bad_bot
SetEnvIfNoCase User-Agent "^httplib" bad_bot
SetEnvIfNoCase User-Agent "^turingos" bad_bot
SetEnvIfNoCase User-Agent "^spanner" bad_bot
SetEnvIfNoCase User-Agent "^InfoNaviRobot" bad_bot
SetEnvIfNoCase User-Agent "^Harvest/1.5" bad_bot
SetEnvIfNoCase User-Agent "Bullseye/1.0" bad_bot
SetEnvIfNoCase User-Agent "^Mozilla/4.0 (compatible; BullsEye; Windows 95)" bad_bot
SetEnvIfNoCase User-Agent "^Crescent Internet ToolPak HTTP OLE Control v.1.0" bad_bot
SetEnvIfNoCase User-Agent "^CherryPickerSE/1.0" bad_bot
SetEnvIfNoCase User-Agent "^CherryPicker /1.0" bad_bot
SetEnvIfNoCase User-Agent "^WebBandit/3.50" bad_bot
SetEnvIfNoCase User-Agent "^NICErsPRO" bad_bot
SetEnvIfNoCase User-Agent "^Microsoft URL Control – 5.01.4511" bad_bot
SetEnvIfNoCase User-Agent "^DittoSpyder" bad_bot
SetEnvIfNoCase User-Agent "^Foobot" bad_bot
SetEnvIfNoCase User-Agent "^WebmasterWorldForumBot" bad_bot
SetEnvIfNoCase User-Agent "^SpankBot" bad_bot
SetEnvIfNoCase User-Agent "^BotALot" bad_bot
SetEnvIfNoCase User-Agent "^lwp-trivial/1.34" bad_bot
SetEnvIfNoCase User-Agent "^lwp-trivial" bad_bot
SetEnvIfNoCase User-Agent "^Wget/1.6" bad_bot
SetEnvIfNoCase User-Agent "^BunnySlippers" bad_bot
SetEnvIfNoCase User-Agent "^Microsoft URL Control – 6.00.8169" bad_bot
SetEnvIfNoCase User-Agent "^URLy Warning" bad_bot
SetEnvIfNoCase User-Agent "^Wget/1.5.3" bad_bot
SetEnvIfNoCase User-Agent "^LinkWalker" bad_bot
SetEnvIfNoCase User-Agent "^cosmos" bad_bot
SetEnvIfNoCase User-Agent "^moget" bad_bot
SetEnvIfNoCase User-Agent "^hloader" bad_bot
SetEnvIfNoCase User-Agent "^humanlinks" bad_bot
SetEnvIfNoCase User-Agent "^LinkextractorPro" bad_bot
SetEnvIfNoCase User-Agent "^Offline Explorer" bad_bot
SetEnvIfNoCase User-Agent "^Mata Hari" bad_bot
SetEnvIfNoCase User-Agent "^LexiBot" bad_bot
SetEnvIfNoCase User-Agent "^Web Image Collector" bad_bot
SetEnvIfNoCase User-Agent "^The Intraformant" bad_bot
SetEnvIfNoCase User-Agent "^True_Robot/1.0" bad_bot
SetEnvIfNoCase User-Agent "^True_Robot" bad_bot
SetEnvIfNoCase User-Agent "^BlowFish/1.0" bad_bot
SetEnvIfNoCase User-Agent "^JennyBot" bad_bot
SetEnvIfNoCase User-Agent "^MIIxpc/4.2" bad_bot
SetEnvIfNoCase User-Agent "^BuiltBotTough" bad_bot
SetEnvIfNoCase User-Agent "^ProPowerBot/2.14" bad_bot
SetEnvIfNoCase User-Agent "^BackDoorBot/1.0" bad_bot
SetEnvIfNoCase User-Agent "^toCrawl/UrlDispatcher" bad_bot
SetEnvIfNoCase User-Agent "^WebEnhancer" bad_bot
SetEnvIfNoCase User-Agent "^TightTwatBot" bad_bot
SetEnvIfNoCase User-Agent "^suzuran" bad_bot
SetEnvIfNoCase User-Agent "^VCI WebViewer VCI WebViewer Win32" bad_bot
SetEnvIfNoCase User-Agent "^VCI" bad_bot
SetEnvIfNoCase User-Agent "^Szukacz/1.4" bad_bot
SetEnvIfNoCase User-Agent "^QueryN Metasearch" bad_bot
SetEnvIfNoCase User-Agent "^Openfind data gathere" bad_bot
SetEnvIfNoCase User-Agent "^Openfind" bad_bot
SetEnvIfNoCase User-Agent "^Xenu’s Link Sleuth 1.1c" bad_bot
SetEnvIfNoCase User-Agent "^Xenu’s" bad_bot
SetEnvIfNoCase User-Agent "^Zeus" bad_bot
SetEnvIfNoCase User-Agent "^RepoMonkey Bait & Tackle/v1.01" bad_bot
SetEnvIfNoCase User-Agent "^RepoMonkey" bad_bot
SetEnvIfNoCase User-Agent "^Zeus 32297 Webster Pro V2.9 Win32" bad_bot
SetEnvIfNoCase User-Agent "^Webster Pro" bad_bot
SetEnvIfNoCase User-Agent "^EroCrawler" bad_bot
SetEnvIfNoCase User-Agent "^LinkScan/8.1a Unix" bad_bot
SetEnvIfNoCase User-Agent "^Keyword Density/0.9" bad_bot
SetEnvIfNoCase User-Agent "^Kenjin Spider" bad_bot
SetEnvIfNoCase User-Agent "^Cegbfeieh" bad_bot

 

<Limit GET POST>
order allow,deny
allow from all
Deny from env=bad_bot
</Limit>

 


Above rules are Bad bots prohibition rules have RewriteEngine On directive included however for many websites this directive is enabled directly into VirtualHost section for domain/s, if that is your case you might also remove RewriteEngine on from .htaccess and still the prohibition rules of bad bots should continue to work
Above rules are also perfectly suitable wordpress based websites / blogs in case you need to filter out obstructive spiders even though the rules would work on any website domain with mod_rewrite enabled.

Once you have implemented above rules, you will not need to restart Apache, as .htaccess will be read dynamically by each client request to Webserver

2. Testing .htaccess Bad Bots Filtering Works as Expected


In order to test the new Bad Bot filtering configuration is working properly, you have a manual and more complicated way with lynx (text browser), assuming you have shell access to a Linux / BSD / *Nix computer, or you have your own *NIX server / desktop computer running
 

Here is how:
 

 

lynx -useragent="Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler)" -head -dump http://www.your-website-filtering-bad-bots.com/

 

 

Note that lynx will provide a warning such as:

Warning: User-Agent string does not contain "Lynx" or "L_y_n_x"!

Just ignore it and press enter to continue.

Two other use cases with lynx, that I historically used heavily is to pretent with Lynx, you're GoogleBot in order to see how does Google actually see your website?
 

  • Pretend with Lynx You're GoogleBot

 

lynx -useragent="Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" -head -dump http://www.your-domain.com/

 

 

  • How to Pretend with Lynx Browser You are GoogleBot-Mobile

 

lynx -useragent="Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_1 like Mac OS X; en-us) AppleWebKit/532.9 (KHTML, like Gecko) Version/4.0.5 Mobile/8B117 Safari/6531.22.7 (compatible; Googlebot-Mobile/2.1; +http://www.google.com/bot.html)" -head -dump http://www.your-domain.com/

 


Or for the lazy ones that doesn't have Linux / *Nix at disposal you can use WannaBrowser website

Wannabrowseris a web based browser emulator which gives you the ability to change the User-Agent on each website req1uest, so just set your UserAgent to any bot browser that we just filtered for example set User-Agent to CheeseBot

The .htaccess rule earier added once detecting your browser client is coming in with the prohibit browser agent will immediately filter out and you'll be unable to access the website with a message like:
 

HTTP/1.1 403 Forbidden

 

Just as I've talked a lot about Index Bots, I think it is worthy to also mention three great websites that can give you a lot of Up to Date information on exact Spiders returned user-agent, common known Bot traits as well as a a current updated list with the Bad Bots etc.

Bot and Browser Resources information user-agents, bad-bots and odd Crawlers and Bots specifics

1. botreports.com
2. user-agents.org
3. useragentapi.com

 

An updated list with robots user-agents (crawler-user-agents) is also available in github here regularly updated by Caia Almeido

There are also a third party plugin (modules) available for Website Platforms like WordPress / Joomla / Typo3 etc.

Besides the listed on these websites as well as the known Bad and Good Bots, there are perhaps a hundred of others that might end up crawling your webdsite that might or might not need  to be filtered, therefore before proceeding with any filtering steps, it is generally a good idea to monitor your  HTTPD access.log / error.log, as if you happen to somehow mistakenly filter the wrong bot this might be a reason for Website Indexing Problems.

Hope this article give you some valueable information. Enjoy ! 🙂

 

Fix QMAIL mail server – “warning: dropping connection, unable to SSL accept:protocol error” and why error occurs

Friday, August 24th, 2012

Every time I had to modify something in productive QMAIL server install, I end up with some kind unexplainable problems which create huge issues for clients. For one more time I end up with errors after minor “innocent” modifications of a working for more than year time QMAIL ….

After last changes I made to combined qmail install of Thibs, QmailRocks in Daemontools qmail start up files:

/var/qmail/supervise/qmail-smtpd/run
/var/qmail/supervise/qmail-smtpdssl/run

In both files I set variable:

SSL=0 to SSL=1:

After making the change I restarted qmail tested sending emails and all looked well. Therefore I thought all works as usual, e.g. e-mail are properly sent and respectively received to the mail server….

So far so good until just today, when I received urgent phone call in which my employer reported about severe problems with receiving emails.
Trying to send from Gmail or Yahoo to our mail server were unable to be received with some delivery failure errors …
First I was a bit sceptical, hoping that maybe the errors reporting are not caused by the mail server but after giving a try to send email to the mail server in question the reported problem prooved true.
As always when errors with QMail, I checked what the logs says about the problem just to find in /var/log/qmail-smtpd/current log following err:

@4000000050374a0e1a5dc374 sslserver: warning: dropping connection, unable to SSL accept:protocol error

I’ve dig over almost all primary forums, threads and blog posts online but nowhere I couldn’t find anymeaningful as explanation to the error, so was forced to look for solution myself.
Obviously, there was error with SSL, so my first thought was to check if all is fine with permissions of servercert.pem and clientcert.pem. The permissions of two files were as follows:

ls -al /var/qmail/control/clientcert.pem
-rw-r----- 1 root qmail 2136 2011-10-10 13:23 /var/qmail/control/clientcert.pem
ls -al /var/qmail/control/servercert.pem
-rw-r--r-- 1 qmaild qmail 2311 2011-10-12 13:21 /var/qmail/control/servercert.pem

At first glimpse I was suspicious concerning permissions of /var/qmail/control/clientcert.pem but after checking on other Qmail servers which worked just fine I was sure the problem did not root in clientcert.pem permissions.

As you can guess another failure point I suspected was the previous day change of SSL=0 to SSL=1 in /var/qmail/supervise/qmail-smtpd/run and /var/qmail/supervise/qmail-smtpdssl/run. On that account, I immediately reversed back the yesterday setting of SSL=0 and then restart QMAIL.

The usual QMAIL to restart qmail is via qmailctl, but since so often qmailctl does not reload qmail current settings I had to also refresh current working qmail binaries via both stopping qmail with qmailctl stop and through /etc/inittab by commenting out in it line dealing with daemontools svscanboot:

Hence I first stopped all running qmail processes via init script:

# /usr/bin/qmailctl stop

Then commented line:

SV:123456:respawn:/usr/bin/svscanboot

to:

#SV:123456:respawn:/usr/bin/svscanboot

Onwards did reload of initab with command:

# /sbin/init q

Right onwards I uncommented the commented line:

#SV:123456:respawn:/usr/bin/svscanboot

to:

SV:123456:respawn:/usr/bin/svscanboot

And load up daemontools (svscanboot) via inittab issuing:

# /sbin/init q

Finally I had to start QMAIL processes:

# qmailctl restart
...

Change of SSL svscanboot daemontools service script SSL=0 to SSL=1 however created other problems for clients cause any present clients which used crypted connections to SMTP server viaSSL encryption rendered unable to send mails anymore with error messages like:

Cannot establish SSL with SMTP server xx.xxx.xxx.xxx:465, SSL_connect error 336031996.

To work around this issue I had to once again start SSL (set SSL=1) in /var/qmail/supervise/qmail-smtpssl/run and leave SSL switched off for /var/qmail/supervise/qmail-smtp/run.

Even doing this changes for about 20 minutes though I restarted QMAIL multiple times, qmail continued having issues with mails received with the shitty:

@4000000050374a0e1a5dc374 sslserver: warning: dropping connection, unable to SSL accept:protocol error

After multiple restarts “magically” the stupid server figured out it should load my changed setting in qmail-smtpdssl/run (before it finally worked I probably had to restart 20 times using qmailctl stop; qmailctl start ….

I’ve figured out as a good practice to put delay between qmailctl stop and qmailctl start cmds so in restarts I used a little 3 secs sleep in between like so:

# qmailctl stop; killall -9 multilog; sleep 3; qmailctl start

Also killing multilog (killall -9 multilog) is good practice cause often nevertheless restarts server logging is not refreshed …

Something else that might be important is the AUTH settings in qmail-smtpd/run and qmail-smtpdssl/run in thisfinally working qmail they are:

AUTH=1
REQUIRE_AUTH=0
ALLOW_INSECURE_AUTH=0

Hope this post helps someone to solve same crazy error …
Cheers

Fix of “Unable to allocate memory for pool.” PHP error messages

Saturday, October 15th, 2011

Since some time, I don’t know exactly where, after some updates of my WordPress running on a small server with FreeBSD 7.2. I’ve started getting a lot of Apache crashes. Often the wordpress scripts stopped working completely and I got only empty pages when trying to process the wordpress blog in a browser.

After a bunch of reading online, I’ve figured out that the cause might be PHP APC stands for Alternative PHP Cache .

I was not sure if the PHP running on the server had an APC configured at all so I used a phpinfo(); script to figure out if I had it loaded. I saw the APC among the loaded to show off in the list of loaded php modules, so this further led me to the idea the APC could be really causing the unexpected troubles.

Thus first I decided to disable the APC on a Virtualhost level for the domain where the crashing wordpress was hosted, to do I placed in the VirtualHost section in the Apache configuration /usr/local/etc/apache2/httpd.conf the following config directive:

php_flag apc.cache_by_default Off

These get me rid of the multiple errors:

PHP Warning: require_once() [function.require-once]: Unable to allocate memory for pool. in /usr/local/www/data-dist/blog/wp-content/plugins/tweet-old-post/top-admin.php on line 6

which constantly were re-occuring in php_error.log:

Further after evaluating all the websites hosted on the server and making sure none of which was really depending on APC , I’ve disabled the APC completely for PHP. To do so I issued:

echo 'apc.enabled = 0' >> /usr/local/etc/php.ini

Similarly on GNU/Linux to disable globally APC from PHP only the correct location to php.ini should be provided on Debian this is /etc/php5/apache2/php.ini .