Posts Tagged ‘text’

Block Web server over loading Bad Crawler Bots and Search Engine Spiders with .htaccess rules

Monday, September 18th, 2017

howto-block-webserver-overloading-bad-crawler-bots-spiders-with-htaccess-modrewrite-rules-file

In last post, I've talked about the problem of Search Index Crawler Robots aggressively crawling websites and how to stop them (the article is here) explaning how to raise delays between Bot URL requests to website and how to completely probhit some bots from crawling with robots.txt.

As explained in article the consequence of too many badly written or agressive behaviour Spider is the "server stoning" and therefore degraded Web Server performance as a cause or even a short time Denial of Service Attack, depending on how well was the initial Server Scaling done.

The bots we want to filter are not to be confused with the legitimate bots, that drives real traffic to your website, just for information

 The 10 Most Popular WebCrawlers Bots as of time of writting are:
 

1. GoogleBot (The Google Crawler bots, funnily bots become less active on Saturday and Sundays :))

2. BingBot (Bing.com Crawler bots)

3. SlurpBot (also famous as Yahoo! Slurp)

4. DuckDuckBot (The dutch search engine duckduckgo.com crawler bots)

5. Baiduspider (The Chineese most famous search engine used as a substitute of Google in China)

6. YandexBot (Russian Yandex Search engine crawler bots used in Russia as a substitute for Google )

7. Sogou Spider (leading Chineese Search Engine launched in 2004)

8. Exabot (A French Search Engine, launched in 2000, crawler for ExaLead Search Engine)

9. FaceBot (Facebook External hit, this crawler is crawling a certain webpage only once the user shares or paste link with video, music, blog whatever  in chat to another user)

10. Alexa Crawler (la_archiver is a web crawler for Amazon's Alexa Internet Rankings, Alexa is a great site to evaluate the approximate page popularity on the internet, Alexa SiteInfo page has historically been the Swift Army knife for anyone wanting to quickly evaluate a webpage approx. ranking while compared to other pages)

Above legitimate bots are known to follow most if not all of W3C – World Wide Web Consorium (W3.Org) standards and therefore, they respect the content commands for allowance or restrictions on a single site as given from robots.txt but unfortunately many of the so called Bad-Bots or Mirroring scripts that are burning your Web Server CPU and Memory mentioned in previous article are either not following /robots.txt prescriptions completely or partially.

Hence with the robots.txt unrespective bots, the case the only way to get rid of most of the webspiders that are just loading your bandwidth and server hardware is to filter / block them is by using Apache's mod_rewrite through

 

.htaccess


file

Create if not existing in the DocumentRoot of your website .htaccess file with whatever text editor, or create it your windows / mac os desktop and transfer via FTP / SecureFTP to server.

I prefer to do it directly on server with vim (text editor)

 

 

vim /var/www/sites/your-domain.com/.htaccess

 

RewriteEngine On

IndexIgnore .htaccess */.??* *~ *# */HEADER* */README* */_vti*

SetEnvIfNoCase User-Agent "^Black Hole” bad_bot
SetEnvIfNoCase User-Agent "^Titan bad_bot
SetEnvIfNoCase User-Agent "^WebStripper" bad_bot
SetEnvIfNoCase User-Agent "^NetMechanic" bad_bot
SetEnvIfNoCase User-Agent "^CherryPicker" bad_bot
SetEnvIfNoCase User-Agent "^EmailCollector" bad_bot
SetEnvIfNoCase User-Agent "^EmailSiphon" bad_bot
SetEnvIfNoCase User-Agent "^WebBandit" bad_bot
SetEnvIfNoCase User-Agent "^EmailWolf" bad_bot
SetEnvIfNoCase User-Agent "^ExtractorPro" bad_bot
SetEnvIfNoCase User-Agent "^CopyRightCheck" bad_bot
SetEnvIfNoCase User-Agent "^Crescent" bad_bot
SetEnvIfNoCase User-Agent "^Wget" bad_bot
SetEnvIfNoCase User-Agent "^SiteSnagger" bad_bot
SetEnvIfNoCase User-Agent "^ProWebWalker" bad_bot
SetEnvIfNoCase User-Agent "^CheeseBot" bad_bot
SetEnvIfNoCase User-Agent "^Teleport" bad_bot
SetEnvIfNoCase User-Agent "^TeleportPro" bad_bot
SetEnvIfNoCase User-Agent "^MIIxpc" bad_bot
SetEnvIfNoCase User-Agent "^Telesoft" bad_bot
SetEnvIfNoCase User-Agent "^Website Quester" bad_bot
SetEnvIfNoCase User-Agent "^WebZip" bad_bot
SetEnvIfNoCase User-Agent "^moget/2.1" bad_bot
SetEnvIfNoCase User-Agent "^WebZip/4.0" bad_bot
SetEnvIfNoCase User-Agent "^WebSauger" bad_bot
SetEnvIfNoCase User-Agent "^WebCopier" bad_bot
SetEnvIfNoCase User-Agent "^NetAnts" bad_bot
SetEnvIfNoCase User-Agent "^Mister PiX" bad_bot
SetEnvIfNoCase User-Agent "^WebAuto" bad_bot
SetEnvIfNoCase User-Agent "^TheNomad" bad_bot
SetEnvIfNoCase User-Agent "^WWW-Collector-E" bad_bot
SetEnvIfNoCase User-Agent "^RMA" bad_bot
SetEnvIfNoCase User-Agent "^libWeb/clsHTTP" bad_bot
SetEnvIfNoCase User-Agent "^asterias" bad_bot
SetEnvIfNoCase User-Agent "^httplib" bad_bot
SetEnvIfNoCase User-Agent "^turingos" bad_bot
SetEnvIfNoCase User-Agent "^spanner" bad_bot
SetEnvIfNoCase User-Agent "^InfoNaviRobot" bad_bot
SetEnvIfNoCase User-Agent "^Harvest/1.5" bad_bot
SetEnvIfNoCase User-Agent "Bullseye/1.0" bad_bot
SetEnvIfNoCase User-Agent "^Mozilla/4.0 (compatible; BullsEye; Windows 95)" bad_bot
SetEnvIfNoCase User-Agent "^Crescent Internet ToolPak HTTP OLE Control v.1.0" bad_bot
SetEnvIfNoCase User-Agent "^CherryPickerSE/1.0" bad_bot
SetEnvIfNoCase User-Agent "^CherryPicker /1.0" bad_bot
SetEnvIfNoCase User-Agent "^WebBandit/3.50" bad_bot
SetEnvIfNoCase User-Agent "^NICErsPRO" bad_bot
SetEnvIfNoCase User-Agent "^Microsoft URL Control – 5.01.4511" bad_bot
SetEnvIfNoCase User-Agent "^DittoSpyder" bad_bot
SetEnvIfNoCase User-Agent "^Foobot" bad_bot
SetEnvIfNoCase User-Agent "^WebmasterWorldForumBot" bad_bot
SetEnvIfNoCase User-Agent "^SpankBot" bad_bot
SetEnvIfNoCase User-Agent "^BotALot" bad_bot
SetEnvIfNoCase User-Agent "^lwp-trivial/1.34" bad_bot
SetEnvIfNoCase User-Agent "^lwp-trivial" bad_bot
SetEnvIfNoCase User-Agent "^Wget/1.6" bad_bot
SetEnvIfNoCase User-Agent "^BunnySlippers" bad_bot
SetEnvIfNoCase User-Agent "^Microsoft URL Control – 6.00.8169" bad_bot
SetEnvIfNoCase User-Agent "^URLy Warning" bad_bot
SetEnvIfNoCase User-Agent "^Wget/1.5.3" bad_bot
SetEnvIfNoCase User-Agent "^LinkWalker" bad_bot
SetEnvIfNoCase User-Agent "^cosmos" bad_bot
SetEnvIfNoCase User-Agent "^moget" bad_bot
SetEnvIfNoCase User-Agent "^hloader" bad_bot
SetEnvIfNoCase User-Agent "^humanlinks" bad_bot
SetEnvIfNoCase User-Agent "^LinkextractorPro" bad_bot
SetEnvIfNoCase User-Agent "^Offline Explorer" bad_bot
SetEnvIfNoCase User-Agent "^Mata Hari" bad_bot
SetEnvIfNoCase User-Agent "^LexiBot" bad_bot
SetEnvIfNoCase User-Agent "^Web Image Collector" bad_bot
SetEnvIfNoCase User-Agent "^The Intraformant" bad_bot
SetEnvIfNoCase User-Agent "^True_Robot/1.0" bad_bot
SetEnvIfNoCase User-Agent "^True_Robot" bad_bot
SetEnvIfNoCase User-Agent "^BlowFish/1.0" bad_bot
SetEnvIfNoCase User-Agent "^JennyBot" bad_bot
SetEnvIfNoCase User-Agent "^MIIxpc/4.2" bad_bot
SetEnvIfNoCase User-Agent "^BuiltBotTough" bad_bot
SetEnvIfNoCase User-Agent "^ProPowerBot/2.14" bad_bot
SetEnvIfNoCase User-Agent "^BackDoorBot/1.0" bad_bot
SetEnvIfNoCase User-Agent "^toCrawl/UrlDispatcher" bad_bot
SetEnvIfNoCase User-Agent "^WebEnhancer" bad_bot
SetEnvIfNoCase User-Agent "^TightTwatBot" bad_bot
SetEnvIfNoCase User-Agent "^suzuran" bad_bot
SetEnvIfNoCase User-Agent "^VCI WebViewer VCI WebViewer Win32" bad_bot
SetEnvIfNoCase User-Agent "^VCI" bad_bot
SetEnvIfNoCase User-Agent "^Szukacz/1.4" bad_bot
SetEnvIfNoCase User-Agent "^QueryN Metasearch" bad_bot
SetEnvIfNoCase User-Agent "^Openfind data gathere" bad_bot
SetEnvIfNoCase User-Agent "^Openfind" bad_bot
SetEnvIfNoCase User-Agent "^Xenu’s Link Sleuth 1.1c" bad_bot
SetEnvIfNoCase User-Agent "^Xenu’s" bad_bot
SetEnvIfNoCase User-Agent "^Zeus" bad_bot
SetEnvIfNoCase User-Agent "^RepoMonkey Bait & Tackle/v1.01" bad_bot
SetEnvIfNoCase User-Agent "^RepoMonkey" bad_bot
SetEnvIfNoCase User-Agent "^Zeus 32297 Webster Pro V2.9 Win32" bad_bot
SetEnvIfNoCase User-Agent "^Webster Pro" bad_bot
SetEnvIfNoCase User-Agent "^EroCrawler" bad_bot
SetEnvIfNoCase User-Agent "^LinkScan/8.1a Unix" bad_bot
SetEnvIfNoCase User-Agent "^Keyword Density/0.9" bad_bot
SetEnvIfNoCase User-Agent "^Kenjin Spider" bad_bot
SetEnvIfNoCase User-Agent "^Cegbfeieh" bad_bot

 

<Limit GET POST>
order allow,deny
allow from all
Deny from env=bad_bot
</Limit>

 


Above rules are Bad bots prohibition rules have RewriteEngine On directive included however for many websites this directive is enabled directly into VirtualHost section for domain/s, if that is your case you might also remove RewriteEngine on from .htaccess and still the prohibition rules of bad bots should continue to work
Above rules are also perfectly suitable wordpress based websites / blogs in case you need to filter out obstructive spiders even though the rules would work on any website domain with mod_rewrite enabled.

Once you have implemented above rules, you will not need to restart Apache, as .htaccess will be read dynamically by each client request to Webserver

2. Testing .htaccess Bad Bots Filtering Works as Expected


In order to test the new Bad Bot filtering configuration is working properly, you have a manual and more complicated way with lynx (text browser), assuming you have shell access to a Linux / BSD / *Nix computer, or you have your own *NIX server / desktop computer running
 

Here is how:
 

 

lynx -useragent="Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler)" -head -dump http://www.your-website-filtering-bad-bots.com/

 

 

Note that lynx will provide a warning such as:

Warning: User-Agent string does not contain "Lynx" or "L_y_n_x"!

Just ignore it and press enter to continue.

Two other use cases with lynx, that I historically used heavily is to pretent with Lynx, you're GoogleBot in order to see how does Google actually see your website?
 

  • Pretend with Lynx You're GoogleBot

 

lynx -useragent="Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" -head -dump http://www.your-domain.com/

 

 

  • How to Pretend with Lynx Browser You are GoogleBot-Mobile

 

lynx -useragent="Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_1 like Mac OS X; en-us) AppleWebKit/532.9 (KHTML, like Gecko) Version/4.0.5 Mobile/8B117 Safari/6531.22.7 (compatible; Googlebot-Mobile/2.1; +http://www.google.com/bot.html)" -head -dump http://www.your-domain.com/

 


Or for the lazy ones that doesn't have Linux / *Nix at disposal you can use WannaBrowser website

Wannabrowseris a web based browser emulator which gives you the ability to change the User-Agent on each website req1uest, so just set your UserAgent to any bot browser that we just filtered for example set User-Agent to CheeseBot

The .htaccess rule earier added once detecting your browser client is coming in with the prohibit browser agent will immediately filter out and you'll be unable to access the website with a message like:
 

HTTP/1.1 403 Forbidden

 

Just as I've talked a lot about Index Bots, I think it is worthy to also mention three great websites that can give you a lot of Up to Date information on exact Spiders returned user-agent, common known Bot traits as well as a a current updated list with the Bad Bots etc.

Bot and Browser Resources information user-agents, bad-bots and odd Crawlers and Bots specifics

1. botreports.com
2. user-agents.org
3. useragentapi.com

 

An updated list with robots user-agents (crawler-user-agents) is also available in github here regularly updated by Caia Almeido

There are also a third party plugin (modules) available for Website Platforms like WordPress / Joomla / Typo3 etc.

Besides the listed on these websites as well as the known Bad and Good Bots, there are perhaps a hundred of others that might end up crawling your webdsite that might or might not need  to be filtered, therefore before proceeding with any filtering steps, it is generally a good idea to monitor your  HTTPD access.log / error.log, as if you happen to somehow mistakenly filter the wrong bot this might be a reason for Website Indexing Problems.

Hope this article give you some valueable information. Enjoy ! 🙂

 


Share this on

Windows 7 fix menu messed up cyrillic – How to fix cyrillic text in Windows

Friday, July 22nd, 2016

faststone-viewer-messed-up-menu-cyrillic-windows-7-screenshot

How to fix Cyrillic text on Windows 7

I've reinstalled my HP provided company work notebook with Windows 7 Enterprise x86 and had troubles with seeing Cyrillic written text, letters and fonts.
The result after installing some programs and selecting as a default language Bulgarian during installation setup prompt let me to see in some programs and in some of my old written text file names and Cyrillic WIN CP1251 content to be showing a cryptic letters like in above screenshot.

If you're being curious what is causing the broken encoding cyrillic text, it is the fact that in past a lot of cyrillic default encoding was written in KOI-8R and WIN-CP1251 encoding which is not unicode e.g. not compatible with the newer standard encoding for cyrillic UTF-8. Of course the authors of some old programs and documents are not really responsbie for the messed up cyrillic as noone expected that every Cyrillic text will be in UTF-8 in newer times.

Thanksfully there is a way to fix the unreadable / broken encoding cyrillic text by:

Going too menus:

Start menu -> Control Panel -> Change display language -> Clock, Language and Religion

Once there click the Administratibe tab

and choose

Change system locale.

windows-7_administrative_tab_change-system-locale

Here if you're not logged in with administrator user you will be prompted for administrative privileges.

select-system-locale-choose-bulgarian-and-hit-ok-windows-7

Being there choose your language (country) to be:

Bulgarian (Bulgaria) – if you're like me a Bulgarian or Russian (if you're Russian / Belarusian / Ukrainian) or someone from the countries of ex-USSR.
Click OK

And reboot (restart) your computer in order to make the new settings active.
 

This should be it from now on all cyrillic letters in all programs / documents and file names on your PC should visualize fine just as it was intended more or less by the cyrillic assumed creator Saint Climent Ohridski who was a  who reformed Cyrillic from Glagolic alphabet.


Share this on

Word 2011 Check spelling for Mac OS X 2011 – Word check text in Mac OS X Office

Monday, February 29th, 2016

office-for-mac-2011-logo
 

If you happen to be running Mac OS X powered notebook and have recently installed Microsoft Office 2011 for Mac OS because you used to migrate from a Windows PC, you will probably suprrised that your Native Language Dictionary check you used heavily on Windows might be not performing on Mac.
This was exactly the case with my wife Svetlana and as she is not a computer expert and I'm the IT support at home I had to solve it somehow.

Luckily Office 2011 for Mac OS X which I have installed earlier comes with plenty of foreign-languages such as Russian, Bulgarian, Czech, French, German, UK English, US English etc.
Proofing tools is very handy especially for people like my wife who is natively Belarusian and is in process of learning Bulgarian, thus often in need to check Bulgarian words spelling.

By Default the Check spelling on Office package was set to English, there is a quick way to change this to a certain text without changing the check-spelling default from English, the key shortcut to use is:

I. Press Mac command (key) + A – To select All text in opened document (in our case text was in Bulgarian)

command-a-mac-os-x

Click Word window menu and:
 

Choose Tools→Language


and select Bulgarian (or whatever language you need check spelling for.

If you need to change the Language default for all time, again you can do it from Tools

 

 

 Tools→Language

 

language-menu-on-mac-os-x-microsoft-word-office-2011-package-screenshot

II. The Language dialog will appear and you'll see a list of languages to choose. 

select-language-menu-screenshot-ms-word-2011-on-mac-os-x

III. Next a Pop-up Dialog will ask you whether you're sure you want to change the default language to the language of choice in my case this was Russian.

language-default-dialog-mac-osx-word-2011-office-on-mac-change-default-spelling-language

That's it check spelling default will be aplpied now to Word normal template, so next time you open a document your default spelling choice will set

Share this on

Cookie Policy

Thursday, October 1st, 2015

This site uses cookies – small text files that are placed on your machine to help the site provide a better user experience. In general, cookies are used to retain user preferences, store information for things like shopping carts, and provide anonymised tracking data to third party applications like Google Analytics. As a rule, cookies will make your browsing experience better. However, you may prefer to disable cookies on this site and on others. The most effective way to do this is to disable cookies in your browser. We suggest consulting the Help section of your browser or taking a look at the About Cookies website which offers guidance for all modern browsers

Share this on

Joomla 1.5 fix news css problem partial text (article text not completely showing in Joomla – Category Blog Layout problem)

Monday, October 20th, 2014

joomla-fix-weird-news-blog-article-text-incompletely-shown-category-blog-website-layout-problem

I’m still administrating some old archaic Joomla website built on top of Joomla 1.5. Recently there were some security issues with the website so I first tried using jupgrade (Upgrade Joomla 1.5 to Joomla 2.5) plugin to try to resolve the issues. As there were issues with the upgrade, because of used template was not available for Joomla 2.5, I decided to continue using Joomla 1.5 and applied the Joomla 1.5 Security Patch. I also had to disable a couple of unused joomla components and the contact form in order to prevent spammers of randomly spamming through the joomla … the Joomla Security Scanner was mostly useful in order to fix the Joomla security holes ..

So far so good this Joomla solved security but just recently I was asked to add a new article the Joomla News section – (the news section is configure to serve as a mini site blog as there are only few articles added every few months). For my surprise all of a sudden the new joomla article text started displaying text and pictures partially. The weirdly looking newly added news looked very much like some kind of template or css problem. I tried debugging the html code but unfortunately my knowledge in CSS is not so much, so as a next step I tried to temper some settings from Joomla Administrator in hope that this would resolve the text which was appearing from article used to be cut even though the text I’ve placed in artcle seemed correctly formatted. I finally pissed off trying to solve the news section layout problem so looked online too see if anyone else didn’t stick out to same error and I stumbled on Joomla’s forum explaining the Category Blog Layout Problem

The solution to the Joomla incomplete text showing in article is – To go to Joomla administrator menus:

1. Menus -> Main Menu -> (Click on Menu Item(s) – Edit Menu Item(s)) button
2. Click on News (section)
In Parameters section (on the botton right) of screen you will see #Leading set to some low number for example it will be something like 8 or 9. The whole issue in my case was that I was trying to add more than 8 articles and I had a Leading set to 8 and in order to add more articles and keep proper leading I had to raise it to more. To prevent recent leading errors, I’ve raised the Leading to 100 like shown in below screenshot
joomla-blog-layout-basic-parameters-screenshot-fix-joomla-news-cut-text-problem-screenshot

After raising to some high number click Apply and you’re done your problem is solved 🙂
For those curious what the above settings from screenshot mean:

# Leading Articles -> This refers to the number of articles that are to be shown to the full width
# Intro Articles -> This refers to the number of articles that are not to be shown to full width
# Columns -> This refers to the number of columns in which the articles will be shown that are identified as #Intro. If #Intro is zero this setting has no effect
# Links -> Number of articles that are to be shown as links. The number of articles should exceed #leading + #Intro

N.B. Solving this issue took me quite a long time and it caused me a lot of attempts to resolve it. I tried creating the article from scratch, making copy from an old article etc. I even messed few of the news articles one time so badly that I had to recreate them from scratch, before doing any changes to obsolete joomlas always make database and file content backup otherwise you will end up like me in situation loosing 10 hours of your time ..

The bitter experiences once again with Joomla convinced me when I have time I have to migrate this Joomla CMS to WordPress. My so far experience with Joomla prooved to me just for one time the time and nerves spend to learn joomla and built a multi-lingual website with it as well as to administer it with joomla obscure and hard to cryptic interfaces and multiple security issues., makes this CMS completely unworthy to study or use, its hardness to upgrade from release to release, besides its much slow and its less plugins if compared to WordPress makes wordpress much better (and easier to build use platform than Joomla).
So if you happen to be in doubt where to use Joomla or a WordPress to build a new company / community website or a blog my humbe advise is – choose WordPress!

Share this on

How to check who is flooding your Apache, NGinx Webserver – Real time Monitor statistics about IPs doing most URL requests and Stopping DoS attacks with Fail2Ban

Wednesday, August 20th, 2014

check-who-is-flooding-your-apache-nginx-webserver-real-time-monitoring-ips-doing-most-url-requests-to-webserver-and-protecting-your-webserver-with-fail2ban

If you're Linux ystem administrator in Webhosting company providing WordPress / Joomla / Drupal web-sites hosting and your UNIX servers suffer from periodic denial of service attacks, because some of the site customers business is a target of competitor company who is trying to ruin your client business sites through DoS or DDOS attacks, then the best thing you can do is to identify who and how is the Linux server being hammered. If you find out DoS is not on a network level but Apache gets crashing because of memory leaks and connections to Apache are so much that the CPU is being stoned, the best thing to do is to check which IP addresses are causing the excessive GET / POST / HEAD requests in logged.
 

There is the Apachetop tool that can give you the most accessed webserver URLs in a refreshed screen like UNIX top command, however Apachetop does not show which IP does most URL hits on Apache / Nginx webserver. 

 

1. Get basic information on which IPs accesses Apache / Nginx the most using shell cmds


Before examining the Webserver logs it is useful to get a general picture on who is flooding you on a TCP / IP network level, with netstat like so:
 

# here is howto check clients count connected to your server
netstat -ntu | awk '{print $5}' | cut -d: -f1 | sort | uniq -c | sort -n


If you get an extensive number of connected various IPs / hosts (like 10000 or something huge as a number), depending on the type of hardware the server is running and the previous scaling planned for the system you can determine whether the count as huge as this can be handled normally by server, if like in most cases the server is planned to serve a couple of hundreds or thousands of clients and you get over 10000 connections hanging, then your server is under attack or if its Internet server suddenly your website become famous like someone posted an article on some major website and you suddenly received a tons of hits.


There is a way using standard shell tools, to get some basic information on which IP accesses the webserver the most with:

tail -n 500 /var/log/apache2/access.log | cut -d' ' -f1 | sort | uniq -c | sort -gr

Or if you want to keep it refreshing periodically every few seconds run it through watch command:

watch "tail -n 500 /var/log/apache2/access.log | cut -d' ' -f1 | sort | uniq -c | sort -gr"

monioring-access-hits-to-webserver-by-ip-show-most-visiting-apache-nginx-ip-with-shell-tools-tail-cut-uniq-sort-tools-refreshed-with-watch-cmd


Another useful combination of shell commands is to Monitor POST / GET / HEAD requests number in access.log :
 

 awk '{print $6}' access.log | sort | uniq -c | sort -n

     1 "alihack<%eval
      1 "CONNECT
      1 "fhxeaxb0xeex97x0fxe2-x19Fx87xd1xa0x9axf5x^xd0x125x0fx88x19"x84xc1xb3^v2xe9xpx98`X'dxcd.7ix8fx8fxd6_xcdx834x0c"
      1 "x16x03x01"
      1 "xe2
      2 "mgmanager&file=imgmanager&version=1576&cid=20
      6 "4–"
      7 "PUT
     22 "–"
     22 "OPTIONS
     38 "PROPFIND
   1476 "HEAD
   1539 "-"
  65113 "POST
 537122 "GET


However using shell commands combination is plenty of typing and hard to remember, plus above tools does not show you, approximately how frequenty IP hits the webserver

 

2. Real-time monitoring IP addresses with highest URL reqests with logtop

 


Real-time monitoring on IP addresses with highest URL requests is possible with no need of "console ninja skills"  through – logtop.

 

2.1 Install logtop on Debian / Ubuntu and deb derivatives Linux

 


a) Installing Logtop the debian way

LogTop is easily installable on Debian and Ubuntu in newer releases of Debian – Debian 7.0 and Ubuntu 13/14 Linux it is part of default package repositories and can be straightly apt-get-ed with:

apt-get install –yes logtop

b) Installing Logtop from source code (install on older deb based Linuxes)

On older Debian – Debian 6 and Ubuntu 7-12 servers to install logtop compile from source code – read the README installation instructions or if lazy copy / paste below:

cd /usr/local/src
wget https://github.com/JulienPalard/logtop/tarball/master
mv master JulienPalard-logtop.tar.gz
tar -zxf JulienPalard-logtop.tar.gz

cd JulienPalard-logtop-*/
aptitude install libncurses5-dev uthash-dev

aptitude install python-dev swig

make python-module

python setup.py install

make

make install

 

mkdir -p /usr/bin/
cp logtop /usr/bin/


2.2 Install Logtop on CentOS 6.5 / 7.0 / Fedora / RHEL and rest of RPM based Linux-es

b) Install logtop on CentOS 6.5 and CentOS 7 Linux

– For CentOS 6.5 you need to rpm install epel-release-6-8.noarch.rpm
 

wget http://dl.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm
rpm -ivh epel-release-6-8.noarch.rpm
links http://dl.fedoraproject.org/pub/epel/6/SRPMS/uthash-1.9.9-6.el6.src.rpm
rpmbuild –rebuild
uthash-1.9.9-6.el6.src.rpm
cd /root/rpmbuild/RPMS/noarch
rpm -ivh uthash-devel-1.9.9-6.el6.noarch.rpm


– For CentOS 7 you need to rpm install epel-release-7-0.2.noarch.rpm

 

links http://download.fedoraproject.org/pub/epel/beta/7/x86_64/repoview/epel-release.html
 

Click on and download epel-release-7-0.2.noarch.rpm

rpm -ivh epel-release-7-0.2.noarch
rpm –import /etc/pki/rpm-gpg/RPM-GPG-KEY-EPEL-7
yum -y install git ncurses-devel uthash-devel
git clone https://github.com/JulienPalard/logtop.git
cd logtop
make
make install

 

2.3 Some Logtop use examples and short explanation

 

logtop shows 4 columns as follows – Line number, Count, Frequency, and Actual line

 

The quickest way to visualize which IP is stoning your Apache / Nginx webserver on Debian?

 

tail -f access.log | awk {'print $1; fflush();'} | logtop

 

 

logtop-check-which-ip-is-making-most-requests-to-your-apache-nginx-webserver-linux-screenshot

On CentOS / RHEL

tail -f /var/log/httpd/access_log | awk {'print $1; fflush();'} | logtop

 

Using LogTop even Squid Proxy caching server access.log can be monitored.
To get squid Top users by IP listed:

 

tail -f /var/log/squid/access.log | awk {'print $1; fflush();'} | logtop


logtop-visualizing-top-users-using-squid-proxy-cache
 

Or you might visualize in real-time squid cache top requested URLs
 

tail -f /var/log/squid/access.log | awk {'print $7; fflush();'} | logtop


visualizing-top-requested-urls-in-squid-proxy-cache-howto-screenshot

 

3. Automatically Filter IP addresses causing Apache / Nginx Webservices Denial of Service with fail2ban
 

Once you identify the problem if the sites hosted on server are target of Distributed DoS, probably your best thing to do is to use fail2ban to automatically filter (ban) IP addresses doing excessive queries to system services. Assuming that you have already installed fail2ban as explained in above link (On Debian / Ubuntu Linux) with:
 

apt-get install –yes fail2ban


To make fail2ban start filtering DoS attack IP addresses, you will have to set the following configurations:
 

vim /etc/fail2ban/jail.conf


Paste in file:
 

[http-get-dos]
 
enabled = true
port = http,https
filter = http-get-dos
logpath = /var/log/apache2/WEB_SERVER-access.log
# maxretry is how many GETs we can have in the findtime period before getting narky
maxretry = 300
# findtime is the time period in seconds in which we're counting "retries" (300 seconds = 5 mins)
findtime = 300
# bantime is how long we should drop incoming GET requests for a given IP for, in this case it's 5 minutes
bantime = 300
action = iptables[name=HTTP, port=http, protocol=tcp]


Before you paste make sure you put the proper logpath = location of webserver (default one is /var/log/apache2/access.log), if you're using multiple logs for each and every of hosted websites, you will probably want to write a script to automatically loop through all logs directory get log file names and automatically add auto-modified version of above [http-get-dos] configuration. Also configure maxtretry per IP, findtime and bantime, in above example values are a bit low and for heavy loaded websites which has to serve thousands of simultaneous connections originating from office networks using Network address translation (NAT), this might be low and tuned to prevent situations, where even the customer of yours can't access there websites 🙂

To finalize fail2ban configuration, you have to create fail2ban filter file:

vim /etc/fail2ban/filters.d/http-get-dos.conf


Paste:
 

# Fail2Ban configuration file
#
# Author: http://www.go2linux.org
#
[Definition]
 
# Option: failregex
# Note: This regex will match any GET entry in your logs, so basically all valid and not valid entries are a match.
# You should set up in the jail.conf file, the maxretry and findtime carefully in order to avoid false positives.
 
failregex = ^<HOST> -.*"(GET|POST).*
 
# Option: ignoreregex
# Notes.: regex to ignore. If this regex matches, the line is ignored.
# Values: TEXT
#
ignoreregex =


To make fail2ban load new created configs restart it:
 

/etc/init.d/fail2ban restart


If you want to test whether it is working you can use Apache webserver Benchmark tools such as ab or siege.
The quickest way to test, whether excessive IP requests get filtered – and make your IP banned temporary:
 

ab -n 1000 -c 20 http://your-web-site-dot-com/

This will make 1000 page loads in 20 concurrent connections and will add your IP to temporary be banned for (300 seconds) = 5 minutes. The ban will be logged in /var/log/fail2ban.log, there you will get smth like:

2014-08-20 10:40:11,943 fail2ban.actions: WARNING [http-get-dos] Ban 192.168.100.5
2013-08-20 10:44:12,341 fail2ban.actions: WARNING [http-get-dos] Unban 192.168.100.5


Share this on

find text strings recursively in Linux and UNIX – find grep in sub-directories command examples

Tuesday, May 13th, 2014

unix_Linux_recursive_file_search_string_grep
GNU Grep
is equipped with a special option "-r" to grep recursively. Looking for string in a file in a sub-directories tree with the -r option is a piece of cake. You just do:

grep -r 'string' /directory/

or if you want to search recursively non-case sensitive for text

grep -ri 'string' .
 

Another classic GNU grep use (I use almost daily) is whether you want to match all files containing (case insensitive) string  among all files:

grep -rli 'string' directory-name
 

Now if you want to grep whether a string is contained in a file or group of files in directory recursively on some other UNIX like HP-UX or Sun OS / Solaris where there is no GNU grep installed by default here is how to it:

find /directory -exec grep 'searched string' {} dev/null ;

Note that this approach to look for files containing string on UNIX is very slowThus on not too archaic UNIX systems for some better search performance it is better to use xargs;

find . | xargs grep searched-string


A small note to open here is by using xargs there might be weird results when run on filesystems with filenames starting with "-".

Thus comes the classical (ultimate) way to grep for files containing string with find + grep, e.g.

find / -exec grep grepped-string {} dev/null ;

Another way to search a string recursively in files is by using UNIX OS '*' (star) expression:

grep pattern * */* */*/* 2>/dev/null

Talking about recursive directory text search in UNIX, should mention  another good GNU GREP alternative ACK – check it on betterthangrep.com 🙂 . Ack is perfect for programmers who have to dig through large directory trees of code for certain variables, functions, objects etc.

 


Share this on

30 years anniversary of the first mass produced portable computer COMPAQ Grid Compass 1011

Thursday, July 19th, 2012

Grid Notebook Big screen logo

Today it is considered the modern laptop (portable computers) are turning 30 years old. The notebook grandparent is a COMPAQGRiD Compass 1011 – a “mobile computer” with a electroluminescent display (ELD) screen supporting resolution of 320×240 pixels. The screen allowed the user to use the computer console in a text resolution of 80×24 chars. This portable high-tech gadget was equipped with magnesium alloy case, an Inten 8086 CPU (XT processor) at 8Mhz (like my old desktop pravetz pc 😉 ), 340 kilobyte (internal non-removable magnetic bubble memory and even a 1,200 bit/s modem!

COMPAQ Grid Compass considered first laptop / notebook on earthy 30 anniversary of the portable computer

The machine was uniquely compatible for its time as one could easily attach devices such as floppy 5.25 inch drives and external (10 Meg) hard disk via IEEE-488 I/O compatible protocol called GPiB (General Purpose instrumental Bus).

First mass prdocued portable computer laptop grid COMPAQ 11011 back side input peripherals

The laptop had also unique small weight of only 5 kg and a rechargable batteries with a power unit (like modern laptops) connectable to a normal (110/220 V) room plug.

First notebook in World ever the COMPAQ grid Compass 1101,br />
The machine was bundled with an own specificly written OS GRiD-OS. GRID-OS could only run a specialized software so this made the application available a bit limited.
Shortly after market introduction because of the incompitablity of GRID-OS, grid was shipped with MS-DOS v. 2.0.
This primitive laptop computer was developed for serve mainly the needs of business users and military purposes (NASA, U.S. military) etc.

GRID was even used on Space Shuttles during 1980 – 1990s.
The price of the machine in April 1982 when GriD Compass was introduced was the shockingly high – $8150 dollars.

The machine hardware design is quite elegant as you can see on below pic:

 COMPAQ grid laptop 1101 bubbles internal memory

As a computer history geek, I’ve researched further on GRID Compass and found a nice 1:30 hour video telling in detailed presentation retelling the history.

Shortly after COMPAQ’s Grid Compass 1011’s introduction, many other companies started producing similar sized computers; one example for this was the Epson HX-20 notebook. 30 years later, probably around 70% of citizens on the globe owns a laptop or some kind of portable computer device (smartphone, tablet, ultra-book etc.).

Most of computer users owning a desktop nowdays, owns a laptop too for mobility reasons. Interestengly even 30 years later the laptop as we know it is still in a shape (form) very similar to its original predecessor. Today the notebook sales are starting to be overshadowed by tablets and ultra-books (for second quarter laptop sales raised 5% but if compared with 2011, the sales rise is lesser 1.8% – according to data provided by Digital Research agency). There are estimations done by (Forrester Research) pointing until the end of year 2015, sales of notebook substitute portable devices will exceed the overall sales of notebooks. It is manifested today the market dynamics are changing in favour of tabets and the so called next generation laptopsULTRA-BOOKS. It is a mass hype and a marketing lie that Ultra-Books are somehow different from laptops. The difference between a classical laptop and Ultra-Books is the thinner size, less weight and often longer battery use time. Actually Ultra-Books are copying the design concept of Mac MacBook Air trying to resell under a lound name.
Even if in future Ipads, Android tablets, Ultra-Books or whatever kind of mambo-jambo portable devices flood the market, laptops will still be heavily used in future by programmers, office workers, company employees and any person who is in need to do a lot of regular text editting, email use and work with corporative apps. Hence we will see a COMPAC Grid Compass 1011 notebook likes to be dominant until end of the decade.

Share this on

How to add support for DJVU file format on M$ Windows, Mac, GNU / Linux and FreeBSD

Thursday, June 14th, 2012

Windjview Format paper Clipper logo / support DjView on Windows and Linux

By default there is no way to see what is inside a DJVU formatted document on both Windows and Linux OS platforms. It was just a few months ago I saw on one computer I had to fix up the DJVU format. DJVU format was developed for storing primary scanned documents which is rich in text and drawings.
Many old and ancient documents for example Church books in latin and some older stuff is only to be found online in DJVU format.
The main advantage of DJVU over lets say PDF which is also good for storing text and visual data is that DJVU's data encoding makes the files much more smaller in size, while still the quality of the scanned document is well readable for human eye.

DJVU is a file format alternative to PDF which we all know has been set itself to be one of the major standard formats for distributing electronic documents.

Besides old books there are plenty of old magazines, rare reports, tech reports newspapers from 1st and 2nd World War etc in DJVU.
A typical DJVU document takes a size of only lets say 50 to 100 KBytes of size just for comparison most a typical PDF encoded document is approximately sized 500 KiloBytes.

1.% Reading DJVU's on M$ Windoze and Mac-s (WinDjView)

The program reader for DJVU files in Windows is WinDjView WinDjView official download site is here

WinDjView is licensed under GPLv2 is a free software licensed under GPL2.

WinDjView works fine on all popular Windows versions (7, Vista, 2003, XP, 2000, ME, 98, NT4).

WinDjView with opened old document Sol manual ,,,,

I've made a mirror copy of WinDjView for download here (just in case something happens with the present release and someone needs it in future).

For Mac users there is also a port of WinDjView called MacDjView ;;;,

2.% Reading DJVU files on GNU / Linux

The library capable of rendering DJVUs in both Linux and Windows is djviewlibre again free software (A small note to make here is WinDjView also uses djviewlibre to render DJVU file content).

The program that is capable of viewing DJVU files in Linux is called djview4 I have so far tested it only with Debian GNU / Linux.

To add support to a desktop Debian GNU / Linux rel. (6.0.2) Squeeze, had to install following debs ;;;

debian:~# apt-get install --yes djview4 djvulibre-bin djviewlibre-desktop libdjviewlibre-text pdf2djvu
........
...

pdf2djvu is not really necessery to install but I installed it since I think it is a good idea to have a PDF to DJVU converter on the system in case I somedays need it ;;;

djview4 is based on KDE's QT library, so unfortunately users like me who use GNOME for a desktop environment will have the QT library installed as a requirement of above apt-get ;;;

Here is Djview4 screenshot with one opened old times Bulgarian magazine called Computer – for you

DJVU Pravetz Computer for you old school Bulgarian Pravetz magazine

Though the magazine opens fine, every now and then I got some spit errors whether scrolling the pages, but it could be due to improperly encoded DJVU file and not due to the reader. Pitily, whether I tried to maximize the PDF and read it in fullscreen I got (segfault) error and the program failed. Anyways at least I can read the magazine in non-fullscreen mode ;;; ,,,,

3.% Reading DJVU's on FreeBSD and (other BSDs)

Desktop FreeBSD users and other BSD OS enthusiasts could also use djview4 to view DJVUs as there is a BSD port in the ports tree.
To use it on BSD I had to install port /usr/ports/graphics/djview4:

freebsd# cd /usr/ports/graphics/djview4
freebsd# make install clean
,,,,...

For G / Linux users who has to do stuff with DJVU files, there are two other programs which might be useful:
 

  • a) djvusmooth – graphical editor for DjVu
  • b) gscan2pdf – A GUI to produce PDFs or DjVus from scanned documents

DJVUSmooth Debian GNU / Linux opened prog

I tried djusmooth to edit the same PDF magazine which I prior opened but I got an Unhandled exception: IOError, as you can in below shot:

DJVUSmooth Unhandled Exception IOError

This is probably normal since djvusmooth is in its very early stage of development – current version is 0.2.7-1

Unfortunately I don't have a scanner at home so I can't test if gscan2pdf produces proper DJVUs from scans, anyways I installed it to at least check the program interface which on a first glimpse looks simplistic:

gscan2pdf 0.9.31 Debian Linux Squeeze screenshot
To sum it up obviously DJVU seems like a great alternative to PDF, however its support for Free Software OSes is still lacking behind.
The Current windows DJVU works way better, though hopefully this will change soon.


Share this on