Posts Tagged ‘webpage’

Block Web server over loading Bad Crawler Bots and Search Engine Spiders with .htaccess rules

Monday, September 18th, 2017

howto-block-webserver-overloading-bad-crawler-bots-spiders-with-htaccess-modrewrite-rules-file

In last post, I've talked about the problem of Search Index Crawler Robots aggressively crawling websites and how to stop them (the article is here) explaning how to raise delays between Bot URL requests to website and how to completely probhit some bots from crawling with robots.txt.

As explained in article the consequence of too many badly written or agressive behaviour Spider is the "server stoning" and therefore degraded Web Server performance as a cause or even a short time Denial of Service Attack, depending on how well was the initial Server Scaling done.

The bots we want to filter are not to be confused with the legitimate bots, that drives real traffic to your website, just for information

 The 10 Most Popular WebCrawlers Bots as of time of writting are:
 

1. GoogleBot (The Google Crawler bots, funnily bots become less active on Saturday and Sundays :))

2. BingBot (Bing.com Crawler bots)

3. SlurpBot (also famous as Yahoo! Slurp)

4. DuckDuckBot (The dutch search engine duckduckgo.com crawler bots)

5. Baiduspider (The Chineese most famous search engine used as a substitute of Google in China)

6. YandexBot (Russian Yandex Search engine crawler bots used in Russia as a substitute for Google )

7. Sogou Spider (leading Chineese Search Engine launched in 2004)

8. Exabot (A French Search Engine, launched in 2000, crawler for ExaLead Search Engine)

9. FaceBot (Facebook External hit, this crawler is crawling a certain webpage only once the user shares or paste link with video, music, blog whatever  in chat to another user)

10. Alexa Crawler (la_archiver is a web crawler for Amazon's Alexa Internet Rankings, Alexa is a great site to evaluate the approximate page popularity on the internet, Alexa SiteInfo page has historically been the Swift Army knife for anyone wanting to quickly evaluate a webpage approx. ranking while compared to other pages)

Above legitimate bots are known to follow most if not all of W3C – World Wide Web Consorium (W3.Org) standards and therefore, they respect the content commands for allowance or restrictions on a single site as given from robots.txt but unfortunately many of the so called Bad-Bots or Mirroring scripts that are burning your Web Server CPU and Memory mentioned in previous article are either not following /robots.txt prescriptions completely or partially.

Hence with the robots.txt unrespective bots, the case the only way to get rid of most of the webspiders that are just loading your bandwidth and server hardware is to filter / block them is by using Apache's mod_rewrite through

 

.htaccess


file

Create if not existing in the DocumentRoot of your website .htaccess file with whatever text editor, or create it your windows / mac os desktop and transfer via FTP / SecureFTP to server.

I prefer to do it directly on server with vim (text editor)

 

 

vim /var/www/sites/your-domain.com/.htaccess

 

RewriteEngine On

IndexIgnore .htaccess */.??* *~ *# */HEADER* */README* */_vti*

SetEnvIfNoCase User-Agent "^Black Hole” bad_bot
SetEnvIfNoCase User-Agent "^Titan bad_bot
SetEnvIfNoCase User-Agent "^WebStripper" bad_bot
SetEnvIfNoCase User-Agent "^NetMechanic" bad_bot
SetEnvIfNoCase User-Agent "^CherryPicker" bad_bot
SetEnvIfNoCase User-Agent "^EmailCollector" bad_bot
SetEnvIfNoCase User-Agent "^EmailSiphon" bad_bot
SetEnvIfNoCase User-Agent "^WebBandit" bad_bot
SetEnvIfNoCase User-Agent "^EmailWolf" bad_bot
SetEnvIfNoCase User-Agent "^ExtractorPro" bad_bot
SetEnvIfNoCase User-Agent "^CopyRightCheck" bad_bot
SetEnvIfNoCase User-Agent "^Crescent" bad_bot
SetEnvIfNoCase User-Agent "^Wget" bad_bot
SetEnvIfNoCase User-Agent "^SiteSnagger" bad_bot
SetEnvIfNoCase User-Agent "^ProWebWalker" bad_bot
SetEnvIfNoCase User-Agent "^CheeseBot" bad_bot
SetEnvIfNoCase User-Agent "^Teleport" bad_bot
SetEnvIfNoCase User-Agent "^TeleportPro" bad_bot
SetEnvIfNoCase User-Agent "^MIIxpc" bad_bot
SetEnvIfNoCase User-Agent "^Telesoft" bad_bot
SetEnvIfNoCase User-Agent "^Website Quester" bad_bot
SetEnvIfNoCase User-Agent "^WebZip" bad_bot
SetEnvIfNoCase User-Agent "^moget/2.1" bad_bot
SetEnvIfNoCase User-Agent "^WebZip/4.0" bad_bot
SetEnvIfNoCase User-Agent "^WebSauger" bad_bot
SetEnvIfNoCase User-Agent "^WebCopier" bad_bot
SetEnvIfNoCase User-Agent "^NetAnts" bad_bot
SetEnvIfNoCase User-Agent "^Mister PiX" bad_bot
SetEnvIfNoCase User-Agent "^WebAuto" bad_bot
SetEnvIfNoCase User-Agent "^TheNomad" bad_bot
SetEnvIfNoCase User-Agent "^WWW-Collector-E" bad_bot
SetEnvIfNoCase User-Agent "^RMA" bad_bot
SetEnvIfNoCase User-Agent "^libWeb/clsHTTP" bad_bot
SetEnvIfNoCase User-Agent "^asterias" bad_bot
SetEnvIfNoCase User-Agent "^httplib" bad_bot
SetEnvIfNoCase User-Agent "^turingos" bad_bot
SetEnvIfNoCase User-Agent "^spanner" bad_bot
SetEnvIfNoCase User-Agent "^InfoNaviRobot" bad_bot
SetEnvIfNoCase User-Agent "^Harvest/1.5" bad_bot
SetEnvIfNoCase User-Agent "Bullseye/1.0" bad_bot
SetEnvIfNoCase User-Agent "^Mozilla/4.0 (compatible; BullsEye; Windows 95)" bad_bot
SetEnvIfNoCase User-Agent "^Crescent Internet ToolPak HTTP OLE Control v.1.0" bad_bot
SetEnvIfNoCase User-Agent "^CherryPickerSE/1.0" bad_bot
SetEnvIfNoCase User-Agent "^CherryPicker /1.0" bad_bot
SetEnvIfNoCase User-Agent "^WebBandit/3.50" bad_bot
SetEnvIfNoCase User-Agent "^NICErsPRO" bad_bot
SetEnvIfNoCase User-Agent "^Microsoft URL Control – 5.01.4511" bad_bot
SetEnvIfNoCase User-Agent "^DittoSpyder" bad_bot
SetEnvIfNoCase User-Agent "^Foobot" bad_bot
SetEnvIfNoCase User-Agent "^WebmasterWorldForumBot" bad_bot
SetEnvIfNoCase User-Agent "^SpankBot" bad_bot
SetEnvIfNoCase User-Agent "^BotALot" bad_bot
SetEnvIfNoCase User-Agent "^lwp-trivial/1.34" bad_bot
SetEnvIfNoCase User-Agent "^lwp-trivial" bad_bot
SetEnvIfNoCase User-Agent "^Wget/1.6" bad_bot
SetEnvIfNoCase User-Agent "^BunnySlippers" bad_bot
SetEnvIfNoCase User-Agent "^Microsoft URL Control – 6.00.8169" bad_bot
SetEnvIfNoCase User-Agent "^URLy Warning" bad_bot
SetEnvIfNoCase User-Agent "^Wget/1.5.3" bad_bot
SetEnvIfNoCase User-Agent "^LinkWalker" bad_bot
SetEnvIfNoCase User-Agent "^cosmos" bad_bot
SetEnvIfNoCase User-Agent "^moget" bad_bot
SetEnvIfNoCase User-Agent "^hloader" bad_bot
SetEnvIfNoCase User-Agent "^humanlinks" bad_bot
SetEnvIfNoCase User-Agent "^LinkextractorPro" bad_bot
SetEnvIfNoCase User-Agent "^Offline Explorer" bad_bot
SetEnvIfNoCase User-Agent "^Mata Hari" bad_bot
SetEnvIfNoCase User-Agent "^LexiBot" bad_bot
SetEnvIfNoCase User-Agent "^Web Image Collector" bad_bot
SetEnvIfNoCase User-Agent "^The Intraformant" bad_bot
SetEnvIfNoCase User-Agent "^True_Robot/1.0" bad_bot
SetEnvIfNoCase User-Agent "^True_Robot" bad_bot
SetEnvIfNoCase User-Agent "^BlowFish/1.0" bad_bot
SetEnvIfNoCase User-Agent "^JennyBot" bad_bot
SetEnvIfNoCase User-Agent "^MIIxpc/4.2" bad_bot
SetEnvIfNoCase User-Agent "^BuiltBotTough" bad_bot
SetEnvIfNoCase User-Agent "^ProPowerBot/2.14" bad_bot
SetEnvIfNoCase User-Agent "^BackDoorBot/1.0" bad_bot
SetEnvIfNoCase User-Agent "^toCrawl/UrlDispatcher" bad_bot
SetEnvIfNoCase User-Agent "^WebEnhancer" bad_bot
SetEnvIfNoCase User-Agent "^TightTwatBot" bad_bot
SetEnvIfNoCase User-Agent "^suzuran" bad_bot
SetEnvIfNoCase User-Agent "^VCI WebViewer VCI WebViewer Win32" bad_bot
SetEnvIfNoCase User-Agent "^VCI" bad_bot
SetEnvIfNoCase User-Agent "^Szukacz/1.4" bad_bot
SetEnvIfNoCase User-Agent "^QueryN Metasearch" bad_bot
SetEnvIfNoCase User-Agent "^Openfind data gathere" bad_bot
SetEnvIfNoCase User-Agent "^Openfind" bad_bot
SetEnvIfNoCase User-Agent "^Xenu’s Link Sleuth 1.1c" bad_bot
SetEnvIfNoCase User-Agent "^Xenu’s" bad_bot
SetEnvIfNoCase User-Agent "^Zeus" bad_bot
SetEnvIfNoCase User-Agent "^RepoMonkey Bait & Tackle/v1.01" bad_bot
SetEnvIfNoCase User-Agent "^RepoMonkey" bad_bot
SetEnvIfNoCase User-Agent "^Zeus 32297 Webster Pro V2.9 Win32" bad_bot
SetEnvIfNoCase User-Agent "^Webster Pro" bad_bot
SetEnvIfNoCase User-Agent "^EroCrawler" bad_bot
SetEnvIfNoCase User-Agent "^LinkScan/8.1a Unix" bad_bot
SetEnvIfNoCase User-Agent "^Keyword Density/0.9" bad_bot
SetEnvIfNoCase User-Agent "^Kenjin Spider" bad_bot
SetEnvIfNoCase User-Agent "^Cegbfeieh" bad_bot

 

<Limit GET POST>
order allow,deny
allow from all
Deny from env=bad_bot
</Limit>

 


Above rules are Bad bots prohibition rules have RewriteEngine On directive included however for many websites this directive is enabled directly into VirtualHost section for domain/s, if that is your case you might also remove RewriteEngine on from .htaccess and still the prohibition rules of bad bots should continue to work
Above rules are also perfectly suitable wordpress based websites / blogs in case you need to filter out obstructive spiders even though the rules would work on any website domain with mod_rewrite enabled.

Once you have implemented above rules, you will not need to restart Apache, as .htaccess will be read dynamically by each client request to Webserver

2. Testing .htaccess Bad Bots Filtering Works as Expected


In order to test the new Bad Bot filtering configuration is working properly, you have a manual and more complicated way with lynx (text browser), assuming you have shell access to a Linux / BSD / *Nix computer, or you have your own *NIX server / desktop computer running
 

Here is how:
 

 

lynx -useragent="Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler)" -head -dump http://www.your-website-filtering-bad-bots.com/

 

 

Note that lynx will provide a warning such as:

Warning: User-Agent string does not contain "Lynx" or "L_y_n_x"!

Just ignore it and press enter to continue.

Two other use cases with lynx, that I historically used heavily is to pretent with Lynx, you're GoogleBot in order to see how does Google actually see your website?
 

  • Pretend with Lynx You're GoogleBot

 

lynx -useragent="Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" -head -dump http://www.your-domain.com/

 

 

  • How to Pretend with Lynx Browser You are GoogleBot-Mobile

 

lynx -useragent="Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_1 like Mac OS X; en-us) AppleWebKit/532.9 (KHTML, like Gecko) Version/4.0.5 Mobile/8B117 Safari/6531.22.7 (compatible; Googlebot-Mobile/2.1; +http://www.google.com/bot.html)" -head -dump http://www.your-domain.com/

 


Or for the lazy ones that doesn't have Linux / *Nix at disposal you can use WannaBrowser website

Wannabrowseris a web based browser emulator which gives you the ability to change the User-Agent on each website req1uest, so just set your UserAgent to any bot browser that we just filtered for example set User-Agent to CheeseBot

The .htaccess rule earier added once detecting your browser client is coming in with the prohibit browser agent will immediately filter out and you'll be unable to access the website with a message like:
 

HTTP/1.1 403 Forbidden

 

Just as I've talked a lot about Index Bots, I think it is worthy to also mention three great websites that can give you a lot of Up to Date information on exact Spiders returned user-agent, common known Bot traits as well as a a current updated list with the Bad Bots etc.

Bot and Browser Resources information user-agents, bad-bots and odd Crawlers and Bots specifics

1. botreports.com
2. user-agents.org
3. useragentapi.com

 

An updated list with robots user-agents (crawler-user-agents) is also available in github here regularly updated by Caia Almeido

There are also a third party plugin (modules) available for Website Platforms like WordPress / Joomla / Typo3 etc.

Besides the listed on these websites as well as the known Bad and Good Bots, there are perhaps a hundred of others that might end up crawling your webdsite that might or might not need  to be filtered, therefore before proceeding with any filtering steps, it is generally a good idea to monitor your  HTTPD access.log / error.log, as if you happen to somehow mistakenly filter the wrong bot this might be a reason for Website Indexing Problems.

Hope this article give you some valueable information. Enjoy ! 🙂

 

Linux convert and read .mht (Microsoft html) file format. MHT format explained

Thursday, June 5th, 2014

linux-open-and-convert-mht-file-format-to-html-howto
If you're using Linux as a Desktop system sooner or later you will receive an email with instructions or an html page stored in .mht file format.
So what is mht? MHT is an webpage archive format (short for MIME HTML document). MHTML saves the Web page content and incorporates external resources, such as images, applets, Flash animations and so on, into HTML documents. Usually those .mht files were produced with Microsoft Internet Explorer – saving pages through:

File -> Save As (Save WebPage) dialog saves pages in .MHT.

To open those .mht files on Linux, where Firefox is available add the UNMHT FF Extension to browser. Besides allowing you to view MHT on Linux, whether some customer is requiring a copy of an HTML page in MHT, UNMHT allows you to also save complete web pages, including text and graphics, into a MHT file.
There is also support for Google Chrome browser for MHT opening and saving via a plugin called IETAB. But unfortunately IETAB is not supported in Linux.
Anyways IETAB is worthy to mention here as if your'e a Windows users and you want to browse pages compatible only with Internet Explorer, IETAB will emulates exactly IE by using IE rendering engine in Chrome  and supports Active X Controls. IETAB is a great extension for QA (web testers) using Windows for desktop who prefer to not use IE for security reasons. IETab supports IE6, IE7, IE8 and IE9.

Another way to convert .MHT content file into HTML is to use Linux KDE's mhttohtml tool.

linux-kde-converter-mhttohtml

Another approach to open .MHT files in Linux is to use Opera browser for Linux which has support for .MHT

Note that because MHT files could be storing potentially malicious content (like embedded Malware) it is always wise when opening MHT on Windows to assure you have scanned the file with Antivirus program. Often mails containing .MHT from unknown recipients are containing viruses or malware. Also links embedded into MHT file could easily expose you to spoof attacks. MHT files are encoded in combination of plain text MIMEs and BASE64 encoding scheme, MHT's mimetype is:

MIME type: message/rfc822
 

How to Turn Off, Suppress PHP Notices and Warnings – PHP error handling levels via php.ini and PHP source code

Friday, April 25th, 2014

php-logo-disable-warnings-and-notices-in-php-through-htaccess-php-ini-and-php-code

PHP Notices are common to occur after PHP version upgrades or where an obsolete PHP code is moved from Old version PHP to new version. This is common error in web software using Frameworks which have been abandoned by developers.

Having PHP Notices to appear on a webpage is pretty ugly and give a lot of information which might be used by malicious crackers to try to break your site thus it is always a good idea to disable PHP Notices. There are plenty of ways to disable PHP Notices

The easiest way to disable it is globally in all Webserver PHP library via php.ini (/etc/php.ini) open it and make sure display_errors is disabled:

display_errors = 0

or

display_errors = Off

Note that that some claim in PHP 5.3 setting display_errors to Off will not work as expected. Anyways to make sure where your loaded PHP Version display_errors is ON or OFF use phpinfo();

It is also possible to disable PHP Notices and error reporting straight from PHP code you need code like:

 

<?php
// Turn off all error reporting
error_reporting(0);
?>

 

or through code:

 

ini_set('display_errors',0);


PHP has different levels of error reporting, here is complete list of possible error handling variables:

 

 

 

<?php
// Report simple running errors

error_reporting(E_ERROR | E_WARNING | E_PARSE);

// Reporting E_NOTICE can be good too (to report uninitialized
// variables or catch variable name misspellings …)

error_reporting(E_ERROR | E_WARNING | E_PARSE | E_NOTICE);

// Report all errors except E_NOTICE
// This is the default value set in php.ini

error_reporting(E_ALL ^ E_NOTICE);
// Report all PHP errors (see changelog)

error_reporting(E_ALL);
// Report all PHP errors error_reporting(-1);
// Same as error_reporting(E_ALL);

ini_set('error_reporting', E_ALL); ?>

The level of logging could be tuned on Debian Linux via /etc/php5/apache2/php.ini or if necessary to set PHP log level in PHP CLI through /etc/php5/cli/php.ini with:

error_reporting = E_ALL & ~E_NOTICE

 

If you need to remove to remove exact warning or notices from PHP without changing the way  PHPLib behaves is to set @ infront of variable or function that is causing NOTICES or WARNING:
For example:
 

@yourFunctionHere();
@var = …;


Its also possible to Disable PHP Notices and Warnings using .htaccess file (useful in shared hosting where you don't have access to global php.ini), here is how:

# PHP error handling for development servers
php_flag display_startup_errors off
php_flag display_errors off
php_flag html_errors off
php_flag log_errors on
php_flag ignore_repeated_errors off
php_flag ignore_repeated_source off
php_flag report_memleaks on
php_flag track_errors on
php_value docref_root 0
php_value docref_ext 0
php_value error_log /home/path/public_html/domain/php_errors.log
php_value error_reporting -1
php_value log_errors_max_len 0

This way though PHP Notices and Warnings will be suppressed errors will get logged into php_error.log

Change Internet Explorer to open in tabs / Force IE Open Links in new tab

Thursday, November 28th, 2013

Internet-explorer multiple opened windows annoying behaviour fix howto picture

If you have to work on MS Windows 7 / 8 with Internet Explorer for the reason websites you're forced to work are only properly working under IE. This is common in big companies like my employer Hewlett Packard or IBM for instance. You certainly have been annoyed by default Internet Explorer 7 / Internet Explorer 8 or EI 9 behaviour to open each new link in separate Windows. By default normal browsers like Opera, Firefox and Google Chrome does not behave in such irritating ways but open each new link in separate tab. If you're like me used to work most of your life with Firefox, this IE behavior can quickly drive you "crazy" so you will look for fastly to change that abnormal browser actions. What makes things with default IE behavior even more messy is the fact that there are sites which automatically open in Separate tab (for they were javascripted) to do so and ones that open in new Window making the whole browsing experience a "pure windows hell".

Thanks God IE new window page popups can be easily changed

1. Open Internet Explorer and Click on
Tools
-> Internet Options

Internet explorer Internet options menu screenshot
(Note: if your version of Internet Explorer is hiding menus press Alt key to make it visualize menus)

2.
In General (tab) select on (Change how webpage is displayed in Tabs) Settings

Internet Explorer internet options change webpage displayed in tabs settings screenshot

3.  Under field When a popup is encountered: Choose radio button of ( Always Open Pop-ups in new tab )

Internet explorer tabbed browser settings Microsoft Windows 8 screenshot

After Apply and OK press finally Pages will start opening in a "human readable" way 🙂 in new Tabs. Hope this hint helps someone. Enjoy 🙂

Check how webpage looks with Internet Explorer on Linux and FreeBSD with Mozilla Firefox (Netrenderer Firefox plugin)

Thursday, November 1st, 2012

Simulate Internet Explorer in screenshots on GNU / Linux and FreeBSD using Netrenderer in Firefox - Internet Explorer testing tool for web developers on Linux and FreeBSD

I'm not full time web developer. But sometimes, I develop websites too or just had to do some website testing.
I'm using GNU / Linux and BSD as main server and desktop platforms for many years already and hence I don't have regular access to Windows OS and respectively Internet Explorer. In that manner of thoughts it is very useful to have a way to check if a certain website I create displays fine on Internet Explorer 6,7,8 too.

Usually whether I need to test if website displays properly its elements in Internet Explorer I do use the infamous  http://ipinfo.info/netrenderer/index.php – I guess it is almost impossible anyone is developing websites on Linux and don't know it :). Fortunately while I was googling to remind myself about the exact link location to netrenderer, I've stumbled upon Mozilla Firefox add-on extension which does precisely what ipinfo.info/netrenderer/ website does – i.e. renders a website with HTML Web Engine compatible   to most Internet Explorer versions and creating screenshots on how a website would look under Internet Explorer. Of course the plugin is not a panace and since it only makes screenshots whether there are problems with interactivity (Javascript AJAX) of a website on IE will the plugin will be of zero use. However in general it is good to know if at least the website elements are ordered fine.
After the plugin is added in the usual way as any other plugin in FF, you can start using it with keyboard shortcuts:

Ctrl+Shift+F5/F6/F7/F8 – respectively renders the page in IE5.5, IE 6, IE 7 / IE 8 Beta 2

Pressing CTRL + Shift + FX, makes the IE screenshot of site using http://ipinfo.info/netrenderer/

I'm currently running latest Firefox version 16.0.2 and here plugin works, fine I guess on most FF releases not older than few years it should work fine too.

Below is description of the plugin, as taken from plugin website:

IE NetRendered Add-on Description

Adds buttons, tools menu and contextual menu entries to get a screenshot of the current page with IE NetRenderer.

Keyboard shortcuts are also available: Ctrl+Shift+F5/F6/F7/F8 to render the page in IE5.5/6/7/8 Beta 2 (Cmd+Shift+F* on Mac).

Really useful for webmasters which are not using Windows!

You can also access the IE NetRenderer service here: http://ipinfo.info/netrenderer/index.php

Please note that the extension developper is not affiliated with GEOTEK, providing the IE NetRenderer service. You can visit his website here: http://nicopensource.free.fr/

 

 

 

 

 

 

 

 

 

How to convert any internet Webpage to PDF from command line on GNU/Linux

Friday, September 30th, 2011

Linux webpage html to pdf command line convertor wkhtmltopdf

If you're looking for a command line utility to generate PDF file out of any webpage located online you are looking for Wkhtmltopdf
The conversion of webpages to PDF by the tool is done using Apple's Webkit open source render.
wkhtmltopdf is something very useful for web developers, as some webpages has a requirement to produce dynamically pdfs from a remote website locations.
wkhtmltopdf is shipped with Debian Squeeze 6 and latest Ubuntu Linux versions and still not entered in Fedora and CentOS repositories.

To use wkhtmltopdf on Debian / Ubuntu distros install it via apt;

linux:~# apt-get install wkhtmltodpf
...

Next to convert a webpage of choice use cmd:

linux:~$ wkhtmltopdf www.pc-freak.net pc-freak.net_website.pdf
Loading page (1/2)
Printing pages (2/2)
Done

If the web page to be snapshotted in long few pages a few pages PDF will be generated by wkhtmltopdf
wkhtmltopdf also supports to create the website snapshot with a specified orientation Landscape / Portrait

-O Portrait options to it, like so:

linux:~$ wkhtmltopdf -O Portrait www.pc-freak.net pc-freak.net_website.pdf

wkhtmltopdf has many useful options, here are some of them:
 

  • Javascript disabling – Disable support for javascript for a website
  • Grayscale pdf generation – Generates PDf in Grayscale
  • Low quality pdf generation – Useful to shrink the output size of generated pdf size
  • Set PDF page size – (A4, Letter etc.)
  • Add zoom to the generated pdf content
  • Support for password HTTP authentication
  • Support to use the tool over a proxy
  • Generation of Table of Content based on titles (only in static version)
  • Adding of Header and Footers (only in static version)

To generate an A4 page with wkhtmltopdf:

wkhtmltopdf -s A4 www.pc-freak.net/blog/ pc-freak.net_blog.pdf

wkhtmltopdf looks promising but seems a bit buggy still, here is what happened when I tried to create a pdf without setting an A4 page formatting:

linux:$ wkhtmltopdf www.pc-freak.net/blog/ pc-freak.net_blog.pdf
Loading page (1/2)
OpenOffice path before fixup is '/usr/lib/openoffice' ] 71%
OpenOffice path is '/usr/lib/openoffice'
OpenOffice path before fixup is '/usr/lib/openoffice'
OpenOffice path is '/usr/lib/openoffice'
** (:12057): DEBUG: NP_Initialize
** (:12057): DEBUG: NP_Initialize succeeded
** (:12057): DEBUG: NP_Initialize
** (:12057): DEBUG: NP_Initialize succeeded
** (:12057): DEBUG: NP_Initialize
** (:12057): DEBUG: NP_Initialize succeeded
** (:12057): DEBUG: NP_Initialize
** (:12057): DEBUG: NP_Initialize succeeded
Printing pages (2/2)
Done
Printing pages (2/2)
Segmentation fault

Debian and Ubuntu version of wkhtmltopdf does not support TOC generation and Adding headers and footers, to support it one has to download and install the static version of wkhtmltopdf
Using the static version of the tool is also the only option for anyone on Fedora or any other RPM based Linux distro.

Secure Apache webserver against basic Denial of Service attacks with mod_evasive on Debian Linux

Wednesday, September 7th, 2011

Secure Apache against basic Denial of Service attacks with mod evasive, how webserver DDoS works

One good module that helps in mitigating, very basic Denial of Service attacks against Apache 1.3.x 2.0.x and 2.2.x webserver is mod_evasive

I’ve noticed however many Apache administrators out there does forget to install it on new Apache installations or even some of them haven’t heard about of it.
Therefore I wrote this small article to create some more awareness of the existence of the anti DoS module and hopefully thorugh it help some of my readers to strengthen their server security.

Here is a description on what exactly mod-evasive module does:

debian:~# apt-cache show libapache2-mod-evasive | grep -i description -A 7

Description: evasive module to minimize HTTP DoS or brute force attacks
mod_evasive is an evasive maneuvers module for Apache to provide some
protection in the event of an HTTP DoS or DDoS attack or brute force attack.
.
It is also designed to be a detection tool, and can be easily configured to
talk to ipchains, firewalls, routers, and etcetera.
.
This module only works on Apache 2.x servers

How does mod-evasive anti DoS module works?

Detection is performed by creating an internal dynamic hash table of IP Addresses and URIs, and denying any single IP address which matches the criterias:

  • Requesting the same page more than number of times per second
  • Making more than N (number) of concurrent requests on the same child per second
  • Making requests to Apache during the IP is temporarily blacklisted (in a blocking list – IP blacklist is removed after a time period))

These anti DDoS and DoS attack protection decreases the possibility that Apache gets DoSed by ana amateur DoS attack, however it still opens doors for attacks who has a large bot-nets of zoombie hosts (let’s say 10000) which will simultaneously request a page from the Apache server. The result in a scenario with a infected botnet running a DoS tool in most of the cases will be a quick exhaustion of system resources available (bandwidth, server memory and processor consumption).
Thus mod-evasive just grants a DoS and DDoS security only on a basic, level where someone tries to DoS a webserver with only possessing access to few hosts.
mod-evasive however in many cases mesaure to protect against DoS and does a great job if combined with Apache mod-security module discussed in one of my previous blog posts – Tightening PHP Security on Debian with Apache 2.2 with ModSecurity2
1. Install mod-evasive

Installing mod-evasive on Debian Lenny, Squeeze and even Wheezy is done in identical way straight using apt-get:

deiban:~# apt-get install libapache2-mod-evasive
...

2. Enable mod-evasive in Apache

debian:~# ln -sf /etc/apache2/mods-available/mod-evasive.load /etc/apache2/mods-enabled/mod-evasive.load

3. Configure the way mod-evasive deals with potential DoS attacks

Open /etc/apache2/apache2.conf, go down to the end of the file and paste inside, below three mod-evasive configuration directives:

<IfModule mod_evasive20.c>
DOSHashTableSize 3097DOS
PageCount 30
DOSSiteCount 40
DOSPageInterval 2
DOSSiteInterval 1
DOSBlockingPeriod 120
#DOSEmailNotify hipo@mymailserver.com
</IfModule>

In case of the above configuration criterias are matched, mod-evasive instructs Apache to return a 403 (Forbidden by default) error page which will conserve bandwidth and system resources in case of DoS attack attempt, especially if the DoS attack targets multiple requests to let’s say a large downloadable file or a PHP,Perl,Python script which does a lot of computation and thus consumes large portion of server CPU time.

The meaning of the above three mod-evasive config vars are as follows:

DOSHashTableSize 3097 – Increasing the DoSHashTableSize will increase performance of mod-evasive but will consume more server memory, on a busy webserver this value however should be increased
DOSPageCount 30 – Add IP in evasive temporary blacklist if a request for any IP that hits the same page 30 consequential times.
DOSSiteCount 40 – Add IP to be be blacklisted if 40 requests are made to a one and the same URL location in 1 second time
DOSBlockingPeriod 120 – Instructs the time in seconds for which an IP will get blacklisted (e.g. will get returned the 403 foribden page), this settings instructs mod-evasive to block every intruder which matches DOSPageCount 30 or DOSSiteCount 40 for 2 minutes time.
DOSPageInterval 2 – Interval of 2 seconds for which DOSPageCount can be reached.
DOSSiteInterval 1 – Interval of 1 second in which if DOSSiteCount of 40 is matched the matched IP will be blacklisted for configured period of time.

mod-evasive also supports IP whitelisting with its option DOSWhitelist , handy in cases if for example, you should allow access to a single webpage from office env consisting of hundred computers behind a NAT.
Another handy configuration option is the module capability to notify, if a DoS is originating from a number of IP addresses using the option DOSEmailNotify
Using the DOSSystemCommand in relation with iptables, could be configured to filter out any IP addresses which are found to be matching the configured mod-evasive rules.
The module also supports custom logging, if you want to keep track on IPs which are found to be trying a DoS attack against the server place in above shown configuration DOSLogDir “/var/log/apache2/evasive” and create the /var/log/apache2/evasive directory, with:
debian:~# mkdir /var/log/apache2/evasive

I decided not to log mod-evasive DoS IP matches as this will just add some extra load on the server, however in debugging some mistakenly blacklisted IPs logging is sure a must.

4. Restart Apache to load up mod-evasive debian:~# /etc/init.d/apache2 restart
...

Finally a very good reading which sheds more light on how exactly mod-evasive works and some extra module configuration options are located in the documentation bundled with the deb package to read it, issue:

debian:~# zless /usr/share/doc/libapache2-mod-evasive/README.gz

How to easy add Joomla 1.5 donate Paypal capabilities with Joomla PAYPAL DONATION MODULE

Wednesday, June 15th, 2011

PayPal donation Module Joomla Screenshot

Many joomla CMS installations are for Non-profit organizations or Non Government organizations. These are organizations which are not officially making profit and therefore this instituations are interested into donations to support their activities.

In this occasions adding Joomla paypal capabilities is very essential. There are plenty of modules which enables Joomla to support paypal monetary payments, however many of them are either paid or requires registration and thus it’s quite time consuming to set up a decent PayPal supporting module for Joomla.
After a bit of investigation thanks God, I’ve come across a module that is free of charge, easily downloadable (wihtout registration) and is also relatively easy to configure, these module is called PAYPAL DONATION MODULE
I’ve mirored the module to my server, just in case if the module disappears in the future.

Here are a very brief explanation on how the module can be downloaded installed and configured:

First Download (mod_ojdonation_pp) Paypal Donation Module here

Install it as joomla module via:

Extensions -> Install/Uninstall
menu

Afterwards, go to:

Extensions -> Module Manager

In the list of modules you will notice the Donate module which will be disabled. Use the Enable button to enable it.

Next by clicking on the Donate Module Name, one can configure the module, where the most essential configuration values that needs to be filled in are:

1. Title: – The title of the donation form:
2. Donation Title: – Title of donation picture to show in the webpage
3. Donation Amount: – Default donation amount user will donate with paypal by clicking on Donate button
4. Currency – Default currency the donators will use to donate to configured paypal account
5. Paypal ID: – The email address of paypal account your donators will donate to (This was a bit hard to understand since Paypal ID is not a number ID but the email address configured as an username in PayPal).
6. Donation Description: – Description text to appear before the Donate button
7. Donation Footer: – Text to appear after the Donate button

There are two ways one could add the donation module to show the donation form, on the joomla website:
a. One is to enable the donation button on every joomla webpage (I don’t like this kind of behaviour).

To use this kind of donate button display approach, you will have to select from the Donation module, conf options:
– Show on FrontPage: and Show Title:

Also make sure the Enabled: option is set to Yes

b. Second approach is to set the PayPal Donation form only to appear on a single menu, to do so:

While in Paypal Donation Module configuration in Menu Assignment section, select:

Select Menu Item(s) from the List
instead of the default All value set for Menus.

The last setting to be choosen is the paypal donation form page location (where exactly on the selected pages the form will appear).

The form location is set from the Position: dropdown menu, the option which I found to be the best one for me was the bottom option. However just play with the Position setting and choose the one that will be best for you.

Then scroll on in the Menu Selection: and choose only the menus where you want a paypal donation form to appear.

Finally to save all the recent made settings, click on Apply and refreshing in a new page should show you paypal’s money donation form in joomla

If all is configured fine with Joomla’s – Paypal Donation Module you should get on your webpage:

PayPal donation Module in Joomla Screenshot
 

Cloud Computing a possible threat to users privacy and system administrator employment

Monday, March 28th, 2011

Cloud Computing screenshot

If you’re employed into an IT branch an IT hobbyist or a tech, geek you should have certainly heard about the latest trend in Internet and Networking technologies the so called Cloud Computing

Most of the articles available in newspapers and online have seriously praised and put the hopes for a better future through cloud computing.
But is really the cloud computing as good as promised? I seriously doubt that.
Let’s think about it what is a cloud? It’s a cluster of computers which are connected to work as one.
No person can precisely say where exactly on the cluster cloud a stored information is located (even the administrator!)

The data stored on the cluster is a property of a few single organizations let’s say microsoft, amazon etc., so we as users no longer have a physical possession of our data (in case if we use the cloud).

On the other hand the number of system administrators that are needed for an administration of a huge cluster is dramatically decreased, the every day system administrator, who needs to check a few webservers and a mail server on daily basis, cache web data with a squid proxy cache or just restart a server will be no longer necessary.

Therefore about few million of peoples would have to loose their jobs, the people necessary to administrate a cluster will be probably no more than few thousands as the clouds are so high that no more than few clouds will exist on the net.

The idea behind the cluster is that we the users store retrieve our desktops and boot our operating system from the cluster.
Even loading a simple webpage will have to retrieve it’s data from the cluster.

Therefore it looks like in the future the cloud computing and the internet are about to become one and the same thing. The internet might become a single super cluster where all users would connect with their user ids and do have full access to the information inside.

Technologies like OpenID are trying to make the user identification uniform, I assume a similar uniform user identication will be used in the future in a super cloud where everybody, where entering inside will have access to his/her data and will have the option to access any other data online.

The desire of humans and business for transperancy would probably end up in one day, where people will want to share every single bit of information.
Even though it looks very cool for a sci-fi movie, it’s seriously scary!

Cloud computing expenses as they’re really high would be affordable only for a multi-national corporations like Google and Microsoft

Therefore small and middle IT business (network building, expanding, network and server system integration etc.) would gradually collapse and die.

This are only a few small tiny bit of concerns but in reality the problems that cloud computing might create are a way more severe.
We the people should think seriously and try to oppose cloud computing, while we still can! It might be even a good idea if a special legislation that is aming at limiting cloud computing can be integrated and used only inside the boundary of a prescribed limitations.

Institutions like the European Parliament should be more concerned about the issues which the use of cloud computing will bring, EU legislation should very soon be voted and bounding contracts stop clouds from expanding and taking over the middle size IT business.

Few little SEO tips on indexing a website / webpage in a better fashion (Through usage of robots.txt and sitemap.xml)

Monday, December 28th, 2009

As I’ve mentioned in my previous post it’s of vital value to have your website W3C compliant.
I’m trying to learn something more about SEO this days and therefore I found a website dedicated to
Search Engine Optimization discussing the importance of having the robots.txt file in each and every hosted
domain as well as the huge importance of generating sitemap for every web page out there in the Net.
More about robots.txt can be learned fromabout robots.txt’s website
Here is a straigh forward robots.txt I choose to use to enable indexing of website without any restrictions
# Allow allUser-agent: *Disallow:It was kinda of handy for me to learn more about sitemaps on sitemaps.org
In few words sitemaps are used to better map content on a website as every SEO out there knows.
Here is a sample sitemap from the sitemap.xml I’ve generated for pc-freak.net’s entry page
http://pc-freak.net/ 2009-12-27T15:28:49+00:00 1.00There are plenty of both online and standalone sitemap.xml generators on the net.
I’ve choosed to use an online service to generate my sitemap.xml after which I modified
the entries in the generated file manually.
I’ve also given a try to google-sitemapgen from the bsd port tree, howeverI wasn’t able to use that soft to generate any sitemaps.
One of the most used and probably honoured project on sitemap generating is Google’s sitemap Generator, checkin Google for further info.
Another handy tip that helps indexing in search engines is including:
Sitemap: http://pc-freak.net/sitemap.xmlin the robots.txt file, so after inclusion robots.txtwould look alike.# Allow allUser-agent: *Disallow:Sitemap: http://pc-freak.net/sitemap.xml