Posts Tagged ‘content’

Block Web server over loading Bad Crawler Bots and Search Engine Spiders with .htaccess rules

Monday, September 18th, 2017

howto-block-webserver-overloading-bad-crawler-bots-spiders-with-htaccess-modrewrite-rules-file

In last post, I've talked about the problem of Search Index Crawler Robots aggressively crawling websites and how to stop them (the article is here) explaning how to raise delays between Bot URL requests to website and how to completely probhit some bots from crawling with robots.txt.

As explained in article the consequence of too many badly written or agressive behaviour Spider is the "server stoning" and therefore degraded Web Server performance as a cause or even a short time Denial of Service Attack, depending on how well was the initial Server Scaling done.

The bots we want to filter are not to be confused with the legitimate bots, that drives real traffic to your website, just for information

 The 10 Most Popular WebCrawlers Bots as of time of writting are:
 

1. GoogleBot (The Google Crawler bots, funnily bots become less active on Saturday and Sundays :))

2. BingBot (Bing.com Crawler bots)

3. SlurpBot (also famous as Yahoo! Slurp)

4. DuckDuckBot (The dutch search engine duckduckgo.com crawler bots)

5. Baiduspider (The Chineese most famous search engine used as a substitute of Google in China)

6. YandexBot (Russian Yandex Search engine crawler bots used in Russia as a substitute for Google )

7. Sogou Spider (leading Chineese Search Engine launched in 2004)

8. Exabot (A French Search Engine, launched in 2000, crawler for ExaLead Search Engine)

9. FaceBot (Facebook External hit, this crawler is crawling a certain webpage only once the user shares or paste link with video, music, blog whatever  in chat to another user)

10. Alexa Crawler (la_archiver is a web crawler for Amazon's Alexa Internet Rankings, Alexa is a great site to evaluate the approximate page popularity on the internet, Alexa SiteInfo page has historically been the Swift Army knife for anyone wanting to quickly evaluate a webpage approx. ranking while compared to other pages)

Above legitimate bots are known to follow most if not all of W3C – World Wide Web Consorium (W3.Org) standards and therefore, they respect the content commands for allowance or restrictions on a single site as given from robots.txt but unfortunately many of the so called Bad-Bots or Mirroring scripts that are burning your Web Server CPU and Memory mentioned in previous article are either not following /robots.txt prescriptions completely or partially.

Hence with the robots.txt unrespective bots, the case the only way to get rid of most of the webspiders that are just loading your bandwidth and server hardware is to filter / block them is by using Apache's mod_rewrite through

 

.htaccess


file

Create if not existing in the DocumentRoot of your website .htaccess file with whatever text editor, or create it your windows / mac os desktop and transfer via FTP / SecureFTP to server.

I prefer to do it directly on server with vim (text editor)

 

 

vim /var/www/sites/your-domain.com/.htaccess

 

RewriteEngine On

IndexIgnore .htaccess */.??* *~ *# */HEADER* */README* */_vti*

SetEnvIfNoCase User-Agent "^Black Hole” bad_bot
SetEnvIfNoCase User-Agent "^Titan bad_bot
SetEnvIfNoCase User-Agent "^WebStripper" bad_bot
SetEnvIfNoCase User-Agent "^NetMechanic" bad_bot
SetEnvIfNoCase User-Agent "^CherryPicker" bad_bot
SetEnvIfNoCase User-Agent "^EmailCollector" bad_bot
SetEnvIfNoCase User-Agent "^EmailSiphon" bad_bot
SetEnvIfNoCase User-Agent "^WebBandit" bad_bot
SetEnvIfNoCase User-Agent "^EmailWolf" bad_bot
SetEnvIfNoCase User-Agent "^ExtractorPro" bad_bot
SetEnvIfNoCase User-Agent "^CopyRightCheck" bad_bot
SetEnvIfNoCase User-Agent "^Crescent" bad_bot
SetEnvIfNoCase User-Agent "^Wget" bad_bot
SetEnvIfNoCase User-Agent "^SiteSnagger" bad_bot
SetEnvIfNoCase User-Agent "^ProWebWalker" bad_bot
SetEnvIfNoCase User-Agent "^CheeseBot" bad_bot
SetEnvIfNoCase User-Agent "^Teleport" bad_bot
SetEnvIfNoCase User-Agent "^TeleportPro" bad_bot
SetEnvIfNoCase User-Agent "^MIIxpc" bad_bot
SetEnvIfNoCase User-Agent "^Telesoft" bad_bot
SetEnvIfNoCase User-Agent "^Website Quester" bad_bot
SetEnvIfNoCase User-Agent "^WebZip" bad_bot
SetEnvIfNoCase User-Agent "^moget/2.1" bad_bot
SetEnvIfNoCase User-Agent "^WebZip/4.0" bad_bot
SetEnvIfNoCase User-Agent "^WebSauger" bad_bot
SetEnvIfNoCase User-Agent "^WebCopier" bad_bot
SetEnvIfNoCase User-Agent "^NetAnts" bad_bot
SetEnvIfNoCase User-Agent "^Mister PiX" bad_bot
SetEnvIfNoCase User-Agent "^WebAuto" bad_bot
SetEnvIfNoCase User-Agent "^TheNomad" bad_bot
SetEnvIfNoCase User-Agent "^WWW-Collector-E" bad_bot
SetEnvIfNoCase User-Agent "^RMA" bad_bot
SetEnvIfNoCase User-Agent "^libWeb/clsHTTP" bad_bot
SetEnvIfNoCase User-Agent "^asterias" bad_bot
SetEnvIfNoCase User-Agent "^httplib" bad_bot
SetEnvIfNoCase User-Agent "^turingos" bad_bot
SetEnvIfNoCase User-Agent "^spanner" bad_bot
SetEnvIfNoCase User-Agent "^InfoNaviRobot" bad_bot
SetEnvIfNoCase User-Agent "^Harvest/1.5" bad_bot
SetEnvIfNoCase User-Agent "Bullseye/1.0" bad_bot
SetEnvIfNoCase User-Agent "^Mozilla/4.0 (compatible; BullsEye; Windows 95)" bad_bot
SetEnvIfNoCase User-Agent "^Crescent Internet ToolPak HTTP OLE Control v.1.0" bad_bot
SetEnvIfNoCase User-Agent "^CherryPickerSE/1.0" bad_bot
SetEnvIfNoCase User-Agent "^CherryPicker /1.0" bad_bot
SetEnvIfNoCase User-Agent "^WebBandit/3.50" bad_bot
SetEnvIfNoCase User-Agent "^NICErsPRO" bad_bot
SetEnvIfNoCase User-Agent "^Microsoft URL Control – 5.01.4511" bad_bot
SetEnvIfNoCase User-Agent "^DittoSpyder" bad_bot
SetEnvIfNoCase User-Agent "^Foobot" bad_bot
SetEnvIfNoCase User-Agent "^WebmasterWorldForumBot" bad_bot
SetEnvIfNoCase User-Agent "^SpankBot" bad_bot
SetEnvIfNoCase User-Agent "^BotALot" bad_bot
SetEnvIfNoCase User-Agent "^lwp-trivial/1.34" bad_bot
SetEnvIfNoCase User-Agent "^lwp-trivial" bad_bot
SetEnvIfNoCase User-Agent "^Wget/1.6" bad_bot
SetEnvIfNoCase User-Agent "^BunnySlippers" bad_bot
SetEnvIfNoCase User-Agent "^Microsoft URL Control – 6.00.8169" bad_bot
SetEnvIfNoCase User-Agent "^URLy Warning" bad_bot
SetEnvIfNoCase User-Agent "^Wget/1.5.3" bad_bot
SetEnvIfNoCase User-Agent "^LinkWalker" bad_bot
SetEnvIfNoCase User-Agent "^cosmos" bad_bot
SetEnvIfNoCase User-Agent "^moget" bad_bot
SetEnvIfNoCase User-Agent "^hloader" bad_bot
SetEnvIfNoCase User-Agent "^humanlinks" bad_bot
SetEnvIfNoCase User-Agent "^LinkextractorPro" bad_bot
SetEnvIfNoCase User-Agent "^Offline Explorer" bad_bot
SetEnvIfNoCase User-Agent "^Mata Hari" bad_bot
SetEnvIfNoCase User-Agent "^LexiBot" bad_bot
SetEnvIfNoCase User-Agent "^Web Image Collector" bad_bot
SetEnvIfNoCase User-Agent "^The Intraformant" bad_bot
SetEnvIfNoCase User-Agent "^True_Robot/1.0" bad_bot
SetEnvIfNoCase User-Agent "^True_Robot" bad_bot
SetEnvIfNoCase User-Agent "^BlowFish/1.0" bad_bot
SetEnvIfNoCase User-Agent "^JennyBot" bad_bot
SetEnvIfNoCase User-Agent "^MIIxpc/4.2" bad_bot
SetEnvIfNoCase User-Agent "^BuiltBotTough" bad_bot
SetEnvIfNoCase User-Agent "^ProPowerBot/2.14" bad_bot
SetEnvIfNoCase User-Agent "^BackDoorBot/1.0" bad_bot
SetEnvIfNoCase User-Agent "^toCrawl/UrlDispatcher" bad_bot
SetEnvIfNoCase User-Agent "^WebEnhancer" bad_bot
SetEnvIfNoCase User-Agent "^TightTwatBot" bad_bot
SetEnvIfNoCase User-Agent "^suzuran" bad_bot
SetEnvIfNoCase User-Agent "^VCI WebViewer VCI WebViewer Win32" bad_bot
SetEnvIfNoCase User-Agent "^VCI" bad_bot
SetEnvIfNoCase User-Agent "^Szukacz/1.4" bad_bot
SetEnvIfNoCase User-Agent "^QueryN Metasearch" bad_bot
SetEnvIfNoCase User-Agent "^Openfind data gathere" bad_bot
SetEnvIfNoCase User-Agent "^Openfind" bad_bot
SetEnvIfNoCase User-Agent "^Xenu’s Link Sleuth 1.1c" bad_bot
SetEnvIfNoCase User-Agent "^Xenu’s" bad_bot
SetEnvIfNoCase User-Agent "^Zeus" bad_bot
SetEnvIfNoCase User-Agent "^RepoMonkey Bait & Tackle/v1.01" bad_bot
SetEnvIfNoCase User-Agent "^RepoMonkey" bad_bot
SetEnvIfNoCase User-Agent "^Zeus 32297 Webster Pro V2.9 Win32" bad_bot
SetEnvIfNoCase User-Agent "^Webster Pro" bad_bot
SetEnvIfNoCase User-Agent "^EroCrawler" bad_bot
SetEnvIfNoCase User-Agent "^LinkScan/8.1a Unix" bad_bot
SetEnvIfNoCase User-Agent "^Keyword Density/0.9" bad_bot
SetEnvIfNoCase User-Agent "^Kenjin Spider" bad_bot
SetEnvIfNoCase User-Agent "^Cegbfeieh" bad_bot

 

<Limit GET POST>
order allow,deny
allow from all
Deny from env=bad_bot
</Limit>

 


Above rules are Bad bots prohibition rules have RewriteEngine On directive included however for many websites this directive is enabled directly into VirtualHost section for domain/s, if that is your case you might also remove RewriteEngine on from .htaccess and still the prohibition rules of bad bots should continue to work
Above rules are also perfectly suitable wordpress based websites / blogs in case you need to filter out obstructive spiders even though the rules would work on any website domain with mod_rewrite enabled.

Once you have implemented above rules, you will not need to restart Apache, as .htaccess will be read dynamically by each client request to Webserver

2. Testing .htaccess Bad Bots Filtering Works as Expected


In order to test the new Bad Bot filtering configuration is working properly, you have a manual and more complicated way with lynx (text browser), assuming you have shell access to a Linux / BSD / *Nix computer, or you have your own *NIX server / desktop computer running
 

Here is how:
 

 

lynx -useragent="Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler)" -head -dump http://www.your-website-filtering-bad-bots.com/

 

 

Note that lynx will provide a warning such as:

Warning: User-Agent string does not contain "Lynx" or "L_y_n_x"!

Just ignore it and press enter to continue.

Two other use cases with lynx, that I historically used heavily is to pretent with Lynx, you're GoogleBot in order to see how does Google actually see your website?
 

  • Pretend with Lynx You're GoogleBot

 

lynx -useragent="Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" -head -dump http://www.your-domain.com/

 

 

  • How to Pretend with Lynx Browser You are GoogleBot-Mobile

 

lynx -useragent="Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_1 like Mac OS X; en-us) AppleWebKit/532.9 (KHTML, like Gecko) Version/4.0.5 Mobile/8B117 Safari/6531.22.7 (compatible; Googlebot-Mobile/2.1; +http://www.google.com/bot.html)" -head -dump http://www.your-domain.com/

 


Or for the lazy ones that doesn't have Linux / *Nix at disposal you can use WannaBrowser website

Wannabrowseris a web based browser emulator which gives you the ability to change the User-Agent on each website req1uest, so just set your UserAgent to any bot browser that we just filtered for example set User-Agent to CheeseBot

The .htaccess rule earier added once detecting your browser client is coming in with the prohibit browser agent will immediately filter out and you'll be unable to access the website with a message like:
 

HTTP/1.1 403 Forbidden

 

Just as I've talked a lot about Index Bots, I think it is worthy to also mention three great websites that can give you a lot of Up to Date information on exact Spiders returned user-agent, common known Bot traits as well as a a current updated list with the Bad Bots etc.

Bot and Browser Resources information user-agents, bad-bots and odd Crawlers and Bots specifics

1. botreports.com
2. user-agents.org
3. useragentapi.com

 

An updated list with robots user-agents (crawler-user-agents) is also available in github here regularly updated by Caia Almeido

There are also a third party plugin (modules) available for Website Platforms like WordPress / Joomla / Typo3 etc.

Besides the listed on these websites as well as the known Bad and Good Bots, there are perhaps a hundred of others that might end up crawling your webdsite that might or might not need  to be filtered, therefore before proceeding with any filtering steps, it is generally a good idea to monitor your  HTTPD access.log / error.log, as if you happen to somehow mistakenly filter the wrong bot this might be a reason for Website Indexing Problems.

Hope this article give you some valueable information. Enjoy ! 🙂

 

Share this on

rc.local missing in Debian 8 Jessie and Debian 9 Stretch and newer Ubuntu 16, Fedora, CentOS Linux – Why is /etc/rc.local not working and how to make it work again

Monday, September 11th, 2017

rc.local-not-working-solve-fix-linux-startup-with-rc.local-explained-how-to-make-rc.local-working-again-on-newer-linux-distributions

If you have installed a newer version of Debian GNU / Linux such as Debian Jessie or Debian  9 Stretch or Ubuntu 16 Xenial Xerus either on a server or on a personal Desktop laptop and you want tto execute a number of extra commands next to finalization of system boot just like we GNU / Linux users used to do already for the rest 25+ years you will be surprised that /etc/rc.local is no longer available (file is completely missing!!!).

This kind of behaviour (to avoid use of /etc/rc.local and make the file not present by default right after Linux OS install) was evident across many RedHack (Redhat) distributions such as Fedora and CentOS Linux for the last number of releases and the tendency was to also happen in Debian based distros too as it often does, however there was a possibility on this RPM based distros as well as rest of Linux distros to have the /etc/rc.local manually created to work around the missing file.

But NOoooo, the smart new generation GNU / Linux architects with large brains decided to completely wipe out the execution on Linux boot of /etc/rc.local from finalization stage, SMART isn't it??

For instance If you used to eat certain food for the last 25+ years and they suddenly prohibit you to eat it because they say this is not necessery anymore how would you feel?? Crazy isn't it??

Yes I understand the idea to wipe out /etc/rc.local did have a reason as the developers are striving to constanly improve the boot speed process (and the introduction of systemd (system and service manager) in Debian 8 Jessie over the past years did changed significantly on how Linux boots (earlier used SysV boot and LSB – linux standard based init scripts), but come on guys /etc/rc.local
doesn't stone the boot process with minutes, including it will add just 2, 3 seconds extra to boot runtime, so why on earth did you decided to remove it??

What I really loved about Linux through the years was the high level of consistency and inter-operatibility, most things worked just the same way across distributions and there was some logic upgrade, but lately this kind of behaviour is changing so in many of the new things in both GUI and text mode (console) way to interact with a GNU / Linux PC all becomes messy sadly …

So the smart guys who develop Gnu / Linux distros said its time to depreciate /etc/rc.local to prevent the user to be able to execute his set of finalization commands at the end of each booted multiuser runlevel.

The good news is you can bring back (resurrect) /etc/rc.local really easy:

To so, just execute the following either in Physical /dev/tty Console or in Gnome-Terminal (for GNOME users) or for KDE GUI environment users in KDE's terminal emulator konsole:

 

cat <<EOF >/etc/rc.local
#!/bin/sh -e
#
# rc.local
#
# This script is executed at the end of each multiuser runlevel.
# Make sure that the script will "exit 0" on success or any other
# value on error.
#
# In order to enable or disable this script just change the execution
# bits.
#
# By default this script does nothing.
exit 0
EOF
chmod +x /etc/rc.local
systemctl start rc-local
systemctl status rc-local


I think above is self-explanatory /etc/rc.local file is being created and then to enable it we run systemctl start rc-local and then to check the just run rc-local service status systemctl status

You will get an output similar to below:
 

 

root@jericho:/home/hipo# systemctl start rc-local
root@jericho:/home/hipo# systemctl status rc-local
● rc-local.service – /etc/rc.local Compatibility
   Loaded: loaded (/lib/systemd/system/rc-local.service; static; vendor preset:
  Drop-In: /lib/systemd/system/rc-local.service.d
           └─debian.conf
   Active: active (exited) since Mon 2017-09-11 13:15:35 EEST; 6s ago
  Process: 5008 ExecStart=/etc/rc.local start (code=exited, status=0/SUCCESS)
    Tasks: 0 (limit: 4915)
   CGroup: /system.slice/rc-local.service
sep 11 13:15:35 jericho systemd[1]: Starting /etc/rc.local Compatibility…
setp 11 13:15:35 jericho systemd[1]: Started /etc/rc.local Compatibility.

To test /etc/rc.local is working as expected you can add to print any string on boot, right before exit 0 command in /etc/rc.local

you can add for example:
 

echo "YES, /etc/rc.local IS NOW AGAIN WORKING JUST LIKE IN EARLIER LINUX DISTRIBUTIONS!!! HOORAY !!!!";


On CentOS 7 and Fedora 18 codename (Spherical Cow) or other RPM based Linux distro if /etc/rc.local is missing you can follow very similar procedures to have it enabled, make sure

/etc/rc.d/rc.local

is existing

and /etc/rc.local is properly symlined to /etc/rc.d/rc.local

Also don't forget to check whether /etc/rc.d/rc.local is set to be executable file with ls -al /etc/rc.d/rc.local

If it is not executable, make it be by running cmd:
 

chmod a+x /etc/rc.d/rc.local


If file /etc/rc.d/rc.local happens to be missing just create it with following content:

 

#!/bin/sh

# Your boot time rc.commands goes somewhere below and above before exit 0

exit 0


That's all folks rc.local not working is solved,
enjoy /etc/rc.local working again 🙂

 

Share this on

Linux convert and read .mht (Microsoft html) file format. MHT format explained

Thursday, June 5th, 2014

linux-open-and-convert-mht-file-format-to-html-howto
If you're using Linux as a Desktop system sooner or later you will receive an email with instructions or an html page stored in .mht file format.
So what is mht? MHT is an webpage archive format (short for MIME HTML document). MHTML saves the Web page content and incorporates external resources, such as images, applets, Flash animations and so on, into HTML documents. Usually those .mht files were produced with Microsoft Internet Explorer – saving pages through:

File -> Save As (Save WebPage) dialog saves pages in .MHT.

To open those .mht files on Linux, where Firefox is available add the UNMHT FF Extension to browser. Besides allowing you to view MHT on Linux, whether some customer is requiring a copy of an HTML page in MHT, UNMHT allows you to also save complete web pages, including text and graphics, into a MHT file.
There is also support for Google Chrome browser for MHT opening and saving via a plugin called IETAB. But unfortunately IETAB is not supported in Linux.
Anyways IETAB is worthy to mention here as if your'e a Windows users and you want to browse pages compatible only with Internet Explorer, IETAB will emulates exactly IE by using IE rendering engine in Chrome  and supports Active X Controls. IETAB is a great extension for QA (web testers) using Windows for desktop who prefer to not use IE for security reasons. IETab supports IE6, IE7, IE8 and IE9.

Another way to convert .MHT content file into HTML is to use Linux KDE's mhttohtml tool.

linux-kde-converter-mhttohtml

Another approach to open .MHT files in Linux is to use Opera browser for Linux which has support for .MHT

Note that because MHT files could be storing potentially malicious content (like embedded Malware) it is always wise when opening MHT on Windows to assure you have scanned the file with Antivirus program. Often mails containing .MHT from unknown recipients are containing viruses or malware. Also links embedded into MHT file could easily expose you to spoof attacks. MHT files are encoded in combination of plain text MIMEs and BASE64 encoding scheme, MHT's mimetype is:

MIME type: message/rfc822
 

Share this on

Check and Restart Apache if it is malfunctioning (not returning HTML content) shell script

Monday, March 19th, 2012

Check and Restart Apache Webserver on Malfunction, Apache feather logo

One of the company Debian Lenny 5.0 Webservers, where I'm working as sys admin sometimes stops to properly server HTTP requests.
Whenever this oddity happens, the Apache server seems to be running okay but it is not failing to return requested content

I can see the webserver listens on port 80 and establishing connections to remote hosts – the apache processes show normally as I can see in netstat …:

apache:~# netstat -enp 80
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address Foreign Address State User Inode PID/Program name
tcp 0 0 xxx.xxx.xxx.xx:80 46.253.9.36:5665 SYN_RECV 0 0 -
tcp 0 0 xxx.xxx.xxx.xx:80 78.157.26.24:5933 SYN_RECV 0 0 -
...

Also the apache forked child processes show normally in process list:

apache:~# ps axuwwf|grep -i apache
root 46748 0.0 0.0 112300 872 pts/1 S+ 18:07 0:00 \_ grep -i apache
root 42530 0.0 0.1 217392 6636 ? Ss Mar14 0:39 /usr/sbin/apache2 -k start
www-data 42535 0.0 0.0 147876 1488 ? S Mar14 0:01 \_ /usr/sbin/apache2 -k start
root 28747 0.0 0.1 218180 4792 ? Sl Mar14 0:00 \_ /usr/sbin/apache2 -k start
www-data 31787 0.0 0.1 219156 5832 ? S Mar14 0:00 | \_ /usr/sbin/apache2 -k start

In spite of that, in any client browser to any of the Apache (Virtual hosts) websites, there is no HTML content returned…
This weird problem continues until the Apache webserver is retarted.
Once webserver is restarted everything is back to normal.
I use Apache Check Apache shell script set on few remote hosts to regularly check with nmap if port 80 (www) of my server is open and responding, anyways this script just checks if the open and reachable and thus using it was unable to detect Apache wasn't able to return back HTML content.
To work around the malfunctions I wrote tiny script – retart_apache_if_empty_content_is_returned.sh

The scripts idea is very simple;
A request is made a remote defined host with lynx text browser, then the output of lines is counted, if the output returned by lynx -dump http://someurl.com is less than the number returned whether normally invoked, then the script triggers an apache init script restart.

I've set the script to periodically run in a cron job, every 5 minutes each hour.
# check if apache returns empty content with lynx and if yes restart and log it
*/5 * * * * /usr/sbin/restart_apache_if_empty_content.sh >/dev/null 2>&1

This is not perfect as sometimes still, there will be few minutes downtime, but at least the downside will not be few hours until I am informed ssh to the server and restart Apache manually

A quick way to download and set from cron execution my script every 5 minutes use:

apache:~# cd /usr/sbin
apache:/usr/sbin# wget -q http://www.pc-freak.net/bscscr/restart_apache_if_empty_content.sh
apache:/usr/sbin# chmod +x restart_apache_if_empty_content.sh
apache:/usr/sbin# crontab -l > /tmp/file; echo '*/5 * * * *' /usr/sbin/restart_apache_if_empty_content.sh 2>&1 >/dev/null

 

Share this on

How to disable tidy HTML corrector and validator to output error and warning messages

Sunday, March 18th, 2012

I've noticed in /var/log/apache2/error.log on one of the Debian servers I manage a lot of warnings and errors produced by tidy HTML syntax checker and reformatter program.

There were actually quite plenty frequently appearing messages in the the log like:

...
To learn more about HTML Tidy see http://tidy.sourceforge.net
Please fill bug reports and queries using the "tracker" on the Tidy web site.
Additionally, questions can be sent to html-tidy@w3.org
HTML and CSS specifications are available from http://www.w3.org/
Lobby your company to join W3C, see http://www.w3.org/Consortium
line 1 column 1 - Warning: missing <!DOCTYPE> declaration
line 1 column 1 - Warning: plain text isn't allowed in <head> elements
line 1 column 1 - Info: <head> previously mentioned
line 1 column 1 - Warning: inserting implicit <body>
line 1 column 1 - Warning: inserting missing 'title' element
Info: Document content looks like HTML 3.2
4 warnings, 0 errors were found!
...

I did a quick investigation on where from this messages are logged in error.log, and discovered few .php scripts in one of the websites containing the tidy string.
I used Linux find + grep cmds find in all php files the "tidy "string, like so:

server:~# find . -iname '*.php'-exec grep -rli 'tidy' '{}' \;
find . -iname '*.php' -exec grep -rli 'tidy' '{}' \; ./new_design/modules/index.mod.php
./modules/index.mod.php
./modules/index_1.mod.php
./modules/index1.mod.php

Opening the files, with vim to check about how tidy is invoked, revealed tidy calls like:

exec('/usr/bin/tidy -e -ashtml -utf8 '.$tmp_name,$rett);

As you see the PHP programmers who wrote this website, made a bigtidy mess. Instead of using php5's tidy module, they hard coded tidy external command to be invoked via php's exec(); external tidy command invocation.
This is extremely bad practice, since it spawns the command via a pseudo limited apache shell.
I've notified about the issue, but I don't know when, the external tidy calls will be rewritten.

Until the external tidy invocations are rewritten to use the php tidy module, I decided to at least remove the tidy warnings and errors output.

To remove the warning and error messages I've changed:

exec('/usr/bin/tidy -e -ashtml -utf8 '.$tmp_name,$rett);

exec('/usr/bin/tidy --show-warnings no --show-errors no -q -e -ashtml -utf8 '.$tmp_name,$rett);

The extra switches meaning is like so:

q – instructs tidy to produce quiet output
-e – show only errors and warnings
–show warnings no && –show errors no, completely disable warnings and error output

Onwards tidy no longer logs junk messages in error.log Not logging all this useless warnings and errors has positive effect on overall server performance especially, when the scripts, running /usr/bin/tidy are called as frequently as 1000 times per sec. or more

Share this on

Why and How facebook profits because of you? – People, The real facebook investors

Tuesday, March 6th, 2012

Facebook people real facebook investors, facebook profits because of you / facebook greedy money logo

Facebook is usually praised and very seldom criticized. I've seen already on a couple of occasions on the TV channel news on earthquake occasions or some kind of other calamities, where facebook was said to help the rescuing teams etc. We constantly hear how facebook has helped people point their location in disastrous situations or just helping people organize a protest against a harmful company activity. Whilst this might be true, the harms it does are quite big as well. A primary harm it does is to economy as we know it. As people are engaged in filling in Mark Zuckerberg and facebook investors pockets, they rarely think about how actually facebook gets their money?

Let me explain:

Basicly facebook makes money out of its constantly increased social network data content. This could have not been possible without the 800 000 000+ million of people who constantly post updates on facebook, create groups, post pictures, add likes, comment and post links to other facebook pages. If people had not all this volunteers (facebook users) to post all this bunch of mostly junky information, facebook inc. would not have a penny. Therefore what makes facebook grow is the people itself who willingly choose to be part of this money making machine. One would think with regular company the investors are the owners of the company shares. This classical business model is not facebook model, there it is rather different as the real investors in facebook are not the capital shareholders but the regular social network user base – this means (you and me)!.

For all those who still don't get what I'm talking about I will shortly explain.
Everyone who has a basic idea on how internet advertising works is aware that the primary origin for facebook todays profit is the left pane sky scraper field with ever changing advertisements.

Various advertisers pays facebook all the time big money for displaying those stupid advertisement. As many peole are viewing and clicking the advertisements, facebook makes billions out if its advertisers.

So far so good, facebook generates its profits out of peoples free time and delibarately information sharing you would say and you might argue me that facebook steals people (time / money). This would have been true if you don't put in the picture for a contrast, a regular blogger, who makes its daily living out of blogging.
What a regular blogger does is frequent blogging on various kind of topics of his interest. Various bloggers blog at various titles, but most of them has a few major topics which they're following.

The more articles a blogger collects and the higher the uniqueness of this information is the bigger the probability this blog to have a good users base and the more interesting content it will have for search engine robots like Google Bot Crawlers or Yahoo Bot etc. etc.
With all priod said, the higher the probability this blog to have more traffic drawn from web searches to the blog. As the blogger content increases with time when it gets 10000 or more unique articles (pages), consequently it can be used as an advertising place. A 10000 pages blog could earn a person a few hundred of euros (200, 300 EUR) per month.

Well the business scheme behind facebook is exactly the same, except they store and physically own the data of the facebook registered persons. The user posts content on his facebook wall, makes pages or does various activities which generate pages, the content gets indexed in Google and with time the overall facebook website content grows. As new users joins facebook with the increased popularity of website. The website is growing exponentially like in a atoms chain reaction.

Because of this steady content growth, it becomes an interesting place not only for advertisers but for all kind of people that use the internet.
And there you have the monetarization facebook makes billions of dollars every second because of you.  This is the shocking truth, they get their money because people click or view advertisement on each others profile, so there you're YOU make the little people who develop facebook and the original investors richer and richer with every day, where you make yourself poorer and poorer by investing your personal time in facebook instead of using it to work on something that will potentially generate you some dividents in short or long future.

Actually social network is nothing more than just a multiple blogging platform, but some marketing person come with this marketing hype work "social network".
The social network buzz word is in my view just another big marketing "white lie"!. Correct me if i'm wrong but what in fact is a "Social network?". I don't see facebook neither as social, network as network. I don't know about you but I have never made a long lasting friend or relationship using facebook so far. I think the poor Facebook creator Zuckerberg made facebook with a viral mindset. He intended it to be like a social virus and so far he succeed pretty much. I just wait and eager to see who will start the anti-virus for Zuckerberg's (facebook) – people time eating virus. 
 

Share this on

How to set up Qmail auto reply (Out of the Office), vacation message manually using .qmail message processing file

Tuesday, February 14th, 2012

Qmail Logo Auto reply message / how to setup qmail auto reply out of the office vacation message

I had to setup a QMAIL auto reply (Out of the Office) message on 5 email addresses and since I haven't done it for a long time it took me a couple 20 minutes to consult Qmail (Life With Qmail http://lifewithqmail.org (great website!) documentation and read a couple of online forum threads until I finally remembered, how I used to be setting up a vacation message manually via qmail's .qmail file.

Of course Setting qmail auto reply can always be done via QmailAdmin or VQadmin ..Qmail Vpopmail web frontends however on many Qmail mail servers Qmailadmin or/and VQadmin is absent due to some reason or even on a big mail servers the server doesn't run Apache at all. Hence it is good to know how to set qmail vacation message directly via plain SSH terminal connection and this is why how this article got born.

So here is how I enable qmail auto reply "manually", through .qmail for my email address info@my-email-domain.com:

1. Set a /var/vpopmail/domains/my-email-domain.com/info/.qmail file with the following content:

| /usr/bin/autorespond 86400 3 /home/vpopmail/domains/my-email-domain.com/info/vacation/message /home/vpopmail/domains/my-email-domain.com/info/vacation

2. Create /home/vpopmail/domains/my-email-domain.com/info/vacation directory

linux:~# mkdir -p /home/vpopmail/domains/my-email-domain.com/info/vacation/

3. Create /home/vpopmail/domains/my-email-domain.com/info/vacation/message file with auto reply message

First create the message file with touch command:

linux:~# touch /home/vpopmail/domains/my-email-domain.com/info/vacation/message

Then put with vim or mcedit etc. an auto-reply vacation message similar to the sample below:

From: info@cadiainsurance.com
Subject: We have received your message. Thank you!

Dear Customer, we thank you for the interest in our services.
A member of our team will reply promptly to your enquiry shortly.

4. Set proper permissions for vacation/message and .qmail files

/home/vpopmail/domains/my-email-domain.com/info/vacation/message and /home/vpopmail/domains/my-email-domain.com/info/.qmail files has to be owned by user/group vpopmail:vchkpw, e.g.:

linux:~# chown -R vpopmail:vchkpw /home/vpopmail/domains/my-email-domain.com/info/vacation
linux:~# chown vpopmail:vchkpw /home/vpopmail/domains/my-email-domain.com/info/.qmail

If you are a qmail administration with the requirement to create auto reply message for employees going on a holiday often (in a middle sized company office), setting up the out of the office auto reply manually one by one is a time consuming, annoying task and "crazy" task. Therefore some time ago while still I was employed in a Bulgarian mid-sized company called Design.BG, I've written a tiny shell script which creates qmail email users vacation messages by passing few arguments.

Here is my create_vpopmail_vacation.sh shell script
Note that this script might have a lot of bugs and is not much tested, so read it carefully and test it before you put it for daily use 😉
Happy Hacking! 😉

Share this on

How to make wordpress Update Plugins prompt to permanently store password / Get rid of annoying updates wordpress prompt

Thursday, February 2nd, 2012

I'm managing few wordpress installations which requires me to type in:
Hostname , FTP Username and FTP Password , every single time a plugin update is issued and I want to upgrade to the new version.
Below is a screenshot of this annoying behaviour:

How to get rid of update plugins wordpress username password prompt

As you can see in the above screenshot, there is no way through Update Plugins web interface to store the password permanently. Hence the only option to store it permanently is to manually edit wp-config.php (file located in wordpress docroot, e.g. /path/to/wordpress/wp-config.php , inside the file find the line:

define ('WPLANG', '');

Right after it put a code similar to:

define('FS_METHOD', 'ftpsockets');
define('FTP_BASE', '/path/to/wordpress/');
define('FTP_CONTENT_DIR', '/path/to/wordpress/wp-content/');
define('FTP_PLUGIN_DIR ', '/path/to/wordpress/wp-content/plugins/');
define('FTP_USER', 'Username');
define('FTP_PASS', 'Password');
define('FTP_HOST', 'localhost');

Change the above defines:
path/to/wordpress/ – with your wordpress location directory.
Username and Password – with your respective FTP username and password. The localhost

That's all, from now onwards the User/Password prompt will not appear anymore. Consider there is a security downside of storing the FTP User/Pass in wp-config.php , if someone is able to intrude the wordpress install and access the documentroot of the wordpress install he we'll be able to obtain the ftp user/pass and log in the server directly via FTP protocol.

Share this on

How to convert html pages to text in console / terminal on GNU / Linux and FreeBSD

Thursday, December 8th, 2011

HTML to Plain Text Convertion on GNU / Linux and FreeBSD

I’m realizing the more I’m converting to a fully functional GUI user, the less I’m doing coding or any interesting stuff…
I remembered of the old glorious times, when I was full time console user and got a memory on a nifty trick I was so used to back in the day.
Back then I was quite often writing shell scripts which were fetching (html) webpages and converting the html content into a plain TEXT (TXT) files

In order to fetch a page back in the days I used lynx(a very simple UNIX text browser, which by the way lacks support for any CSS or Javascipt) in combination with html2text – (an advanced HTML-to-text converter).

Let’s say I wanted to fetch a my personal home page http://www.pc-freak.net/, I did that via the command:

$ lynx -source http://www.pc-freak.net/ | html2text > pcfreak_page.txt

The content from www.pc-freak.net got spit by lynx as an html source and passed html2pdf wchich saves it in plain text file pcfreak_page.txt
The bit more advanced elinks – (lynx-like alternative character mode WWW browser) provides better support for HTML and even some CSS and Javascript so to properly save the content of many pages in plain html file its better to use it instead of lynx, the way to produce .txt using elinks files is identical, e.g.:

$ elinks -source http://www.pc-freak.net/blog/ | html2text > pcfreak_blog_page.txt

By the way back in the days I was used more to links , than the superior elinks , nowdays I have both of the text browsers installed and testing to fetch an html like in the upper example and pipe to html2text produced garbaged output.

Here is the time to tell its not even necessery to have a text browser installed in order to fetch a webpage and convert it to a plain text TXT!. wget file downloading tools supports source dump as well, for all those who did not (yet) tried it and want to test it:

$ wget -qO- http://www.pc-freak.net | html2text Anyways of course, some pages convertion of text inside HTML tags would not properly get saved with neither lynx or elinks cause some texts might be embedded in some elinks or lynx unsupported CSS or JavaScript. In those cases the GUI browser is useful. You can use any browser like Firefox, Epiphany or Opera ‘s File -> Save As (Text Files) embedded functionality, below is a screenshot showing an html page which I’m about to save as a plain Text File in Mozilla Firefox:

Firefox iceWeasel Opera etc. save html webpage as plain text on GNU / Linux, FreeBSD

Besides being handy in conjunction with text browsers, html2text is also handy for converting .html pages already existing on the computer’s hard drive to a plain (.TXT) text format.
One might wonder, why would ever one would like to do that?? Well I personally prefer reading plain text documents instead of htmls 😉
Converting an html files already existing on hard drive with html2text is done with cmd:

$ html2text index.html >index.txt

To convert a whole directory full of .html (documentation) or whatever files to plain text .TXT , cd the directory with HTMLs and issue the one liner bash loop command:

$ cd html/
html$ for i in $(echo *.html); do html2text $i > $(echo $i | sed -e 's#.html#.txt#g'); done

Now lay off your back and enjoy reading the dox like in the good old hacker days when .TXT files were fashionable 😉

Share this on