Posts Tagged ‘page’

How to convert html pages to text in console / terminal on GNU / Linux and FreeBSD

Thursday, December 8th, 2011

HTML to Plain Text Convertion on GNU / Linux and FreeBSD

I’m realizing the more I’m converting to a fully functional GUI user, the less I’m doing coding or any interesting stuff…
I remembered of the old glorious times, when I was full time console user and got a memory on a nifty trick I was so used to back in the day.
Back then I was quite often writing shell scripts which were fetching (html) webpages and converting the html content into a plain TEXT (TXT) files

In order to fetch a page back in the days I used lynx(a very simple UNIX text browser, which by the way lacks support for any CSS or Javascipt) in combination with html2text – (an advanced HTML-to-text converter).

Let’s say I wanted to fetch a my personal home page https://www.pc-freak.net/, I did that via the command:

$ lynx -source https://www.pc-freak.net/ | html2text > pcfreak_page.txt

The content from www.pc-freak.net got spit by lynx as an html source and passed html2pdf wchich saves it in plain text file pcfreak_page.txt
The bit more advanced elinks – (lynx-like alternative character mode WWW browser) provides better support for HTML and even some CSS and Javascript so to properly save the content of many pages in plain html file its better to use it instead of lynx, the way to produce .txt using elinks files is identical, e.g.:

$ elinks -source https://www.pc-freak.net/blog/ | html2text > pcfreak_blog_page.txt

By the way back in the days I was used more to links , than the superior elinks , nowdays I have both of the text browsers installed and testing to fetch an html like in the upper example and pipe to html2text produced garbaged output.

Here is the time to tell its not even necessery to have a text browser installed in order to fetch a webpage and convert it to a plain text TXT!. wget file downloading tools supports source dump as well, for all those who did not (yet) tried it and want to test it:

$ wget -qO- https://www.pc-freak.net | html2text Anyways of course, some pages convertion of text inside HTML tags would not properly get saved with neither lynx or elinks cause some texts might be embedded in some elinks or lynx unsupported CSS or JavaScript. In those cases the GUI browser is useful. You can use any browser like Firefox, Epiphany or Opera ‘s File -> Save As (Text Files) embedded functionality, below is a screenshot showing an html page which I’m about to save as a plain Text File in Mozilla Firefox:

Firefox iceWeasel Opera etc. save html webpage as plain text on GNU / Linux, FreeBSD

Besides being handy in conjunction with text browsers, html2text is also handy for converting .html pages already existing on the computer’s hard drive to a plain (.TXT) text format.
One might wonder, why would ever one would like to do that?? Well I personally prefer reading plain text documents instead of htmls 😉
Converting an html files already existing on hard drive with html2text is done with cmd:

$ html2text index.html >index.txt

To convert a whole directory full of .html (documentation) or whatever files to plain text .TXT , cd the directory with HTMLs and issue the one liner bash loop command:

$ cd html/
html$ for i in $(echo *.html); do html2text $i > $(echo $i | sed -e 's#.html#.txt#g'); done

Now lay off your back and enjoy reading the dox like in the good old hacker days when .TXT files were fashionable 😉

How to find out all programs bandwidth use with (nethogs) top like utility on Linux

Friday, September 30th, 2011

Just run across across a super nice top like, program for system administrators, its called nethogs and is definitely entering my “l337” admin outfit next to tools like iftop, nettop, ettercap, darkstat htop, iotop etc.

nethogs is ultra easy to use, to get immediately in console statistics about running processes UPLOAD and DOWNLOAD bandwidth consumption just run it:

linux:~# nethogs

Nethogs screenshot on Linux Server with Nginx
Nethogs running on Debian GNU/Linux serving static web content with Nginx

If you need to check what program is using what amount of network bandwidth, you will definitely love this tool. Having information of bandwidth consumption is also viewable partially with iftop, however iftop is unable to track the bandwidth consumption to each process using the network thus it seems nethogs is unique at what it does.

Nethogs supports IPv4 and IPv6 as well as supports network traffic over ppp. The tool is available via package repositories for Debian GNU/Lenny 5 and Debian Squeeze 6.

To install Nethogs on CentOS and Fedora distributions, you will have to install it from source. On CentOS 5.7, latest nethogs which as of time of writting this article is 0.8.0 compiles and installs fine with make && make install commands.

In the manner of thoughts of network bandwidth monitoring, another very handy tool to add extra understanding on what kind of traffic is crossing over a Linux server is jnettop
jnettop shows which hosts/ports is taking up the most network traffic.
It is available for install via apt in Debian 5/6).

Here is a screenshot on jnettop in action:

Jnettop check network traffic in console

To install jnettop on latest Fedoras / CentOS / Slackware Linux it has to be download and compiled from source via jnettop’s official wiki page
I’ve tested jnettop install from source on CentOS release 5.7 and it seems to compile just fine using the usual compile commands:

[root@prizebg jnettop-0.13.0]# ./configure
...
[root@prizebg jnettop-0.13.0]# make
...
[root@prizebg jnettop-0.13.0]# make install

If you need to have an idea on the network traffic passing by your Linux server distringuished by tcp/udp/icmp network protocols and services like ssh / ftp / apache, then you will definitely want to take a look at nettop (if of course not familiar with it yet).
Nettop is not provided as a deb package in Debian and Ubuntu, where it is included as rpm for CentOS and presumably Fedora?
Here is a screenshot on nettop network utility in action:

Nettop server traffic division by protocol screenshot
FreeBSD users should be happy to find out that jnettop and nettop are part of the ports tree and the two can be installed straight, however nethogs would not work on FreeBSD, I searched for a utility capable of what Nethogs can, but couldn’t find such.
It seems the only way on FreeBSD to track bandwidth back and from originating process is using a combination of iftop and sockstat utilities. Probably there are other tools which people use to track network traffic to the processes running on a hos and do general network monitoringt, if anyone knows some good tools, please share with me.

How to convert any internet Webpage to PDF from command line on GNU/Linux

Friday, September 30th, 2011

Linux webpage html to pdf command line convertor wkhtmltopdf

If you're looking for a command line utility to generate PDF file out of any webpage located online you are looking for Wkhtmltopdf
The conversion of webpages to PDF by the tool is done using Apple's Webkit open source render.
wkhtmltopdf is something very useful for web developers, as some webpages has a requirement to produce dynamically pdfs from a remote website locations.
wkhtmltopdf is shipped with Debian Squeeze 6 and latest Ubuntu Linux versions and still not entered in Fedora and CentOS repositories.

To use wkhtmltopdf on Debian / Ubuntu distros install it via apt;

linux:~# apt-get install wkhtmltodpf
...

Next to convert a webpage of choice use cmd:

linux:~$ wkhtmltopdf www.pc-freak.net www.pc-freak.net_website.pdf
Loading page (1/2)
Printing pages (2/2)
Done

If the web page to be snapshotted in long few pages a few pages PDF will be generated by wkhtmltopdf
wkhtmltopdf also supports to create the website snapshot with a specified orientation Landscape / Portrait

-O Portrait options to it, like so:

linux:~$ wkhtmltopdf -O Portrait www.pc-freak.net www.pc-freak.net_website.pdf

wkhtmltopdf has many useful options, here are some of them:
 

  • Javascript disabling – Disable support for javascript for a website
  • Grayscale pdf generation – Generates PDf in Grayscale
  • Low quality pdf generation – Useful to shrink the output size of generated pdf size
  • Set PDF page size – (A4, Letter etc.)
  • Add zoom to the generated pdf content
  • Support for password HTTP authentication
  • Support to use the tool over a proxy
  • Generation of Table of Content based on titles (only in static version)
  • Adding of Header and Footers (only in static version)

To generate an A4 page with wkhtmltopdf:

wkhtmltopdf -s A4 www.pc-freak.net/blog/ www.pc-freak.net_blog.pdf

wkhtmltopdf looks promising but seems a bit buggy still, here is what happened when I tried to create a pdf without setting an A4 page formatting:

linux:$ wkhtmltopdf www.pc-freak.net/blog/ www.pc-freak.net_blog.pdf
Loading page (1/2)
OpenOffice path before fixup is '/usr/lib/openoffice' ] 71%
OpenOffice path is '/usr/lib/openoffice'
OpenOffice path before fixup is '/usr/lib/openoffice'
OpenOffice path is '/usr/lib/openoffice'
** (:12057): DEBUG: NP_Initialize
** (:12057): DEBUG: NP_Initialize succeeded
** (:12057): DEBUG: NP_Initialize
** (:12057): DEBUG: NP_Initialize succeeded
** (:12057): DEBUG: NP_Initialize
** (:12057): DEBUG: NP_Initialize succeeded
** (:12057): DEBUG: NP_Initialize
** (:12057): DEBUG: NP_Initialize succeeded
Printing pages (2/2)
Done
Printing pages (2/2)
Segmentation fault

Debian and Ubuntu version of wkhtmltopdf does not support TOC generation and Adding headers and footers, to support it one has to download and install the static version of wkhtmltopdf
Using the static version of the tool is also the only option for anyone on Fedora or any other RPM based Linux distro.

Getting around “Secure Connection Failed Peer’s, Certificate has been revoked., (Error code: sec_error_revoked_certificate)

Friday, April 8th, 2011

Certificate has been revoked, sec_error_revoked_certificate screenshot

One of the SSL secured websites (https://) which I have recently accessed couldn’t be opened with an error message showing up:

Secure Connection Failed

An error occurred during a connection to www.domain.com.

Peer’s Certificate has been revoked.

(Error code: sec_error_revoked_certificate)

* The page you are trying to view can not be shown because the authenticity of the received data could not be verified.
* Please contact the web site owners to inform them of this problem. Alternatively, use the command found in the help menu to report this broken site.

That error catched my attention so I digged further in what the message means. Here is what I found as an explanation to what is certificate revocation online

What is a SSL Certificate revocation

Revocation of a certificate means that the Certificate Authority (CA) that issuer of the certificate for a website have decided that the certificate is no longer valid, even if it has not expired.

The information about revocation can be distributed in two ways: Certificate Revocation Lists (CRLs), or by using the Online Certificate Status Protocol (OCSP).

CRLs are (usually) large files that contain a list with information about all the currentely active (unexpired) certificates that are no longer valid. This file has to be downloaded from the CA by the client at regular intervals (usually at least a week apart), and may be quite large.

OCSP, on the other hand, means that the client asks the CA “Is this particular certificate still valid?”, and the server responds “Yes” or “No”. This method can usually be fairly well up to date, meaning the information is at most a few days old, as opposed to at least a week for CRLs.

All the major browsers support OCSP, but some (like Opera) does not currently support CRLs.

By this time most of the modern browsers (Firefox, Chrome, Opera and Internet explorer does support revocation lists and all of the aforementioned hsa enabled at least OCSP by default.

Why SSL revocation error might occur:

A CA can revoke a certificate due to a number of reasons:

– A new certificate has been issued to the website, meaning the old one is not going to be used anymore.
– The website with the certificate is being used for purposes that are not accepted by the CA.
– The certificate was issued based on incorrect information.
– The owner is no longer able to use the private key associated with the certificate, for example the password is lost, the key storage was destroyed somehow, etc.
– The private key has been compromised or stolen, which means traffic to the site is no longer secure.
– The certificate and key have been stolen and is actually being used for fraud while posing as a legitimate website …

Now after all above being said the error:

Secure Connection Failed Peer's, Certificate has been revoked., (Error code: sec_error_revoked_certificate)

is a sure indicator that the website which had the certificate problem as a one you could not trust to make money transactions or do any operation that has a direct relation to your personal private date.

However as there are still websites which use an SSL encryption and are entertainment websites or just a news websites, sometimes getting around the ssl revocation issue to check this website is a necessity.

Therefore to enable your Firefox 3.5 / Iceweasel browser with a website which has ssl certificate revocation issue you need to do the following:

Edit -> Preferences -> Advanced -> Encryption -> Validation

After you see the Certificate Validation screen remove the tick set on:

Use the Online Certificate Status Protocol (OCSP) to confirm the current validity of certificates

Now refresh the website and you will skip the certificate revocation issue error and the webpage will open up.
Note that even though this will work, it’s not recommended to use this work around!

Automatic blog posts tagging in wordpress blog 3.1 with (auto-tags) / wp plugin to increase Search Engine ranking

Monday, April 4th, 2011

There are plenty of articles, on how to increase search engine ranking in wordpress and I’m sure this article might be not that interesting but still I thought it might be nice to mention about this 3 wordpress plugins Auto-Tags, SEO Slugs and Platinium SEO Pack which will help you increase your traffic.

Let me say a few words for each of the 3 plugins:

1. Auto-tags
Below is the description of the plugin directly taken from the plugin website http://wordpress.org/extend/plugins/auto-tag/

This plugin uses the Yahoo.com and tagthe.net APIs to find the most relevant keywords
from the content of your post, and then adds them as tags.
New for version 0.2: an options page allows to choose how many tags are
retrieved from each service The tag adding is fully automatic,
so if you're using a plugin like feedwordpress to display RSS feeds
on your blog as posts, everything will get done as the feed
posts are published. No user intervention necessary!

Here are the installation instructions for auto-tags:

debian:~# cd /var/www/blog/wp-content/plugins
debian:/var/www/wp-content/plugins:# wget https://www.pc-freak.net/files/auto-tag.0.4.6.zip
100%[================================>] 14,325 45.3K/s in 0.3s

2011-04-04 12:30:17 (45.3 KB/s) – `auto-tag.0.4.6.zip’ saved [14325/14325]
debian:/var/www/wp-content/plugins:# unzip auto-tag.0.4.6.zip

In the above example my wordpress installation is in /var/www/blog/ , if your wordpress is installed in another directory location change to the respective directory.

To activate the Plugin go to:

Plugins -> Auto Tags
Press over Activate to activate the plugin.

To configure the Auto-tags plugin navigate to:

Settings -> Auto tags plugin

Auto tags Screen options

Therein you can configure the number of post tags to be retrieved from Yahoo, tagthe.net. The settings also allows you to disable certain tags you don’t want to appear in your post tags from the field, Remove those tags (comma separated)

The plugin also has an option called Append tags to the ones that already exist which on my wordpress 3.1 installation doesn’t work

After ending up your desired configuration simply press the Update Options button.

Now each time you type a new post in your wordpress blog, a tags related to the post will automatically be included.
Based on this tags Search engines will easily find content that relates to your blog tags and thus your page indexing will get better.