Checking your website for broken Links on Linux
with linkchecker and htcheck / How to find broken links on your
website
Have you wondered how could you check your website for broken
links? Cause I did!
You might wonder why should I care so hard about the broken
links?
Well it's simple broken links on your webpage will have an
influence on how Google indexes your website, and how often does it
bother to crawl on your website in other words, having a eagly eye
on your broken links interface is a vital for every self respecting
web developer, as well as for system administrators.
From a web development perspective, it's important that your
website have as less 404 error pages possible since that is
important for Webpages W3C Compliancy.
On other hand if you're a SEO Specialist, having as less broken
links on your domains is vital for Google Pageranking, Yahoo,
Live.com, Altavista, Yandex etc, as well as for general Good Search
Engine Indexing.
Having said all that you should already feel the topic is really
interesting. I believe not many people has wrote stuff about it
online.
That's why I decided to share with you a possible way on how to
track your broken web domain for broken pages on the Linux and
possibly other Unix compatible architectures.
There are plenty of tools available that could be used for finding
out the broken links on your website using Linux operating
system.
I used
apt-get in order to look for a
link
checker software for Linux
noah:/home/hipo/Desktop# apt-cache search 'link check'
htcheck-php - Simple php interface to database generated by
ht://Check
htcheck - Utility for checking web site for dead/external
links
linkchecker - check websites and HTML documents for broken
links
linklint - A fast link checker and web site maintenance
tool
The 2 error link reporting tools I used were:
1.
linkchecker
and
2.
Htcheck
I'll evaluate both of the tools and will share with you my
impressions of the two really valuable, broken link checking tools
for Linux.
Let me begin with a short introduction on what you could expect
from the
linkchecker broken links (error 404)
links.
Here are the linkchecker
Features recursive and multithreaded checking
output in colored or normal text, HTML, SQL, CSV, XML or a sitemap
graph in different formats
HTTP/1.1, HTTPS, FTP, mailto:, news:, nntp:, Telnet and local file
links support
restriction of link checking with regular expression filters for
URLs
proxy support
username/password authorization for HTTP and FTP and Telnet
honors robots.txt exclusion protocol
Cookie support
HTML and CSS syntax check
Antivirus check
a command line interface
a GUI client interface
a (Fast)CGI web interface (requires HTTP server)
Luckily
linchecker has a Debian package port so installing
it comes as easily as executing:
root@noah:~# apt-get install linkchecker
linkchecker-gui
However at the present moment on Debian Sid (Testing/Unstable)
linkchecker-gui is missing some dependencies with
libqt and python
so I was not able to install and test the
Graphic User Interface
for Linkchecker . Anyways here is a screenshot of the
linkchecker GUI interface in order to give you a glipmse on what to
expect if you succeed in installing it on Mac OS X or some other
operating system.
Using linkchecker's command line interface is really straight
forward you just have too invoke the linkchecker command and pipe
it too
the tee shell command
Here is how:
root@noah:~# linkcheker http://pc-freak.net/ | tee -a
pc-freak.net-broken-links-linkchecker.log
Though it's simplicity to use from a first look checking the manual
of linkchecker reveals quite many interesting usage parameters, so
be sure also to take a look at the manual.
Of course it might be wise to combine linkchecker with some bash
scripting in order to pereodically review your website or websites
for broken links.
I intend to do that in the coming days so if I write some script
that uses linkchecker and facilitates the search for a broken links
I'll post it on the blog.
Having said all that linkchecker goody, let me proceed further to
Htcheck
HtCheck is really wondeful and it in a certain sense better than
linkchecker, because it offers some extra possibilities like for
instance generation of reports which could be stored in MySQL and
could be visualized any time via a web browser.
Here is a descrition extracted from HtCheck's website:
ht://Check is more than a link checker. It is a
console application written for Linux systems in C++ and derived
from ht://Dig.
It can retrieve information through HTTP/1.1 and store the
information in a MySQL database, and it is particularly suitable
for
small Internet domains or Intranet.
Its purpose is to help a webmaster manage one or more related
sites: after a "crawl", ht://Check gives back very useful
summaries
and reports, including broken links, anchors not found,
content-types and HTTP status codes summaries, etc.
From version 1.2.3, ht://Check also performs accessibility checks
in accordance with the principles of the University of
Toronto's
Open Accessibility Checks (OAC) project, allowing users to discover
site-wide barriers like images without proper alternatives,
missing titles, etc.
ht://Check can also be used for Web structure analysis, as it
stores information regarding links between HTML
documents.
I have to admit this htcheck-php is really handy! To use the extra
php web interface to htcheck you'll need
the htcheck-php
package installed.
To install both htcheck and it's web interface on Debian you'll
need to issue the command:
root@noah:~# apt-get install htcheck htcheck-php
Now there are few more things to do before you could start using
htcheck.
You'll need to edit
/etc/htcheck/htcheck.conf
There you will need to change at least the
start_url
variable.
Another necessery thing will be to use phpmyadmin or the console
mysql client in order to create the required htcheck username and
password and grant some relevant permissions to the htcheck user in
MySQL.
Yet if you try to execute the htcheck binary (which by the way is
written in C) to generate you will experience a problem with
connecting to mysql's database and you will most likely get the
error message.
noah:/home/hipo# htcheck
Error (1045): Access denied for user 'root'@'localhost' (using
password: NO)
! htcheck: Database error
That really pissed me off but anyways you're lucky that I got it
for you.
This whole issue is well documented in
htcheck's
installation notes which you can read here
If you're lazy reading the whole document just skip and read
The
Htcheck MySQL Connection Settings part
The solutions to the above pointed htcheck problem, where htcheck
could not connect to the database is easily solvable, by creating a
.my.cnf file in your home directory e.g.
~/ .
Let's say you're running with a root user the htcheck, all you need
to do is edit
/root/.my.cnf and place in it:
[client]
host=127.0.0.1
user=htcheck
password=yoursqlpassword
That's it now issue again the
htcheck command again, so
that it could create the proper "htcheck" database (created by
default) and store crawl your website for broken links and
generate and store the reports in your MySQL server.
root@noah:~# htcheck -i
In the above example the "-i" option passed to htcheck will take
care for "htcheck"'s database to be rebuilt, that's necessery
especially if you made any changes in
/etc/htcheck/htcheck.conf after the first time you have
invoked on htcheck and you'd like the new configuration changes to
reflected in the generated reports in MySQL.
However if you run the htcheck tool for a first time, you can start
it without the "-i" flag.
In order to configure htcheck's web reporting interface to be
properly show website crawling statistics you'll have also to edit
/etc/htcheck/global.inc.php and set the username and
password variables according to the ones you have previously choose
while creating the htcheck's MySQL username and password.
As a last step before you could use the htcheck's Web gui interface
through your browser is to either configure a virtualhost for
htcheck in your Apache configuration or simply make an Apache Alias
from your Apache configuration, on Debian, you'll have to edit
/etc/apache2/apache2.conf
Place the following
Apache Alias in order to be able to
access your htcheck's statistics from your default configured
Apache domain name.
Alias /usr/share/htcheck/php/ /htcheck/
In order to load the new Apache configurations as usual you'll need
an Apache WebServer restart
root@noah:~# /etc/init.d/apache2 restart
Here you can take a quick look what to expect from Htcheck's PHP
Web Error 404 reporting interface on a Debian GNU/Linux
System:
Now Enjoy htcheck neat Error page discover tool and it's web
statistics interface!