Have you wondered how could you check your website for broken links? Cause I did!
You might wonder why should I care so hard about the broken links?
Well it’s simple broken links on your webpage will have an influence on how Google indexesyour website, and how often does it bother to crawl on your website in other words, havinga eagly eye on your broken links interface is a vital for every self respecting web developer,as well as for system administrators.
From a web development perspective, it’s important that your website have as less 404 error pages possible since that is important for Webpages W3C Compliancy.
On other hand if you’re a SEO Specialist, having as less broken links on your domains is vital for Google Pageranking,Yahoo, Live.com, Altavista, Yandex etc, as well as for general Good Search Engine Indexing.
Having said all that you should already feel the topic is really interesting. I believe not many people has wrote stuff aboutit online.
That’s why I decided to share with you a possible way on how to track your broken web domain for broken pages on the Linux and possibly other Unix compatible architectures.
There are plenty of tools available that could be used for finding out the broken linkson your website using Linux operating system.
I used apt-get in order to look for a link checker software for Linux
noah:/home/hipo/Desktop# apt-cache search 'link check'
htcheck-php - Simple php interface to database generated by ht://Check
htcheck - Utility for checking web site for dead/external links
linkchecker - check websites and HTML documents for broken links
linklint - A fast link checker and web site maintenance tool
The 2 error link reporting tools I used were:
1. linkchecker
and
2. Htcheck
I’ll evaluate both of the tools and will share with you my impressions of the two really valuable,broken link checking tools for Linux.
Let me begin with a short introduction on what you could expect from the linkchecker broken links (error 404) links.
Here are the linkchecker Featuresrecursive and multithreaded checking
output in colored or normal text, HTML, SQL, CSV, XML or a sitemap graph in different formats
HTTP/1.1, HTTPS, FTP, mailto:, news:, nntp:, Telnet and local file links support
restriction of link checking with regular expression filters for URLs
proxy support
username/password authorization for HTTP and FTP and Telnet
honors robots.txt exclusion protocol
Cookie support
HTML and CSS syntax check
Antivirus check
a command line interface
a GUI client interface
a (Fast)CGI web interface (requires HTTP server)
Luckily linchecker has a Debian package port so installing it comes as easily as executing:
root@noah:~# apt-get install linkchecker linkchecker-gui
However at the present moment on Debian Sid (Testing/Unstable) linkchecker-gui is missing some dependencies with libqt and python
so I was not able to install and test the Graphic User Interface for Linkchecker .Anyways here is a screenshot of the linkchecker GUI interface in order to give you a glipmse on what to expect if you succeed in installing it on Mac OS X or some other operating system.
Using linkchecker’s command line interface is really straight forward you just have too invoke the linkchecker command and pipe it too the tee shell command
Here is how:
root@noah:~# linkcheker https://www.pc-freak.net/ | tee -a www.pc-freak.net-broken-links-linkchecker.log
Though it’s simplicity to use from a first look checking the manual of linkchecker reveals quite many interesting usage parameters, so be sure also to take a look at the manual.
Of course it might be wise to combine linkchecker with some bash scripting in order to pereodically review your website or websites for broken links.
I intend to do that in the coming days so if I write some script that uses linkchecker and facilitates the search for a broken links I’ll post it on the blog.
Having said all that linkchecker goody, let me proceed further to Htcheck
HtCheck is really wondeful and it in a certain sense better than linkchecker, because it offers some extra possibilities like for instance generation of reports which could be stored in MySQL and could be visualized any time via a web browser.
Here is a descrition extracted from HtCheck’s website:
ht://Check is more than a link checker. It is a console application written for Linux systems in C++ and derived from ht://Dig.
It can retrieve information through HTTP/1.1 and store the information in a MySQL database, and it is particularly suitable for
small Internet domains or Intranet.
Its purpose is to help a webmaster manage one or more related sites: after a “crawl”, ht://Check gives back very useful summaries
and reports, including broken links, anchors not found, content-types and HTTP status codes summaries, etc.
From version 1.2.3, ht://Check also performs accessibility checks in accordance with the principles of the University of Toronto’s
Open Accessibility Checks (OAC) project, allowing users to discover site-wide barriers like images without proper alternatives,
missing titles, etc.
ht://Check can also be used for Web structure analysis, as it stores information regarding links between HTML documents.
I have to admit this htcheck-php is really handy! To use the extra php web interface to htcheck you’ll need the htcheck-php package installed.
To install both htcheck and it’s web interface on Debian you’ll need to issue the command:
root@noah:~# apt-get install htcheck htcheck-php
Now there are few more things to do before you could start using htcheck.
You’ll need to edit /etc/htcheck/htcheck.conf
There you will need to change at least the start_url variable.
Another necessery thing will be to use phpmyadmin or the console mysql client in order to create the required htcheck username and password and grant some relevant permissions to the htcheck user in MySQL.
Yet if you try to execute the htcheck binary (which by the way is written in C) to generate you will experience a problem with connecting to mysql’s database and you will most likely get the error message.
noah:/home/hipo# htcheck
Error (1045): Access denied for user 'root'@'localhost' (using password: NO)
! htcheck: Database error
That really pissed me off but anyways you’re lucky that I got it for you.
This whole issue is well documented in htcheck’s installation notes which you can read here
If you’re lazy reading the whole document just skip and read The Htcheck MySQL Connection Settings part
The solutions to the above pointed htcheck problem, where htcheck could not connect to the database is easily solvable, by creating a .my.cnf file in your home directory e.g. ~/ .
Let’s say you’re running with a root user the htcheck, all you need to do is edit /root/.my.cnf and place in it:
[client]
host=127.0.0.1
user=htcheck
password=yoursqlpassword
That’s it now issue again the htcheck command again, so that it could create the proper “htcheck” database (created by default) and store crawl your website for broken links and generate and store the reports in your MySQL server.
root@noah:~# htcheck -i
In the above example the “-i” option passed to htcheck will take care for “htcheck”‘s database to be rebuilt, that’s necessery especially if you made any changes in /etc/htcheck/htcheck.conf after the first time you have invoked on htcheck and you’d like the new configuration changes to reflected in the generated reports in MySQL.
However if you run the htcheck tool for a first time, you can start it without the “-i” flag.
In order to configure htcheck’s web reporting interface to be properly show website crawling statistics you’ll have also to edit /etc/htcheck/global.inc.php and set the username and password variables according to the ones you have previously choose while creating the htcheck’s MySQL username and password.
As a last step before you could use the htcheck’s Web gui interface through your browser is to either configure a virtualhost for htcheck in your Apache configuration or simply make an Apache Alias from your Apache configuration, on Debian, you’ll have to edit /etc/apache2/apache2.conf
Place the following Apache Alias in order to be able to access your htcheck’s statistics from your default configured Apache domain name.
Alias /usr/share/htcheck/php/ /htcheck/
In order to load the new Apache configurations as usual you’ll need an Apache WebServer restart
root@noah:~# /etc/init.d/apache2 restart
Here you can take a quick look what to expect from Htcheck’s PHP Web Error 404 reporting interface on a Debian GNU/Linux System:
Now Enjoy htcheck neat Error page discover tool and it’s web statistics interface!
Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.1.3) Gecko/20090824 Firefox/3.5.3 (.NET CLR 3.5.30729)
sorry for the odd post, but I new at this and wanted to say I really like your site.
View CommentView CommentMozilla/5.0 (X11; U; Linux i686; en-us) AppleWebKit/531.2+ (KHTML, like Gecko) Safari/531.2+ Debian/squeeze/sid () Epiphany/2.29.92
There is only one thing I find as a problem with htcheck. It doesn’t include a native support for SSL (the https protocol)! I’ll have to look further online for a web crawler that generates nice web logs and supports this. The good news here is that if you need a quick way to crawl your website, you can still use linkchecker which I’ve just tested and it plays nicely with both http and the https (SSL) protocols.
View CommentView CommentMozilla/5.0 (X11; U; Linux i686; en-us) AppleWebKit/531.2+ (KHTML, like Gecko) Safari/531.2+ Debian/squeeze/sid () Epiphany/2.29.92
Hey man, I’m glad you like it. It’s nothing special though I’m trying to share a valuable stuff and that way probably be helpful to somebody out there.
View CommentView CommentMozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; Media Center PC 6.0; InfoPath.2; MS-RTC LM 8)
Very Good Blog! I also like your clean and smooth design.
View CommentView CommentOpera/9.64(Windows NT 5.1; U; en) Presto/2.1.1
Guter Web Log, aber der Feed funktioniert nicht mit Opera.
View CommentView CommentMozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.14) Gecko/2009082707 Firefox/3.0.14 (.NET CLR 3.5.30729)
Thanks for the good post but do you optimize for Firefox. I was having a little trouble with my browser.
View CommentView CommentMozilla/5.0 (X11; U; Linux x86_64; en-us) AppleWebKit/531.2+ (KHTML, like Gecko) Safari/531.2+ Debian/squeeze/sid () Epiphany/2.29.92
Hey I’m happy you like my post.
I personally review my blog with Iceweasel (The Debian Linux variant of Firefox).
What kind of troubles did you experienced with Firefox?
Best
View CommentView CommentMozilla/5.0 (X11; U; Linux i686; it-IT; rv:1.9.0.2) Gecko/2008092313 Ubuntu/9.25 (jaunty) Firefox/3.8
Thanks I found just the info I already searched everywhere and just couldn’t find. What a perfect site.
View CommentView CommentMozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.0.10) Gecko/2009042316 Firefox/3.0.10
wuhuu.. Cool danke.
View CommentView CommentMozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30)
this article is exactly what i have been looking for! found your page bookmarked by a friend of mine. I’ll also share it. Thanks again!
View CommentView CommentOpera/9.64(Windows NT 5.1; U; en) Presto/2.1.1
You made some good quality points there. I did a look foron the issueand found most people will will be in agreementwith your blog.
View CommentView CommentOpera/9.64(Windows NT 5.1; U; en) Presto/2.1.1
Purple monkey dishwasher
View CommentView CommentMozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30)
I hate giving away my secrets but I have been using a service called Directory Maximizer for all my submissions for years.
View CommentView CommentMozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/532.2 (KHTML, like Gecko) Chrome/4.0.221.7 Safari/532.2
I was doing a search and came across this blog site. I must admit that this article is on point! Keep writing more. I will be reading your posts
View CommentView CommentMozilla/5.0 (X11; U; Linux i686; it-IT; rv:1.9.0.2) Gecko/2008092313 Ubuntu/9.25 (jaunty) Firefox/3.8
Wow, brilliant post, I was thinking how to do that. and found across your web page from yahoo, tons of brilliant stuff here, now that I’ve got some idea. I’ve bookmarked your blog and also added rss. Please keep us updated….
View CommentView CommentMozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/532.2 (KHTML, like Gecko) Chrome/4.0.221.7 Safari/532.2
Pretty insightfull post. Never thought that it was this easy after all. I have spent a a lot of my time looking for someone to explain this matter clearly and you’re the only person that ever did that. I really appreciate it! Keep it up!
View CommentView CommentMozilla/5.0 (X11; U; Linux i686; it-IT; rv:1.9.0.2) Gecko/2008092313 Ubuntu/9.25 (jaunty) Firefox/3.8
this was a very entertaining read. i enjoyed it very much!
View CommentView CommentMozilla/5.0 (Windows; U; Windows NT 6.1; en-GB; rv:1.9.1.3) Gecko/20090824 Firefox/3.5.3
Wow that is an very crazy article for me. I like your style of writing. Maybe you should write more articles of these type. By the way, sorry for my bad english 😉
View CommentView CommentMozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.14) Gecko/2009082707 Firefox/3.0.14 (.NET CLR 3.5.30729)
What I don’t realize is how you’re not even more well-liked than that you are now. You are just so intelligent. You know so significantly about this subject, produced me consider about it from numerous distinct angles. Its like people today arent interested unless it has a thing to complete with Lady Gaga! Your stuffs good. Preserve it up!
View CommentView CommentMozilla/5.0 (X11; U; Linux i686; it-IT; rv:1.9.0.2) Gecko/2008092313 Ubuntu/9.25 (jaunty) Firefox/3.8
Really good post.. thanks.. 🙂
View CommentView Comment