Howto install FuzzyOcr on Debian 5.0 (Lenny) /
FuzzyOCR install tutorial on Debian Linux
Recently, I had a task to install
FuzzyOCR on Debian Lenny in
order to reduce the amount of the
"image spam" delivered to
the end users.Since there is no official install tutorial for
debian users I decided to create this one with the hope it might be
useful for others.
Here are few lines that explain what is FuzzyOCR;
FuzzyOcr is a plugin for SpamAssassin which is aimed at
unsolicited bulk mail (also known as "Spam") containing images as
the main content carrier. Using different methods, it analyzes the
content and properties of images to distinguish between normal
mails (Ham) and spam mails. The methods mainly are:
Now I won't get into details anymore and I'll get you to the
concrete packages and configurations I've done in order to have the
software up and running.
1.
Install required debian packages
debian-server# apt-get install netpbm gocr giftext giflib-tools
libungif-bin \
libpng3 libungif4g gifsicle ocrad \
libstring-approx-perl libmldbm-perl libmldbm-sync-perl \
liblog-agent-perl libpng12-dev libtiff4-dev libsvga1-dev
libx11-dev
2.
Download latest version of FuzzyOCR
debian-server# wget
http://users.own-hero.net/~decoder/fuzzyocr/fuzzyocr-3.6.0.tar.gz
3.
Copy some FuzzyOCR configuration and installation files in
/etc/mail/spamassassin/
debian-server# cp -rpf FuzzyOcr.scansets
/etc/mail/spamassassin/
debian-server# cp -rpf FuzzyOcr.preps /etc/mail/spamassassin/
debian-server# cp -rpf FuzzyOcr.pm /etc/mail/spamassassin/
debian-server# cp -rpf FuzzyOcr/ /etc/mail/spamassassin/
debian-server# cp -rpf FuzzyOcr.cf
/etc/mail/spamassassin
4.
Create some log files and files in order to use FuzzyOCR with
a hashing database.
debian-server# touch /var/log/qmail/FuzzyOcr.log
debian-server# chown vpopmail:vchkpw
/var/log/qmail/FuzzyOcr.log
debian-server# touch /etc/mail/spamassassin/FuzzyOcr.db
debian-server# chown vpopmail:vchkpw
/etc/mail/spamassassin/FuzzyOcr.db
debian-server# touch /etc/mail/spamassassin/FuzzyOcr.safe.db
debian-server# chown vpopmail:vchkpw
/etc/mail/spamassassin/FuzzyOcr.safe.db
5.
Edit FuzzyOcr configuration files.
debian-server# vim
/etc/mail/spamassassin/FuzzyOcr.cf
You need to put there the following directives:
focr_enable_image_hashing 2
focr_db_hash /etc/mail/spamassassin/FuzzyOcr.db
focr_db_safe /etc/mail/spamassassin/FuzzyOcr.safe.db
focr_db_max_days 15
Now there are few more things that need to be done before we have a
complete install, e.g. we need to compile netpbm from source,
because three of the binary executables required by FuzzyOcr are
for some reason not bundled with debian lenny netpbm package.
So;
So first we download and untar the latest version of netpbm:
debian-server# links
"http://downloads.sourceforge.net/project/netpbm/super_stable/10.35.73/netpbm-10.35.73.tgz?use_mirror=sunet"
debian-server# tar -zxvvf netpbm-10.35.73.gz
We need to have the following "hack" in order to have the source
compile properly:
debian-server# mkdir /usr/X11R6/lib
debian-server# ln -sf /usr/lib/libX11.so
/usr/X11R6/lib/libX11.so
Next we compile the source of netbpm and install it:
debian-server# cd netpbm-10.35.73
debian-server# make && make install
If it happens that your build fails during the "make", then you
must use the
apt-file program to determine which debian
package contains the missing header files because of which the
build has failed
We proceed next, with the installation of
tesseract . Tesseract
is 1 of the best OCR open source engine available nowadays
Therefore we now download and install it:
debian-server# wget
http://tesseract-ocr.googlecode.com/files/tesseract-2.04.tar.gz
debian-server# tar -zxvvf tesseract-2.04.tar.gz
debian-server# cd tesseract-2.04
debian-server# ./configure && make && make
install
In order to load FuzzyOcr in spamassassin we have to restart
Spamassassin:
debian-server# /etc/init.d/spamassassin restart
Note: If you are have spamassassin running via djb daemontools
restart spamassassin via the
svc command:
Last thing we do is the check out if FuzzyOcr is correctly loaded
and checking against image spam when new messages arrives, so here
is how:
Change back to your FuzzyOcr-3.6.0/ directory:
debian-server# cd FuzzyOcr-3.6.0/
debian-server# cd samples
debian-server# spamassassin --debug FuzzyOcr < ocr-animated.eml
>/dev/null
Check out the lines related to FuzzyOcr, you should have some lines
in the output reporting FuzzyOcr has found a spam in the
ocr-animated.eml file.
Another possible approach to test what is happening in spamassassin
is to use:
debian-server# spamassassin -D
The above command will provide you with information about
spamassassin in real time.
This article is pretty much in a beta stage, I'll be glad of any
feedback on it so I can advance it!
Thanks for reading!