If you need to convert Adobe PDF to Plain Text on Linux or FreeBSD, you will have to take a look at a poppler-utils – (PDF Utilities).
For those who wonder why you need at all a .PDF in .TXT, I can think of at least 4 good reasons.
- First Plain Text takes much less HDD space
- - Second it is convenient to read .txt from text console
- 3rd having the .PDF in .TXT allows you to easily pass text to festival and listen your computer reading the text for you - It is proven fact that we humans learn and remember easier when we listen
- Compitability – Plain text files does not need a special reading program but are supported on virtually all old and modern OSes
PDF to text convertion on Linux and other UNIX-es is possible through a set of tools called poppler-utils
poppler-utils is installable on most Linux distributions on Debian Ubuntu based Linux-es it is installable with the usual:
noah:~# apt-get install --yes poppler-utils
On Fedora it is available and installable from default repositories with yum
[root@fedora~]# yum -y install poppler-utils
On Mandriva Linux:
[root@mandriva~] # urpmi poppler
On FreeBSD (and possibly other BSDs) you can install via ports or install it from binary with:
freebsd# pkg_add -vr poppler-utils
Here is a list of poppler-utils contents from the .deb Debian package, on other distros and BSD the /bin content tools are same.
noah:~ # dpkg -L poppler-utils|grep -i /usr/bin/
1. Converting .pdf to .txt
Converting whole PDF document to TXT is done with:
$ pdftotext PeopleWare-Productive_Projects.pdf PeopleWare-Productive_Projects.txt
2. Extracting from PDF to Text file only selected pages
Dumping to .TXT only specific pages from a PDF file: is done through -f and -l arguments (First and Last) pages number.
$ pdftotext -f 3 -l 10 PeopleWare-Productive_Projects.pdf PeopleWare-Productive_Projects.txt
3. Converting PDF to TXT protected with password
$ pdftotext -opw 'Password' Password-protected-file.pdf Unprotected-file-dump.txt
the -opw arguments stand for 'Owner Password'. As suggested by man page -opw will bypass all PDF security restrictions. In PDFs there are file permission password protection as well as user password.
To remove permissions password protection of file
$ pdftotext -upw 'Password' Password-protected-file.pdf Unprotected-file-dump.txt
4. Converting .pdf to .txt and setting type of end of file
Depending on the type of Operating System the TEXT file will be red further, you can set the type of end of lines (for those who don't know it here is the 3 major OSes UNIX, Windows, and MAC end of line codes:
DOS & Windows: rn 0D0A (hex), 13,10 (decimal)
Unix & Mac OS X: n, 0A, 10
Macintosh (OS 9): r, 0D, 13
$ pdftotext -eol unix PeopleWare-Productive_Projects.pdf
The -eol accepts (mac, unix or dos) as options
A bit off topic but very useful thing is to then listen to converted .txt files using festival.
5. Reading .PDF in Linux Text Console and Terminals
$ pdftotext PDF_file_to_Read.pdf -
Btw it is interesting to mention Midnight Commander ( mcview ), component which supports opening .pdf files in console uses pdftotext for extracting PDFs and visualizing in plain text in exactly same way
Well that's it happy convertion.