Convert PDF .pdf to Plain Text .txt files on GNU / Linux and FreeBSD / pdftotext

Friday, 16th November 2012

Convert PDF .pdf to .txt Plain Text on GNU / Linux Redhat, Debian, CentOS, Fedora and FreeBSD with pdftotext poppler-utils

If you need to convert Adobe PDF to Plain Text on Linux or FreeBSD, you will have to take a look at a poppler-utils – (PDF Utilities).

For those who wonder why you need at all a .PDF in .TXT, I can think of at least 4 good reasons. 
 

PDF to text convertion on Linux and other UNIX-es is possible through a set of tools called poppler-utils

poppler-utils is installable on most Linux distributions on Debian Ubuntu based Linux-es it is installable with the usual:

noah:~# apt-get install --yes poppler-utils
....

On Fedora it is available and installable from default repositories with yum

[root@fedora~]# yum -y install poppler-utils 

On Mandriva Linux:
[root@mandriva~] # urpmi poppler
....

On FreeBSD (and possibly other BSDs) you can install via ports or install it from binary with:

freebsd# pkg_add -vr poppler-utils
....

Here is a list of poppler-utils contents from the .deb Debian package, on other distros and BSD the /bin content tools are same.
noah:~ # dpkg -L poppler-utils|grep -i /usr/bin/
/usr/bin/pdftohtml
/usr/bin/pdfinfo
/usr/bin/pdfimages
/usr/bin/pdftops
/usr/bin/pdftoabw
/usr/bin/pdftoppm
/usr/bin/pdffonts
/usr/bin/pdftotext

1. Converting  .pdf to .txt 

Converting whole PDF document to TXT is done with:

$ pdftotext PeopleWare-Productive_Projects.pdf PeopleWare-Productive_Projects.txt
 
2. Extracting from PDF to Text file only selected pages

 Dumping to .TXT only specific pages from a PDF file: is done through -f and -l arguments (First and Last) pages number.

$ pdftotext -f 3 -l 10 PeopleWare-Productive_Projects.pdf PeopleWare-Productive_Projects.txt

3. Converting PDF to TXT  protected with password

  $ pdftotext -opw 'Password' Password-protected-file.pdf Unprotected-file-dump.txt

the -opw arguments stand for 'Owner Password'. As suggested by man page -opw will bypass all PDF security restrictions. In PDFs there are file permission password protection as well as user password. 

To remove permissions password protection of file

$ pdftotext -upw 'Password' Password-protected-file.pdf Unprotected-file-dump.txt

 
4. Converting .pdf to .txt and setting type of end of file

Depending on the type of Operating System the TEXT file will be red further, you can set the type of end of lines (for those who don't know it here is the 3 major OSes UNIX, Windows, and MAC end of line codes:

DOS & Windows: rn 0D0A (hex), 13,10 (decimal)
Unix & Mac OS X: n, 0A, 10
Macintosh (OS 9): r, 0D, 13

$ pdftotext -eol unix PeopleWare-Productive_Projects.pdf
PeopleWare-Productive_Projects.txt

The -eol accepts (mac, unix or dos) as options

A bit off topic but very useful thing is to then listen to converted .txt files using festival.

5. Reading .PDF in Linux Text Console and Terminals

$ pdftotext PDF_file_to_Read.pdf -

Btw it is interesting to mention Midnight Commander ( mcview ), component which supports opening .pdf files in console uses pdftotext for extracting PDFs and visualizing in plain text in exactly same way

Well that's it happy convertion.

Share this on:

Download PDFDownload PDF

Tags: , , , , ,

Leave a Reply

CommentLuv badge