Posts Tagged ‘plain’

Convert PDF .pdf to Plain Text .txt files on GNU / Linux and FreeBSD / pdftotext

Friday, November 16th, 2012

Convert PDF .pdf to .txt Plain Text on GNU / Linux Redhat, Debian, CentOS, Fedora and FreeBSD with pdftotext poppler-utils

If you need to convert Adobe PDF to Plain Text on Linux or FreeBSD, you will have to take a look at a poppler-utils – (PDF Utilities).

For those who wonder why you need at all a .PDF in .TXT, I can think of at least 4 good reasons. 
 

PDF to text convertion on Linux and other UNIX-es is possible through a set of tools called poppler-utils

poppler-utils is installable on most Linux distributions on Debian Ubuntu based Linux-es it is installable with the usual:

noah:~# apt-get install --yes poppler-utils
....

On Fedora it is available and installable from default repositories with yum

[root@fedora~]# yum -y install poppler-utils 

On Mandriva Linux:
[root@mandriva~] # urpmi poppler
....

On FreeBSD (and possibly other BSDs) you can install via ports or install it from binary with:

freebsd# pkg_add -vr poppler-utils
....

Here is a list of poppler-utils contents from the .deb Debian package, on other distros and BSD the /bin content tools are same.
noah:~ # dpkg -L poppler-utils|grep -i /usr/bin/
/usr/bin/pdftohtml
/usr/bin/pdfinfo
/usr/bin/pdfimages
/usr/bin/pdftops
/usr/bin/pdftoabw
/usr/bin/pdftoppm
/usr/bin/pdffonts
/usr/bin/pdftotext

1. Converting  .pdf to .txt 

Converting whole PDF document to TXT is done with:

$ pdftotext PeopleWare-Productive_Projects.pdf PeopleWare-Productive_Projects.txt
 
2. Extracting from PDF to Text file only selected pages

 Dumping to .TXT only specific pages from a PDF file: is done through -f and -l arguments (First and Last) pages number.

$ pdftotext -f 3 -l 10 PeopleWare-Productive_Projects.pdf PeopleWare-Productive_Projects.txt

3. Converting PDF to TXT  protected with password

  $ pdftotext -opw 'Password' Password-protected-file.pdf Unprotected-file-dump.txt

the -opw arguments stand for 'Owner Password'. As suggested by man page -opw will bypass all PDF security restrictions. In PDFs there are file permission password protection as well as user password. 

To remove permissions password protection of file

$ pdftotext -upw 'Password' Password-protected-file.pdf Unprotected-file-dump.txt

 
4. Converting .pdf to .txt and setting type of end of file

Depending on the type of Operating System the TEXT file will be red further, you can set the type of end of lines (for those who don't know it here is the 3 major OSes UNIX, Windows, and MAC end of line codes:

DOS & Windows: \r\n 0D0A (hex), 13,10 (decimal)
Unix & Mac OS X: \n, 0A, 10
Macintosh (OS 9): \r, 0D, 13

$ pdftotext -eol unix PeopleWare-Productive_Projects.pdf
PeopleWare-Productive_Projects.txt

The -eol accepts (mac, unix or dos) as options

A bit off topic but very useful thing is to then listen to converted .txt files using festival.

5. Reading .PDF in Linux Text Console and Terminals

$ pdftotext PDF_file_to_Read.pdf -

Btw it is interesting to mention Midnight Commander ( mcview ), component which supports opening .pdf files in console uses pdftotext for extracting PDFs and visualizing in plain text in exactly same way

Well that's it happy convertion.

Convert doc files to plain text (txt) in terminal / console (tty) on GNU / Linux

Sunday, September 20th, 2009

I was looking for a way to convert Microsoft .doc files to plain text (txt) in Linux directly through terminal.
After some lookup in Google Groups I found ANTIWORD! .
Luckily Debian comes even with a package containing the nice nifty program.
Here is the description of antiword – Converts MS Word files to text, PS and PDF
Fun, desciprtion Eh? 🙂 Ain’t it?
There are some other ways to Convert doc files to plain text, for instance you could use the command catdoc , for example to convert simple .doc to .txt file usecatdoc -a whitepaper.doc.
Another way to convert .doc files to .txt mostly used by developers is via the wvware (nothing to do with vmware!:)) utility.
wvware could directly convert it to html. For example:
wvWare file.doc >file.htmlor
wvText file.doc file.txt
.A lot of things I’ll skip here are well explained in the article Viewing Word files at the command line .
END—–