PDF Archives - ☩ Walking in Light with Christ - Faith, Computing, Diary ☩ Walking in Light with Christ

Posts Tagged ‘PDF’

How to extract /export Images from .docx Word Document on Linux / BSD

Monday, April 11th, 2016

If you're a Linux user and you have to write some scripts for websites to extract all Images /Pictures from Microsoft Word .docx document then you'll probably wonder if there is a (Linux) command-line tool to extract all the images from a MS Word .docx document?

For websites or even for scripting purposes extracting .docx on Linux / UNIX OS is really a great thing to do.

The good news is MS .docx format is simply a ZIP Archive file format, so you can straight away unzip and pick up all the contained image .JPG / .PNG / .GIF whatever Graphic files bundled.

To list the content of a sample .docx on Linux you will hence need to have installed unzip command line tool, if you still haven't install it either with yum (on RHEL / CentOS / Fedora)

[root@centos ~]:# yum -y install unzip
…

Or with apt-get on (Debian / Ubuntu / Mint):

debian-linux:~# apt-get install –yes unzip

Once installed to get a list of .docx MS Word file:

debian-linux:~# unzip -l your-file-name-of-choice.docx

Archive: your-file-name-of-choice.docx
Length      Date    Time    Name
——— ———- —–   —-
     4333 1980-01-01 00:00   [Content_Types].xml
      737 1980-01-01 00:00   _rels/.rels
     4117 1980-01-01 00:00   word/_rels/document.xml.rels
   462177 1980-01-01 00:00   word/document.xml
     2984 1980-01-01 00:00   word/footer1.xml
     1487 1980-01-01 00:00   word/header2.xml
     1351 1980-01-01 00:00   word/header3.xml
     1556 1980-01-01 00:00   word/header4.xml
     1756 1980-01-01 00:00   word/header1.xml
     2390 1980-01-01 00:00   word/footer2.xml
     1432 1980-01-01 00:00   word/footer3.xml
     1629 1980-01-01 00:00   word/footnotes.xml
     1623 1980-01-01 00:00   word/endnotes.xml
     1449 1980-01-01 00:00   word/header5.xml
   306540 1980-01-01 00:00   word/media/image5.jpeg
     5564 1980-01-01 00:00   word/media/image2.png
     5593 1980-01-01 00:00   word/media/image4.png
     8050 1980-01-01 00:00   word/media/image3.png
     6992 1980-01-01 00:00   word/theme/theme1.xml
     5537 1980-01-01 00:00   word/media/image1.png
      685 1980-01-01 00:00   word/glossary/_rels/document.xml.rels
    10300 1980-01-01 00:00   word/glossary/document.xml
    12341 1980-01-01 00:00   word/settings.xml
     3390 1980-01-01 00:00   word/glossary/settings.xml
      677 1980-01-01 00:00   docProps/core.xml
     1380 1980-01-01 00:00   docProps/custom.xml
     1000 1980-01-01 00:00   docProps/app.xml
      335 1980-01-01 00:00   customXml/itemProps2.xml
      296 1980-01-01 00:00   customXml/_rels/item4.xml.rels
      296 1980-01-01 00:00   customXml/_rels/item3.xml.rels
….

As you can see from above output .docx files media files are always stored under "word/media/*" file structure folder.
Therefore to not extract all .xml .rels files from .docx but only pick up the picture files:

debian-linux:~# unzip your-file-name-of-choice.docx "word/media/*"

Archive: your-file-name-of-choice.docx
extracting: word/media/image5.jpeg
extracting: word/media/image2.png
extracting: word/media/image4.png
extracting: word/media/image3.png
extracting: word/media/image1.png

In case you need to gather only some specific files format from the Word .docx document, issue:

[root@centos ~:]# unzip your-file-name-of-choice.docx "*.jpeg"
…

Or if you just need the .xmls extensions

[root@centos ~:]# unzip your-file-name-of-choice.docx "*.xml"

If you need to extract pictures from older .doc (2003) MS file format you will first need to convert .doc file to .docx and then you can use unzip to extract the files you need.

Unfortunately I'm not aware how to convert .doc to docx with a tool if somebody knows share in comment.
Perhaps it is possible with unoconv or abiword.

The closest thing I know is how to convert .doc Word document to PDF:

abiword –to=pdf filename.doc

Tags: bsd extract pictures only from .docx, document, export png jpg gif from .docx on linux, extract only settings and metadata word .docx linux, extract pictures from docx, extract pictures from word docx linux, format, How to, how to extract .docx on linux, issue, Linux, linux extract images docx, need, PDF, png, unix get only pictures from word docx, unzip, xml
Posted in Curious Facts, Everyday Life, File Convert Tools, Linux | No Comments »

How to password encrypt / decrypt files on Linux to keep and pass your data private

Thursday, August 7th, 2014

If you have a sensitive data like a scan copy of your ID card, Driving License, Birth Certificate, Marriage Certificate or some revolutionary business / idea or technology and you want to transfer that over some kind of network lets say Internet vie some public unencrypted e-mail service like (Gmail.com / Yahoo Mail / Mail.com / (Bulgarian Mail Abv.bg)) etc. you will certainly want to transfer the file in encrypted form to prevent, someone sniffing your Network or someone having administrative permissions to servers of free mail where your mail data is stored.

Transferring your files in encrypted form become very important these days especially after recent Edward Snowden disclosures about American Mass Surveilance program PRISM – for those who didn't yet hear of PRISM (this is a American of America's NSA – National Security Agency aiming to sniff and log everyone's information transferred in digital form via the Internet and even Mobile Phone conversations)…

First step to mitigate surveilance is to use fully free software (100% free software) OS distribution like Trisquel GNU / Linux.
Second is to encrypt to use encryption – the process of transforming information (referred to as plaintext) using an algorithm (called cipher) to make it unreadable to anyone except those possessing special knowledge, usually referred to as a key.
There are many ways to encrypt your data on Linux and to later decrpyt it, I've earlier blogged about encryping files with GPG and OpenSSL on Linux, however encryption with GPG and OpenSSL is newer as concept than the old-school way to encrypt files on UNIX with crypt command which in Linux is replaced by mcrypt command.

mcrypt is provided by mcrypt package by default on most if not all Linux distributions, however mcrypt is not installed by default so to start using it you have to install it first.

1. Install mcrypt on Debian / Ubuntu / Mint (deb based) Linux

apt-get install –yes mcrypt

2. Install mcrypt on Fedora / CentOS rest of RPM bases Linux

yum -y install libmcrypt

3. Encrypting file with mcrypt

To get a list with all supported algorithms by mcrypt:

mcrypt –list
cast-128 (16): cbc cfb ctr ecb ncfb nofb ofb
gost (32): cbc cfb ctr ecb ncfb nofb ofb
rijndael-128 (32): cbc cfb ctr ecb ncfb nofb ofb
twofish (32): cbc cfb ctr ecb ncfb nofb ofb
arcfour (256): stream
cast-256 (32): cbc cfb ctr ecb ncfb nofb ofb
loki97 (32): cbc cfb ctr ecb ncfb nofb ofb
rijndael-192 (32): cbc cfb ctr ecb ncfb nofb ofb
saferplus (32): cbc cfb ctr ecb ncfb nofb ofb
wake (32): stream
blowfish-compat (56): cbc cfb ctr ecb ncfb nofb ofb
des (8): cbc cfb ctr ecb ncfb nofb ofb
rijndael-256 (32): cbc cfb ctr ecb ncfb nofb ofb
serpent (32): cbc cfb ctr ecb ncfb nofb ofb
xtea (16): cbc cfb ctr ecb ncfb nofb ofb
blowfish (56): cbc cfb ctr ecb ncfb nofb ofb
enigma (13): stream
rc2 (128): cbc cfb ctr ecb ncfb nofb ofb
tripledes (24): cbc cfb ctr ecb ncfb nofb ofb

mcrypt < File-To-Crypt.PDF > File-To-Crypt.PDF.cpy

Enter the passphrase (maximum of 512 characters)
Please use a combination of upper and lower case letters and numbers.
Enter passphrase:
Enter passphrase:

If crypt is invoked to create the encrypted file without OS redirects (< >), i.e.:

mcrypt -a blowfish File-To-Crypt.PDF

Please use a combination of upper and lower case letters and numbers.
Enter passphrase:
Enter passphrase:

File File-To-Crypt was encrypted.

mcrypt outputs encrypted file in .nc extension and the new file and file default mode of 0600 (read write only for root user) are set, while new file keeps the modification date of the original.

4. Decrypting file with mcrypt

Decryption of files is done mdecrypt

mdecrypt File-To-Crypt.PDF.cpy

Enter passphrase:
File File-To-Crypt.PDF.cpy was decrypted.

To make mcrypt behave in a certain way when invoked modify ~/.mcryptrd

mcrypt is also available as a module for php5 (php5-mcrypt).

Tags: American Mass Surveilance, Debian Ubuntu Linux, encryption, file, free software, GPG, How to, information, Install, Internet, Linux, password, PDF, root user, sensitive data
Posted in Everyday Life, Linux, Linux and FreeBSD Desktop, System Administration, Various | 1 Comment »

How to convert html pages to text in console / terminal on GNU / Linux and FreeBSD

Thursday, December 8th, 2011

HTML to Plain Text Convertion on GNU / Linux and FreeBSD

I’m realizing the more I’m converting to a fully functional GUI user, the less I’m doing coding or any interesting stuff…
I remembered of the old glorious times, when I was full time console user and got a memory on a nifty trick I was so used to back in the day.
Back then I was quite often writing shell scripts which were fetching (html) webpages and converting the html content into a plain TEXT (TXT) files

In order to fetch a page back in the days I used lynx – (a very simple UNIX text browser, which by the way lacks support for any CSS or Javascipt) in combination with html2text – (an advanced HTML-to-text converter).

Let’s say I wanted to fetch a my personal home page https://www.pc-freak.net/, I did that via the command:

$ lynx -source https://www.pc-freak.net/ | html2text > pcfreak_page.txt

The content from www.pc-freak.net got spit by lynx as an html source and passed html2pdf wchich saves it in plain text file pcfreak_page.txt
The bit more advanced elinks – (lynx-like alternative character mode WWW browser) provides better support for HTML and even some CSS and Javascript so to properly save the content of many pages in plain html file its better to use it instead of lynx, the way to produce .txt using elinks files is identical, e.g.:

$ elinks -source https://www.pc-freak.net/blog/ | html2text > pcfreak_blog_page.txt

By the way back in the days I was used more to links , than the superior elinks , nowdays I have both of the text browsers installed and testing to fetch an html like in the upper example and pipe to html2text produced garbaged output.

Here is the time to tell its not even necessery to have a text browser installed in order to fetch a webpage and convert it to a plain text TXT!. wget file downloading tools supports source dump as well, for all those who did not (yet) tried it and want to test it:

$ wget -qO- https://www.pc-freak.net | html2text Anyways of course, some pages convertion of text inside HTML tags would not properly get saved with neither lynx or elinks cause some texts might be embedded in some elinks or lynx unsupported CSS or JavaScript. In those cases the GUI browser is useful. You can use any browser like Firefox, Epiphany or Opera ‘s File -> Save As (Text Files) embedded functionality, below is a screenshot showing an html page which I’m about to save as a plain Text File in Mozilla Firefox:

Firefox iceWeasel Opera etc. save html webpage as plain text on GNU / Linux, FreeBSD

Besides being handy in conjunction with text browsers, html2text is also handy for converting .html pages already existing on the computer’s hard drive to a plain (.TXT) text format.
One might wonder, why would ever one would like to do that?? Well I personally prefer reading plain text documents instead of htmls 😉
Converting an html files already existing on hard drive with html2text is done with cmd:

$ html2text index.html >index.txt

To convert a whole directory full of .html (documentation) or whatever files to plain text .TXT , cd the directory with HTMLs and issue the one liner bash loop command:

$ cd html/ html$ for i in $(echo *.html); do html2text $i > $(echo $i | sed -e 's#.html#.txt#g'); done

Now lay off your back and enjoy reading the dox like in the good old hacker days when .TXT files were fashionable 😉

Tags: advanced html, character mode, command lynx, content, convertion, course, CSS, drive, file, freak, full time, glorious times, gnu linux, html pages, html source, HTML-to-text, html2text, index, interesting stuff, javascipt, Javascript, Lynx, necessery, nifty trick, page, page txt, pcfreak, PDF, personal home page, Shell, shell scripts, spit, support, terminal, text, text browser, text converter, time, trick, TXT, unix text, wget
Posted in Everyday Life, FreeBSD, Linux, Linux and FreeBSD Desktop, Various | 1 Comment »

How to convert any internet Webpage to PDF from command line on GNU/Linux

Friday, September 30th, 2011

If you're looking for a command line utility to generate PDF file out of any webpage located online you are looking for Wkhtmltopdf
The conversion of webpages to PDF by the tool is done using Apple's Webkit open source render.
wkhtmltopdf is something very useful for web developers, as some webpages has a requirement to produce dynamically pdfs from a remote website locations.
wkhtmltopdf is shipped with Debian Squeeze 6 and latest Ubuntu Linux versions and still not entered in Fedora and CentOS repositories.

To use wkhtmltopdf on Debian / Ubuntu distros install it via apt;

linux:~# apt-get install wkhtmltodpf ...

Next to convert a webpage of choice use cmd:

linux:~$ wkhtmltopdf www.pc-freak.net www.pc-freak.net_website.pdf Loading page (1/2) Printing pages (2/2) Done

If the web page to be snapshotted in long few pages a few pages PDF will be generated by wkhtmltopdf
wkhtmltopdf also supports to create the website snapshot with a specified orientation Landscape / Portrait

-O Portrait options to it, like so:

linux:~$ wkhtmltopdf -O Portrait www.pc-freak.net www.pc-freak.net_website.pdf

wkhtmltopdf has many useful options, here are some of them:

Javascript disabling – Disable support for javascript for a website
Grayscale pdf generation – Generates PDf in Grayscale
Low quality pdf generation – Useful to shrink the output size of generated pdf size
Set PDF page size – (A4, Letter etc.)
Add zoom to the generated pdf content
Support for password HTTP authentication
Support to use the tool over a proxy
Generation of Table of Content based on titles (only in static version)
Adding of Header and Footers (only in static version)

To generate an A4 page with wkhtmltopdf:

wkhtmltopdf -s A4 www.pc-freak.net/blog/ www.pc-freak.net_blog.pdf

wkhtmltopdf looks promising but seems a bit buggy still, here is what happened when I tried to create a pdf without setting an A4 page formatting:

linux:$ wkhtmltopdf www.pc-freak.net/blog/ www.pc-freak.net_blog.pdf Loading page (1/2) OpenOffice path before fixup is '/usr/lib/openoffice' ] 71% OpenOffice path is '/usr/lib/openoffice' OpenOffice path before fixup is '/usr/lib/openoffice' OpenOffice path is '/usr/lib/openoffice' ** (:12057): DEBUG: NP_Initialize ** (:12057): DEBUG: NP_Initialize succeeded ** (:12057): DEBUG: NP_Initialize ** (:12057): DEBUG: NP_Initialize succeeded ** (:12057): DEBUG: NP_Initialize ** (:12057): DEBUG: NP_Initialize succeeded ** (:12057): DEBUG: NP_Initialize ** (:12057): DEBUG: NP_Initialize succeeded Printing pages (2/2) Done Printing pages (2/2) Segmentation fault

Debian and Ubuntu version of wkhtmltopdf does not support TOC generation and Adding headers and footers, to support it one has to download and install the static version of wkhtmltopdf
Using the static version of the tool is also the only option for anyone on Fedora or any other RPM based Linux distro.

Tags: apple, authentication support, CentOS, choice, command, command line utility, content support, conversion, DEBUG, DoneIf, fedora, freak, generation, gnu linux, Grayscale, Initialize, Javascript, Landscape, landscape portrait, line, Linux, linux versions, loading page, low quality, nbsp, online, Open, open source, OpenOffice, orientation, page, password, PDF, pdf content, pdf size, portrait options, printing, quality pdf, repositories, requirement, Set, size a4, snapshot, something, squeeze, static version, support, table of content, tool, Ubuntu, use, Useful, web developers, web page, Webkit, webpage, www
Posted in Linux Audio & Video, System Administration, Various, Web and CMS | 2 Comments »

☩ Walking in Light with Christ – Faith, Computing, Diary

Posts Tagged ‘PDF’

Daily Bible quote

GET ARTICLE UPDATES

Useful blog? Help it:

Links to Other Places

Recent Posts

Ads

Categories

About Myself

Recent Comments

Top Post Views

blogtopsites