If you're a Linux user and you have to write some scripts for websites to extract all Images /Pictures from Microsoft Word .docx document then you'll probably wonder if there is a (Linux) command-line tool to extract all the images from a MS Word .docx document?
For websites or even for scripting purposes extracting .docx on Linux / UNIX OS is really a great thing to do.
The good news is MS .docx format is simply a ZIP Archive file format, so you can straight away unzip and pick up all the contained image .JPG / .PNG / .GIF whatever Graphic files bundled.
To list the content of a sample .docx on Linux you will hence need to have installed unzip command line tool, if you still haven't install it either with yum (on RHEL / CentOS / Fedora)
[root@centos ~]:# yum -y install unzip
…
Or with apt-get on (Debian / Ubuntu / Mint):
debian-linux:~# apt-get install –yes unzip
Once installed to get a list of .docx MS Word file:
debian-linux:~# unzip -l your-file-name-of-choice.docx
Archive: your-file-name-of-choice.docx
Length Date Time Name
——— ———- —– —-
4333 1980-01-01 00:00 [Content_Types].xml
737 1980-01-01 00:00 _rels/.rels
4117 1980-01-01 00:00 word/_rels/document.xml.rels
462177 1980-01-01 00:00 word/document.xml
2984 1980-01-01 00:00 word/footer1.xml
1487 1980-01-01 00:00 word/header2.xml
1351 1980-01-01 00:00 word/header3.xml
1556 1980-01-01 00:00 word/header4.xml
1756 1980-01-01 00:00 word/header1.xml
2390 1980-01-01 00:00 word/footer2.xml
1432 1980-01-01 00:00 word/footer3.xml
1629 1980-01-01 00:00 word/footnotes.xml
1623 1980-01-01 00:00 word/endnotes.xml
1449 1980-01-01 00:00 word/header5.xml
306540 1980-01-01 00:00 word/media/image5.jpeg
5564 1980-01-01 00:00 word/media/image2.png
5593 1980-01-01 00:00 word/media/image4.png
8050 1980-01-01 00:00 word/media/image3.png
6992 1980-01-01 00:00 word/theme/theme1.xml
5537 1980-01-01 00:00 word/media/image1.png
685 1980-01-01 00:00 word/glossary/_rels/document.xml.rels
10300 1980-01-01 00:00 word/glossary/document.xml
12341 1980-01-01 00:00 word/settings.xml
3390 1980-01-01 00:00 word/glossary/settings.xml
677 1980-01-01 00:00 docProps/core.xml
1380 1980-01-01 00:00 docProps/custom.xml
1000 1980-01-01 00:00 docProps/app.xml
335 1980-01-01 00:00 customXml/itemProps2.xml
296 1980-01-01 00:00 customXml/_rels/item4.xml.rels
296 1980-01-01 00:00 customXml/_rels/item3.xml.rels
….
As you can see from above output .docx files media files are always stored under "word/media/*" file structure folder.
Therefore to not extract all .xml .rels files from .docx but only pick up the picture files:
debian-linux:~# unzip your-file-name-of-choice.docx "word/media/*"
Archive: your-file-name-of-choice.docx
extracting: word/media/image5.jpeg
extracting: word/media/image2.png
extracting: word/media/image4.png
extracting: word/media/image3.png
extracting: word/media/image1.png
In case you need to gather only some specific files format from the Word .docx document, issue:
[root@centos ~:]# unzip your-file-name-of-choice.docx "*.jpeg"
…
Or if you just need the .xmls extensions
[root@centos ~:]# unzip your-file-name-of-choice.docx "*.xml"
If you need to extract pictures from older .doc (2003) MS file format you will first need to convert .doc file to .docx and then you can use unzip to extract the files you need.
Unfortunately I'm not aware how to convert .doc to docx with a tool if somebody knows share in comment.
Perhaps it is possible with unoconv or abiword.
The closest thing I know is how to convert .doc Word document to PDF:
abiword –to=pdf filename.doc
More helpful Articles
Tags: bsd extract pictures only from .docx, document, export png jpg gif from .docx on linux, extract only settings and metadata word .docx linux, extract pictures from docx, extract pictures from word docx linux, format, How to, how to extract .docx on linux, issue, Linux, linux extract images docx, need, PDF, png, unix get only pictures from word docx, unzip, xml