Howto to detect file encoding and convert default
encoding of given files from one encoding to another on GNU/Linux
and FreeBSD
I wanted to convert an html document character encoding to UTF-8,
to achieve that of
course it was first needed to determine what kind of character
encoding was used in
creation time of the file.
First thing I tried was:
hipo@noah:~/Desktop/test$ file File-Whole.htm
File-Whole.htm: HTML document text
as you can see that's shit cause for some reason mime encoding is
not printed by the file
command.
Next what I tried was:
hipo@noah:~/Desktop/test$ file --mime File-Whole.htm1
File-Whole.htm1: text/html; charset=unknown-8bit
Here you see that character encoding is reported as
charset=unknown-8bit which
ain't cool at all and is of no use and prompts an error if I try it
in iconv
Here is why I needed concretely to determine what kind of character
set my file uses to later
be able to convert it using iconv .
To achieve my goal after consulting with Mr. Google , I found
out about enca --
detect and convert encoding of text files
It's obviously my lucky day because good guys from Debian has
packaged enca so, everything came to the point of
apt-getting it.
# apt-get install enca
On FreeBSD enca port is available, so installing it cames simply to
installing it from port tree.
Here is how:
pcfreak# cd /usr/ports/converters/enca;
pcfreak# make install clean
Now I tried launching
enca directly without any program
parameters, but I was unlucky:
hipo@noah:~/Desktop/test$ enca file-Whole.htm
enca: Cannot determine (or understand) your language preferences.
Please use `-L language', or `-L none' if your language is not supported
(only a few multibyte encodings can be recognized then).
Run `enca --list languages' to get a list of supported languages.
I gave it another try, following prescribed usage parameters though
I first checked my possibility
as a languages I can pass by to enca's
-L parameter.
Preliminary knowing that my text contains text in Bulgarian
language, it wasn't such a big deal
for me to determine the required language:
hipo@noah:~/Desktop/test$ enca -L bulgarian File-Whole.htm
transformation format 8 bits; CP1251
Knowing my character set all left for me was to do do the convert
to UTF-8 to make text,
much more accessible.
hipo@noah:~/Desktop/test$ iconv --from-code=unknown-8bit --to=UTF-8 File-Whole.htm > File-Whole.htm.new
hipo@noah:~/Desktop/test$ mv File-Whole.htm.new File-Whole.htm
Well here we are conversion mission accomplished