Howto to detect file encoding and convert default encoding of given files from one encoding to another on GNU/Linux and FreeBSD

Sunday, 7th February 2010

I wanted to convert an html document character encoding to UTF-8, to achieve that of
course it was first needed to determine what kind of character encoding was used in
creation time of the file.
First thing I tried was:

hipo@noah:~/Desktop/test$ file File-Whole.htm
File-Whole.htm: HTML document text

as you can see that’s shit cause for some reason mime encoding is not printed by the file
command.
Next what I tried was:
hipo@noah:~/Desktop/test$ file --mime File-Whole.htm1File-Whole.htm1: text/html; charset=unknown-8bit

Here you see that character encoding is reported as charset=unknown-8bit which
ain’t cool at all and is of no use and prompts an error if I try it in iconv
Here is why I needed concretely to determine what kind of character set my file uses to later
be able to convert it using iconv .
To achieve my goal after consulting with Mr. Google , I found
out about enca — detect and convert encoding of text files
It’s obviously my lucky day because good guys from Debian has packaged enca so, everything came to the point of
apt-getting it.
# apt-get install enca

On FreeBSD enca port is available, so installing it cames simply to installing it from port tree.
Here is how:
pcfreak# cd /usr/ports/converters/enca;pcfreak# make install clean

Now I tried launching enca directly without any program parameters, but I was unlucky:

hipo@noah:~/Desktop/test$ enca file-Whole.htm
enca: Cannot determine (or understand) your language preferences.
Please use `-L language', or `-L none' if your language is not supported
(only a few multibyte encodings can be recognized then).
Run `enca --list languages' to get a list of supported languages.

I gave it another try, following prescribed usage parameters though I first checked my possibility
as a languages I can pass by to enca’s -L parameter.
Preliminary knowing that my text contains text in Bulgarian language, it wasn’t such a big deal
for me to determine the required language:

hipo@noah:~/Desktop/test$ enca -L bulgarian File-Whole.htm
transformation format 8 bits; CP1251

Knowing my character set all left for me was to do do the convert to UTF-8 to make text,
much more accessible.

hipo@noah:~/Desktop/test$ iconv --from-code=unknown-8bit --to=UTF-8 File-Whole.htm > File-Whole.htm.new
hipo@noah:~/Desktop/test$ mv File-Whole.htm.new File-Whole.htm

Well here we are conversion mission accomplished 🙂

Share this on:

More helpful Articles

Download PDFDownload PDF

Tags: , , , ,

6 Responses to “Howto to detect file encoding and convert default encoding of given files from one encoding to another on GNU/Linux and FreeBSD”

  1. SayusiAndo says:
    Google Chrome 6.0.472.51 Google Chrome 6.0.472.51 GNU/Linux GNU/Linux
    Mozilla/5.0 (X11; U; Linux i686; en-US) AppleWebKit/534.3 (KHTML, like Gecko) Chrome/6.0.472.51 Safari/534.3

    Hi!

    Thanks for this article! I had an little problem with subtitle files from Windows and about file -bi these files are unknown-8bit… But, Mr. Google have led to here and I’ could solve this problem!

    Thanks again!

    András

    View CommentView Comment
    • admin says:
      Epiphany 2.29.92 Epiphany 2.29.92 Debian GNU/Linux x64 Debian GNU/Linux x64
      Mozilla/5.0 (X11; U; Linux x86_64; en-us) AppleWebKit/531.2+ (KHTML, like Gecko) Safari/531.2+ Debian/squeeze/sid () Epiphany/2.29.92

      It’s my pleasure to help to somebody, see you around

      View CommentView Comment
  2. tom says:
    IceWeasel 3.5.15 IceWeasel 3.5.15 GNU/Linux x64 GNU/Linux x64
    Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.15) Gecko/20101028 Iceweasel/3.5.15 (like Firefox/3.5.15)

    iconv: conversion from `unknown-8bit’ is not supported

    View CommentView Comment
    • admin says:
      Epiphany 2.30.6 Epiphany 2.30.6 Debian GNU/Linux x64 Debian GNU/Linux x64
      Mozilla/5.0 (X11; U; Linux x86_64; en-us) AppleWebKit/531.2+ (KHTML, like Gecko) Version/5.0 Safari/531.2+ Debian/squeeze (2.30.6-1) Epiphany/2.30.6

      did you followed all the steps from the article including the part with the enca program?
      In my example I put bulgarian but you might need to change that to adjust to your language if it’s not bulgarian.

      View CommentView Comment
  3. windowsdvdmaker says:
    Google Chrome 4.0.221.7 Google Chrome 4.0.221.7 Windows 7 Windows 7
    Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/532.2 (KHTML, like Gecko) Chrome/4.0.221.7 Safari/532.2

    Useful information for me. I found this from Google.

    View CommentView Comment

Leave a Reply

CommentLuv badge