Linux: Convert recursively files content from WINDOWS-CP1251 to Unicode UTF-8 with recode and iconv

Wednesday, 9th January 2013

 

Linux How to make mass file convert of charset windows CP1251 toutf8 and to other encodings

Some time ago I've written a tiny article, explaining how converting of HTML or TEXT file content inside file can be converted with iconv.

Just recently, I've made mirror of a whole website with its directory structure with wget cmd. The website to be mirrored was encoded with charset Windows-1251 (which is now a bit obsolete and not very recommended to use), where my Apache Webserver to which I mirrored is configured by default to deliver file content (.html, txt, js, css …) in newer and more standard (universal cyrillic) compliant UTF-8 encoding. Thus opening in browser from my website, the website was delivered in UTF-8, whether the file content itself was with encoding Windows CP-1251; Thus I ended up seeing a lot of monkey unreadable characters instead of Slavonic letters. To deal with the inconvenience, I've used one liner script that converts all Windows-1251 charset files to UTF-8. This triggered me writting this little post, hoping the info might be useful to others in a similar situation to mine:

1. Make Mass file charset / encoding convertion with recode

On most Linux hosts, recode is probably not installed. If you're on Debian / Ubuntu Linux install it with apt;

apt-get install --yes recode

It is also installable from default repositories on Fedora, RHEL, CentOS with:

 

yum -y install recode

Here is recode description taken from man page:

NAME
       recode – converts files between character sets

find . -name "*.html" -exec recode WINDOWS-1251..UTF-8 {} ;

If you have few file extensions whose chracter encoding needs to be converted lets say .html, .htm and .php use cmd:

find . -name "*.html" -o -name '*.htm' -o -name '*.php' -exec recode WINDOWS-1251..UTF-8 {} ;

Btw I just recently learned how one can look for few, file extensions with find under one liner the argument to pass is -o -name '*.file-extension', as you can see from  example, you can look for as  many different file extensions as you like with one find search command.

After completing the convertion, I've remembered that earlier I've also used iconv on a couple of occasions to convert from Cyrillic CP-1251 to Cyrillic UTF-8, thus for those who prefer to complete convertion with iconv here is an alternative a bit longer method using for cycle + mv and iconv.

2. Mass file convertion with iconv

for i in $(find . -name "*.html" -print); do
iconv -f WINDOWS-1251 -t UTF-8 $i > $i.utf-8;
mv $i $i.bak;
mv $i.utf-8 $i;
done

As you see in above line of code, there are two occurances of move command as one is backupping all .html files and second mv overwrites with files with converted encoding. For any other files different from .html, just change in cmd find . -iname '*.html' to whatever file extension.

Share this on:

Download PDFDownload PDF

Tags: , , , , , , , , , , , , ,

3 Responses to “Linux: Convert recursively files content from WINDOWS-CP1251 to Unicode UTF-8 with recode and iconv”

  1. yestema says:
    Safari 6.0.4 Safari 6.0.4 Mac OS X 10.8.3 Mac OS X 10.8.3
    Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/536.29.13 (KHTML, like Gecko) Version/6.0.4 Safari/536.29.13

    I get an error :  "missing argument to `-exec'"

    View CommentView Comment
    • s4hjh33 says:
      Google Chrome 37.0.2062.120 Google Chrome 37.0.2062.120 GNU/Linux x64 GNU/Linux x64
      Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.120 Safari/537.36

      It should have a backslash before the closing semicolon, like that:
      find . -name “*.html” -exec recode WINDOWS-1251..UTF-8 {} \;

      View CommentView Comment

Leave a Reply

CommentLuv badge