Some time ago I've written a tiny article, explaining how converting of HTML or TEXT file content inside file can be converted with iconv.
Just recently, I've made mirror of a whole website with its directory structure with wget cmd. The website to be mirrored was encoded with charset Windows-1251 (which is now a bit obsolete and not very recommended to use), where my Apache Webserver to which I mirrored is configured by default to deliver file content (.html, txt, js, css …) in newer and more standard (universal cyrillic) compliant UTF-8 encoding. Thus opening in browser from my website, the website was delivered in UTF-8, whether the file content itself was with encoding Windows CP-1251; Thus I ended up seeing a lot of monkey unreadable characters instead of Slavonic letters. To deal with the inconvenience, I've used one liner script that converts all Windows-1251 charset files to UTF-8. This triggered me writting this little post, hoping the info might be useful to others in a similar situation to mine:
1. Make Mass file charset / encoding convertion with recode
On most Linux hosts, recode is probably not installed. If you're on Debian / Ubuntu Linux install it with apt;
apt-get install --yes recode
It is also installable from default repositories on Fedora, RHEL, CentOS with:
yum -y install recode
Here is recode description taken from man page:
NAME
recode – converts files between character sets
find . -name "*.html" -exec recode WINDOWS-1251..UTF-8 {} ;
If you have few file extensions whose chracter encoding needs to be converted lets say .html, .htm and .php use cmd:
find . -name "*.html" -o -name '*.htm' -o -name '*.php' -exec recode WINDOWS-1251..UTF-8 {} ;
Btw I just recently learned how one can look for few, file extensions with find under one liner the argument to pass is -o -name '*.file-extension', as you can see from example, you can look for as many different file extensions as you like with one find search command.
After completing the convertion, I've remembered that earlier I've also used iconv on a couple of occasions to convert from Cyrillic CP-1251 to Cyrillic UTF-8, thus for those who prefer to complete convertion with iconv here is an alternative a bit longer method using for cycle + mv and iconv.
2. Mass file convertion with iconv
for i in $(find . -name "*.html" -print); do
iconv -f WINDOWS-1251 -t UTF-8 $i > $i.utf-8;
mv $i $i.bak;
mv $i.utf-8 $i;
done
As you see in above line of code, there are two occurances of move command as one is backupping all .html files and second mv overwrites with files with converted encoding. For any other files different from .html, just change in cmd find . -iname '*.html' to whatever file extension.
More helpful Articles
Tags: CentOS, character sets, convertion, converts, exec, file extensions, inconvenience, linux hosts, man page, page 3a, repositories, rhel, tiny article, unicode utf 8
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/536.29.13 (KHTML, like Gecko) Version/6.0.4 Safari/536.29.13
I get an error : "missing argument to `-exec'"
View CommentView CommentMozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.120 Safari/537.36
It should have a backslash before the closing semicolon, like that:
View CommentView Commentfind . -name “*.html” -exec recode WINDOWS-1251..UTF-8 {} \;
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_5) AppleWebKit/537.71 (KHTML, like Gecko) Version/6.1 Safari/537.71
Thanks
View CommentView Comment