Posts Tagged ‘utf 8’

How to convert file content encoded in windows-cp1251 charset to UTF-8 (with iconv) to be delivered properly encoded to browsing end clients

Wednesday, May 16th, 2012

windows-cp1251 bulgarian to UTF-8 / Encoding Communication Decoding Communication Funny Picture

I have a bunch of old html files all encoded in the historically obsolete Windows-cp1251. Windows-CP1251 used to be common used 7 years ago and therefore still big portions of the web content in Bulgarian / Russian Cyrillic is still transferred to the end users in this encoding.

This was just before the "UTF-8 revolution", where massively people started using UTF-8,
Well it was clear the specific national country text encoding standards will quickly be moved by to UTF-8 – Universal Encoding format which abbreviation stands for (Unicode Transformation Format).

Though UTF-8 was clear to be "the future", many web developers mostly because of their incompetency or using an old sources of learning how to writen in HTML continued to use windows-cp1251 in HTMLs. I'm even convinced, there are still developers out there who are writting websites for Bulgarian / Russian / Macedonian customers using obsolete encodings …

The smarter developers of those accustomed to windows-cp1251, KOI-8R etc. etc., were using the meta tag to specify the type of charset of the web page content with:

<meta http-equiv="content-type" content="text/html;charset=windows-cp1251">

or

<meta http-equiv="content-type" content="text/html;charset=koi-8r">

Anyhow, still many devs even didn't placed the windows-cp1251 in the head of the HTML …

The result for the system administrator is always a mess – a lot of webpages that are showing like unreadable signs and tons of unhappy customers.
As always the system administrator is considered responsible, for the programmer mistakes :). So instead of programmers fix their bad cooking, the admin has to fix it all!

One quick work around me as admin has applied to failing to display pages in Cyrillic using the Windows-cp1251 character encoding was to force windows-cp1251 as a default encoding for the whole virtualhost or Apache directory with Apache directives like:

<VirtualHost *:80>
ServerAdmin some_user@some_host.com
DocumentRoot /var/www/html
AddDefaultCharset windows-cp1251
ServerName the_host_name.com
ServerAlias www.the_host_name.com
....
....
<Directory>
AddDefaultCharset windows-cp1251
>/Directory>
</VirtualHost>

Though this mostly would, work there are some occasions, where only a particular html files from all the content served by Apache is encoded in windows-cp1251, if most of the content is already written in UTF-8, this could be a big issues as you cannot just change the UTF-8 globally to windows-cp1251, just because few pages are written in archaic encoding….
Since most of the content is displayed to the client by Apache (as prior explained) just fine, only particular htmls lets's ay single.html, single2.html etc. etc. are displayed with some question marks or some non-human readable "hieroglyphs".

Below is a screenshot from two pages returned to my browser in wrongly set htmls charset:

Improper Windows CP1251 encoding with Apache set to serve UTF-8 encoding questiomarks

Improper Windows CP1251 delivered page in UTF-8 browser view

Apache returns cp1251 in some non-UTF8 wrong encoding (webserver improperly served cyrillic encoding)

Improperly served encoding CP1251 delivered by Apache in non-utf-8 encoding

When this kind of issues occur, the only solution is to simply login to the server and use iconv command to convert all files returning unreadable content from whatever the non UTF-8 encoding is lets say in my case Bulgarian typeset of cp1251 to UTF-8

Here is how the iconv command to convert between windows-cp1251 to utf-8 the two sample files named single1.html and single2.html

server:/web# /usr/bin/iconv -f WINDOWS-1251 -t UTF-8 single1.html > single1.html.utf8
server:/web# mv single1.html single1.html.bak;
server:/web# mv single1.html.utf8 single1.html
server:/web# /usr/bin/iconv -f WINDOWS-1251 -t UTF-8 single2.html > single2.html.utf8
server:/web# mv single2.html single2.html.bak;
server:/web# mv single2.html.utf8 single2.html

I always, make copies of the original cp1251 encoded files (as you see mv single1.html single1.html.bak), because if something goes wrong with convertion I can easily revert back.

If there are 10 files with consequential numbers naming they can be converted using a short for loop, like so:

server:/web# for i $(seq 1 10); do
/usr/bin/iconv -f WINDOWS-1251 -t UTF-8 single$i.html > single$i.html.utf8;mv single$i.html single$i.html.bak
mv single$i.html.utf8 single$i.html
done

Just as earlier mentioned if single1.html, single2.html … has in the html <head>:

<meta http-equiv="Content-Type" content="text/html; charset=windows-1251">

You should open, each of the files in question and wipe out the line either by hand or use sed to wipe it in one loop if it has to be done for lets say 10 files named (single{1..10})

server:/web# for i in $(seq 1 10); do
sed '/<meta http-equiv="Content-Type" content="text\/html; charset=windows-1251>/d' single$i.txt > single$i.txt.new;
mv single$i.txt single$i.txt.bak;
mv single$i.txt.new single$i.txt

Well now,

How to change Debian GNU / Linux console (tty) language to Bulgarian or Russian Language

Wednesday, April 25th, 2012

Debian has a package language-env. I haven't used my Linux console for a long time. So I couldn't exactly remember how I used to be making the Linux console to support cyrillic language (CP1251, bg_BG.UTF-8) etc.

I've figured out for the language-env existence in Debian Book on hosted on OpenFMIBulgarian Faculty of Mathematics and Informatics website.
The package info with apt-cache show displays like that:

hipo@noah:~/Desktop$ apt-cache show language-env|grep -i -A 3 description
Description: simple configuration tool for native language environment
This tool adds basic settings for natural language environment such as
LANG variable, font specifications, input methods, and so on into
user's several dot-files such as .bashrc and .emacs.

What is really strange, is the package maintainer is not Bulgarian, Russian or Ukrainian but Japanese.
As you see the developer is weirdly not Bulgarian but Japanese Kenshi Muto. What is even more interesting is that it is another japanese that has actually written the script set-language-env contained within the package. Checking the script in the header one can see him, Tomohiro KUBOTA

Before I've found about the language-env existence, I knew I needed to have the respective locales installed on the system with:

# dpkg-reconfigure locales

So I run dpkg-reconfigure to check I have existing the locales for adding the Bulgarian language support.
Checking if the bulgarian locale is installed is also possible with /bin/ls:

# ls -al /usr/share/i18n/locales/*|grep -i bg
-rw-r--r-- 1 root root 8614 Feb 12 21:10 /usr/share/i18n/locales/bg_BG

The language-env contains a perl script called set-language-env which is doing the actual Debian Bulgarization / cyrillization. The set-language-env author is another Japanese and again not Slavonic person.

Actually set-language-env script is not doing the Bulgariazation but is a wrapper script that uses a number of "hacks" to make the console support cyrillic.

Further on to make the console support cyrillic, execute:

hipo@noah:~$ set-language-env
Setting up users' native language environment
by modifying their dot-files.
Type "set-language-env -h" for help.
1 : be (Bielaruskaja,Belarusian)
2 : bg (Bulgarian)
3 : ca (Catala,Catalan)
4 : da (Dansk,Danish)
5 : de (Deutsch,German)
6 : es (Espanol,Spanish)
7 : fr (Francais,French)
8 : ja (Nihongo,Japanese)
9 : ko (Hangul,Korean)
10 : lt (Lietuviu,Lithuanian)
11 : mk (Makedonski,Macedonian)
12 : pl (Polski,Polish)
13 : ru (Russkii,Russian)
14 : sr (Srpski,Serbian)
15 : th (Thai)
16 : tr (Turkce,Turkish)
17 : uk (Ukrajins'ka,Ukrainian)
Input number > 2

There are many questions in cyrillic list necessery to be answered to exactly define if you need cyrillic language support for GNOME, pine, mutt, console etcetera.
The script will create or append commands to a number of files on the system like ~/.bash_profile
The script uses the cyr command part of the Debian console-cyrillic package for the actual Bulgarian Linux localization.

As said it was supposed to also do a localization in the past of many Graphical environment programs, as well as include Bulgarian support for GNOME desktop environment. Since GNOME nowdays is already almost completely translated through its native language files, its preferrable that localization to be done on Linux install time by selecting a country language instead of later doing it with set-language-env. If you failed to set the GNOME language during Linux install, then using set-language-env will still work. I've tested it and even though a lot of time passed since set-language-env was heavily used for bulgarization still the GUI env bulgarization works.

If set-language-env is run in gnome-terminal the result, the whole set of question dialogs will pop-up in new xterm and due to a bug, questions imposed will be unreadable as you can see in below screenshot:

set-language-env command screenshot in Debian GNU / Linux gnome-terminal

If you want to remove the bulgarization, later at certain point, lets you don't want to have the cyrillic console or programs support use:

# set-language-env -r
Setting up users native language environment' 

For anyone who wish to know more in depth, how set-language-env works check the README files in /usr/share/doc/language-env/ one readme written by the author of the Bulgarian localization part of the package Anton Zinoviev is /usr/share/doc/language-env/README.be-bg-mk-sr-uk

How to add multi language support to wordpress with qTranslate

Monday, October 3rd, 2011

QTRanslate WordPress Language Translate Screenshot 1

Lately, I have to deal with some wordpress based installs in big part of my working time. One of the wordpress sites needed to have added a multi language support.

My first research in Google pointed me to WPML Multilingual CMS The WordPress Multilingual Plugin
WPML Multilingual CMS looks nice and easy to use but unfortunately its paid, the company couldn’t afford to pay for the plugin so I looked forward online for a free alternative and stumbled upon QTranslate

QTranslate is free and very easy to install. Its installed the wordpress classic way and the installation went smoothly, e.g.:

1. Download and unzip QTranslate

# cd /var/www/blog/wp-content/plugins
/var/www/blog/wp-content/plugins# wget http://downloads.wordpress.org/plugin/qtranslate.2.5.24.zip
...
/var/www/blog/wp-content/plugins# unzip qtranslate.2.5.24.zip
...

Just for fun and in case the plugin disappears in future, a mirror of Qtranslate 2.5.24 is found here

2. Enable QTranslate from wordpress admin

Plugins -> Inactive -> qTranslate (Activate)

After activating the plugin, there is a Settings button from which qTranslate‘s various plugin parameteres can be tuned.

qTranslate WordPress translate screenshot 2

In my case my site had to support both English and Arabic, so from the settings I added support for Arabic translation to the wordpress install.

Adding Arabic is done in the following way:

a. From the Language Management (qTranslate Configuration) from the Languages menu and the Languages (Add Languages) I had to choose a language code (in my case a language code of ar – for Arabic). Next I had to choose the Arabic flag from the follow up flag list.

In next text box Name , again I had to fill Arabic, for Locale en_US.UTF-8
The following Date Format and Time Format text boxes are optional so I left them blank.
To complete the process of adding the Arabic as a new language wordpress should support I pressed the Add Language button and the Arabic got added as a second language.

Afterwards the Arabic was added as second language, on the bottom of the left wordpress menu pane a button allowing a switch between English, Arabic appeared (see below screenshot):

MultiLingual WordPress with qTranslate

Finally to make Arabic appear as a second language of choice on the website I added it as a Widget in the Widgets menu from the AWidgets menu:

Appearance -> Widgets

In widgets I added qTranslate Language Chooser to the Sidebar without putting any kind of Title for qtranslate widget .
I found it most helpful to choose the Text and Image as an option on how to display the Language switching in the wp.

Test your web browser compatability with Acid3 test

Wednesday, January 25th, 2012

Acid3 Test is a group of browser compitability tests. Acid3 test is a good indicator on how Web ready is your browser.

Acidtest is part of the web standards project. Latest Firefox 9.0.1 passes the test on 100% (100/100).
I've tried it with Epiphany and it scored only 67/100, still I'm using Epiphany on daily basis and I'm quite happy with it.
Acid3 browser compitability Test Firefox 9.0.1
The tests involved are testing browser for:
 

  • DOM
  • DOM2
  • Checks on HTML tables and forms browser rendering
  • SVG compitability testing
  • DOM1 and DOM2 compitability
  • Various ECMA Script Javascript compitability tests
  • Unicode (UTF-16 and UTF-8) browser compitability
  • XHML, SMIL, CSS, HTML compitability
  • Content-type image/png, text plain etc.

Acid3 browser test fail
The Acid3 test is written itself in Javascript. It consists of 6 testing "stages" (buckets) upon which the browser tested is evaluated.
Each of the test is represented visually by a rectangle. If the a test stage is passed you see a new rectangle appearing in the tested browser.
In wikipedia, there is a thorough list with web browsers by type and engine and the level of support for the Acid3 test.
The test is of great use if you're web developer.