Posts Tagged ‘UTF’

How to convert file content encoded in windows-cp1251 charset to UTF-8 (with iconv) to be delivered properly encoded to browsing end clients

Wednesday, May 16th, 2012

windows-cp1251 bulgarian to UTF-8 / Encoding Communication Decoding Communication Funny Picture

I have a bunch of old html files all encoded in the historically obsolete Windows-cp1251. Windows-CP1251 used to be common used 7 years ago and therefore still big portions of the web content in Bulgarian / Russian Cyrillic is still transferred to the end users in this encoding.

This was just before the "UTF-8 revolution", where massively people started using UTF-8,
Well it was clear the specific national country text encoding standards will quickly be moved by to UTF-8 – Universal Encoding format which abbreviation stands for (Unicode Transformation Format).

Though UTF-8 was clear to be "the future", many web developers mostly because of their incompetency or using an old sources of learning how to writen in HTML continued to use windows-cp1251 in HTMLs. I'm even convinced, there are still developers out there who are writting websites for Bulgarian / Russian / Macedonian customers using obsolete encodings …

The smarter developers of those accustomed to windows-cp1251, KOI-8R etc. etc., were using the meta tag to specify the type of charset of the web page content with:

<meta http-equiv="content-type" content="text/html;charset=windows-cp1251">

or

<meta http-equiv="content-type" content="text/html;charset=koi-8r">

Anyhow, still many devs even didn't placed the windows-cp1251 in the head of the HTML …

The result for the system administrator is always a mess – a lot of webpages that are showing like unreadable signs and tons of unhappy customers.
As always the system administrator is considered responsible, for the programmer mistakes :). So instead of programmers fix their bad cooking, the admin has to fix it all!

One quick work around me as admin has applied to failing to display pages in Cyrillic using the Windows-cp1251 character encoding was to force windows-cp1251 as a default encoding for the whole virtualhost or Apache directory with Apache directives like:

<VirtualHost *:80>
ServerAdmin some_user@some_host.com
DocumentRoot /var/www/html
AddDefaultCharset windows-cp1251
ServerName the_host_name.com
ServerAlias www.the_host_name.com
....
....
<Directory>
AddDefaultCharset windows-cp1251
>/Directory>
</VirtualHost>

Though this mostly would, work there are some occasions, where only a particular html files from all the content served by Apache is encoded in windows-cp1251, if most of the content is already written in UTF-8, this could be a big issues as you cannot just change the UTF-8 globally to windows-cp1251, just because few pages are written in archaic encoding….
Since most of the content is displayed to the client by Apache (as prior explained) just fine, only particular htmls lets's ay single.html, single2.html etc. etc. are displayed with some question marks or some non-human readable "hieroglyphs".

Below is a screenshot from two pages returned to my browser in wrongly set htmls charset:

Improper Windows CP1251 encoding with Apache set to serve UTF-8 encoding questiomarks

Improper Windows CP1251 delivered page in UTF-8 browser view

Apache returns cp1251 in some non-UTF8 wrong encoding (webserver improperly served cyrillic encoding)

Improperly served encoding CP1251 delivered by Apache in non-utf-8 encoding

When this kind of issues occur, the only solution is to simply login to the server and use iconv command to convert all files returning unreadable content from whatever the non UTF-8 encoding is lets say in my case Bulgarian typeset of cp1251 to UTF-8

Here is how the iconv command to convert between windows-cp1251 to utf-8 the two sample files named single1.html and single2.html

server:/web# /usr/bin/iconv -f WINDOWS-1251 -t UTF-8 single1.html > single1.html.utf8
server:/web# mv single1.html single1.html.bak;
server:/web# mv single1.html.utf8 single1.html
server:/web# /usr/bin/iconv -f WINDOWS-1251 -t UTF-8 single2.html > single2.html.utf8
server:/web# mv single2.html single2.html.bak;
server:/web# mv single2.html.utf8 single2.html

I always, make copies of the original cp1251 encoded files (as you see mv single1.html single1.html.bak), because if something goes wrong with convertion I can easily revert back.

If there are 10 files with consequential numbers naming they can be converted using a short for loop, like so:

server:/web# for i $(seq 1 10); do
/usr/bin/iconv -f WINDOWS-1251 -t UTF-8 single$i.html > single$i.html.utf8;mv single$i.html single$i.html.bak
mv single$i.html.utf8 single$i.html
done

Just as earlier mentioned if single1.html, single2.html … has in the html <head>:

<meta http-equiv="Content-Type" content="text/html; charset=windows-1251">

You should open, each of the files in question and wipe out the line either by hand or use sed to wipe it in one loop if it has to be done for lets say 10 files named (single{1..10})

server:/web# for i in $(seq 1 10); do
sed '/<meta http-equiv="Content-Type" content="text\/html; charset=windows-1251>/d' single$i.txt > single$i.txt.new;
mv single$i.txt single$i.txt.bak;
mv single$i.txt.new single$i.txt

Well now,

How to clean QMAIL mail server filled Queue with spam mail messages

Friday, June 8th, 2012

Sys Admins managing QMAIL mail servers know, often it happens QMAIL queue gets filled with unwanted unsolicated SPAM e-mails due to a buggy WEB PHP / Perl mail sendingform or some other odd reason, like too many bouncing messages ,,,,

For one more time I’ve experienced the huge SPAM destined mails queue on one QMAIL running on Debian GNU / Linux.

= … Hence I needed to clean up the qmail queue. For this there is a little tool written in PERL called qmhandle

A+) Download qmhandle;

Qmhandle’s official download site is in SourceForge

I’ve made also a mirror copy of QMHandle here

qmail-server:~# cd /usr/local/src
qmail-server:/usr/local/src# wget -q https://www.pc-freak.net/files/qmhandle-1.3.2.tar.gz

B=) STOP QMAIL
As it is written in the program documentation one has to be very careful when cleaning the mail queue. Be sure to stop qmail with qmailctl or whatever script is used to shutdown any mail sever in progress operations, otherwise there is big chance the queue to mess up badly .

C#) Check extended info about the mail queue:

qmail-server:/usr/local/src/qmhandle-1.3.2# ./qmhandle -l -c
102 (10, 10/102) Return-path: anonymous@qmail-hostname.com
From: QMAIL-HOSTNAME
To: as1riscl1.spun@gmail.com
Subject: =?UTF8?B?QWNjb3VudCBpbmZvcm1hdGlvbiBmb3IgU09DQ0VSRkFNRQ==?=
Date: 1 Sep 2011 21:02:16 -0000
Size: 581 bytes
,,,,
1136 (9, 9/1136)
Return-path: werwer@qmail-hostname.com.tw
From: martin.georgiev@qmail-hostname.com
To: costador4312@ukr.net
Subject: Link Exchange Proposal / Qmail-Hostname.com
Date: Fri, 2 Sep 2011 07:58:52 +0100 (BST)
Size: 1764 bytes
,,,,....
1103 (22, 22/1103)
Return-path: anonymous@qmail-hostname.com
From: SOCCERFAME
To: alex.masdf.e.kler.1@gmail.com
Subject: =?UTF8?B?QWNjb3VudCBpbmZvcm1hdGlvbiBmb3IgU09DQ0VSRkFNRQ==?=
Date: 2 Sep 2011 00:36:11 -0000
Size: 578 bytes
,,,,,,
,,,,....
Total messages: 1500 Messages with local recipients: 0
Messages with remote recipients: 1500
Messages with bounces: 500
Messages in preprocess: 300

D-) Delete the Queue
qmail-server:/usr/local/src/qmhandle-1.3.2# ./qmHandle -D
......
......

Fina1ly launch the qmail to continue normal oper.

qmail-server:~# qmailctl start
,,,,,
..,,,

How to change Debian GNU / Linux console (tty) language to Bulgarian or Russian Language

Wednesday, April 25th, 2012

Debian has a package language-env. I haven't used my Linux console for a long time. So I couldn't exactly remember how I used to be making the Linux console to support cyrillic language (CP1251, bg_BG.UTF-8) etc.

I've figured out for the language-env existence in Debian Book on hosted on OpenFMIBulgarian Faculty of Mathematics and Informatics website.
The package info with apt-cache show displays like that:

hipo@noah:~/Desktop$ apt-cache show language-env|grep -i -A 3 description
Description: simple configuration tool for native language environment
This tool adds basic settings for natural language environment such as
LANG variable, font specifications, input methods, and so on into
user's several dot-files such as .bashrc and .emacs.

What is really strange, is the package maintainer is not Bulgarian, Russian or Ukrainian but Japanese.
As you see the developer is weirdly not Bulgarian but Japanese Kenshi Muto. What is even more interesting is that it is another japanese that has actually written the script set-language-env contained within the package. Checking the script in the header one can see him, Tomohiro KUBOTA

Before I've found about the language-env existence, I knew I needed to have the respective locales installed on the system with:

# dpkg-reconfigure locales

So I run dpkg-reconfigure to check I have existing the locales for adding the Bulgarian language support.
Checking if the bulgarian locale is installed is also possible with /bin/ls:

# ls -al /usr/share/i18n/locales/*|grep -i bg
-rw-r--r-- 1 root root 8614 Feb 12 21:10 /usr/share/i18n/locales/bg_BG

The language-env contains a perl script called set-language-env which is doing the actual Debian Bulgarization / cyrillization. The set-language-env author is another Japanese and again not Slavonic person.

Actually set-language-env script is not doing the Bulgariazation but is a wrapper script that uses a number of "hacks" to make the console support cyrillic.

Further on to make the console support cyrillic, execute:

hipo@noah:~$ set-language-env
Setting up users' native language environment
by modifying their dot-files.
Type "set-language-env -h" for help.
1 : be (Bielaruskaja,Belarusian)
2 : bg (Bulgarian)
3 : ca (Catala,Catalan)
4 : da (Dansk,Danish)
5 : de (Deutsch,German)
6 : es (Espanol,Spanish)
7 : fr (Francais,French)
8 : ja (Nihongo,Japanese)
9 : ko (Hangul,Korean)
10 : lt (Lietuviu,Lithuanian)
11 : mk (Makedonski,Macedonian)
12 : pl (Polski,Polish)
13 : ru (Russkii,Russian)
14 : sr (Srpski,Serbian)
15 : th (Thai)
16 : tr (Turkce,Turkish)
17 : uk (Ukrajins'ka,Ukrainian)
Input number > 2

There are many questions in cyrillic list necessery to be answered to exactly define if you need cyrillic language support for GNOME, pine, mutt, console etcetera.
The script will create or append commands to a number of files on the system like ~/.bash_profile
The script uses the cyr command part of the Debian console-cyrillic package for the actual Bulgarian Linux localization.

As said it was supposed to also do a localization in the past of many Graphical environment programs, as well as include Bulgarian support for GNOME desktop environment. Since GNOME nowdays is already almost completely translated through its native language files, its preferrable that localization to be done on Linux install time by selecting a country language instead of later doing it with set-language-env. If you failed to set the GNOME language during Linux install, then using set-language-env will still work. I've tested it and even though a lot of time passed since set-language-env was heavily used for bulgarization still the GUI env bulgarization works.

If set-language-env is run in gnome-terminal the result, the whole set of question dialogs will pop-up in new xterm and due to a bug, questions imposed will be unreadable as you can see in below screenshot:

set-language-env command screenshot in Debian GNU / Linux gnome-terminal

If you want to remove the bulgarization, later at certain point, lets you don't want to have the cyrillic console or programs support use:

# set-language-env -r
Setting up users native language environment' 

For anyone who wish to know more in depth, how set-language-env works check the README files in /usr/share/doc/language-env/ one readme written by the author of the Bulgarian localization part of the package Anton Zinoviev is /usr/share/doc/language-env/README.be-bg-mk-sr-uk

For the School-examination

Thursday, January 31st, 2008

Tell me which ideotic government would create a site based on php and would make the serverunder Windows?

Just Guess ours the Bulgarian ministry of Science and Knowledge has started a new site dedicated to helping graduating school pupils with the Future School-examinationthey have to make.

It’s pretty easy to see that just observe:

jericho% telnet zamaturite.bg 80

Trying 212.122.183.208…

Connected to zamaturite.bg.
Escape character is ‘^]’.
HEAD / HTTP/1.0HTTP/1.1 200 OK
Connection: close
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Date: Wed, 30 Jan 2008 19:10:18 GMT
Content-Type: text/html; charset=UTF-8Server: Apache/2.2.6 (Win32) PHP/5.2.5X-Powered-By: PHP/5.2.5Set-Cookie: PHPSESSID=fn5jtjbet7clrapi0a5e5kgvt7; path=/
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
Keep-Alive: timeout=5, max=100
Connection closed by foreign host.
jericho%

Just great our Bulgarian government spend money on buying proprietary software OS to run a Free Software based solution.

This example is pretty examplary of what our country looks like. Sad …

END—–

How to add multi language support to wordpress with qTranslate

Monday, October 3rd, 2011

QTRanslate WordPress Language Translate Screenshot 1

Lately, I have to deal with some wordpress based installs in big part of my working time. One of the wordpress sites needed to have added a multi language support.

My first research in Google pointed me to WPML Multilingual CMS The WordPress Multilingual Plugin
WPML Multilingual CMS looks nice and easy to use but unfortunately its paid, the company couldn’t afford to pay for the plugin so I looked forward online for a free alternative and stumbled upon QTranslate

QTranslate is free and very easy to install. Its installed the wordpress classic way and the installation went smoothly, e.g.:

1. Download and unzip QTranslate

# cd /var/www/blog/wp-content/plugins
/var/www/blog/wp-content/plugins# wget http://downloads.wordpress.org/plugin/qtranslate.2.5.24.zip
...
/var/www/blog/wp-content/plugins# unzip qtranslate.2.5.24.zip
...

Just for fun and in case the plugin disappears in future, a mirror of Qtranslate 2.5.24 is found here

2. Enable QTranslate from wordpress admin

Plugins -> Inactive -> qTranslate (Activate)

After activating the plugin, there is a Settings button from which qTranslate‘s various plugin parameteres can be tuned.

qTranslate WordPress translate screenshot 2

In my case my site had to support both English and Arabic, so from the settings I added support for Arabic translation to the wordpress install.

Adding Arabic is done in the following way:

a. From the Language Management (qTranslate Configuration) from the Languages menu and the Languages (Add Languages) I had to choose a language code (in my case a language code of ar – for Arabic). Next I had to choose the Arabic flag from the follow up flag list.

In next text box Name , again I had to fill Arabic, for Locale en_US.UTF-8
The following Date Format and Time Format text boxes are optional so I left them blank.
To complete the process of adding the Arabic as a new language wordpress should support I pressed the Add Language button and the Arabic got added as a second language.

Afterwards the Arabic was added as second language, on the bottom of the left wordpress menu pane a button allowing a switch between English, Arabic appeared (see below screenshot):

MultiLingual WordPress with qTranslate

Finally to make Arabic appear as a second language of choice on the website I added it as a Widget in the Widgets menu from the AWidgets menu:

Appearance -> Widgets

In widgets I added qTranslate Language Chooser to the Sidebar without putting any kind of Title for qtranslate widget .
I found it most helpful to choose the Text and Image as an option on how to display the Language switching in the wp.

Test your web browser compatability with Acid3 test

Wednesday, January 25th, 2012

Acid3 Test is a group of browser compitability tests. Acid3 test is a good indicator on how Web ready is your browser.

Acidtest is part of the web standards project. Latest Firefox 9.0.1 passes the test on 100% (100/100).
I've tried it with Epiphany and it scored only 67/100, still I'm using Epiphany on daily basis and I'm quite happy with it.
Acid3 browser compitability Test Firefox 9.0.1
The tests involved are testing browser for:
 

  • DOM
  • DOM2
  • Checks on HTML tables and forms browser rendering
  • SVG compitability testing
  • DOM1 and DOM2 compitability
  • Various ECMA Script Javascript compitability tests
  • Unicode (UTF-16 and UTF-8) browser compitability
  • XHML, SMIL, CSS, HTML compitability
  • Content-type image/png, text plain etc.

Acid3 browser test fail
The Acid3 test is written itself in Javascript. It consists of 6 testing "stages" (buckets) upon which the browser tested is evaluated.
Each of the test is represented visually by a rectangle. If the a test stage is passed you see a new rectangle appearing in the tested browser.
In wikipedia, there is a thorough list with web browsers by type and engine and the level of support for the Acid3 test.
The test is of great use if you're web developer.

Fix audioCD play problems with VLC on GNU Linux

Monday, January 23rd, 2012

I've not played audio CD for ages. Anyways I had to set up one computer with Linux just recently and one of the requirements was to be able to play audiocds.
I was surprised that actually a was having issue with such as simple tasks.
Here is how i come with this article.

If you encounter errors playing Audio CDs on any Linux distro in VLC or other players, you might need to apply the following fix.

root@xubuntu-desktop:~# apt-get install xubuntu-restricted-extras
...
root@xubuntu-desktop:~# apt-get install ubuntu-restricted-extras
...

I'm not sure if this packages are required, anyways having them installed is a good idea especially on computers which will have to support as much multimedia as possible.

Trying to play a CD with VLC the result was not nice, you see in the picture above the error that poped up while trying it with VLC:

Due to wrong configuration of the play device VLC will be looking to read the audio cd from.

To succesfully play the audiocd invoke VLC command with a cdda///dev/sr0 argument like so:

hipo@xubuntu-desktop:~$ vlc cdda:///dev/sr0
...

To permanently fix the error you will have to edit ~/.config/vlc/vlcrc :

Inside ~/.config/vlc/vlcrc find the lines:

dvd=/dev/cdrom

Substitute the above line with:

dvd=/dev/sr0

Next find the line:

vcd=/dev/cdrom

Change the above line with:

vcd=/dev/sr0
Due to a bug in generating vlcrc , the dvd= might be set also to other messy unreadable characters (different from /dev/cdrom). This can also be the reason why it fails to properly read the disc.

If dvd= and vcd is set to a different unreadable characters delete them and substitute with /dev/sr0 .I've experienced this on Xubuntu Linux with a Bulgarian localization (probably the bug can be seen in other Linuxes when GNOME is installed in Russian, Chineese and other UTF-8 languages.

The strange error can be observed also in other players when the localization is set to someone's native language …
Alternative solution is to install and use rhythmbox instead of VLC.
Other program to play audio CDs called workman , you will have to get used to the interface which uses gtk1 and therefore obsolete. Putting aside the ugly interface it works 😉

How to convert UTF-8 encoding files to Windows CP1251 on GNU / Linux

Friday, October 21st, 2011

I needed to convert a file which had a Bulgarian text written in UTF-8 encoding to Windows CP1251 in order to fix a website encoding problems after a move of the website from one physical server to another.

I tried first with enca( detects and convert encoding of text files from one encoding to another).

The exact way I tried to convert was:

linux:~# enca -L bg /home/site/www/includes/utf8_encoded_file.php
...
Unfortunately this attempt to conver was unsucesfully, and the second logical guess was to use iconvConvert encoding of given files from one encoding to another to do the utf8 to cp1251 conversion.
I reached for some help in irc.freenode.net, #varnalab channel and Alex Kuklin helped me, giving me an example command line to do the conversion.
iconv winedows to cp1251 conversion line, he pointed to me was:

linux:~# iconv -f utf8 -t cp1251 < in > out

Further on I adapted Alex’s example to convert my utf8_encoded_file.php encoded Bulgarian characted to CP1251 and used the following commands to convert and create backups of my original UTF8 file:

linux:~# cd /home/site/www/includes
linux:/home/site/www/includes# iconv -f utf8 -t cp1251 < utf8_encoded_file.php in > utf8_encoded_file.php.cp1251
linux:/home/site/www/includes# mv utf8_encoded_file.php utf8_encoded_file.php.bak
linux:/home/site/www/includes# mv utf8_encoded_file.php.cp1251 utf8_encoded_file.php