Tuesday, October 7, 2008

Using Iconv To Convert Character Sets On Linux And Unix

Hey There,

Today's topic was obliquely referenced in yesterday's post on using the online multilingual dictionary from the Bash CLI. Yes, after I woke up around 1 am and thought about it, the results did seem odd. It turns out, when I went to the page, that the multilingual dictionary on reference.com is actually house within a frameset under the online Thesaurus. My belated apologies for getting the name incorrect, although, if you think about it, it's not even a multilingual dictionary (not really). If you enter a word in English, get the definition in English and then get the word spelled for you in 30 different languages, that's more like an online dictionary with a keyword translator. If it were truly a multilingual dictionary, I would expect to get definitions in 30 different languages. Perhaps I want for too much at some times and not enough at others. Probably a good mix of the two ;)

Today, we're going to look at character conversion (Since, if you looked at the picture in yesterday's bash script post, you probably noticed that a lot of the foreign characters came up looking like garbage. Ok; maybe not garbage, but certainly not in the way they were intended to be shown. All manner of inflective symbols and more picturesque languages (Cyrillic) etc, just don't show up well on a standard U.S. computer.

There are, as I see it, two basic ways to get around this.

1. Install every font set and allow for every single type of encoding available. If your OS supports it, you should know be able to read Japanese Kanji and Russian, German, French, etc with the proper characters displayed. Still, I'm not sure there's any guarantee that this would always work, or that the benefits of keeping a huge cache of fonts and displays wouldn't end up under-weighing the burden.

2. Use a program called "iconv" (present on most Linux and Unix boxes nowadays), and convert when necessary.

I'll assume we want to proceed with option 2 and check out a simple example. But, before we do that, I'd like to point out that, from distro to distro, the implementations of the iconv program can vary greatly. For a striking difference, consider Sun's native version vs. Gnu's. In the Sun's version from 2.6 (Perhaps not a fair comparison, but their native client hasn't come up all that much either), you only have two flags to choose from (actually, you have to use them both): -f (The "convert from" character set) and -t (the "convert to" character set). Don't know what character sets you have installed? Your issue ;)

In the Gnu version, you've got those options (plus the long versions of both -- guess what those are ;) a few other options to suppress errors and gloss over characters that can't be decoded and, best of all, a "--list" option that will list all the character sets installed on your system. This list can be very useful if you want to convert to or from a character set but aren't sure if you can. With Gnu's iconv, you'll know in a few seconds without having to look anywhere else (For Sun 2.6, and later, you can find the definitions under /usr/lib/locale. If you do a "man" on "charmap" it's referring to the charmap source files in there - There may or may not be a standard directory like /usr/share/ii8n/charmaps on your Sun system). For useful information you can actually work with (on a less-than-abstract level) do a "man" on "localedef" and/or "locale." Didn't want to leave you with nothing on that side of the coin ;)

Here's a simple, but dramatically visual enough way, to show you what iconv can do. This example isn't really "practical," except insofar as it shows you proof from a few different angles that "iconv" actually can do what it says.

First, we'll check out our test file and make sure it's good to go:

host # ls
regular
host # file regular.ascii
regular: ASCII text
host # cat regular.ascii
This is a file created using our standard charset
host # cat -v regular.ascii
This is a file created using our standard charset


Then, we'll check the version of iconv (We want Gnu) to make sure we can work as efficiently as possible with as little charmap/locale/charset knowledge as possible:

host # iconv --version
iconv (GNU libc) 2.3.2...
<-- Good :)
host # iconv --list
The following list contain all the coded character sets known. This does
not necessarily mean that all combinations of these names can be used for
the FROM and TO command line parameters. One coded character set can be
listed with several different names (aliases).

437, 500, 500V1, 850, 851, 852, 855, 856, 857, 860, 861, 862, 863, 864, 865,
866, 866NAV, 869, 874, 904, 1026, 1046, 1047, 8859_1, 8859_2, 8859_3, 8859_4,
8859_5, 8859_6, 8859_7, 8859_8, 8859_9, 10646-1:1993, 10646-1:1993/UCS4,
ANSI_X3.4-1968, ANSI_X3.4-1986, ANSI_X3.4, ANSI_X3.110-1983, ANSI_X3.110,
ARABIC, ARABIC7, ARMSCII-8, ASCII, ASMO-708, ASMO_449, BALTIC, BIG-5...


There are literally 50, or so, more lines of available character sets we can convert from and to. Just run that command anytime if you need to check out the entire list. This is great because now we don't have to do any extra work to figure out that part.

Then we'll move on to converting the regular ASCII file to a UTF-16 file. We could convert it to UTF-8, but the "file" command doesn't consider the two charsets "different" enough (and neither do your eyes) and we want this to stand out. Also, UTF-16 has some fun features like not having complete EOL chars, etc, and is unsearchable by standard "grep." Notice the output from the "file," "grep," "cat" and "cat -v" commands once we've finished with the converse and do our comparisons:

host # iconv --from-code ASCII --to-code UTF-16 --output regular.utf16 regular.ascii
host # ls
regular.ascii regular.utf16
host # file regular.ascii regular.utf16
regular.ascii: ASCII text
regular.utf16: Little-endian UTF-16 Unicode character data, with no line terminators
host # grep file regular.ascii regular.utf16
regular.ascii:This is a file created using our standard charset
host # cat regular.ascii
This is a file created using our standard charset
host # cat regular.utf16
ÿþThis is a file created using our standard charset
host # cat -v regular.ascii
This is a file created using our standard charset
host # cat -v regular.utf16
M-^?M-~T^@h^@i^@s^@ ^@i^@s^@ ^@a^@ ^@f^@i^@l^@e^@ ^@c^@r^@e^@a^@t^@e^@d^@ ^@u^@s^@i^@n^@g^@ ^@o^@u^@r^@ ^@s^@t^@a^@n^@d^@a^@r^@d^@ ^@c^@h^@a^@r^@s^@e^@t^@
^@


Man, did we make a mess of things ;) Now, we'll put everything back the way it was (hopefully) and run the same tests:

host # iconv --from-code UTF-16 --to-code ASCII --output regular.new regular.utf16
host # ls
regular.ascii regular.new regular.utf16
host # file regular.ascii regular.new
regular.ascii: ASCII text
regular.new: ASCII text
host # grep file regular.ascii regular.new
regular.ascii:This is a file created using our standard charset
regular.new:This is a file created using our standard charset
host # cat regular.ascii
This is a file created using our standard charset
host # cat regular.new
This is a file created using our standard charset
host # cat -v regular.ascii
This is a file created using our standard charset
host # cat -v regular.new
This is a file created using our standard charset


And, voila, we've got everything back. "file" returns the same information, both files can be catted again and "grep" works on the file that we converted back from UTF-16 to ASCII.

That, in a nutshell is what "iconv" can do for you. The possibilities are endless, as (of course) are the amount of problems you can create for yourself ;)

Cheers,

, Mike




Please note that this blog accepts comments via email only. See our Mission And Policy Statement for further details.