NebuPookins.net - NP-Complete - Microsoft AppLocale, and an introduction of Unicode

NebuPookins.net

NP-Complete

Deprecated: Function ereg_replace() is deprecated in /home/nebupook/public_html/include.parse.php on line 32

Deprecated: Function ereg_replace() is deprecated in /home/nebupook/public_html/include.parse.php on line 33

Microsoft AppLocale, and an introduction of Unicode

Fri March 18th, 2005, 12:25 AM EST | 0 comments

I recently discovered this great tool for running code-page based applications on Windows XP. But before I get into that, I have to explain what code-page based applications are (cue groaning).

So computer store everything as numbers, right? That means that text is stored as numbers too. One way to do this is to let the number 1 represent the character 'a', the number 2 representing the character 'b', and so on, with 26 representing 'z'. What about if you want to represent uppercase characters too? Well, we could make 'A' be 27, 'B' be 28, and so on. Or we could make positive numbers be uppercase and negative numbers be lowercase (e.g. 'A' = 1, 'a' = -1, 'B' = 2, 'b' = -2) and so on. And what about the punctuation symbols? As you can see, there are many ways to do this.

Now when you save your text files to disk, they're stored on the disk as numbers as well. That means that if you want to save a text file with the contents "aBc", it might be stored as the sequence of numbers 1,28,3 using one of the above proposed schemes. Then, you bring that disk to your friend's computer and try loading it up there, and your friend's computer might interpret that as being "A?C" because it uses the encoding system that says positive numbers are uppercase. As you can see, this leads to a big mess, which is why ASCII was invented.

ASCII stands for American Standard Code for Information Interchange, and you can see the full ASCII table at http://www.lookuptables.com/. In ASCII, 'a' is 97 and 'A' is 65. The first few characters are various control characters for computers (8 is backspace, 9 is tab, for example). This was great for Americans, who only have 26 letters in their alphabets, and each letter has a distinct upper and lower case version, 10 numerals, and a handful of punctuation marks.

This is not so great for other languages. There's lots of characters that cannot be represented with ASCII. Most of the Greek alphabet cannot be represented, absolutely zero Chinese or Japanese character can be represented, and there's probably lots of other languages (French, Russian, Arabic, etc.) I don't know very much about which don't use the same alphabet as English. So what do computers in Japan do? Why, they use a different code page of course!

Now I don't really know much about code pages, and there isn't much information on it online as far as I can find, because code pages are pretty much obsolete with the introduction of Unicode, but from what I understand, code pages are simply other standards like ASCII. There's a codepage for Japan which says how to translate the numbers into the appropriate Kanji, which is very incompatible with ASCII, for example. Essentially we get into the mess I outlined above with sharing files with your friend who uses a different standard. To fix this mess, Unicode was invented.

Unicode is supposed to be a global standard that has a number for every distinct character in every language possible. So, just for example, maybe the English character 'a' is the number 97, and the Japanese kanji 'あ' is the number 12'354 and so on. Building the Unicode standard was difficult, not only for technical reasons (finding out what all the possible characters in all the possible languages are), but also for political reasons and linguistic reasons. There were many disagreeements on what constituted a character, and which characters were equivalent. For example, in English, the characters 'a' and 'A' are generally agreed to be distinct concepts, one being the lowercase form, and another being the uppercase form (note that the Japanese Kanji alphabet, for example, doesn't really have a concept of uppercase or lowercase, so to a Japanese person unfamiliar with English, this might be a very foreign concept that needed to be explained for the Japanese to agree on the design of Unicode). However, capital A written in cursive, and the capital A written in print are considered, conceptually, to be equivalent characters. So far, not too many difficulties.

English is special in that there is a preservation of the number of characters when converting from uppercase to lowercase. If a word has 5 characters when written in lowercase, then when you write it in uppercase, it still has 5 characters. Furthermore, there is a one to one correspondence between uppercase and lowercase characters, meaning there's only one way to convert from uppercase to lowercase and vice versa. You could imagine an alphabet where this was not so. Perhaps lowercase 'c' can be converted to both uppercase 'C' or uppercase 'D', so that "cat", to uppercase, would become "DAT", which back to lowercase because "dat".

This actually happens in German: From what I understand, the German language uses mostly a latin alphabet (that's the alphabet English uses), but adds a character ß which has two upper case versions, one is SS, and the other is SZ, depending on the word it is in. Does that mean SS should be given its own number, or should SS be represented by the number for S listed twice? The general consensus is for the later, but now, when a computer sees an SS, and wants to make it lower case, how does it know when to turn it into ss and when to turn it into ß? For this reason, some people on the Unicode standard commited, argued that the German character 'S' should be distinct from the English character 'S' (though in the end, they were made the same, as far as I know), because the two 'S' behave differently.

This "equivalence" of characters is a tricky one, because the Russian character 'В' looks just like the English character 'B', but the Russian character is pronounced the way English people would pronounce the English character 'V', whereas the English character 'B' is pronounced the way a Russian person would pronounce the Russian character 'Б'. The general consensus is that the English character 'B' and the Russian character 'В' are two distinct characters, and should be given different numbers.

This, by the way, is the vulnerability used to attack SSL mentioned last month. I could make a fake site, and call it "pаypаl.com" instead of "paypal.com". It's pretty much imperceptible to humans, but the character in the first domain name is the Russian 'а', while the character in the second domain name is the English 'a', each of which has a different number, and so the two domains point to different sites.

But this is all a digression, just to illustrate how complex Unicode actually is, and the difficulty with creating the Unicode standard (by the way, Unicode is so complete in its goal of encompassing all known languages, that it even has an alphabet for the fictional language of Klingon). The real purpose of this post was to talk about Microsoft's AppLocale.

I have some old Japanese games that still use codepage. I cannot run them unless I tell Windows to set the default codepage to Japanese, and to do that, I have to reboot the computer. And then, once I've done that, I won't be able to play old English games which still use codepage, so I'd have to reboot again! New games that use Unicode work no matter what Windows has set the default codepage to. What I had resorted to was to build an entirely new computer, and to run Japanese Windows XP on that computer and simply use that computer whenever I wanted to play (old) Japanese games, and use my English computer whenever I wanted to play (old) English games. (Actually, I used Microsoft VirtualPC to emulate a whole new computer, and installed Japanese Windows XP on it). This was incovenient, but not as inconvenient as having to reboot my computer every time I wanted to play a certain game.

Enter Microsoft AppLocale. What it does is creates a wrapper around a certain program so that that program thinks it's running in one code page (or "locale", as Microsoft calls it), while the rest of your computer might be running in a different code page. The way it works is I run AppLocale, and it asks me what locale I want to simulate. I choose 日本語 (which means "Japanese language" in English), and then it asks me what program I want to run in the 日本語 locale. I specify my game. Presto-Magicko, the game runs as if it were on a Japanese install of Windows XP, but without me having to reboot. In fact, I can play a Japanese, English and Russian game all at the same time (if I had a Russian game, which I don't).

Microsoft AppLocale is free, but is given without waranty, meaning if it fucks up your system, Microsoft won't help you. But that's just a standard disclaimer on pretty much all free software (and on a lot of non-free software too) just to protect them from getting sued. AppLocale has worked wonderfully for me, and I'm sure it'll work wonderfully for you to (disclaimer: I take no responsibility if AppLocale fucks up your system).

Deprecated: Function ereg_replace() is deprecated in /home/nebupook/public_html/include.parse.php on line 60

Deprecated: Function ereg_replace() is deprecated in /home/nebupook/public_html/include.parse.php on line 61

E-mail this story to a friend.

You must be logged in to post comments.

Sites linking to this post:

Name:
Password:
	Register a new account.