Shift-JIS & UTF-8

Brief comparison of two encodings for Japanese language: Shift-JIS and UTF-8, by Hiroshi

First of all, the purpose of the article is to explain the difference between the encodings for Japanese language, especially Shift-JIS and UTF-8 for non-technical users, who already understand some technical issues about encoding, for example, why such encodings are necessary and exit.

Second, you are not obliged to know Japanese language to understand the article, but if you have some knowledge of Japanese language, it would help a lot. Although the knowledge of Chinese might help, I focus on Japanese, since it is my native language and I have much experience and tryouts using this language on computers.

What are the encodings for Japanese?

There are some encoding which can handle Japanese language. Here is the list of them.

Encoding OS Used for
JIS Email
Shift-JIS Mac OS 9.x, Windows 98 Mac/Windows applications
EUC *nix systems, such as Linux or *BSD Perl, PHP
UTF-8 Mac OS X XML, Java, C#
UTF-16 Windows XP

As you can see, there are some encodings available for Japanese. Until recently, we only needed to discuss three of them: JIS, Shift-JIS, and EUC; however, here came a new encoding: Unicode. In the recent years, the popularity of Unicode, especially UTF-8, is rising.

At this moment, I summarize the difference as follows:

For emails – actually, this is de-facto standard as encoding for emails
Shift-JIS, EUC
For localization – both of them are 2-bytes encoding
Unicode, UTF-8
To describe all the language using the same encoding

Next, I will explain details of popular encodings: Shift-JIS and UTF-8.

Why Shift-JIS?

Computers first supported letters used in English, which are no more than 100 letters including ? or ! and so on. However, ASCII cannot even handle German letters with Umlaut or accented French letters. Some of you much have experience writing your Web pages using escape sequences like ä (ä) or é (é), because if the encoding is not correctly set, these letter cannot be shown correctly on screen. From this example, you would understand that as long as you use letters used in English, you can read them in any occasion, but other letters cannot.

By using 1-byte, you can describe 256 letters, which is far enough for English speakers and there are still some room for accented French or umlaut. So, the Latin-1 encoding was created.

Yet, it is not enough for Japanese language. Japanese has Hiragana (more or less 50 letters), Katakana (more or less 50 letters as well), and Kanji (which means “Chinese letters”). In total it requires at least 1900 letters to be supported. With 1-byte, it is impossible, so the 2-bytes are used, which enables using approximately 65000 letters. Shift-JIS and EUC use 2-bytes to handle one letter.

So, by using Shift-JIS, Japanese can finally be used on computer. But, the problem still remains. For example, some letters overlap. Imangine a stituation. You use DOS prompt, and it would show C:\(backslash), but by on Windows Japanese edition, which is based on Shift-JIS encoding, it looks C:¥ (yen sign). The same happens with trademark (™) or copyright (©) signs defined for Latin-1 encoding, and this also means that backslash (\), for instance, cannot be written using Shift-JIS. Not to mention, Japanese and accentuated French cannot co-exist in 1 document at all if you use Latin-1 or Shift-JIS.

Although there are some other problems, but I am not going to explain each of them. The fact is that even though Shift-JIS has some problems, it used to be the de-facto standard encoding for personal computers such as Windows 98/ME or Macintosh (up to OS 9.x). Even today, there are plenty of software which uses Shift-JIS as internal coding. Many Web pages are written in Shift-JIS, because those Web pages are written on Macintosh or Windows.

In summary, Shift-JIS is the ultimate solution for localization.

Why UTF-8/Unicode?

As CPU gets faster and HDD gets cheaper, computer started having enough speed and resource for handling multi-byte letters. At the same time, there rose some needs for mixing different languages in a single document. So, Macintosh used its own multilingual encoding, and so did Windows.

However, the situation that each OS use different encoding for multilingualization is not preferable. Therefore, Unicode was created. So, the primary purpose of Unicode is to describe all the language in the world using a single encoding.

UTF-8 is one of the variants of Unicode. It uses the same code for ASCII for 1-byte letters, then multi-byte for Japanese, Korean or Chinese, for example. It is the default encoding for Java, XML, and so on. Mac OS X uses UTF-8 as the internal encoding as well, while Windows 2000 and XP use UTF-16 (as far as I know. Please let me know if I am wrong).

Here are advantages of Unicode:

  • You can write any letters defined by Unicode in your document and other people have no trouble opening it.
  • Programme does not need to determine the encoding when opening files.
  • You can use multi-byte letters for filename, and others have no trouble using it. Under localized system, it can cause trouble.
  • Normally, if the program is written for single byte, somebody must localize it and add multi-byte capability. This can oncludes re-writing of the codes. This is tough, if the programmer has no idea how other languages work on computer.

In short, the purpose of Unicode is offering single framework for the encoding and to solve the messed up current state of encodings. In addition, Unicode enriched the portability of document as well, because it is platform independent.

Problems still remain…

You might think that UTF-8 is the ultimate answer for the current situation; yet, I have to underline some problems.

First, each application must be also Unicode-savvy. In fact, there are plenty of applications which uses Shift-JIS internally. Is this case, UTF-8 can be used just for letters defined in Shift-JIS.

Second, fonts must be determined dynamically by the application or OS. Because the font, which contains all the letters defined by UTF-8, does not exist, applications must dynamically change the font. For example, if you use Eudora(Email client), you can only specify one font for preview window. In this case, you cannot read all the letters, even if the encoding is UTF-8.

The biggest problem of UTF-8 is that some letters are defined as the same letter although they are different. For example, some Kanji (which means “Chinese letter” in Japanese, as I mentioned before) have older variant and current variant. Unicode cannot handle these respectedly because both of them are defined as the same letter although we consider them different. When you create the list of classmates in a school in Japan, some people might use old variants for their names, which cannot be handled UTF-8. In this case, those people must give up using old variants, or sometimes they must give up writing their names in Kanji and substitute by using Hiragana or Katakana. Nation-wide database of all Japanese using pure UTF-8 is out of the question. Some existing examples are Japanese pop star Tsuyoshi Kusanagi or department store Takashimaya. Letter nagi cannot be written using UTF-8. Takashimaya often use the new variant of Kanji for the letter taka although their actual letter is old variant. Those names cannot be supported UTF-8.

To solve such a problem, Mac OS X, for examp
le, has an additional function, which enables it manage those variants without any problem. This solution resembles locale system. As long as such a function is necessary, UTF-8 is not the final solution to the encodings.

Yes to UTF-8

Although UTF-8 still has some problems, it is definitely a neat remedy at the moment. As I have already mentioned, programmers do not need to consider encodings if UTF-8 is used. In other words, without the efforts of multi-lingualization, the program is already multi-lingual-ready if UTF-8 is used. Newly created programming languages such as Java and C#, or XML declare UTF-8 as default encoding. Which should be respected.

These days UTF-8 gets more attention and popularity by the programmers, which is preferable, because those who create applications are those who can solve the problem. Then, the end users would unconsciously follow the trend of UTF-8. I encourage them using UTF-8 although I would not give thorough applause to UTF-8.

  1. I wrote this article mostly using gedit on Linux for PowerPC; however, it did not accept by direct input accented French and umlaut, so I had to copy & paste from another editor, Kate. In addition, it did not show backslash on gedit. Then, I was obliged to use on Mac OS X.
  2. Although I do not directly quote books nor Web resources, I remark here that I got much information and insights from variety of source of information, especially from uncountable number of Web pages.

All Rights Reserved

Flattr this!

  • akira

    Linguistic? Rhetoric? Typography? Really strange title. There are many english errors, and many letters that my browser (safari) cannot display. For (the title) everibody will presume that you want to make a “point”. With so many errors and the inability to perceive or correct them, that’s not the way. Better change the title for something like: “What i know and say it’s not what i do”.

  • Konichiwa Akira,

    Linguistique, rhétorique et typographie is the tile of the category. This can’t be internationalised. I could have categorised this article under Codage Web (Web coding), but this category has way to much unclassified stuff already.

    For the rest, I will forward the information to the author.