How to blog in Chinese?

I’m suddenly confused. I know that, in Chinese characters, Oliver is spelled ???. And when I just pasted those characters into my Safari, it looks right. But I’m almost certain that when I save this item, and it gets rendered as a web page, it’s not going to appear as Chinese to my readers. I know how to handle this for a specific web-page by changing the text-encoding for an entire page. But this blog post will appear in many different places (here, in aggregators, in blog search engines, etc.) — how do I universally indicate “these characters here in this chunk of text are in Chinese?”

Screen Shot of this post being edited, showing proper rendering of Chinese characters

Postscript: As expected, my Chinese characters ended up as “???”. This might be because of limitations in the MySQL database where I’m storing the post, or because of text encoding problems in the browser. Or maybe I’m just in the dark. Please advise.


Nathan's picture
Nathan on February 10, 2006 - 00:46 Permalink

You need to publish in Unicode! Unicode is the modern character encoding scheme that has enough codepoints to represent all characters in all scripts currently in use today. However for better backwards compatibilty with existing systems and tools usually Unicode is not used directly. Instead charsets such as UTF-8 are used which represent the Unicode codepoints in such a way that is backward compatible with the common-denominator ASCII character set.

So if you specify the document charset of this page as UTF-8 in either meta content-type tag or in the HTTP headers then you can use any Unicode codepoints you want. Since there are millions on codepoints, most fonts only cover a subset used by a particular script. To view Chinese characters you’d need Chinese fonts, which in your screenshot you appear to have. If you are thinking of switching to UTF-8, you may need to convert some content. You’re currently using ISO-8859-1 charset which only partially overlaps with UTF-8 — the lowest 128 codepoints are the ASCII set in both. Any existing content may contain ISO-8859-1 codepoints between 128 and 255. To convert to UTF-8, run your content through iconv(). Be warned though that in UTF-8, one byte does not necessarily equal one character. This might cause previous assumptions to explode in string handling code. The PHP Multibyte String module can help deal with that.

Clark's picture
Clark on February 10, 2006 - 16:31 Permalink

??????????????????? utf-8 ??????