December 06, 2004

UNIHAN.

Through an interesting Language Log post ("Semen, green rice and the rate of internet decay") by Mark Liberman, I learned about the Unihan site (it was actually mentioned in the comments to this LH post from last year, but there was so much else being discussed I didn't even notice it). The search page allows you to search for characters by meaning or transcription, the latter in "three varieties of Chinese (Cantonese, Mandarin, and Tang), the two basic Japanese pronunciations (Japanese On, or Sino-Japanese, and Japanese Kun, or native Japanese), and Sino-Korean," and the radical-stroke index allows you to look them up as you would in a traditional dictionarly. And the results page, eg for ren2 'man(kind), people,' gives you not only its number in the most important dictionaries, readings in the six varieties mentioned above, and definitions, but a long series of phrases using the character in both Chinese (Mandarin and Cantonese readings) and Japanese (kanji and kana).

One problem is that if you search by meaning, what you enter is treated as a string of characters rather than a word, so that entering "man" gets you 355 matches, including characters with "manifest," "manner," "womanly," "command," and so on in the definition. There's probably a way around this, but adding spaces before and after doesn't work.

Posted by languagehat at December 6, 2004 02:23 PM
Comments

Another peculiarity is that Mandarin pronunciations are also treated as strings, so a "han" search will get you shan, zhan, and chan. An "an" search gives you 5520 matches, incuding ang, zhang, han, chan, etc. I've fished for a workaround, but none seems to work. You can cut off the end "g" by entering the tone, but you still have 1518 mathes for an4.

On the other hand, li4, one of the most common character-tome combinations, gives you 226 matches, which is about right, though about half of them don't display on my browser.

Posted by: John Emerson at December 6, 2004 07:04 PM

I just checked out the Unihan page. From the link you listed above, nearly all the characters from the results page are very rare and arachic. If characters like those showed up in tattoos, I'll most probably discard them as yet another case of Hanzi Smatter (http://www.hanzismatter.com).

Posted by: Duncan Mak at December 6, 2004 09:51 PM

That's a good example of Unicode.org's detailed CJK files being transformed into usable information by a web frontend. I don't know why I had never thought of that before.

Posted by: Christopher Culver at December 7, 2004 02:26 AM

This brings up something I've wondered about for a while-- is there a Chinese equivalent to kakasi, which would take a URL for, e.g., a big5 encoded page and spit out a pinyin version?

Posted by: james at December 7, 2004 12:21 PM

For pinyin conversion (from GB- or Unicode-encoded pages), try David Lancashire's Adsotrans. Use the "advanced" page and select "convert to pinyin." (Give it time. The site is acting slow today.) Capitalization at the beginnings of sentences and other such niceties are coming soon. If prompted, use "guest" for both the login name and password.

I hope to have something similar on my own site one of these days.

Posted by: Mark S. at December 7, 2004 09:21 PM