International characters, encodings question

  • Thread starter Thread starter Zachary Turner
  • Start date Start date
Z

Zachary Turner

I am trying to search some Japanese text for an occurence of any kanji, or
Japanese characters. But I don't know how to differentiate in code between
kanji and non-kanji. Do all kanji fall within a certain range of unicode
values? Or is there a framework function somewhere that can help with this?

Thanks
 
Zachary Turner said:
I am trying to search some Japanese text for an occurence of any kanji, or
Japanese characters. But I don't know how to differentiate in code between
kanji and non-kanji. Do all kanji fall within a certain range of unicode
values? Or is there a framework function somewhere that can help with this?

Thanks

The kanji aren't all in a row, plus, many kanji characters are used in other
Asian systems but not in Japanese. So I'd say it's impossible to tell if a
kanji is Japanese or not. Hiragana and Katakana all fall in a range though.

Hiragana U+3040 - U+309F
Katakana U+30A0 - U+30FF

There are also punctuation marks which are perhaps unique to Japanese, I'm
not sure. Those are not included above.

Check out http://www.unicode.org/book/ch10.pdf

HTH
++++
 
++++ +++++++++ > said:
The kanji aren't all in a row, plus, many kanji characters are used in other
Asian systems but not in Japanese. So I'd say it's impossible to tell if a
kanji is Japanese or not. Hiragana and Katakana all fall in a range though.

Hiragana U+3040 - U+309F
Katakana U+30A0 - U+30FF

There are also punctuation marks which are perhaps unique to Japanese, I'm
not sure. Those are not included above.

Check out http://www.unicode.org/book/ch10.pdf

Well, I'm actually not concerned if the Kanji is Japanese or not. I
basically want to be able to uniquely categorize an arbitrary character into
one of the following three categories:

Hiragana or Katakana
Kanji
neither of the above.

The first one seems easy enough. For the second one, I actually know that
the text is either going to be Japanese or English. In other words, I'm not
going to have any Korean or Chinese or anything in there. I basically want
to check if it's an "ideogram". If it is, I think I'll be satisfied. Is
there a simple way to check for this?

I'll read up on that document a little later, but for now I want to get back
to the code.

Thanks
Zach
 
Zachary Turner said:
Well, I'm actually not concerned if the Kanji is Japanese or not. I
basically want to be able to uniquely categorize an arbitrary character into
one of the following three categories:

Hiragana or Katakana
Kanji
neither of the above.

The first one seems easy enough. For the second one, I actually know that
the text is either going to be Japanese or English. In other words, I'm not
going to have any Korean or Chinese or anything in there. I basically want
to check if it's an "ideogram". If it is, I think I'll be satisfied. Is
there a simple way to check for this?

I'll read up on that document a little later, but for now I want to get back
to the code.

OK, that document describes the points and all the blocks where kanji are to
be found. That will tell you it it's a kanji.
One thing you may still need to look for is that for English, there are also
wide characters. So A-Z, and 1-9 actually exist in two places. The wide
version requires two bytes, as it comes from double byte character sets.
A U+FF21 Z U+FF3A
a U+FF41 z U+FF5A

You may want to handle these as well.

HTH
Eric
 
Back
Top