HighTechTalks DotNet Forums  

word breaking for cjk languages

Dotnet Internationalization microsoft.public.dotnet.internationalization


Discuss word breaking for cjk languages in the Dotnet Internationalization forum.



Reply
 
Thread Tools Search this Thread Display Modes
  #1  
Old   
AT
 
Posts: n/a

Default word breaking for cjk languages - 07-10-2006 , 12:46 PM






Hello,

I have a .Net 1.1 application which processes text in unicode and needs
to split it up into "words". currently I use a very simple algorithm of
just splitting on whitespace which works for western languages but
fails completely for CJKV (specifically I am interested in japanese).

I understand that a dictionary-based approach is necessary to do word
breaking in japanese. from what I gather, MS word, IE, and index server
each include their own (different) word breaking algorithm but none of
these is available through any API. is there any library function
available or built-in that I can use for this? I can pay a few hundred
bucks for a commercial product, but not a huge amount. the results
don't have to be 100% but it just has to do a better job than my
current "look for spaces" algorithm.

so far the closest I have found is that there seems to be a
breakIterator class in java but I don't see that anyone has ported this
to .Net. so my next best solution is simply to identity any japanese
character (or possibly just any kanji) as a word in itself.

any more clues or ideas gratefully accepted.

Andy


Reply With Quote
  #2  
Old   
Bart Mathias
 
Posts: n/a

Default Re: word breaking for cjk languages - 07-10-2006 , 04:26 PM






ajfish (AT) blueyonder (DOT) co.uk wrote:
Quote:
Hello,

I have a .Net 1.1 application which processes text in unicode and needs
to split it up into "words". currently I use a very simple algorithm of
just splitting on whitespace which works for western languages but
fails completely for CJKV (specifically I am interested in japanese).

I understand that a dictionary-based approach is necessary to do word
breaking in japanese. from what I gather, MS word, IE, and index server
each include their own (different) word breaking algorithm but none of
these is available through any API. is there any library function
available or built-in that I can use for this? I can pay a few hundred
bucks for a commercial product, but not a huge amount. the results
don't have to be 100% but it just has to do a better job than my
current "look for spaces" algorithm.

so far the closest I have found is that there seems to be a
breakIterator class in java but I don't see that anyone has ported this
to .Net. so my next best solution is simply to identity any japanese
character (or possibly just any kanji) as a word in itself.

any more clues or ideas gratefully accepted.
This is a problem I had to work out from scratch in 1981 when I was
starting a Japanese-English machine translation project for Weidener
Communications in Provo, Utah.

Twenty-five years later I don't remember all the details, but among the
rules of thumb was to try a break between kana and kanji (in that
order), break before any change to katakana, break before a hiragana
"o," or after an "wo," and like that. I don't think my practice
materials had any long stretches of hiragana to worry about, and as a
translation program it had to have access to a dictionary (which I was
building as I went along based on the words in the practice materials)
to verify things with.

More recently I started a Classical Japanese analysis program, which
simply grabbed a string of about seven characters and looked to see if
it was in the dictionary. If not found, it dropped the last character
and tried the dictionary again. (When there were no characters left, it
would display the original seven-character string and ask the user to
identify the first word in it, which then got added to the dictionary.)

Either way, the program has to know how things inflect.

Bart


Reply With Quote
  #3  
Old   
Mihai N.
 
Posts: n/a

Default Re: word breaking for cjk languages - 07-11-2006 , 02:43 AM



Quote:
I have a .Net 1.1 application which processes text in unicode and needs
to split it up into "words". currently I use a very simple algorithm of
just splitting on whitespace which works for western languages but
fails completely for CJKV (specifically I am interested in japanese).
....
I understand that a dictionary-based approach is necessary to do word
breaking in japanese. from what I gather, MS word, IE, and index server
each include their own (different) word breaking algorithm but none of
these is available through any API. is there any library function
available or built-in that I can use for this?

Main question is why do you need to split up into words?

If it is for line-wrapping, that does not happen at word boundary, but
following a set of rules that are easy to implement.
The name for the rule in Japanese is "kinsoku shori" but Chinese has
a similar set of rules (although with no special name).

See:
http://ja.wikipedia.org/wiki/%E7%A6%...87%A6%E7%90%86
Ok, is Japanese, but BabelFish (http://babelfish.altavista.com/) does a good
enough job if you already have a basic understanding.

See also
http://www.microsoft.com/globaldev/g...g_linebrk.mspx
http://www.w3.org/International/tutorials/css3-text/
http://xml.ascc.net/en/utf-8/faq/zhl10n-faq-xsl.html#lb (for Chinese)

Best reference is probably Ken Lunde's "CJKV Information Processing"
http://www.amazon.com/gp/product/1565922247



If you need it for some processing other than line-breaking, then you
need something better, something that can "understand" a bit of Japanese
grammar, syntax, etc.
In this case you might take a look at this:
http://www.basistech.com/base-linguistics/asian/

And you also need the help of a native Japanese speaker :-)


--
Mihai Nita [Microsoft MVP, Windows - SDK]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email


Reply With Quote
  #4  
Old   
AT
 
Posts: n/a

Default Re: word breaking for cjk languages - 07-11-2006 , 06:04 AM




Mihai N. wrote:
Quote:
Main question is why do you need to split up into words?

Best reference is probably Ken Lunde's "CJKV Information Processing"
http://www.amazon.com/gp/product/1565922247

unfortunately It is not for the purposes of line breaking. my
application requires the user to be able to select a particular word
within a line of text, so it is similar to the algorithm that word or
IE would use when you double-click.

Quote:
If you need it for some processing other than line-breaking, then you
need something better, something that can "understand" a bit of Japanese
grammar, syntax, etc.
In this case you might take a look at this:
http://www.basistech.com/base-linguistics/asian/

And you also need the help of a native Japanese speaker :-)



Reply With Quote
  #5  
Old   
AT
 
Posts: n/a

Default Re: word breaking for cjk languages - 07-11-2006 , 08:33 AM



Apud ajfish (AT) blueyonder (DOT) co.uk (sci.lang.japan) hoc legimus:
Quote:
I understand that a dictionary-based approach is necessary to do word
breaking in japanese. from what I gather, MS word, IE, and index server
each include their own (different) word breaking algorithm but none of
these is available through any API. is there any library function
available or built-in that I can use for this? I can pay a few hundred
bucks for a commercial product, but not a huge amount. the results
don't have to be 100% but it just has to do a better job than my
current "look for spaces" algorithm.
There are several public domain packages which break up Japanese
text into "words". One very commonly used is Chasen
(http://chasen.naist.jp/hiki/ChaSen/) Another is MeCab
(http://mecab.sourceforge.jp/)

Both do morphological analysis and part-of-speech tagging, but the
segmentation is a useful by-product. Both have documentation mostly
in Japanese, although Chasen has an oldish English manual as a PDF file.

--
Jim Breen http://www.csse.monash.edu.au/~jwb/
Clayton School of Information Technology,
Monash University, VIC 3800, Australia
$B%8%`!&%V%j!<%s(B@$B%b%J%7%eBg3X(B


Reply With Quote
  #6  
Old   
chance
 
Posts: n/a

Default Re: word breaking for cjk languages - 07-11-2006 , 11:31 AM




<ajfish (AT) blueyonder (DOT) co.uk> wrote

Quote:
Hello,

I have a .Net 1.1 application which processes text in unicode and needs
to split it up into "words". currently I use a very simple algorithm of
just splitting on whitespace which works for western languages but
fails completely for CJKV (specifically I am interested in japanese).

I understand that a dictionary-based approach is necessary to do word
breaking in japanese. from what I gather, MS word, IE, and index server
each include their own (different) word breaking algorithm but none of
these is available through any API. is there any library function
available or built-in that I can use for this? I can pay a few hundred
bucks for a commercial product, but not a huge amount. the results
don't have to be 100% but it just has to do a better job than my
current "look for spaces" algorithm.

so far the closest I have found is that there seems to be a
breakIterator class in java but I don't see that anyone has ported this
to .Net. so my next best solution is simply to identity any japanese
character (or possibly just any kanji) as a word in itself.

any more clues or ideas gratefully accepted.

Andy

I believe that word-breaking will be done ultimately only manually
in all languages. Wasn't it in the first place typesetters' arbitrary device
or a matter of style for writers?

CK

CK



Reply With Quote
  #7  
Old   
Mihai N.
 
Posts: n/a

Default Re: word breaking for cjk languages - 07-12-2006 , 02:40 AM



Quote:
unfortunately It is not for the purposes of line breaking. my
application requires the user to be able to select a particular word
within a line of text, so it is similar to the algorithm that word or
IE would use when you double-click.
Then probably the Basitech lib might help
http://www.basistech.com/base-linguistics/asian/

On the other side, I am not sure hw important it is to do that.
Word 2003 (with Japanese support and all) doesn't do it
(and the i18n level in Word 2003 is really good).
Maybe a quick market/user research will show that nobody cares
and save you some time :-)


--
Mihai Nita [Microsoft MVP, Windows - SDK]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email


Reply With Quote
Reply




Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Powered by vBulletin Version 3.5.4
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.