![]() | |
![]() |
| | Thread Tools | Search this Thread | Display Modes |
#1
| |||
| |||
|
#2
| |||
| |||
|
|
Hello, I have a .Net 1.1 application which processes text in unicode and needs to split it up into "words". currently I use a very simple algorithm of just splitting on whitespace which works for western languages but fails completely for CJKV (specifically I am interested in japanese). I understand that a dictionary-based approach is necessary to do word breaking in japanese. from what I gather, MS word, IE, and index server each include their own (different) word breaking algorithm but none of these is available through any API. is there any library function available or built-in that I can use for this? I can pay a few hundred bucks for a commercial product, but not a huge amount. the results don't have to be 100% but it just has to do a better job than my current "look for spaces" algorithm. so far the closest I have found is that there seems to be a breakIterator class in java but I don't see that anyone has ported this to .Net. so my next best solution is simply to identity any japanese character (or possibly just any kanji) as a word in itself. any more clues or ideas gratefully accepted. |
#3
| |||
| |||
|
|
I have a .Net 1.1 application which processes text in unicode and needs to split it up into "words". currently I use a very simple algorithm of just splitting on whitespace which works for western languages but fails completely for CJKV (specifically I am interested in japanese). .... I understand that a dictionary-based approach is necessary to do word breaking in japanese. from what I gather, MS word, IE, and index server each include their own (different) word breaking algorithm but none of these is available through any API. is there any library function available or built-in that I can use for this? |
#4
| |||
| |||
|
|
Main question is why do you need to split up into words? Best reference is probably Ken Lunde's "CJKV Information Processing" http://www.amazon.com/gp/product/1565922247 |
|
If you need it for some processing other than line-breaking, then you need something better, something that can "understand" a bit of Japanese grammar, syntax, etc. In this case you might take a look at this: http://www.basistech.com/base-linguistics/asian/ And you also need the help of a native Japanese speaker :-) |
#5
| |||
| |||
|
|
I understand that a dictionary-based approach is necessary to do word breaking in japanese. from what I gather, MS word, IE, and index server each include their own (different) word breaking algorithm but none of these is available through any API. is there any library function available or built-in that I can use for this? I can pay a few hundred bucks for a commercial product, but not a huge amount. the results don't have to be 100% but it just has to do a better job than my current "look for spaces" algorithm. |
#6
| |||
| |||
|
|
Hello, I have a .Net 1.1 application which processes text in unicode and needs to split it up into "words". currently I use a very simple algorithm of just splitting on whitespace which works for western languages but fails completely for CJKV (specifically I am interested in japanese). I understand that a dictionary-based approach is necessary to do word breaking in japanese. from what I gather, MS word, IE, and index server each include their own (different) word breaking algorithm but none of these is available through any API. is there any library function available or built-in that I can use for this? I can pay a few hundred bucks for a commercial product, but not a huge amount. the results don't have to be 100% but it just has to do a better job than my current "look for spaces" algorithm. so far the closest I have found is that there seems to be a breakIterator class in java but I don't see that anyone has ported this to .Net. so my next best solution is simply to identity any japanese character (or possibly just any kanji) as a word in itself. any more clues or ideas gratefully accepted. Andy |
#7
| |||
| |||
|
|
unfortunately It is not for the purposes of line breaking. my application requires the user to be able to select a particular word within a line of text, so it is similar to the algorithm that word or IE would use when you double-click. |
![]() |
| Thread Tools | Search this Thread |
| Display Modes | |
| |