![]() | |
![]() |
| | Thread Tools | Search this Thread | Display Modes |
#1
| |||
| |||
|
|
I want to build an index to a list of words based on the first letter of these words for 4 languages ( french english czech spanish ). In english there is no problem. There is no accent and all letter are based on only one character. In french there are accents, and as fas as i know letters are all one character. Capitalizing a word dont change its meaning. In other languages there may be accents and letter may be composed of more than one character and capitalization can change the meaning of a word. ----- Here an example in french of what i am seeking. For the words : Abandon École ennui fuite i would like to obtain the following entries : A ( -> Abandon) E ( -> École, ennui) F ( -> fuite ) ----- I can build for french a lookup table and so have a solution; french being my maternal language. But i cannot for czech, etc. ----- My questions are : 1) Even if for french and english my aim is meaningfull, is it meaningfull for czech and spanish? 2) if 1) is answered yes, does the dotnet framework can help me? I had a look at CompareInfo and SortKey classes. Thank you. |
#2
| |||
| |||
|
|
Thank you. Here's an other examples. This time in czech ( of which language i know nothing ). The following words are sorted ( ignoreCase ) with the locale cs-CZ. Cyklopentan Cástice Dusík Ethanol Chlor Fluor Glutaraldehyd Hydroxid Chlor Using the first two bytes of KeyData [ ...CompareInfo.GetSortKey(s,CompareOptions.IgnoreC ase ).KeyData ] as an indicator to the effect that a break occurs on the first letter of the words, i obtain the following index. C C D E F G H C Using the first two bytes of KeyData gives satisfying results for english and almost satisfying results for french ( I cant for example map É to E ). "C C D E F G H C" looks weird and ( must be "C C D E F G H Ch" ). The Keydata for Chlor in ( cs-CZ, CompareOptions.IgnoreCase ) reads "14 46 14 72 14 124 14 138 1 1 1 1 0" . There is only 4 byte pairs instead of 5 even if "Chlor" counts 5 characters. So the base API knows that "Ch" is only one letter. It will be interesting to have a reversed map from (14,46) to Ch. ??? I had a look at the Unicode web site on "NFD". That seems interesting. It will permit to map É to E in french. Will it help me knowing that "Ch" is a letter in czech. What about the framework and NFD? Thanks again. |
#3
| |||
| |||
|
|
Thanks. I had a look a UnicodeData.txt and the ICU website. I will build a simple lookup table for each of the languages I intend to support using ICU collation charts and ICU collation customization rules. These tables will help me determine the index entry for a word( eg École - E in french ; Chlor -> CH in Czech ). I am a novice in unicode. Your help was precious. |
![]() |
| Thread Tools | Search this Thread |
| Display Modes | |
| |