HighTechTalks DotNet Forums  

Determining whether the text is RTL

Dotnet Internationalization microsoft.public.dotnet.internationalization


Discuss Determining whether the text is RTL in the Dotnet Internationalization forum.



Reply
 
Thread Tools Search this Thread Display Modes
  #1  
Old   
Jan Kucera
 
Posts: n/a

Default Determining whether the text is RTL - 09-11-2007 , 05:29 AM






Hello,
I entered a little problem concerning automatic text alignment in WPF
mentioned at http://forums.microsoft.com/MSDN/Sho...PostID=2123352
and it seems I'd have to do the workaround myself, yet this group seems more
appropriate to look for the answer in.

The application gets some text (from XML) and is supposed to display it.
However, this XML contains data from several cultures and some comes from
RTL ones (eg. the text is Hebrew). Now I need to find out, wheter I should
align the text to the left, or to the right. Is there any function, either
in .NET or in Win32 that would determine this for me? I could get the first
character and test whether it is Arabic, Hebrew and so on, but I'll likely
miss some case (or future one), so I'm looking for more general way of doing
that.

Thank you for any hints,
Jan


Reply With Quote
  #2  
Old   
Mihai N.
 
Posts: n/a

Default Re: Determining whether the text is RTL - 09-12-2007 , 12:25 AM






Quote:
The application gets some text (from XML) and is supposed to display it.
However, this XML contains data from several cultures and some comes from
RTL ones (eg. the text is Hebrew). Now I need to find out, wheter I should
align the text to the left, or to the right. Is there any function, either
in .NET or in Win32 that would determine this for me? I could get the first
character and test whether it is Arabic, Hebrew and so on, but I'll likely
miss some case (or future one), so I'm looking for more general way of
doing that.
This is how you determine if some culture needs RTL rendering:
http://blogs.msdn.com/michkap/archiv...12/663013.aspx

But you need to have a way in the XML itself to tag data with a culture.

There is no 100% safe way to determine if the text is RTL based on the text
content only. Imagine you have a mixture like this: "XXXXX YYYYY"
with XXXXX some English text, and YYYYY some Arabic text.
Is that English with an Arabic inset, or Arabic with an English inset?



--
Mihai Nita [Microsoft MVP, Windows - SDK]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email


Reply With Quote
  #3  
Old   
Jan Kucera
 
Posts: n/a

Default Re: Determining whether the text is RTL - 09-12-2007 , 12:36 AM



Hi Mihai,
thank you for answer. However, the Michael's post is expecting to have a
CultureInfo. That way, because targeting newer .NET Framework, I could use
the CultureInfo.TextInfo.IsRightToLeft.

Okay, I know your sample would be a problem. So, how to check it for a
single character? Is there any way to test for all RTL cases?

Actually I think I do have ISO-639-2 tag for the text, but I'm not sure
whether it is worth to create separate info about textflow with them.


Jan

"Mihai N." <nmihai_year_2000 (AT) yahoo (DOT) com> wrote

Quote:
The application gets some text (from XML) and is supposed to display
it.
However, this XML contains data from several cultures and some comes from
RTL ones (eg. the text is Hebrew). Now I need to find out, wheter I
should
align the text to the left, or to the right. Is there any function,
either
in .NET or in Win32 that would determine this for me? I could get the
first
character and test whether it is Arabic, Hebrew and so on, but I'll
likely
miss some case (or future one), so I'm looking for more general way of
doing that.

This is how you determine if some culture needs RTL rendering:
http://blogs.msdn.com/michkap/archiv...12/663013.aspx

But you need to have a way in the XML itself to tag data with a culture.

There is no 100% safe way to determine if the text is RTL based on the
text
content only. Imagine you have a mixture like this: "XXXXX YYYYY"
with XXXXX some English text, and YYYYY some Arabic text.
Is that English with an Arabic inset, or Arabic with an English inset?



--
Mihai Nita [Microsoft MVP, Windows - SDK]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email


Reply With Quote
  #4  
Old   
Mihai N.
 
Posts: n/a

Default Re: Determining whether the text is RTL - 09-12-2007 , 10:45 PM



Quote:
thank you for answer. However, the Michael's post is expecting to have a
CultureInfo. That way, because targeting newer .NET Framework, I could use
the CultureInfo.TextInfo.IsRightToLeft.

Okay, I know your sample would be a problem. So, how to check it for a
single character? Is there any way to test for all RTL cases?
Withoug a CultureInfo you can try calling (the native) GetStringTypeEx.
It takes a locale ID, but you can use whatever you want,
The strong attributes in CT_CTYPE2 (C2_RIGHTTOLEFT/C2_LEFTTORIGHT) are
not affected by locale.

But there is still no reliable way to test for all RTL cases.
Sometimes not even a human can do it.


Quote:
Actually I think I do have ISO-639-2 tag for the text, but I'm not sure
whether it is worth to create separate info about textflow with them.
I think most of the time text content is in a single language.
A document is mostly in language A, with small chunks of other languages.
But those areas have to be tagged.
Designing a document where all the languages are mixed, without properly
tagging them, is not very usefull.
Think MS Word, where you can mark text sections with a different language
for spell-checking.

If possible it would be a good idea to tag the documents
(if not paragraphs, or records, or whatever) with a full locale ID,
RFC 4646 style.

There are quite a few things that cannot be done properly without
locale info. For example sorting, case conversion are culture sensitive.
Font selection (you cannot use a Chinese Traditional font for
Chinese Simplified text, even when the text is identical).
In fact, unless all you do is move text around (no processing, no display),
it is best to know what is the locale of that text.


--
Mihai Nita [Microsoft MVP, Windows - SDK]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email


Reply With Quote
  #5  
Old   
Jan Kucera
 
Posts: n/a

Default Re: Determining whether the text is RTL - 09-13-2007 , 05:21 AM



Quote:
Withoug a CultureInfo you can try calling (the native) GetStringTypeEx.
It takes a locale ID, but you can use whatever you want,
The strong attributes in CT_CTYPE2 (C2_RIGHTTOLEFT/C2_LEFTTORIGHT) are
not affected by locale.

But there is still no reliable way to test for all RTL cases.
Sometimes not even a human can do it.
I will give it a try. I just want to avoid (not mentioning that I did not
find any way of such checking in .NET)
if (char is Arabic || char is Hebrew || char is Urdu || char is Persian ||
char is Syriac)
and forget the Divehi case, or any new culture that will come. I thought
that going the way 'if any version of the Windows (or .NET) I am runnig
thinks it is RTL I should think it as well' would do the trick.



Quote:
I think most of the time text content is in a single language.
A document is mostly in language A, with small chunks of other languages.
But those areas have to be tagged.
Designing a document where all the languages are mixed, without properly
tagging them, is not very usefull.
Think MS Word, where you can mark text sections with a different language
for spell-checking.
Yes I agree, I wanted to mentioned it with your example too. I know the text
I'm displaying will always be whole (or rarely except a word or two) within
the same language. So I can afford to just check the first character in a
title for example.



Quote:
If possible it would be a good idea to tag the documents
(if not paragraphs, or records, or whatever) with a full locale ID,
RFC 4646 style.

There are quite a few things that cannot be done properly without
locale info. For example sorting, case conversion are culture sensitive.
Font selection (you cannot use a Chinese Traditional font for
Chinese Simplified text, even when the text is identical).
In fact, unless all you do is move text around (no processing, no
display),
it is best to know what is the locale of that text.
Well fortunately enough, I define the schema here and I could do some
changes or improvements. I have set of data coming from different cultures
and as Michael has written in the blog and suggested me as well, the user is
most likely expecting behaviour based on his culture. So I do sorting of
this data and case insensitive searching in context of the user's culture.
All I do with data themselves is just to display them. For that reason and
because of WPF I need to have an idea, wheter I should mark the document as
RTL. The only other reason for knowing CultureInfo I could came up with is
the ToTitleCase method, but I expect the titles of documents are already
properly cased.

The problem here is, that I have data in languages which do not match with
any existing culture. Like Latin, Old or Middle English and so on, artifical
languages not foreclased either. Filtering data to show only these in Middle
English (enm) is far more important to my application than having a
CultureInfo for the language, since I need only to display it. This is the
reason I choosed ISO-639-2 table instead of .NET supported cultures.

If there was a table mapping ISO-639-2 or -3 languages to appropriate
CultureInfo classes, even if not accurate, my problems would have been
solved. The document could be kept with the ISO marks and the application
would get corresponding CultureInfo for properly displaying it. Until then,
the GetStringTypeEx would do the work I think.


Thank you for your hints and thoughts.
Jan



Reply With Quote
  #6  
Old   
Mihai N.
 
Posts: n/a

Default Re: Determining whether the text is RTL - 09-13-2007 , 11:24 PM



Quote:
... will always be whole (or rarely except a word or two) within
the same language. So I can afford to just check the first character in a
title for example.
If you don't notice any performance hit, try going beyond the first
character, exactly for the rare "word or two," or digits, or other
characters.
Maybe calculate a percentage (72% rtl, 12% ltr, 6% others), establish
a threshold, and go from there.

Quote:
The problem here is, that I have data in languages which do not match with
any existing culture. Like Latin, Old or Middle English and so on,
artifical languages not foreclased either.
Yes, I understand how this can be a problem :-)

If you can control the environment (and it is Vista) you can create your
own custom locales.

See:
http://blogs.msdn.com/shawnste/archi...23/496440.aspx
http://msdn.microsoft.com/msdnmag/is...12/LocaleHero/
http://msdn.microsoft.com/msdnmag/is.../CLRInsideOut/
http://windowsvistablog.com/blogs/wi...19/442572.aspx

And the tools:
- Microsoft Locale Builder (Beta 2)
http://www.microsoft.com/downloads/d...e4588c5e-8f21-
45cc-b862-38df8d9bd528&DisplayLang=en
- Microsoft Keyboard Layout Creator
http://www.microsoft.com/globaldev/tools/msklc.mspx



--
Mihai Nita [Microsoft MVP, Windows - SDK]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email


Reply With Quote
  #7  
Old   
Michael S. Kaplan [MSFT]
 
Posts: n/a

Default Re: Determining whether the text is RTL - 09-14-2007 , 02:13 AM



Jan,

You can use code like in this post:

http://blogs.msdn.com/michkap/archiv...6/1421178.aspx

or use GetStringTypeW to get the info back.


--

MichKa [Microsoft]
Fundamentals Technical Lead
Windows International
Blog: http://blogs.msdn.com/michkap

This posting is provided "AS IS" with
no warranties, and confers no rights.



"Jan Kucera" <miloush (AT) community (DOT) nospam> wrote

Quote:
Withoug a CultureInfo you can try calling (the native) GetStringTypeEx.
It takes a locale ID, but you can use whatever you want,
The strong attributes in CT_CTYPE2 (C2_RIGHTTOLEFT/C2_LEFTTORIGHT) are
not affected by locale.

But there is still no reliable way to test for all RTL cases.
Sometimes not even a human can do it.

I will give it a try. I just want to avoid (not mentioning that I did not
find any way of such checking in .NET)
if (char is Arabic || char is Hebrew || char is Urdu || char is Persian
|| char is Syriac)
and forget the Divehi case, or any new culture that will come. I thought
that going the way 'if any version of the Windows (or .NET) I am runnig
thinks it is RTL I should think it as well' would do the trick.



I think most of the time text content is in a single language.
A document is mostly in language A, with small chunks of other languages.
But those areas have to be tagged.
Designing a document where all the languages are mixed, without properly
tagging them, is not very usefull.
Think MS Word, where you can mark text sections with a different language
for spell-checking.

Yes I agree, I wanted to mentioned it with your example too. I know the
text I'm displaying will always be whole (or rarely except a word or two)
within the same language. So I can afford to just check the first
character in a title for example.



If possible it would be a good idea to tag the documents
(if not paragraphs, or records, or whatever) with a full locale ID,
RFC 4646 style.

There are quite a few things that cannot be done properly without
locale info. For example sorting, case conversion are culture sensitive.
Font selection (you cannot use a Chinese Traditional font for
Chinese Simplified text, even when the text is identical).
In fact, unless all you do is move text around (no processing, no
display),
it is best to know what is the locale of that text.

Well fortunately enough, I define the schema here and I could do some
changes or improvements. I have set of data coming from different cultures
and as Michael has written in the blog and suggested me as well, the user
is most likely expecting behaviour based on his culture. So I do sorting
of this data and case insensitive searching in context of the user's
culture.
All I do with data themselves is just to display them. For that reason and
because of WPF I need to have an idea, wheter I should mark the document
as RTL. The only other reason for knowing CultureInfo I could came up
with is the ToTitleCase method, but I expect the titles of documents are
already properly cased.

The problem here is, that I have data in languages which do not match with
any existing culture. Like Latin, Old or Middle English and so on,
artifical languages not foreclased either. Filtering data to show only
these in Middle English (enm) is far more important to my application than
having a CultureInfo for the language, since I need only to display it.
This is the reason I choosed ISO-639-2 table instead of .NET supported
cultures.

If there was a table mapping ISO-639-2 or -3 languages to appropriate
CultureInfo classes, even if not accurate, my problems would have been
solved. The document could be kept with the ISO marks and the application
would get corresponding CultureInfo for properly displaying it. Until
then, the GetStringTypeEx would do the work I think.


Thank you for your hints and thoughts.
Jan



Reply With Quote
  #8  
Old   
Jan Kucera
 
Posts: n/a

Default Re: Determining whether the text is RTL - 09-14-2007 , 02:28 AM



"Mihai N." <nmihai_year_2000 (AT) yahoo (DOT) com> wrote

Quote:
Maybe calculate a percentage (72% rtl, 12% ltr, 6% others), establish
a threshold, and go from there.
Yes I thought about this already. It should not cost much performance since
checking only the title of document. But I think I'll try to keep it simple
at the moment (the GetStringTypeEx works as expected, thanks!) untill I find
any problematic data, or solve the problem the other way.

Quote:
If you can control the environment (and it is Vista) you can create your
own custom locales.
Thanks for the links. Regardless whether I could afford to support only
Vista...well.. there are 500 items in ISO-639-2 and 7500 in ISO-639-3...
Uh.. :-)) About most of them I've never heard, not to say about knowing the
culture/language so deeply to be able to create corresponding CultureInfo.

Jan



Reply With Quote
  #9  
Old   
Jan Kucera
 
Posts: n/a

Default Re: Determining whether the text is RTL - 09-14-2007 , 02:39 AM



"Michael S. Kaplan [MSFT]" <michka (AT) online (DOT) microsoft.com> wrote

Quote:
Jan,

You can use code like in this post:
http://blogs.msdn.com/michkap/archiv...6/1421178.aspx
or use GetStringTypeW to get the info back.


Hmmm... thanks for the managed way, Michael!
Although I'd have to find a very good reason to leave PInvoke and move to
Reflection... ;-)

Any improvements in .NET 3.0 or 3.5?
Jan



Reply With Quote
  #10  
Old   
Michael S. Kaplan [MSFT]
 
Posts: n/a

Default Re: Determining whether the text is RTL - 09-14-2007 , 08:20 AM



Unfortunately, no -- red bits/green bits rules, you see. :-(


--

MichKa [Microsoft]
Fundamentals Technical Lead
Windows International
Blog: http://blogs.msdn.com/michkap

This posting is provided "AS IS" with
no warranties, and confers no rights.


"Jan Kucera" <miloush (AT) community (DOT) nospam> wrote

Quote:
"Michael S. Kaplan [MSFT]" <michka (AT) online (DOT) microsoft.com> wrote in message
news:eGaaJ7p9HHA.5456 (AT) TK2MSFTNGP05 (DOT) phx.gbl...
Jan,

You can use code like in this post:
http://blogs.msdn.com/michkap/archiv...6/1421178.aspx
or use GetStringTypeW to get the info back.



Hmmm... thanks for the managed way, Michael!
Although I'd have to find a very good reason to leave PInvoke and move to
Reflection... ;-)

Any improvements in .NET 3.0 or 3.5?
Jan



Reply With Quote
Reply




Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Powered by vBulletin Version 3.5.4
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.