HighTechTalks DotNet Forums  

Conversion problem between UTF-8 and Unicode characters.

Dotnet Internationalization microsoft.public.dotnet.internationalization


Discuss Conversion problem between UTF-8 and Unicode characters. in the Dotnet Internationalization forum.



Reply
 
Thread Tools Search this Thread Display Modes
  #1  
Old   
Sunray
 
Posts: n/a

Default Conversion problem between UTF-8 and Unicode characters. - 07-02-2009 , 11:34 AM






Do I misunderstand something?

This is in C#

When reading some UTF-8 from an Access notes field using ADO.NET I get the
UTF-8 characters held in a UTF-16 string. In this example I'm going to use
the hebrew het character 'ח' character. This is 5D7 unicode and D7 97 UTF-8.

Now I can convert the 5D7 character held in a C# string to its corresponding
a UTF-8 bytes easily.

UnicodeEncoding unicode = new UnicodeEncoding();
UTF8Encoding utf8 = new UTF8Encoding();

string het = "ח";
byte[] UnicodeHet = unicode.GetBytes(het);
byte[] UTF8Bytes = Encoding.Convert(unicode,utf8,UnicodeHet);

UTF8Bytes is then written to the database.

When I read this from the database I get two characters that represent the
UTF-8 string held in UTF-16 C# string. I can convert these back to the het
character using the following code

UnicodeEncoding unicode = new UnicodeEncoding();
UTF8Encoding utf8 = new UTF8Encoding();
Encoding local = Encoding.GetEncoding(1252);

string utf8het = "׳—"; //Normally read from the database but hardcoded here
byte[] utf8hetbytes = local.GetBytes(utf8het);
byte[] utf8result = Encoding.Convert(utf8,unicode,utf8hetbytes);
result = unicode.GetString(utf8result);

If the code page for the machine is set to 1252 this works correctly. e.g.
If the result from the database was a hebrew het character 'ח' character it
will return the utf-8 characters D7 97 in the byte sequence, which will be
correctly decoded to 5D7

Problem: If I subsequently change the code page of the machine to hebrew,

byte[] utf8hetbytes = local.GetBytes(utf8het);

will start returning 3F 97. 3F is ? which generally means a translation
error has occurred on the character.

Why?

If I switch to getting the default code page, it always works. Unfortunately
it appears the rest of the code (poor) requires 1252. Am I wrong in assuming
that if I get 1252 encoding it should not be effected by the code page of the
machine? It appears that I am faced with a bit of a major re-work due to
this.

Is there another way to get the two utf-8 bytes held in a C# string into a
byte array without going through a code page?

Thanks in advance

Alex

Reply With Quote
  #2  
Old   
Nag
 
Posts: n/a

Default Re: Conversion problem between UTF-8 and Unicode characters. - 09-01-2009 , 09:49 PM






Hi,
Is there any Sample Application or starter kit for building .Net
Internationalized web application.

-Nagendra

"Sunray" <Sunray (AT) discussions (DOT) microsoft.com> wrote

Quote:
Do I misunderstand something?

This is in C#

When reading some UTF-8 from an Access notes field using ADO.NET I get the
UTF-8 characters held in a UTF-16 string. In this example I'm going to
use
the hebrew het character 'ח' character. This is 5D7 unicode and D7 97
UTF-8.

Now I can convert the 5D7 character held in a C# string to its
corresponding
a UTF-8 bytes easily.

UnicodeEncoding unicode = new UnicodeEncoding();
UTF8Encoding utf8 = new UTF8Encoding();

string het = "ח";
byte[] UnicodeHet = unicode.GetBytes(het);
byte[] UTF8Bytes = Encoding.Convert(unicode,utf8,UnicodeHet);

UTF8Bytes is then written to the database.

When I read this from the database I get two characters that represent the
UTF-8 string held in UTF-16 C# string. I can convert these back to the
het
character using the following code

UnicodeEncoding unicode = new UnicodeEncoding();
UTF8Encoding utf8 = new UTF8Encoding();
Encoding local = Encoding.GetEncoding(1252);

string utf8het = "׳—"; //Normally read from the database but hardcoded
here
byte[] utf8hetbytes = local.GetBytes(utf8het);
byte[] utf8result = Encoding.Convert(utf8,unicode,utf8hetbytes);
result = unicode.GetString(utf8result);

If the code page for the machine is set to 1252 this works correctly.
e.g.
If the result from the database was a hebrew het character 'ח' character
it
will return the utf-8 characters D7 97 in the byte sequence, which will be
correctly decoded to 5D7

Problem: If I subsequently change the code page of the machine to hebrew,

byte[] utf8hetbytes = local.GetBytes(utf8het);

will start returning 3F 97. 3F is ? which generally means a translation
error has occurred on the character.

Why?

If I switch to getting the default code page, it always works.
Unfortunately
it appears the rest of the code (poor) requires 1252. Am I wrong in
assuming
that if I get 1252 encoding it should not be effected by the code page of
the
machine? It appears that I am faced with a bit of a major re-work due to
this.

Is there another way to get the two utf-8 bytes held in a C# string into a
byte array without going through a code page?

Thanks in advance

Alex

Reply With Quote
Reply




Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Powered by vBulletin Version 3.5.4
Copyright ©2000 - 2010, Jelsoft Enterprises Ltd.