HighTechTalks DotNet Forums  

Encoding (charset) with HTMLDocumentClass

Dotnet Framework (Interop) microsoft.public.dotnet.framework.interop


Discuss Encoding (charset) with HTMLDocumentClass in the Dotnet Framework (Interop) forum.



Reply
 
Thread Tools Search this Thread Display Modes
  #1  
Old   
Harlan Messinger
 
Posts: n/a

Default Encoding (charset) with HTMLDocumentClass - 12-14-2007 , 02:57 PM






In a C# Windows application (ASP.NET 1.1) I am trying to build an
application to request Web-based content and extract some the content
and some other information using the HTML DOM and then create a new copy
of the site with the same material wrapped into a different HTML
structure. As far as I can tell I need the HTMLDocumentClass, based on
the MSHTML COM object, for this.

In particular, the content being read is encoded as ISO-8859-1, and
entities such as the non-breaking space and the en and em dash are
encoded as   and – —, etc. I need to preserve these as
such.

I have

HTMLDocumentClass htmlDoc = new HTMLDocumentClass();
htmlDoc.open(url, "blah", null, null);
htmlDoc.charset = "iso-8859-1";
htmlDoc.close();

string title = htmlDoc.IHTMLDocument2_title;
HTMLTable table = (HTMLTable) htmlDoc.getElementById("Table1");
HTMLTableCell cell = (HTMLTableCell)
table.firstChild.firstChild.firstChild;
string mainContent = cell.innerHTML;

This doesn't work. If I set a breakpoint right after the close() call

Debug.Print htmlDoc.documentElement.innerHTML

gives me

<HEAD></HEAD>\r\n<BODY></BODY>

So, instead, I decided to use HttpWebRequest to retrieve the content and
then write to an empty HTMLDocumentClass object:

System.Text.Encoding encoding =
System.Text.Encoding.GetEncoding("iso-8859-1");
StreamReader reader =
new StreamReader(response.GetResponseStream(), encoding);

HTMLDocumentClass htmlDoc = new HTMLDocumentClass();
htmlDoc.open("", "", null, null);
htmlDoc.charset = "iso-8859-1";
string contents = reader.ReadToEnd();

((IHTMLDocument2) htmlDoc).write(contents, "");
htmlDoc.close();

string title = htmlDoc.IHTMLDocument2_title;
HTMLTable table = (HTMLTable) htmlDoc.getElementById("Table1");
HTMLTableCell cell = (HTMLTableCell)
table.firstChild.firstChild.firstChild;
string mainContent = cell.innerHTML;

When I look at mainContent in a MessageBox, or when I later look at the
file I write mainContent to (using encoding iso-8859-1 with a
StreamWriter), every instance of   has been changed to &nbsp;, and
the dashes have been converted to hyphens.

How can I use the DOM while also maintaining the HTML code as originally
entered? And in particular, how can I prevent dash characters from being
replaced with hyphens?

Reply With Quote
Reply




Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Powered by vBulletin Version 3.5.4
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.