Encoding (charset) with HTMLDocumentClass -
12-14-2007
, 02:57 PM
In a C# Windows application (ASP.NET 1.1) I am trying to build an
application to request Web-based content and extract some the content
and some other information using the HTML DOM and then create a new copy
of the site with the same material wrapped into a different HTML
structure. As far as I can tell I need the HTMLDocumentClass, based on
the MSHTML COM object, for this.
In particular, the content being read is encoded as ISO-8859-1, and
entities such as the non-breaking space and the en and em dash are
encoded as and – —, etc. I need to preserve these as
such.
I have
HTMLDocumentClass htmlDoc = new HTMLDocumentClass();
htmlDoc.open(url, "blah", null, null);
htmlDoc.charset = "iso-8859-1";
htmlDoc.close();
string title = htmlDoc.IHTMLDocument2_title;
HTMLTable table = (HTMLTable) htmlDoc.getElementById("Table1");
HTMLTableCell cell = (HTMLTableCell)
table.firstChild.firstChild.firstChild;
string mainContent = cell.innerHTML;
This doesn't work. If I set a breakpoint right after the close() call
Debug.Print htmlDoc.documentElement.innerHTML
gives me
<HEAD></HEAD>\r\n<BODY></BODY>
So, instead, I decided to use HttpWebRequest to retrieve the content and
then write to an empty HTMLDocumentClass object:
System.Text.Encoding encoding =
System.Text.Encoding.GetEncoding("iso-8859-1");
StreamReader reader =
new StreamReader(response.GetResponseStream(), encoding);
HTMLDocumentClass htmlDoc = new HTMLDocumentClass();
htmlDoc.open("", "", null, null);
htmlDoc.charset = "iso-8859-1";
string contents = reader.ReadToEnd();
((IHTMLDocument2) htmlDoc).write(contents, "");
htmlDoc.close();
string title = htmlDoc.IHTMLDocument2_title;
HTMLTable table = (HTMLTable) htmlDoc.getElementById("Table1");
HTMLTableCell cell = (HTMLTableCell)
table.firstChild.firstChild.firstChild;
string mainContent = cell.innerHTML;
When I look at mainContent in a MessageBox, or when I later look at the
file I write mainContent to (using encoding iso-8859-1 with a
StreamWriter), every instance of has been changed to , and
the dashes have been converted to hyphens.
How can I use the DOM while also maintaining the HTML code as originally
entered? And in particular, how can I prevent dash characters from being
replaced with hyphens? |