HighTechTalks DotNet Forums  

StreamReader->regex->Win32 api marshalling

Dotnet Framework (Performance) microsoft.public.dotnet.framework.performance


Discuss StreamReader->regex->Win32 api marshalling in the Dotnet Framework (Performance) forum.



Reply
 
Thread Tools Search this Thread Display Modes
  #1  
Old   
Tim
 
Posts: n/a

Default StreamReader->regex->Win32 api marshalling - 08-31-2005 , 10:35 AM






I've written a C# .Net console app that reads massive (upto 2Gb) text files
of csv data using a StreamReader with UTF7 in configurable-size char[]
blocks, parses the read data using the .Net regex facilities and re-formats
it before sending it to a Managed C++ .Net dll that wraps the SQLServer bcp
api and so squirts the data into a SQLServer db. It all works fine, but the
performance is just not good enough - I need to at least halve the time its
taking to process the data. I've profiled the code and it looks like the
overhead is more or less evenly split between the reading/parsing on the one
hand and the data loading/bcp/Win32 api marshalling on the other hand - so
I'm looking for improvements across the whole functionality. The problem is
that I need the Regex sophistication to parse the data because its very rough
- so I need to turn the text read from the files into .Net Strings so I can
Regex them. But then I need to marshall the output Strings from the Regex
down to the BCP api because bcp is the only SQLServer i/f that provides
anything like the performance we need. However, I gather that two big .Net
performance overheads are String splitting/manipulation and .Net to Win32
ansi string marshalling :-(. Any suggestions for optimisation tweaks for
these parts of the FCL or ways to circumvent the problem areas much
appreciated.

Reply With Quote
  #2  
Old   
David Browne
 
Posts: n/a

Default Re: StreamReader->regex->Win32 api marshalling - 08-31-2005 , 11:02 AM







"Tim" <Tim (AT) discussions (DOT) microsoft.com> wrote

Quote:
I've written a C# .Net console app that reads massive (upto 2Gb) text
files
of csv data using a StreamReader with UTF7 in configurable-size char[]
blocks, parses the read data using the .Net regex facilities and
re-formats
it before sending it to a Managed C++ .Net dll that wraps the SQLServer
bcp
api and so squirts the data into a SQLServer db. It all works fine, but
the
performance is just not good enough - I need to at least halve the time
its
taking to process the data. I've profiled the code and it looks like the
overhead is more or less evenly split between the reading/parsing on the
one
hand and the data loading/bcp/Win32 api marshalling on the other hand - so
I'm looking for improvements across the whole functionality. The problem
is
that I need the Regex sophistication to parse the data because its very
rough
- so I need to turn the text read from the files into .Net Strings so I
can
Regex them. But then I need to marshall the output Strings from the Regex
down to the BCP api because bcp is the only SQLServer i/f that provides
anything like the performance we need. However, I gather that two big .Net
performance overheads are String splitting/manipulation and .Net to Win32
ansi string marshalling :-(. Any suggestions for optimisation tweaks for
these parts of the FCL or ways to circumvent the problem areas much
appreciated.
Two ideas. First is to use buffering so your parsing code doesn't wait on
your BCP loads, and vice versa. Introducing a buffer between the two can
help keep the processor busy while the BCP component is waiting on network
and SQL Server work.

An excelent candidate for buffering is a file. Just run you .NET code and
output a clean file. Then take that file and BCP it into SQL Server. You
really aren't gaining anything by pipelining this stuff into SQL.

David




Reply With Quote
  #3  
Old   
Scott
 
Posts: n/a

Default Re: StreamReader->regex->Win32 api marshalling - 08-31-2005 , 11:37 AM



Tim wrote:

Quote:
db. It all works fine, but the performance is just not good enough -
I need to at least halve the time its taking to process the data.
I've profiled the code and it looks like the overhead is more or less
evenly split between the reading/parsing on the one hand and the data
loading/bcp/Win32 api marshalling on the other hand - so I'm looking
Are you using the Compile flag on the regex?

I would also try to insert to the database directly from your code by
batching the statements: insert xxx xxxx xxx;insert xxx xxxx xxx;insert
xxx xxxx xxx;... I would start by batching into groups of a 100 and
then try to tweak from there.

I've had very good performance success with these two strategies.

Scott C.


Reply With Quote
  #4  
Old   
David Browne
 
Posts: n/a

Default Re: StreamReader->regex->Win32 api marshalling - 08-31-2005 , 11:44 AM




"Scott" <me (AT) me (DOT) com> wrote

Quote:
Tim wrote:

db. It all works fine, but the performance is just not good enough -
I need to at least halve the time its taking to process the data.
I've profiled the code and it looks like the overhead is more or less
evenly split between the reading/parsing on the one hand and the data
loading/bcp/Win32 api marshalling on the other hand - so I'm looking

Are you using the Compile flag on the regex?

I would also try to insert to the database directly from your code by
batching the statements: insert xxx xxxx xxx;insert xxx xxxx xxx;insert
xxx xxxx xxx;... I would start by batching into groups of a 100 and
then try to tweak from there.

BCP is significantly faster than using "batched" inserts.

David




Reply With Quote
  #5  
Old   
shriop
 
Posts: n/a

Default Re: StreamReader->regex->Win32 api marshalling - 08-31-2005 , 11:51 AM



Lots of info here, but I'll try to split it out.

Regex's are a notoriously bad performance wise way to accomplish what
you're trying to do.
http://www.codeproject.com/cs/database/CsvReader.asp
http://www.csvreader.com/csv_benchmarks.html

I would suggest you either implement the code yourself using raw char
array handling to find the commas, etc, etc, or purchase a commercial
csv parser like the one I sell, http://www.csvreader.com .

bcp is not what I would recommend for this because it does not properly
handle escape sequences, and does not necessarily have a performance
benefit that I've seen over dts. I would recommend calling a dts job
instead to import this data. If you don't need to do any data
manipulation to the original csv file, then there's no reason to even
do any csv parsing, just let dts do it for you. If you do want to try
the batching of insert statements like one of the previous comments
suggested, make sure you start a transaction before doing the chunks of
inserts, then commit at the end of a chunk. Otherwise, you'll get a
performance hit as the db tries to commit after every single insert.

Scott wrote:
Quote:
Tim wrote:

db. It all works fine, but the performance is just not good enough -
I need to at least halve the time its taking to process the data.
I've profiled the code and it looks like the overhead is more or less
evenly split between the reading/parsing on the one hand and the data
loading/bcp/Win32 api marshalling on the other hand - so I'm looking

Are you using the Compile flag on the regex?

I would also try to insert to the database directly from your code by
batching the statements: insert xxx xxxx xxx;insert xxx xxxx xxx;insert
xxx xxxx xxx;... I would start by batching into groups of a 100 and
then try to tweak from there.

I've had very good performance success with these two strategies.

Scott C.


Reply With Quote
  #6  
Old   
Tim
 
Posts: n/a

Default Re: StreamReader->regex->Win32 api marshalling - 08-31-2005 , 12:56 PM



I did initially try writing an intermediate re-formatted text file and then
processing that using the bcp utility as a separate process (great minds
....), but the problem with that is the size of the files - they are huge and
we can't afford to replicate them as intermediate files. The performance
wasn't any better either - the overhead of writing to the intermediate text
file was just as great.
Regarding buffering - the profiling seemed to indicate that there wasn't
much slack in the processor, so I don't think there's a lot to be made up
there (its a local server so there's no network overhead).
I didn't want to complicate things by mentioning this, but a further
complication is that there is a separate thread to the file parsing component
that uses a .Net FileSystemWatcher object to look for the csv files arriving
and fires off the reading/parsing/bcping process as separate threads - so in
practice there will be many of these file processing threads running in
parallel, so I'd expect any slack in the processor to be taken up by that
anyway. So for the minute I'm just trying to optimise the file processing
code itself.

Tim

"David Browne" wrote:

Quote:
"Tim" <Tim (AT) discussions (DOT) microsoft.com> wrote in message
news:433786FE-430F-4807-A097-5032E28B262A (AT) microsoft (DOT) com...
I've written a C# .Net console app that reads massive (upto 2Gb) text
files
of csv data using a StreamReader with UTF7 in configurable-size char[]
blocks, parses the read data using the .Net regex facilities and
re-formats
it before sending it to a Managed C++ .Net dll that wraps the SQLServer
bcp
api and so squirts the data into a SQLServer db. It all works fine, but
the
performance is just not good enough - I need to at least halve the time
its
taking to process the data. I've profiled the code and it looks like the
overhead is more or less evenly split between the reading/parsing on the
one
hand and the data loading/bcp/Win32 api marshalling on the other hand - so
I'm looking for improvements across the whole functionality. The problem
is
that I need the Regex sophistication to parse the data because its very
rough
- so I need to turn the text read from the files into .Net Strings so I
can
Regex them. But then I need to marshall the output Strings from the Regex
down to the BCP api because bcp is the only SQLServer i/f that provides
anything like the performance we need. However, I gather that two big .Net
performance overheads are String splitting/manipulation and .Net to Win32
ansi string marshalling :-(. Any suggestions for optimisation tweaks for
these parts of the FCL or ways to circumvent the problem areas much
appreciated.

Two ideas. First is to use buffering so your parsing code doesn't wait on
your BCP loads, and vice versa. Introducing a buffer between the two can
help keep the processor busy while the BCP component is waiting on network
and SQL Server work.

An excelent candidate for buffering is a file. Just run you .NET code and
output a clean file. Then take that file and BCP it into SQL Server. You
really aren't gaining anything by pipelining this stuff into SQL.

David




Reply With Quote
  #7  
Old   
Tim
 
Posts: n/a

Default Re: StreamReader->regex->Win32 api marshalling - 09-01-2005 , 05:21 AM



When I said the csv data was rough, I meant it! As an example: it uses a
specific configurable field delimiter for all the fields in a record except
the last field in the record which uses a \r\n. BUT ... the last field (and
in fact any other field) may also contain \r\n in its data! My solution to
this was to use the regex definition of the format of the first field in each
record to find the end of the last field. Would I be able to handle this with
your csv parser?
Thanks
Tim

"shriop" wrote:

Quote:
Lots of info here, but I'll try to split it out.

Regex's are a notoriously bad performance wise way to accomplish what
you're trying to do.
http://www.codeproject.com/cs/database/CsvReader.asp
http://www.csvreader.com/csv_benchmarks.html

I would suggest you either implement the code yourself using raw char
array handling to find the commas, etc, etc, or purchase a commercial
csv parser like the one I sell, http://www.csvreader.com .

bcp is not what I would recommend for this because it does not properly
handle escape sequences, and does not necessarily have a performance
benefit that I've seen over dts. I would recommend calling a dts job
instead to import this data. If you don't need to do any data
manipulation to the original csv file, then there's no reason to even
do any csv parsing, just let dts do it for you. If you do want to try
the batching of insert statements like one of the previous comments
suggested, make sure you start a transaction before doing the chunks of
inserts, then commit at the end of a chunk. Otherwise, you'll get a
performance hit as the db tries to commit after every single insert.

Scott wrote:
Tim wrote:

db. It all works fine, but the performance is just not good enough -
I need to at least halve the time its taking to process the data.
I've profiled the code and it looks like the overhead is more or less
evenly split between the reading/parsing on the one hand and the data
loading/bcp/Win32 api marshalling on the other hand - so I'm looking

Are you using the Compile flag on the regex?

I would also try to insert to the database directly from your code by
batching the statements: insert xxx xxxx xxx;insert xxx xxxx xxx;insert
xxx xxxx xxx;... I would start by batching into groups of a 100 and
then try to tweak from there.

I've had very good performance success with these two strategies.

Scott C.



Reply With Quote
  #8  
Old   
shriop
 
Posts: n/a

Default Re: StreamReader->regex->Win32 api marshalling - 09-01-2005 , 11:15 AM



I'm pretty sure this is right in line with what my parser is made to
handle. If you find it can't handle it, then I'd like to see an example
of the data so I can add the functionality to handle it. You can set
the delimiter to anything you want. And end of row is commonly ended
with a /r/n, so it handles that. the /r/n in the data is no problem as
long as there are double quotes around the field itself, so the parser
has something to base the logic on and know that this is data, and not
the end of a row. I've got a free demo version that you can try it out
with. You can also feel free to email me directly with any questions.
This kind of data is why I generally recommend people not try to use
regular expressions to parse the data, the expression just gets too
hairy, leading to poor performance.

Here's what I'm picturing you're data looking like, and what it can
handle, although I don't know what specific delimiter you're using.

1,Bruce,Dunwiddie,"in here, I can have commas and
\r\n"\r\n2,Bob,Jones,some other quote here


Tim wrote:
Quote:
When I said the csv data was rough, I meant it! As an example: it uses a
specific configurable field delimiter for all the fields in a record except
the last field in the record which uses a \r\n. BUT ... the last field (and
in fact any other field) may also contain \r\n in its data! My solution to
this was to use the regex definition of the format of the first field in each
record to find the end of the last field. Would I be able to handle this with
your csv parser?
Thanks
Tim

"shriop" wrote:

Lots of info here, but I'll try to split it out.

Regex's are a notoriously bad performance wise way to accomplish what
you're trying to do.
http://www.codeproject.com/cs/database/CsvReader.asp
http://www.csvreader.com/csv_benchmarks.html

I would suggest you either implement the code yourself using raw char
array handling to find the commas, etc, etc, or purchase a commercial
csv parser like the one I sell, http://www.csvreader.com .

bcp is not what I would recommend for this because it does not properly
handle escape sequences, and does not necessarily have a performance
benefit that I've seen over dts. I would recommend calling a dts job
instead to import this data. If you don't need to do any data
manipulation to the original csv file, then there's no reason to even
do any csv parsing, just let dts do it for you. If you do want to try
the batching of insert statements like one of the previous comments
suggested, make sure you start a transaction before doing the chunks of
inserts, then commit at the end of a chunk. Otherwise, you'll get a
performance hit as the db tries to commit after every single insert.

Scott wrote:
Tim wrote:

db. It all works fine, but the performance is just not good enough -
I need to at least halve the time its taking to process the data.
I've profiled the code and it looks like the overhead is more or less
evenly split between the reading/parsing on the one hand and the data
loading/bcp/Win32 api marshalling on the other hand - so I'm looking

Are you using the Compile flag on the regex?

I would also try to insert to the database directly from your code by
batching the statements: insert xxx xxxx xxx;insert xxx xxxx xxx;insert
xxx xxxx xxx;... I would start by batching into groups of a 100 and
then try to tweak from there.

I've had very good performance success with these two strategies.

Scott C.




Reply With Quote
  #9  
Old   
Tim
 
Posts: n/a

Default Re: StreamReader->regex->Win32 api marshalling - 09-01-2005 , 12:39 PM



Thats the problem - the text isn't quoted. So I'm using meta-data definitions
of the SQLServer table columns that the csv data is destined for to decide
what sort of char patterns to expect in each field. But where the field is
text all bets are off - you just keep going 'til you reach whatever's defined
as the delimiter char; which is fine unless the last field in the row is text
and potentially includes the row delimiter (\r\n) in its data. As far as I
can see the only way to semi-reliably determine the end of the row,
therefore, is to find the start of the next row. Originally I was using the
entire regex definition of a record to find the start of the next row, but
I've optimised it to just look for the first column of the next row to save
time. I can't think of anything else to do - apart from drowning the person
responsible for the data format. Won't solve the problem but I'll feel a lot
better.

"shriop" wrote:

Quote:
I'm pretty sure this is right in line with what my parser is made to
handle. If you find it can't handle it, then I'd like to see an example
of the data so I can add the functionality to handle it. You can set
the delimiter to anything you want. And end of row is commonly ended
with a /r/n, so it handles that. the /r/n in the data is no problem as
long as there are double quotes around the field itself, so the parser
has something to base the logic on and know that this is data, and not
the end of a row. I've got a free demo version that you can try it out
with. You can also feel free to email me directly with any questions.
This kind of data is why I generally recommend people not try to use
regular expressions to parse the data, the expression just gets too
hairy, leading to poor performance.

Here's what I'm picturing you're data looking like, and what it can
handle, although I don't know what specific delimiter you're using.

1,Bruce,Dunwiddie,"in here, I can have commas and
\r\n"\r\n2,Bob,Jones,some other quote here


Tim wrote:
When I said the csv data was rough, I meant it! As an example: it uses a
specific configurable field delimiter for all the fields in a record except
the last field in the record which uses a \r\n. BUT ... the last field (and
in fact any other field) may also contain \r\n in its data! My solution to
this was to use the regex definition of the format of the first field in each
record to find the end of the last field. Would I be able to handle this with
your csv parser?
Thanks
Tim

"shriop" wrote:

Lots of info here, but I'll try to split it out.

Regex's are a notoriously bad performance wise way to accomplish what
you're trying to do.
http://www.codeproject.com/cs/database/CsvReader.asp
http://www.csvreader.com/csv_benchmarks.html

I would suggest you either implement the code yourself using raw char
array handling to find the commas, etc, etc, or purchase a commercial
csv parser like the one I sell, http://www.csvreader.com .

bcp is not what I would recommend for this because it does not properly
handle escape sequences, and does not necessarily have a performance
benefit that I've seen over dts. I would recommend calling a dts job
instead to import this data. If you don't need to do any data
manipulation to the original csv file, then there's no reason to even
do any csv parsing, just let dts do it for you. If you do want to try
the batching of insert statements like one of the previous comments
suggested, make sure you start a transaction before doing the chunks of
inserts, then commit at the end of a chunk. Otherwise, you'll get a
performance hit as the db tries to commit after every single insert.

Scott wrote:
Tim wrote:

db. It all works fine, but the performance is just not good enough -
I need to at least halve the time its taking to process the data.
I've profiled the code and it looks like the overhead is more or less
evenly split between the reading/parsing on the one hand and the data
loading/bcp/Win32 api marshalling on the other hand - so I'm looking

Are you using the Compile flag on the regex?

I would also try to insert to the database directly from your code by
batching the statements: insert xxx xxxx xxx;insert xxx xxxx xxx;insert
xxx xxxx xxx;... I would start by batching into groups of a 100 and
then try to tweak from there.

I've had very good performance success with these two strategies.

Scott C.





Reply With Quote
  #10  
Old   
shriop
 
Posts: n/a

Default Re: StreamReader->regex->Win32 api marshalling - 09-01-2005 , 01:34 PM



Can I see an example of one of these problematic rows? You're pretty
much out of luck from what you say. I saw a post a while back about
exporting events from the event viewer when the description contains
newline characters and the description is not quoted. I tried to find
something, anything, to base the logic off of. But if you can't be
certain of anything to deal with this file, then it's awful hard to
parse. This is one of the reasons why I try to push using my CsvWriter
class to create csv files, so people who don't know any better don't
try to come up with what they think a csv format is and end up leaving
out handling of newlines for example.


Reply With Quote
Reply




Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Powered by vBulletin Version 3.5.4
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.