![]() | |
![]() |
| | Thread Tools | Search this Thread | Display Modes |
#1
| |||
| |||
|
#2
| |||
| |||
|
|
I've written a C# .Net console app that reads massive (upto 2Gb) text files of csv data using a StreamReader with UTF7 in configurable-size char[] blocks, parses the read data using the .Net regex facilities and re-formats it before sending it to a Managed C++ .Net dll that wraps the SQLServer bcp api and so squirts the data into a SQLServer db. It all works fine, but the performance is just not good enough - I need to at least halve the time its taking to process the data. I've profiled the code and it looks like the overhead is more or less evenly split between the reading/parsing on the one hand and the data loading/bcp/Win32 api marshalling on the other hand - so I'm looking for improvements across the whole functionality. The problem is that I need the Regex sophistication to parse the data because its very rough - so I need to turn the text read from the files into .Net Strings so I can Regex them. But then I need to marshall the output Strings from the Regex down to the BCP api because bcp is the only SQLServer i/f that provides anything like the performance we need. However, I gather that two big .Net performance overheads are String splitting/manipulation and .Net to Win32 ansi string marshalling :-(. Any suggestions for optimisation tweaks for these parts of the FCL or ways to circumvent the problem areas much appreciated. |
#3
| |||
| |||
|
|
db. It all works fine, but the performance is just not good enough - I need to at least halve the time its taking to process the data. I've profiled the code and it looks like the overhead is more or less evenly split between the reading/parsing on the one hand and the data loading/bcp/Win32 api marshalling on the other hand - so I'm looking |
#4
| |||
| |||
|
|
Tim wrote: db. It all works fine, but the performance is just not good enough - I need to at least halve the time its taking to process the data. I've profiled the code and it looks like the overhead is more or less evenly split between the reading/parsing on the one hand and the data loading/bcp/Win32 api marshalling on the other hand - so I'm looking Are you using the Compile flag on the regex? I would also try to insert to the database directly from your code by batching the statements: insert xxx xxxx xxx;insert xxx xxxx xxx;insert xxx xxxx xxx;... I would start by batching into groups of a 100 and then try to tweak from there. BCP is significantly faster than using "batched" inserts. |
#5
| |||
| |||
|
|
Tim wrote: db. It all works fine, but the performance is just not good enough - I need to at least halve the time its taking to process the data. I've profiled the code and it looks like the overhead is more or less evenly split between the reading/parsing on the one hand and the data loading/bcp/Win32 api marshalling on the other hand - so I'm looking Are you using the Compile flag on the regex? I would also try to insert to the database directly from your code by batching the statements: insert xxx xxxx xxx;insert xxx xxxx xxx;insert xxx xxxx xxx;... I would start by batching into groups of a 100 and then try to tweak from there. I've had very good performance success with these two strategies. Scott C. |
#6
| |||
| |||
|
|
"Tim" <Tim (AT) discussions (DOT) microsoft.com> wrote in message news:433786FE-430F-4807-A097-5032E28B262A (AT) microsoft (DOT) com... I've written a C# .Net console app that reads massive (upto 2Gb) text files of csv data using a StreamReader with UTF7 in configurable-size char[] blocks, parses the read data using the .Net regex facilities and re-formats it before sending it to a Managed C++ .Net dll that wraps the SQLServer bcp api and so squirts the data into a SQLServer db. It all works fine, but the performance is just not good enough - I need to at least halve the time its taking to process the data. I've profiled the code and it looks like the overhead is more or less evenly split between the reading/parsing on the one hand and the data loading/bcp/Win32 api marshalling on the other hand - so I'm looking for improvements across the whole functionality. The problem is that I need the Regex sophistication to parse the data because its very rough - so I need to turn the text read from the files into .Net Strings so I can Regex them. But then I need to marshall the output Strings from the Regex down to the BCP api because bcp is the only SQLServer i/f that provides anything like the performance we need. However, I gather that two big .Net performance overheads are String splitting/manipulation and .Net to Win32 ansi string marshalling :-(. Any suggestions for optimisation tweaks for these parts of the FCL or ways to circumvent the problem areas much appreciated. Two ideas. First is to use buffering so your parsing code doesn't wait on your BCP loads, and vice versa. Introducing a buffer between the two can help keep the processor busy while the BCP component is waiting on network and SQL Server work. An excelent candidate for buffering is a file. Just run you .NET code and output a clean file. Then take that file and BCP it into SQL Server. You really aren't gaining anything by pipelining this stuff into SQL. David |
#7
| |||
| |||
|
|
Lots of info here, but I'll try to split it out. Regex's are a notoriously bad performance wise way to accomplish what you're trying to do. http://www.codeproject.com/cs/database/CsvReader.asp http://www.csvreader.com/csv_benchmarks.html I would suggest you either implement the code yourself using raw char array handling to find the commas, etc, etc, or purchase a commercial csv parser like the one I sell, http://www.csvreader.com . bcp is not what I would recommend for this because it does not properly handle escape sequences, and does not necessarily have a performance benefit that I've seen over dts. I would recommend calling a dts job instead to import this data. If you don't need to do any data manipulation to the original csv file, then there's no reason to even do any csv parsing, just let dts do it for you. If you do want to try the batching of insert statements like one of the previous comments suggested, make sure you start a transaction before doing the chunks of inserts, then commit at the end of a chunk. Otherwise, you'll get a performance hit as the db tries to commit after every single insert. Scott wrote: Tim wrote: db. It all works fine, but the performance is just not good enough - I need to at least halve the time its taking to process the data. I've profiled the code and it looks like the overhead is more or less evenly split between the reading/parsing on the one hand and the data loading/bcp/Win32 api marshalling on the other hand - so I'm looking Are you using the Compile flag on the regex? I would also try to insert to the database directly from your code by batching the statements: insert xxx xxxx xxx;insert xxx xxxx xxx;insert xxx xxxx xxx;... I would start by batching into groups of a 100 and then try to tweak from there. I've had very good performance success with these two strategies. Scott C. |
#8
| |||
| |||
|
|
When I said the csv data was rough, I meant it! As an example: it uses a specific configurable field delimiter for all the fields in a record except the last field in the record which uses a \r\n. BUT ... the last field (and in fact any other field) may also contain \r\n in its data! My solution to this was to use the regex definition of the format of the first field in each record to find the end of the last field. Would I be able to handle this with your csv parser? Thanks Tim "shriop" wrote: Lots of info here, but I'll try to split it out. Regex's are a notoriously bad performance wise way to accomplish what you're trying to do. http://www.codeproject.com/cs/database/CsvReader.asp http://www.csvreader.com/csv_benchmarks.html I would suggest you either implement the code yourself using raw char array handling to find the commas, etc, etc, or purchase a commercial csv parser like the one I sell, http://www.csvreader.com . bcp is not what I would recommend for this because it does not properly handle escape sequences, and does not necessarily have a performance benefit that I've seen over dts. I would recommend calling a dts job instead to import this data. If you don't need to do any data manipulation to the original csv file, then there's no reason to even do any csv parsing, just let dts do it for you. If you do want to try the batching of insert statements like one of the previous comments suggested, make sure you start a transaction before doing the chunks of inserts, then commit at the end of a chunk. Otherwise, you'll get a performance hit as the db tries to commit after every single insert. Scott wrote: Tim wrote: db. It all works fine, but the performance is just not good enough - I need to at least halve the time its taking to process the data. I've profiled the code and it looks like the overhead is more or less evenly split between the reading/parsing on the one hand and the data loading/bcp/Win32 api marshalling on the other hand - so I'm looking Are you using the Compile flag on the regex? I would also try to insert to the database directly from your code by batching the statements: insert xxx xxxx xxx;insert xxx xxxx xxx;insert xxx xxxx xxx;... I would start by batching into groups of a 100 and then try to tweak from there. I've had very good performance success with these two strategies. Scott C. |
#9
| |||
| |||
|
|
I'm pretty sure this is right in line with what my parser is made to handle. If you find it can't handle it, then I'd like to see an example of the data so I can add the functionality to handle it. You can set the delimiter to anything you want. And end of row is commonly ended with a /r/n, so it handles that. the /r/n in the data is no problem as long as there are double quotes around the field itself, so the parser has something to base the logic on and know that this is data, and not the end of a row. I've got a free demo version that you can try it out with. You can also feel free to email me directly with any questions. This kind of data is why I generally recommend people not try to use regular expressions to parse the data, the expression just gets too hairy, leading to poor performance. Here's what I'm picturing you're data looking like, and what it can handle, although I don't know what specific delimiter you're using. 1,Bruce,Dunwiddie,"in here, I can have commas and \r\n"\r\n2,Bob,Jones,some other quote here Tim wrote: When I said the csv data was rough, I meant it! As an example: it uses a specific configurable field delimiter for all the fields in a record except the last field in the record which uses a \r\n. BUT ... the last field (and in fact any other field) may also contain \r\n in its data! My solution to this was to use the regex definition of the format of the first field in each record to find the end of the last field. Would I be able to handle this with your csv parser? Thanks Tim "shriop" wrote: Lots of info here, but I'll try to split it out. Regex's are a notoriously bad performance wise way to accomplish what you're trying to do. http://www.codeproject.com/cs/database/CsvReader.asp http://www.csvreader.com/csv_benchmarks.html I would suggest you either implement the code yourself using raw char array handling to find the commas, etc, etc, or purchase a commercial csv parser like the one I sell, http://www.csvreader.com . bcp is not what I would recommend for this because it does not properly handle escape sequences, and does not necessarily have a performance benefit that I've seen over dts. I would recommend calling a dts job instead to import this data. If you don't need to do any data manipulation to the original csv file, then there's no reason to even do any csv parsing, just let dts do it for you. If you do want to try the batching of insert statements like one of the previous comments suggested, make sure you start a transaction before doing the chunks of inserts, then commit at the end of a chunk. Otherwise, you'll get a performance hit as the db tries to commit after every single insert. Scott wrote: Tim wrote: db. It all works fine, but the performance is just not good enough - I need to at least halve the time its taking to process the data. I've profiled the code and it looks like the overhead is more or less evenly split between the reading/parsing on the one hand and the data loading/bcp/Win32 api marshalling on the other hand - so I'm looking Are you using the Compile flag on the regex? I would also try to insert to the database directly from your code by batching the statements: insert xxx xxxx xxx;insert xxx xxxx xxx;insert xxx xxxx xxx;... I would start by batching into groups of a 100 and then try to tweak from there. I've had very good performance success with these two strategies. Scott C. |
#10
| |||
| |||
|
![]() |
| Thread Tools | Search this Thread |
| Display Modes | |
| |