I am using Powershell for some ETL work, reading compressed text files in and splitting them out depending on the first three characters of each line.
If I were just filtering the input file, I could pipe the filtered stream to Out-File and be done with it. But I need to redirect the output to more than one destination, and as far as I know this can’t be done with a simple pipe. I’m already using a .NET streamreader to read the compressed input files, and I’m wondering if I need to use a streamwriter to write the output files as well.
The naive version looks something like this:
while (!$reader.EndOfFile) {
$line = $reader.ReadLine();
switch ($line.substring(0,3) {
"001" {Add-Content "output001.txt" $line}
"002" {Add-Content "output002.txt" $line}
"003" {Add-Content "output003.txt" $line}
}
}
That just looks like bad news: finding, opening, writing and closing a file once per row. The input files are huge 500MB+ monsters.
Is there an idiomatic way to handle this efficiently w/ Powershell constructs, or should I turn to the .NET streamwriter?
Are there methods of a (New-Item “path” -type “file”) object I could use for this?
EDIT for context:
I’m using the DotNetZip library to read ZIP files as streams; thus streamreader rather than Get-Content/gc. Sample code:
[System.Reflection.Assembly]::LoadFrom("\Path\To\Ionic.Zip.dll")
$zipfile = [Ionic.Zip.ZipFile]::Read("\Path\To\File.zip")
foreach ($entry in $zipfile) {
$reader = new-object system.io.streamreader $entry.OpenReader();
while (!$reader.EndOfFile) {
$line = $reader.ReadLine();
#do something here
}
}
I should probably Dispose() of both the $zipfile and $reader, but that is for another question!
Reading
As for reading the file and parsing, I would go with
switchstatement:I think it is better approach because
have to make substring (which might
be expensive) and
-fileis quite handy 😉Writing
As for writing the output, I’ll test to use streamwriter, however if performance of
Add-Contentis decent for you, I would stick to it.Added:
Keith proposed to use
>>operator, however, it seems that it is very slow. Besides that it writes output in Unicode which doubles the file size.Look at my test:
The difference is huge.
Just for comparison:
Added: I was curious about the writing performance .. and I was a little bit surprised
It is 80 times faster.
Now you you have to decide – if speed is important, use
StreamWriter. If code clarity is important, useAdd-Content.Substring vs. Regex
According to Keith Substring is 20% faster. It depends, as always. However, in my case the results are like this:
So the difference is not important and for me, regexes are more readable.