Hello - I’m very new to PS. I have written a small script which looks at a huge CSV (Anything up to about 350MB with anything up to 8 million lines) then looks at the 3rd field in each line, compares this field with every entry in a reference CSV (which currently only has about 20 lines in it) and if the reference line matches this 3rd field, it replaces the field with whatever is in the reference file entry. A way of truncating the 3rd field based on entries in a ref file. It then stops looking through the ref file and continues through each line of the original CSV. It then makes a new CSV with a different extension.
Running this with small test files, it works great. However, I’ve just tried it on a 280MB csv and my PC is currently going crazy. Memory usage is up to 10.6GB, CPU nice and low on 4.2%, but it is taking hours to process this file. Running for about 40 minutes, it has process about 2MB of the 280MB file. I have another 27 similarly large files to get through. So might be done by Christmas.
Is there something very wrong with the script causing it to run slow slowly and why does it take 10.6GB of memory to read in a 280MB file???
Any help most appreciated.
$files = Get-ChildItem "~\Desktop\Test\*.dat"
foreach ($f in $files){
$outfile = $f.FullName + "new"
$file = Get-Content $f.FullName
$ref = Get-Content "~\Desktop\Test\Reference\ref.dat"
$file.split("`n") | foreach {# line in file
$c = 0
$outline = $nul
$_ = $_.TrimEnd() #removes any spurious spaces at end of lines
$_ = $_ + ' ' # adds a single space back on the end
$_.split('|') | foreach {# field in the line
$c++
#$c # for console view only to compare lines
If ($c -eq '3') { # will perform comparison against every entry in ref until a match found
$field = $_
$refhit = $false
$ref | foreach {# item in the ref array
if ($refhit -eq $true) {return}
if ($field -like $_) {# replace field with ref field
$refhit = $true
$outline += $_.TrimEnd('*') + '|'
}# end match
}# end ref
if (!$refhit){#pass on the field as is
$outline += "$field|"
}
}else {#end of field - start of pass through
# if not field 3 then do not perform compare
$field = $_
$refhit = $false
if (!$refhit){#pass on the field as is
$outline += "$field|"
}
}#end field
}
# Output filtered line
Add-Content $outfile -Value $outline.TrimEnd('|') #output truncated file to newname
# write-host $outline.TrimEnd('|') #output to console for testing purposes
} #end line
}
Are these actually csv files? If so, why not use Import-Csv and avoid manipulating strings? I’m not sure, but I bet the string manipulation is taking up some time.
This is somewhat of a common question. Here is a breakdown of the methods to open files, but you’re still loading a lot into memory with files that large:
There are also 3 for loops. At first glance, you could be doing a Where clause or HashTable lookup rather than adding for loops with if statements to get a match.
$files = Get-ChildItem “~\Desktop\Test\*.dat”
$ref = Get-Content “~\Desktop\Test\Reference\ref.dat”
foreach ($f in $files) {
$outfile = $f.FullName + “new”
$file = Get-Content $f.FullName
$fileSplit = $file.split(“`n”)
$file = $null
[System.Collections.ArrayList]$outArray = @()
$countOutFile = 1
foreach ($line in $fileSplit){
# line in file
$c = 0
$outline = $null
$line = $line.TrimEnd() #removes any spurious spaces at end of lines
$line = $line + ‘ ‘ # adds a single space back on the end
$splitLine = $line.split(‘|’)
foreach ($sl in $splitLine) { # field in the line
$c++
#$c # for console view only to compare lines
$field = $sl
$refhit = $false
If ($c -eq ‘3’) {
# will perform comparison against every entry in ref until a match found
$ref | foreach { # item in the ref array
if ($refhit -eq $true) { return }
if ($field -like $_) {
# replace field with ref field
$refhit = $true
$outline += $_.TrimEnd(‘*’) + ‘|’
}# end match
}# end ref
}
if (!$refhit) {
#pass on the field as is
$outline += “$field|”
}
}
[void]($outArray.Add($outline.TrimEnd(‘|’)))
if ($countOutFile % 200){
Add-Content $outfile -Value $outArray #output truncated file to newname
$outArray=@()
}
$countOutFile++
# Output filtered line
# write-host $outline.TrimEnd(‘|’) #output to console for testing purposes
} #end line
#dump last content left behind
Add-Content $outfile -Value $outArray #output truncated file to newname
$outArray=@()
}
Wow! I just added the -Raw to the Get-Content line and it is ploughing through at an incredible rate!! Memory consumption down to 12% and it has already processed 10MB in about 4 minutes. Looks like that might be all that is needed.
Hi, many thanks for looking at this, but doesn;t seem to have made anything any quicker. Adding -Raw so far has made the biggest difference, but it is still slower than I can really do with
Personally, I’d scale things back. Just concentrate on loading the file and processing each line to see how fast you can get that to work. Then you can determine the impact that additional parsing is taking. This seems expensive:
$c = 0
$outline = $null
$line = $line.TrimEnd() #removes any spurious spaces at end of lines
$line = $line + ‘ ‘ # adds a single space back on the end
$splitLine = $line.split(‘|’)
foreach ($sl in $splitLine) { # field in the line
$c++
#$c # for console view only to compare lines
$field = $sl
$refhit = $false
If ($c -eq ‘3’) {
# will perform comparison against every entry in ref until a match found
$ref | foreach { # item in the ref array
if ($refhit -eq $true) { return }
if ($field -like $_) {
# replace field with ref field
$refhit = $true
$outline += $_.TrimEnd(‘*’) + ‘|’
}# end match
}# end ref
}
if (!$refhit) {
#pass on the field as is
$outline += “$field|”
}
}
[void]($outArray.Add($outline.TrimEnd(‘|’)))
A faster approach might be using regex to match what you are looking for than splitting the line and then processing the columns in another loop.
Agree on the scale down, i had a similar scenario a couple or 3 years ago, with kinda 70-80Gb of data in file spawning from 450 to 600Mb (around 1M lines each).
I remember on the first trial, after more than 8h, processing was still in deep sea. With the final version i was able to process everything in 20-30 minutes.
issue was not loading the files, but the memory tradeoff playing with the arrays in memory (and no matter the matter, even a vm in azure with 56gb wasn`t enough).
That`s as first step i suggested to mod approach in dumping the array to file every 2000 records.
It has been my experience that PS automatically parses each line using Get-Content so you should not have to use Split at all.
[/quote]
Hi, I must be doing something wrong - I can’t see theses exact lines in my script. I think I know what you are driving at but when it seems to break the script
As I’m only interested in modifying the 3rd column in these files, is there a way I can only read in this column, do what I need to do and then merge it back into the original?
I looked into REGEX but for a novice, it is beyond my comprehension. It’s taken me days to get this match/replace to work.
[quote quote=270577]Sorry for not being clear. My comments were in reference to the script prazoninstrum posted.
To summarize, I dont see why you need to split on the new line char:
$file.split(“`n”)
Not sure if it will save you any time removing this. When I parse CSV files I simply use:
$file = Get-Content -Path ‘\…….whatever’
foreach($line in $file) ……
Again, sorry for not being clear.
[/quote]
Hi, no prob. I took out the $file.split(“`n”) and then used foreach($line in $file){ … referring to $line instead of the old $_ and the script runs but does nothing at all. No errors but nothing output to console