Small Script - Massive Memory consumption, low CPU but takes hours to complete??

Hello - I’m very new to PS. I have written a small script which looks at a huge CSV (Anything up to about 350MB with anything up to 8 million lines) then looks at the 3rd field in each line, compares this field with every entry in a reference CSV (which currently only has about 20 lines in it) and if the reference line matches this 3rd field, it replaces the field with whatever is in the reference file entry. A way of truncating the 3rd field based on entries in a ref file. It then stops looking through the ref file and continues through each line of the original CSV. It then makes a new CSV with a different extension.

Running this with small test files, it works great. However, I’ve just tried it on a 280MB csv and my PC is currently going crazy. Memory usage is up to 10.6GB, CPU nice and low on 4.2%, but it is taking hours to process this file. Running for about 40 minutes, it has process about 2MB of the 280MB file. I have another 27 similarly large files to get through. So might be done by Christmas.

Is there something very wrong with the script causing it to run slow slowly and why does it take 10.6GB of memory to read in a 280MB file???

Any help most appreciated.

$files = Get-ChildItem "~\Desktop\Test\*.dat"
foreach ($f in $files){
$outfile = $f.FullName + "new"

$file = Get-Content $f.FullName
$ref = Get-Content "~\Desktop\Test\Reference\ref.dat"

$file.split("`n") | foreach {# line in file
$c = 0
$outline = $nul
$_ = $_.TrimEnd() #removes any spurious spaces at end of lines
$_ = $_ + ' ' # adds a single space back on the end
$_.split('|') | foreach {# field in the line
$c++
#$c # for console view only to compare lines
If ($c -eq '3') { # will perform comparison against every entry in ref until a match found

$field = $_
$refhit = $false
$ref | foreach {# item in the ref array
if ($refhit -eq $true) {return}
if ($field -like $_) {# replace field with ref field
$refhit = $true
$outline += $_.TrimEnd('*') + '|'

}# end match

}# end ref

if (!$refhit){#pass on the field as is
$outline += "$field|"
}

}else {#end of field - start of pass through



# if not field 3 then do not perform compare
$field = $_
$refhit = $false


if (!$refhit){#pass on the field as is
$outline += "$field|"

}

}#end field
}

# Output filtered line
Add-Content $outfile -Value $outline.TrimEnd('|') #output truncated file to newname
# write-host $outline.TrimEnd('|') #output to console for testing purposes


} #end line

}

Are these actually csv files? If so, why not use Import-Csv and avoid manipulating strings? I’m not sure, but I bet the string manipulation is taking up some time.

This is somewhat of a common question. Here is a breakdown of the methods to open files, but you’re still loading a lot into memory with files that large:

http://www.happysysadm.com/2014/10/reading-large-text-files-with-powershell.html

There are also 3 for loops. At first glance, you could be doing a Where clause or HashTable lookup rather than adding for loops with if statements to get a match.

Hi, yes they are CSV files although pipe delimited. I can easily convert to , del with F.A.R.T

Thanks for the reply. I will look at that article. As I say, I’m just new to all of this so just feeling my way.

I have plenty of CPU power and memory available but that doesn’t help if it still takes weeks to process the data

See if this is quicker

$files = Get-ChildItem “~\Desktop\Test\*.dat”
$ref = Get-Content “~\Desktop\Test\Reference\ref.dat”
foreach ($f in $files) {
    $outfile = $f.FullName + “new”
    $file = Get-Content $f.FullName
    $fileSplit = $file.split(“`n”)
    $file = $null
    [System.Collections.ArrayList]$outArray = @()
    $countOutFile = 1
    foreach ($line in $fileSplit){
        # line in file
        $c = 0
        $outline = $null
        $line = $line.TrimEnd() #removes any spurious spaces at end of lines
        $line = $line + ‘ ‘ # adds a single space back on the end
        $splitLine = $line.split(‘|’)
        foreach ($sl in $splitLine) { # field in the line
            $c++
            #$c # for console view only to compare lines
            $field = $sl
            $refhit = $false
            If ($c -eq ‘3’) {
                # will perform comparison against every entry in ref until a match found
                $ref | foreach { # item in the ref array
                    if ($refhit -eq $true) { return }
                    if ($field -like $_) {
                        # replace field with ref field
                        $refhit = $true
                        $outline += $_.TrimEnd(‘*’) + ‘|’
                    }# end match
                }# end ref
            }
            if (!$refhit) {
                #pass on the field as is
                $outline += “$field|”
            }
        }
        [void]($outArray.Add($outline.TrimEnd(‘|’)))

        if ($countOutFile % 200){
            Add-Content $outfile -Value $outArray #output truncated file to newname
            $outArray=@()
        }
       $countOutFile++
        # Output filtered line
        # write-host $outline.TrimEnd(‘|’) #output to console for testing purposes
    } #end line

    #dump last content left behind
    Add-Content $outfile -Value $outArray #output truncated file to newname
    $outArray=@()
}

Wow! I just added the -Raw to the Get-Content line and it is ploughing through at an incredible rate!! Memory consumption down to 12% and it has already processed 10MB in about 4 minutes. Looks like that might be all that is needed.

Hi, many thanks for looking at this, but doesn;t seem to have made anything any quicker. Adding -Raw so far has made the biggest difference, but it is still slower than I can really do with

Try removing these lines:

$fileSplit = $file.split(`n)
$file = $null

And changing this line:

foreach ($line in $fileSplit)

 

To this:

foreach ($line in $file)

It has been my experience that PS automatically parses each line using Get-Content so you should not have to use Split at all.

is there any change you can share, anonymized possibly, samples of the files?

a onedrive link with either the ref and real file.

Maybe on the real file, just strip all out and keep the first 10 lines.

just to have an idea of the formats.

Please make sure there are no sensitive or private data

Personally, I’d scale things back. Just concentrate on loading the file and processing each line to see how fast you can get that to work. Then you can determine the impact that additional parsing is taking. This seems expensive:

   $c = 0
        $outline = $null
        $line = $line.TrimEnd() #removes any spurious spaces at end of lines
        $line = $line + ‘ ‘ # adds a single space back on the end
        $splitLine = $line.split(‘|’)
        foreach ($sl in $splitLine) { # field in the line
            $c++
            #$c # for console view only to compare lines
            $field = $sl
            $refhit = $false
            If ($c -eq ‘3’) {
                # will perform comparison against every entry in ref until a match found
                $ref | foreach { # item in the ref array
                    if ($refhit -eq $true) { return }
                    if ($field -like $_) {
                        # replace field with ref field
                        $refhit = $true
                        $outline += $_.TrimEnd(‘*’) + ‘|’
                    }# end match
                }# end ref
            }
            if (!$refhit) {
                #pass on the field as is
                $outline += “$field|”
            }
        }
        [void]($outArray.Add($outline.TrimEnd(‘|’)))

A faster approach might be using regex to match what you are looking for than splitting the line and then processing the columns in another loop.

Agree on the scale down, i had a similar scenario a couple or 3 years ago, with kinda 70-80Gb of data in file spawning from 450 to 600Mb (around 1M lines each).

I remember on the first trial, after more than 8h, processing was still in deep sea. With the final version i was able to process everything in 20-30 minutes.

issue was not loading the files, but the memory tradeoff playing with the arrays in memory (and no matter the matter, even a vm in azure with 56gb wasn`t enough).

That`s as first step i suggested to mod approach in dumping the array to file every 2000 records.

 

 

[quote quote=270487]Try removing these lines:

$fileSplit = $file.split(`n)

$file = $null

And changing this line:

foreach ($line in $fileSplit)

To this:

foreach ($line in $file)

It has been my experience that PS automatically parses each line using Get-Content so you should not have to use Split at all.

[/quote]

Hi, I must be doing something wrong - I can’t see theses exact lines in my script. I think I know what you are driving at but when it seems to break the script

 

Thanks for looking though

Hi, yes - I’ll see if I can organise a sample file to send across. Many thanks

As I’m only interested in modifying the 3rd column in these files, is there a way I can only read in this column, do what I need to do and then merge it back into the original?

I looked into REGEX but for a novice, it is beyond my comprehension. It’s taken me days to get this match/replace to work.

 

Many thanks all for looking. I’ll stick at it!

 

Kev

[quote quote=270445]See if this is quicker

PowerShell
[/quote] Hi. just did some more testing on a large file and this is considerably slower than my 1st script.

But simply using the -Raw has enabled me to process a 280MB file in around 80 minutes.

seems still too much anyway.

if you don`t want to post the sample publicly, you can send them to me here

$mail = “cAByAGEAegBvAG4AaQBuAHMAdAByAHUAbQBAAGMAbABvAHUAZABtAGUAdABhAHYAZQByAHMAZQAuAGMAbwBtAA==”
[System.Text.Encoding]::Unicode.GetString([System.Convert]::FromBase64String($mail))

need the ref, and sample of file to process

Sorry for not being clear. My comments were in reference to the script prazoninstrum posted.

To summarize, I dont see why you need to split on the new line char:

$file.split("`n")

Not sure if it will save you any time removing this. When I parse CSV files I simply use:

$file = Get-Content -Path ‘\…whatever’

foreach($line in $file) …

Again, sorry for not being clear.

[quote quote=270577]Sorry for not being clear. My comments were in reference to the script prazoninstrum posted.

To summarize, I dont see why you need to split on the new line char:

$file.split(“`n”)

Not sure if it will save you any time removing this. When I parse CSV files I simply use:

$file = Get-Content -Path ‘\…….whatever’

foreach($line in $file) ……

Again, sorry for not being clear.

[/quote]
Hi, no prob. I took out the $file.split(“`n”) and then used foreach($line in $file){ … referring to $line instead of the old $_ and the script runs but does nothing at all. No errors but nothing output to console

Kev

Hmm … I believe you have gone back to your original script which would be:

$file | foreach { …

I was commenting based on the code provided by prazoninstrum

Again, not sure removing the split will help performance.