Remove duplicates from csv taking too long

biswajeet · April 14, 2020, 4:19am

Hi

I am trying to remove duplicates from csv file. The input file is of 5 mb contains 1.5 lakh records(rows). I am using:

$data = Import-Csv -Path C:\Users\biswajeet.kumar\Documents\test\a.csv
$data |Group-Object SourceIPAddress | Foreach-Object {$_.Group | Select-Object -First 1}|
export-csv "C:\Users\biswajeet.kumar\Documents\test\report2.csv" -NoTypeInformation -Append

But the is taking approx. 5 min to complete. I have to do this for 100s of csv. I can not sort the data. Is there any other ways to speed up this process?

Olaf · April 14, 2020, 4:39am

Biswajeet, welcome to Powershell.org. Please take a moment and read the very first post on top of the list of this forum: Read Me Before Posting! You’ll be Glad You Did!.

When you post code or error messages or sample data or console output format it as code using the code tags “PRE“, please. Thanks in advance.

When you crosspost the same question at the same time to different forums you should at least post links to the other forums along with your question to avoid people willing to you help making their work twice or more.

https://social.technet.microsoft.com/Forums/en-US/e731fbe1-0005-4f44-aece-d411387e2f56/powershell-script-taking-too-much-time?forum=ITCG

The code you posted here looks different to the code you posted there!?

As file system operations are usually the most “expensive” things you can do you should avoid writing to a file again and again and again. It’s better to collect the results in a variable and write it in one big chunk … like this:

$data = Import-Csv -Path C:\Users\biswajeet.kumar\Documents\test\a.csv

$Result = $data |

Group-Object SourceIPAddress |

Foreach-Object {

$_.Group |

Select-Object -First 1

}

Export-Csv -InputObject $Result -Path ‘C:\Users\biswajeet.kumar\Documents\test\report2.csv’ -NoTypeInformation -Append

biswajeet · April 14, 2020, 4:56am

Thanks OlAf,
that’s a great advice. I will use it in my later codes also.
Coming to the script it does not reduce the time a bit which is what I am looking for.
Also gave me output as count, length longlength something like that…

Olaf · April 14, 2020, 5:06am

OK. Could you please post a few sanitzed but still representative lines of your csv file? Format it as code as well please.

biswajeet · April 14, 2020, 5:14am

SourceIPAddress	DestinationProtocol	DestinationPort	TimeStamp
172.28.2.2	tcp	88	Sun Aug 11 22:39:08 2019
10.28.25.23	tcp	88	Sun Aug 11 22:40:08 2019
172.28.2.2	tcp	88	Sun Mar 11 22:44:08 2020
10.28.25.18	tcp	88	Sun Mar 11 22:46:08 2020
10.28.25.23	tcp	88	Sun Mar 11 22:49:08 2020

This is like the input csv.
I am sorry, if making mistakes while posting, just joined the forum

Olaf · April 14, 2020, 5:20am

There are no duplicate source ip addresses in your sample data.

biswajeet · April 14, 2020, 5:26am

Just edited it. I can not post original data (confidential). The original csv contains more than 1 lakh records though.

Olaf · April 14, 2020, 5:43am

I did not say to post original data and I did not say to post all data. BTW: what is “1 lakh” ???

Never mind … try this

$Result = $Data  | 
    Group-Object SourceIPAddress | 
        Foreach-Object {
            $_.Group | 
                Select-Object -First 1
            } 
$Result.GetEnumerator() | Export-Csv -Path 'D:\sample\ResultOutput.csv' -NoTypeInformation

biswajeet · April 14, 2020, 5:57am

I can say 30 sec less time now. earlier it was 5 mins now it is taking 4.30 min

Olaf · April 14, 2020, 6:11am

OK, I ask again: what is “1 lakh” ???

Olaf · April 14, 2020, 6:26am

OK. Let’s try to measure where you loose time …

$Measure = Measure-Command -Expression {

$data = Import-Csv -Path C:\Users\biswajeet.kumar\Documents\test\a.csv

}

“import data ‘$($Measure.TotalSeconds)’”
$Measure = Measure-Command -Expression {

$groupedData = $Data  | Group-Object SourceIPAddress

}

“group Data  ‘$($Measure.TotalSeconds)’”
$Measure = Measure-Command -Expression {

$result = $groupedData |

Foreach-Object {

$_.Group |

Select-Object -First 1

}

}

“pick first / create result  ‘$($Measure.TotalSeconds)’”
$Measure = Measure-Command -Expression {

$Result.GetEnumerator() | Export-Csv -Path ‘D:\sample\ResultOutput.csv’ -NoTypeInformation -Force

}

“write result  ‘$($Measure.TotalSeconds)’”

Please post the output of this code. Thanks

biswajeet · April 14, 2020, 6:28am

There are 5 rows in the sample I pasted above. Like this there are more than 1 lakh rows in the csv file, almost 1.3 lakh. And I need to run this script for 100 csv files like this.
So i needed to reduce some execution time.

biswajeet · April 14, 2020, 6:38am

import data '3.4784568'
group Data  '240.3222332'
pick first / create result  '13.3179946'
write result  '0.7016121'

Here is the output

Olaf · April 14, 2020, 6:48am

I will ask this only once more: What is “1 lakh”?

As you can see the Group-Object command obviously consumes most of the time. Either you find another/faster solution for this or you have to live with that.

You could try to speed up your general task by running some steps in parallel for your 100 files.

biswajeet · April 14, 2020, 6:51am

Yes, I will keep searching if not then this is it
Thanks for your support, really appreciate it.

Olaf · April 14, 2020, 2:14pm

I created some test data myself and tested a little bit. The fastest I’ve got without any tricks was with this code:

$Measure = Measure-Command -Expression {
    $data = Import-Csv -Path 'C:\Users\biswajeet.kumar\Documents\test\a.csv' | 
            sort-object -property SourceIPAddress  -unique
    $data | Export-Csv -Path 'C:\Users\biswajeet.kumar\Documents\test\report2.csv' -NoTypeInformation -Force
}
"total seconds '$($Measure.TotalSeconds)'"

lawson23 · April 17, 2020, 7:43am

This is a very interesting post and I currently have the same but MUCH larger issue and why I came searching. I found three ways to get unique values.

Get-Unique
select-object -unique
sort-object -unique

I have been using the Select-Object -unique and it is taking me 55 minutes on that single command:

3321833.7962 milliseconds.

I’m going to try the sort-object -unique but was wondering OLAF if you tested the Get-Unique. I know there are issues with the top two ways of doing this with case sensitivity and stuff but this should not be an issue in my data because it all collected by getting AD objects which would be 100% the same. Basically I’m collecting a bunch of members of groups adding them together and running the -unique command. I will report back on the sort but was curious if you tested the Get-Unique?

Also using the sort-object is it best to define a property to sort on as you are doing? I really do not know how the command works if not defined. I currently do not define a property using the select-object either.

EDIT START:
Switching to

sort-object -unique

from

select-object -unique

took
21552.1296 milliseconds! (less than half a minute from over 55 minutes)
I will now test to see if it is even faster with a property defined.
EDIT STOP

Olaf · April 17, 2020, 9:07am

Jamie,

I think that depends pretty much on your source data. For the data we have here - an array of objects with more than one property “Select-Object -Unique” and “Get-Unique” does not work.
As I wrote above I’ve created test data with about 1.5 million elements. When I run the code I posted above it takes about 55 seconds reading the csv file, sorting the data and remove the doublettes and writing it to a new csv file. The amount of elements returned is 260 less than the source data.

When I change “Sort-Object” to “Select-Object” or “Get-Unique” the amount of elements returned is 1 - more exactly I get only the first element back.

krzydoug · April 17, 2020, 8:57pm

You could simply keep track of it yourself.

Given this source CSV

SourceIPAddress,DestinationProtocol,DestinationPort,TimeStamp
10.28.25.18,tcp,88,"Sun Mar 11 22:46:08 2020"
10.28.25.18,tcp,88,"Sun Mar 11 22:46:08 2020"
172.28.2.2,tcp,88,"Sun Aug 11 22:39:08 2019"
10.28.25.23,tcp,88,"Sun Aug 11 22:40:08 2019"
172.28.2.2,tcp,88,"Sun Aug 11 22:39:08 2019"
172.28.2.2,tcp,88,"Sun Mar 11 22:44:08 2020"
10.28.25.18,tcp,88,"Sun Mar 11 22:46:08 2020"

Keep track of the property you want, seems like source IP

$tracker = @()
$data = Import-Csv C:\temp\source.csv
$data | ForEach-Object {
    if($_.sourceipaddress -notin $tracker.sourceIP){
        $tracker += @{SourceIP=$_.SourceIPAddress}
        $_
    }
} | export-csv c:\temp\noduplicates.csv -NoTypeInformation

But for speed on large datasets replace the array @() with an arraylist.

$tracker = New-Object System.Collections.ArrayList
$data = Import-Csv C:\temp\source.csv
$data | ForEach-Object {
    if($_.sourceipaddress -notin $tracker.sourceIP){
        [void]$tracker.add(@{SourceIP=$_.SourceIPAddress})
        $_
    }
} | export-csv c:\temp\noduplicates.csv -NoTypeInformation

Both create this output

"SourceIPAddress","DestinationProtocol","DestinationPort","TimeStamp"
"10.28.25.18","tcp","88","Sun Mar 11 22:46:08 2020"
"172.28.2.2","tcp","88","Sun Aug 11 22:39:08 2019"
"10.28.25.23","tcp","88","Sun Aug 11 22:40:08 2019"

I hope this helps.

krzydoug · April 17, 2020, 9:25pm

I had some fun testing. Given this 10 line source CSV

SourceIPAddress,DestinationProtocol,DestinationPort,TimeStamp
10.28.25.14,tcp,88,"Sun Mar 21 12:16:02 2020"
10.28.25.13,tcp,88,"Sun Mar 11 21:41:03 2020"
10.28.25.18,tcp,88,"Sun Mar 11 22:46:08 2020"
10.28.25.18,tcp,88,"Sun Mar 11 22:46:08 2020"
172.28.2.2,tcp,88,"Sun Aug 11 22:39:08 2019"
10.28.25.23,tcp,88,"Sun Aug 11 22:40:08 2019"
172.28.2.2,tcp,88,"Sun Aug 11 22:39:08 2019"
172.28.2.2,tcp,88,"Sun Mar 11 22:44:08 2020"
10.28.25.18,tcp,88,"Sun Mar 11 22:46:08 2020"
10.28.25.23,tcp,88,"Sun Mar 11 22:49:08 2020"

I created a 150,000 line CSV

1..15000 | foreach-object {$data | foreach{$_}}| export-csv c:\temp\largecsv.csv -NoTypeInformation

Confirmed it. (of course I opened it too)

import-csv C:\Temp\largecsv.csv | Measure-Object | select @{n='Lines';e={$_.count}}

Lines
-----
150000

Then ran it through the arraylist tracker

$Measure = Measure-Command -Expression {
    tracker = New-Object System.Collections.ArrayList
    Import-Csv C:\temp\largecsv.csv | ForEach-Object {
        if($_.sourceipaddress -notin $tracker.sourceIP){
            [void]$tracker.add(@{SourceIP=$_.SourceIPAddress})
            $_
        }
    } | export-csv c:\temp\largenoduplicates.csv -NoTypeInformation
}
"total seconds '$($Measure.TotalSeconds)'"

total seconds '8.6225727'

8.6 seconds, nice. And the output

Import-Csv C:\Temp\largenoduplicates.csv

SourceIPAddress DestinationProtocol DestinationPort TimeStamp
--------------- ------------------- --------------- ---------
10.28.25.14 tcp 88 Sun Mar 21 12:16:02 2020
10.28.25.13 tcp 88 Sun Mar 11 21:41:03 2020
10.28.25.18 tcp 88 Sun Mar 11 22:46:08 2020
172.28.2.2 tcp 88 Sun Aug 11 22:39:08 2019
10.28.25.23 tcp 88 Sun Aug 11 22:40:08 2019

Just to see, ran it up to 1 million. That’s a 50MB CSV!

total seconds '57.1582381'

Import-Csv C:\Temp\hugenoduplicates.csv

SourceIPAddress DestinationProtocol DestinationPort TimeStamp
--------------- ------------------- --------------- ---------
10.28.25.14 tcp 88 Sun Mar 21 12:16:02 2020
10.28.25.13 tcp 88 Sun Mar 11 21:41:03 2020
10.28.25.18 tcp 88 Sun Mar 11 22:46:08 2020
172.28.2.2 tcp 88 Sun Aug 11 22:39:08 2019
10.28.25.23 tcp 88 Sun Aug 11 22:40:08 2019

57 seconds for a million, output is still the same.

Topic		Replies	Views
How to optimize code that tries to remove rows from collection PowerShell Help	14	211	May 16, 2024
Removing all instances of duplicates PowerShell Help	5	995	May 16, 2024
Help with CSV PowerShell Help	9	123	May 16, 2024
remove duplicates in csv PowerShell Help	13	415	May 16, 2024
Using -Unique with CSV file PowerShell Help	24	329	May 16, 2024

Remove duplicates from csv taking too long

Related topics