Remove duplicates from csv taking too long

Hi

I am trying to remove duplicates from csv file. The input file is of 5 mb contains 1.5 lakh records(rows). I am using:

$data = Import-Csv -Path C:\Users\biswajeet.kumar\Documents\test\a.csv
$data |Group-Object SourceIPAddress | Foreach-Object {$_.Group | Select-Object -First 1}|
export-csv "C:\Users\biswajeet.kumar\Documents\test\report2.csv" -NoTypeInformation -Append

But the is taking approx. 5 min to complete. I have to do this for 100s of csv. I can not sort the data. Is there any other ways to speed up this process?

Biswajeet, welcome to Powershell.org. Please take a moment and read the very first post on top of the list of this forum: Read Me Before Posting! You’ll be Glad You Did!.

When you post code or error messages or sample data or console output format it as code using the code tags “PRE“, please. Thanks in advance.

When you crosspost the same question at the same time to different forums you should at least post links to the other forums along with your question to avoid people willing to you help making their work twice or more.

https://social.technet.microsoft.com/Forums/en-US/e731fbe1-0005-4f44-aece-d411387e2f56/powershell-script-taking-too-much-time?forum=ITCG

The code you posted here looks different to the code you posted there!?

As file system operations are usually the most “expensive” things you can do you should avoid writing to a file again and again and again. It’s better to collect the results in a variable and write it in one big chunk … like this:

$data = Import-Csv -Path C:\Users\biswajeet.kumar\Documents\test\a.csv
$Result = data | Group-Object SourceIPAddress | Foreach-Object { _.Group |
Select-Object -First 1
}
Export-Csv -InputObject $Result -Path ‘C:\Users\biswajeet.kumar\Documents\test\report2.csv’ -NoTypeInformation -Append

Thanks OlAf,
that’s a great advice. I will use it in my later codes also.
Coming to the script it does not reduce the time a bit which is what I am looking for.
Also gave me output as count, length longlength something like that…

OK. Could you please post a few sanitzed but still representative lines of your csv file? Format it as code as well please.

SourceIPAddress	DestinationProtocol	DestinationPort	TimeStamp
172.28.2.2	tcp	88	Sun Aug 11 22:39:08 2019
10.28.25.23	tcp	88	Sun Aug 11 22:40:08 2019
172.28.2.2	tcp	88	Sun Mar 11 22:44:08 2020
10.28.25.18	tcp	88	Sun Mar 11 22:46:08 2020
10.28.25.23	tcp	88	Sun Mar 11 22:49:08 2020

This is like the input csv.
I am sorry, if making mistakes while posting, just joined the forum

There are no duplicate source ip addresses in your sample data.

Just edited it. I can not post original data (confidential). The original csv contains more than 1 lakh records though.

I did not say to post original data and I did not say to post all data. BTW: what is “1 lakh” ???

Never mind … try this

$Result = $Data  | 
    Group-Object SourceIPAddress | 
        Foreach-Object {
            $_.Group | 
                Select-Object -First 1
            } 
$Result.GetEnumerator() | Export-Csv -Path 'D:\sample\ResultOutput.csv' -NoTypeInformation

I can say 30 sec less time now. earlier it was 5 mins now it is taking 4.30 min

OK, I ask again: what is “1 lakh ???

OK. Let’s try to measure where you loose time …

$Measure = Measure-Command -Expression {
data = Import-Csv -Path C:\Users\biswajeet.kumar\Documents\test\a.csv } "import data '($Measure.TotalSeconds)’"

$Measure = Measure-Command -Expression {
$groupedData = Data | Group-Object SourceIPAddress } "group Data '($Measure.TotalSeconds)’"

$Measure = Measure-Command -Expression {
$result = groupedData | Foreach-Object { _.Group |
Select-Object -First 1
}
}
“pick first / create result ‘$($Measure.TotalSeconds)’”

$Measure = Measure-Command -Expression {
Result.GetEnumerator() | Export-Csv -Path 'D:\sample\ResultOutput.csv' -NoTypeInformation -Force } "write result '($Measure.TotalSeconds)’"

Please post the output of this code. Thanks

There are 5 rows in the sample I pasted above. Like this there are more than 1 lakh rows in the csv file, almost 1.3 lakh. And I need to run this script for 100 csv files like this.
So i needed to reduce some execution time.

import data '3.4784568'
group Data  '240.3222332'
pick first / create result  '13.3179946'
write result  '0.7016121'

Here is the output

I will ask this only once more: What is “1 lakh”?

As you can see the Group-Object command obviously consumes most of the time. Either you find another/faster solution for this or you have to live with that.

You could try to speed up your general task by running some steps in parallel for your 100 files. :wink:

Yes, I will keep searching if not then this is it
Thanks for your support, really appreciate it.

I created some test data myself and tested a little bit. The fastest I’ve got without any tricks was with this code:

$Measure = Measure-Command -Expression {
    $data = Import-Csv -Path 'C:\Users\biswajeet.kumar\Documents\test\a.csv' | 
            sort-object -property SourceIPAddress  -unique
    $data | Export-Csv -Path 'C:\Users\biswajeet.kumar\Documents\test\report2.csv' -NoTypeInformation -Force
}
"total seconds '$($Measure.TotalSeconds)'"

This is a very interesting post and I currently have the same but MUCH larger issue and why I came searching. I found three ways to get unique values.

Get-Unique
select-object -unique
sort-object -unique

I have been using the Select-Object -unique and it is taking me 55 minutes on that single command:

3321833.7962 milliseconds.

I’m going to try the sort-object -unique but was wondering OLAF if you tested the Get-Unique. I know there are issues with the top two ways of doing this with case sensitivity and stuff but this should not be an issue in my data because it all collected by getting AD objects which would be 100% the same. Basically I’m collecting a bunch of members of groups adding them together and running the -unique command. I will report back on the sort but was curious if you tested the Get-Unique?

Also using the sort-object is it best to define a property to sort on as you are doing? I really do not know how the command works if not defined. I currently do not define a property using the select-object either.

EDIT START:
Switching to

sort-object -unique

from

select-object -unique

took
21552.1296 milliseconds! (less than half a minute from over 55 minutes)
I will now test to see if it is even faster with a property defined.
EDIT STOP

Jamie,

I think that depends pretty much on your source data. For the data we have here - an array of objects with more than one property “Select-Object -Unique” and “Get-Unique” does not work.
As I wrote above I’ve created test data with about 1.5 million elements. When I run the code I posted above it takes about 55 seconds reading the csv file, sorting the data and remove the doublettes and writing it to a new csv file. The amount of elements returned is 260 less than the source data.

When I change “Sort-Object” to “Select-Object” or “Get-Unique” the amount of elements returned is 1 - more exactly I get only the first element back.

You could simply keep track of it yourself.

Given this source CSV

SourceIPAddress,DestinationProtocol,DestinationPort,TimeStamp
10.28.25.18,tcp,88,"Sun Mar 11 22:46:08 2020"
10.28.25.18,tcp,88,"Sun Mar 11 22:46:08 2020"
172.28.2.2,tcp,88,"Sun Aug 11 22:39:08 2019"
10.28.25.23,tcp,88,"Sun Aug 11 22:40:08 2019"
172.28.2.2,tcp,88,"Sun Aug 11 22:39:08 2019"
172.28.2.2,tcp,88,"Sun Mar 11 22:44:08 2020"
10.28.25.18,tcp,88,"Sun Mar 11 22:46:08 2020"

Keep track of the property you want, seems like source IP

$tracker = @()
$data = Import-Csv C:\temp\source.csv
$data | ForEach-Object {
    if($_.sourceipaddress -notin $tracker.sourceIP){
        $tracker += @{SourceIP=$_.SourceIPAddress}
        $_
    }
} | export-csv c:\temp\noduplicates.csv -NoTypeInformation

But for speed on large datasets replace the array @() with an arraylist.

$tracker = New-Object System.Collections.ArrayList
$data = Import-Csv C:\temp\source.csv
$data | ForEach-Object {
    if($_.sourceipaddress -notin $tracker.sourceIP){
        [void]$tracker.add(@{SourceIP=$_.SourceIPAddress})
        $_
    }
} | export-csv c:\temp\noduplicates.csv -NoTypeInformation

Both create this output

"SourceIPAddress","DestinationProtocol","DestinationPort","TimeStamp"
"10.28.25.18","tcp","88","Sun Mar 11 22:46:08 2020"
"172.28.2.2","tcp","88","Sun Aug 11 22:39:08 2019"
"10.28.25.23","tcp","88","Sun Aug 11 22:40:08 2019"

I hope this helps.

I had some fun testing. Given this 10 line source CSV

SourceIPAddress,DestinationProtocol,DestinationPort,TimeStamp
10.28.25.14,tcp,88,"Sun Mar 21 12:16:02 2020"
10.28.25.13,tcp,88,"Sun Mar 11 21:41:03 2020"
10.28.25.18,tcp,88,"Sun Mar 11 22:46:08 2020"
10.28.25.18,tcp,88,"Sun Mar 11 22:46:08 2020"
172.28.2.2,tcp,88,"Sun Aug 11 22:39:08 2019"
10.28.25.23,tcp,88,"Sun Aug 11 22:40:08 2019"
172.28.2.2,tcp,88,"Sun Aug 11 22:39:08 2019"
172.28.2.2,tcp,88,"Sun Mar 11 22:44:08 2020"
10.28.25.18,tcp,88,"Sun Mar 11 22:46:08 2020"
10.28.25.23,tcp,88,"Sun Mar 11 22:49:08 2020"

I created a 150,000 line CSV

1..15000 | foreach-object {$data | foreach{$_}}| export-csv c:\temp\largecsv.csv -NoTypeInformation

Confirmed it. (of course I opened it too)

import-csv C:\Temp\largecsv.csv | Measure-Object | select @{n='Lines';e={$_.count}}

Lines
-----
150000

Then ran it through the arraylist tracker

$Measure = Measure-Command -Expression {
    tracker = New-Object System.Collections.ArrayList
    Import-Csv C:\temp\largecsv.csv | ForEach-Object {
        if($_.sourceipaddress -notin $tracker.sourceIP){
            [void]$tracker.add(@{SourceIP=$_.SourceIPAddress})
            $_
        }
    } | export-csv c:\temp\largenoduplicates.csv -NoTypeInformation
}
"total seconds '$($Measure.TotalSeconds)'"

total seconds '8.6225727'

8.6 seconds, nice. And the output

Import-Csv C:\Temp\largenoduplicates.csv

SourceIPAddress DestinationProtocol DestinationPort TimeStamp
--------------- ------------------- --------------- ---------
10.28.25.14 tcp 88 Sun Mar 21 12:16:02 2020
10.28.25.13 tcp 88 Sun Mar 11 21:41:03 2020
10.28.25.18 tcp 88 Sun Mar 11 22:46:08 2020
172.28.2.2 tcp 88 Sun Aug 11 22:39:08 2019
10.28.25.23 tcp 88 Sun Aug 11 22:40:08 2019

Just to see, ran it up to 1 million. That’s a 50MB CSV!

total seconds '57.1582381'

Import-Csv C:\Temp\hugenoduplicates.csv

SourceIPAddress DestinationProtocol DestinationPort TimeStamp
--------------- ------------------- --------------- ---------
10.28.25.14 tcp 88 Sun Mar 21 12:16:02 2020
10.28.25.13 tcp 88 Sun Mar 11 21:41:03 2020
10.28.25.18 tcp 88 Sun Mar 11 22:46:08 2020
172.28.2.2 tcp 88 Sun Aug 11 22:39:08 2019
10.28.25.23 tcp 88 Sun Aug 11 22:40:08 2019

57 seconds for a million, output is still the same.