I had some fun testing. Given this 10 line source CSV
SourceIPAddress,DestinationProtocol,DestinationPort,TimeStamp 10.28.25.14,tcp,88,"Sun Mar 21 12:16:02 2020" 10.28.25.13,tcp,88,"Sun Mar 11 21:41:03 2020" 10.28.25.18,tcp,88,"Sun Mar 11 22:46:08 2020" 10.28.25.18,tcp,88,"Sun Mar 11 22:46:08 2020" 172.28.2.2,tcp,88,"Sun Aug 11 22:39:08 2019" 10.28.25.23,tcp,88,"Sun Aug 11 22:40:08 2019" 172.28.2.2,tcp,88,"Sun Aug 11 22:39:08 2019" 172.28.2.2,tcp,88,"Sun Mar 11 22:44:08 2020" 10.28.25.18,tcp,88,"Sun Mar 11 22:46:08 2020" 10.28.25.23,tcp,88,"Sun Mar 11 22:49:08 2020"
I created a 150,000 line CSV
1..15000 | foreach-object {$data | foreach{$_}}| export-csv c:\temp\largecsv.csv -NoTypeInformation
Confirmed it. (of course I opened it too)
import-csv C:\Temp\largecsv.csv | Measure-Object | select @{n='Lines';e={$_.count}} Lines ----- 150000
Then ran it through the arraylist tracker
$Measure = Measure-Command -Expression { $tracker = New-Object System.Collections.ArrayList Import-Csv C:\temp\largecsv.csv | ForEach-Object { if($_.sourceipaddress -notin $tracker.sourceIP){ [void]$tracker.add(@{SourceIP=$_.SourceIPAddress}) $_ } } | export-csv c:\temp\largenoduplicates.csv -NoTypeInformation } "total seconds '$($Measure.TotalSeconds)'" total seconds '8.6225727'
8.6 seconds, nice. And the output
Import-Csv C:\Temp\largenoduplicates.csv SourceIPAddress DestinationProtocol DestinationPort TimeStamp --------------- ------------------- --------------- --------- 10.28.25.14 tcp 88 Sun Mar 21 12:16:02 2020 10.28.25.13 tcp 88 Sun Mar 11 21:41:03 2020 10.28.25.18 tcp 88 Sun Mar 11 22:46:08 2020 172.28.2.2 tcp 88 Sun Aug 11 22:39:08 2019 10.28.25.23 tcp 88 Sun Aug 11 22:40:08 2019
Just to see, ran it up to 1 million. That’s a 50MB CSV
total seconds '57.1582381' Import-Csv C:\Temp\hugenoduplicates.csv SourceIPAddress DestinationProtocol DestinationPort TimeStamp --------------- ------------------- --------------- --------- 10.28.25.14 tcp 88 Sun Mar 21 12:16:02 2020 10.28.25.13 tcp 88 Sun Mar 11 21:41:03 2020 10.28.25.18 tcp 88 Sun Mar 11 22:46:08 2020 172.28.2.2 tcp 88 Sun Aug 11 22:39:08 2019 10.28.25.23 tcp 88 Sun Aug 11 22:40:08 2019
57 seconds for a million, output is still the same.