Very slow sort-object -Unique performance

KevinGagel · March 13, 2024, 6:51pm

I have a large volume of data to sort. I’m piping a variable with about 6.5 million text items in it to Sort-Object -Unique to eliminate the repeat values. It is taking a very long time. I am wondering what I can do to increase the performance.

Is there a .NET method that I can use? I’m using .NET to stream the data files in a loop and that has dramatically improved the data gathering. It’s just the final sort to make the data unique that is taking forever.

Any ideas or links would be appreciated.

Thanks.

tonyd · March 13, 2024, 7:17pm

Not sure how you cast your “variable with about 6.5 million” items, but maybe this might help from this link:

Sorting Performance

The Sort-Object cmdlet is great for sorting complex objects based on properties but might not always be suitable for sorting large data sets of primitive types. The [Array]::Sort() method is useful when sorting numbers, characters or strings.

For this example, I’m using the complete genome of E.coli bacteria. It’s about 4.4Mb of text. We are sorting each character in the array.

$EColi = Get-Content .\E.coli -Raw

Measure-Command {
    $Array = Sort-Object -InputObject ($EColi.ToCharArray())
}

# PowerShell 7.1.4 
# TotalSeconds      : 56.6806371

Measure-Command {
    $Array = $EColi.ToCharArray()
    [Array]::Sort($Array)
}

# PowerShell 7.1.4 
# TotalSeconds      : 0.1773409

KevinGagel · March 13, 2024, 7:43pm

Thanks, I hadn’t seen that link. I’ve been using this one PowerShell scripting perfomance considerations

I think I can improve performance and make the values unique by iterating through the array and sending the values to a hash table with the key and value set to the value from array. That will make them unique values in the end which is what I’m after.

Thanks for your input. I’ll update after I get somewhere with this.

matt-bloomfield · March 13, 2024, 9:22pm

Chrissy Lemaire wrote some articles a while ago about working with large CSV files and one of the articles addressed finding duplicates in a large data set. Might be useful:

KevinGagel · March 13, 2024, 9:40pm

The article is based on using SQL/ODBC and CSV files. I’m not using CSV, SQL or ODBC connections.
Thanks though.

On a side note, the hash table idea is working much better/faster. I found a new bottleneck and have to re-write the code now. It turns out when you 6.5 million values in a variable getting the $variable.count takes a very long time too!

KevinGagel · March 14, 2024, 10:14pm

Instead of a large array that has to be sorted and who’s values need to be made unique, I’ve re-worked the script to use hash tables. This gives the benefit of keeping the values I want tracked unique and using the value field I kept count of their occurrence. Two birds, one stone.

I now have more accurate tallies and the overall time to process the script went from an original 2 hours to around 20 minutes.

Topic		Replies	Views
Sort Unique Hashtable results without variables PowerShell Help	1	155	May 16, 2024
Remove duplicates from csv taking too long PowerShell Help	21	423	May 16, 2024
PowerShell equivalent to sort\|uniq -c\|sort -nr PowerShell Help	5	592	May 16, 2024
Need to unique based on multiple properties PowerShell Help	6	931	May 16, 2024
Get-unique items from an array PowerShell Help	7	610	May 16, 2024

Very slow sort-object -Unique performance

Related topics