Very slow sort-object -Unique performance

I have a large volume of data to sort. I’m piping a variable with about 6.5 million text items in it to Sort-Object -Unique to eliminate the repeat values. It is taking a very long time. I am wondering what I can do to increase the performance.

Is there a .NET method that I can use? I’m using .NET to stream the data files in a loop and that has dramatically improved the data gathering. It’s just the final sort to make the data unique that is taking forever.

Any ideas or links would be appreciated.

Thanks.

Not sure how you cast your “variable with about 6.5 million” items, but maybe this might help from this link:

Sorting Performance

The Sort-Object cmdlet is great for sorting complex objects based on properties but might not always be suitable for sorting large data sets of primitive types. The [Array]::Sort() method is useful when sorting numbers, characters or strings.

For this example, I’m using the complete genome of E.coli bacteria. It’s about 4.4Mb of text. We are sorting each character in the array.

$EColi = Get-Content .\E.coli -Raw

Measure-Command {
    $Array = Sort-Object -InputObject ($EColi.ToCharArray())
}

# PowerShell 7.1.4 
# TotalSeconds      : 56.6806371

Measure-Command {
    $Array = $EColi.ToCharArray()
    [Array]::Sort($Array)
}

# PowerShell 7.1.4 
# TotalSeconds      : 0.1773409

Thanks, I hadn’t seen that link. I’ve been using this one PowerShell scripting perfomance considerations

I think I can improve performance and make the values unique by iterating through the array and sending the values to a hash table with the key and value set to the value from array. That will make them unique values in the end which is what I’m after.

Thanks for your input. I’ll update after I get somewhere with this.

Chrissy Lemaire wrote some articles a while ago about working with large CSV files and one of the articles addressed finding duplicates in a large data set. Might be useful:

The article is based on using SQL/ODBC and CSV files. I’m not using CSV, SQL or ODBC connections.
Thanks though.

On a side note, the hash table idea is working much better/faster. I found a new bottleneck and have to re-write the code now. It turns out when you 6.5 million values in a variable getting the $variable.count takes a very long time too!

Instead of a large array that has to be sorted and who’s values need to be made unique, I’ve re-worked the script to use hash tables. This gives the benefit of keeping the values I want tracked unique and using the value field I kept count of their occurrence. Two birds, one stone.

I now have more accurate tallies and the overall time to process the script went from an original 2 hours to around 20 minutes.