I’ve got a script that frequently creates hashtables from collections. I used to do this “by hand” until I realized that Group-Object already provides this functionality through its -AsHash parameter. I’ve replaced some of the “by hand” code with Group-Object calls and I’ve realized that it’s no longer as fast.
My question is, am I using the cmdlet wrong and causing this performance hit? If I’m not, I’m also wondering how the call to Group-Object (a cmdlet presumably written in C#) could be slower than the “by hand” PowerShell code.
I’ve already done a bit of investigating myself by writing a script that creates hashtables using both methods (“by hand” and Group-Object) and timing each. I’ve found that Group-Object is only slightly slower when the number of keys in the hash table is ~5000 or lower. However, once you get to something like 10,000 keys the difference in performance is staggering and Group-Object takes much longer.
The lowdown on the Gist script:
Just dot source it and run Compare-HashCreation with the required params
This will create a list of 50,000 tuples of the form (Num, “foobar”) where N is a random number from 0-999. Then it will create two hashtables via both methods and using the the Num property of the tuple for the hashtable keys.
I apologize; your account is flagged in the global WordPress system as a spam originator, and so your many posts on this topic have all been held. I’ve released this one.
At a guess, I’d attribute this to the way .NET itself handles hash tables and arrays generally, meaning when you add an element to one, it more or less has to re-create the entire array. As the array grows progressively larger, that process obviously takes longer and longer.
Interesting… I’ve never looked at the Group-Object cmdlet’s code before, and it behaves a bit oddly in this method (decompiled with ILSpy):
The first bit of code (based on the result of TryGetValue) is what you’d expect of code that adds to a dictionary. What’s interesting is the “for” loop in the else block, which iterates over all of the groups instead of using a dictionary-based lookup. I’m not sure why that code needs to be there, but that is definitely the sort of thing that could make it take a long time to execute if you’re dealing with a large data set.
// Microsoft.PowerShell.Commands.GroupObjectCommand
internal static void DoGrouping(OrderByPropertyEntry currentObjectEntry, bool noElement, List groups, Dictionary groupInfoDictionary, OrderByPropertyComparer orderByPropertyComparer)
{
if (currentObjectEntry != null && currentObjectEntry.orderValues != null && currentObjectEntry.orderValues.Count > 0)
{
object key = PSTuple.ArrayToTuple(currentObjectEntry.orderValues.ToArray());
GroupInfo groupInfo = null;
if (groupInfoDictionary.TryGetValue(key, out groupInfo))
{
if (groupInfo != null)
{
groupInfo.Add(currentObjectEntry.inputObject);
return;
}
}
else
{
bool flag = false;
for (int i = 0; i < groups.Count; i++)
{
if (orderByPropertyComparer.Compare(groups[i].GroupValue, currentObjectEntry) == 0)
{
groups[i].Add(currentObjectEntry.inputObject);
flag = true;
break;
}
}
if (!flag)
{
GroupObjectCommand.tracer.WriteLine(string.Format(CultureInfo.InvariantCulture, "Create a new group: {0}", new object[]
{
currentObjectEntry.orderValues
}), new object[0]);
GroupInfo groupInfo2 = noElement ? new GroupInfoNoElement(currentObjectEntry) : new GroupInfo(currentObjectEntry);
groups.Add(groupInfo2);
groupInfoDictionary.Add(key, groupInfo2);
}
}
}
}
Hey Dave, thanks for your reply. I kinda see what you’re talking about but I’ll need a bit more time to digest exactly what’s going on in the code you pasted.
Somewhat unrelated, I’ve never heard of ILSpy but I just downloaded it bc it seems pretty useful. However, I’m not sure how you navigated to the Group-Object code. Could explain how you did that?