Merging very large collections

I have two very large collections, $a and $b. They are both imported data files.

$a has 25k records, and $b has 60k records.

I need to merge these into one list doing a Left Join.

I tried Join-Object (http://blogs.msdn.com/b/powershell/archive/2012/07/13/join-object.aspx) but the performance is not working very well.

Is there another tool that could merge collections?

There is definitely some optimization that can be done on that Join-Object function, but it’s not going to reduce your time by orders of magnitude or anything. At best, you might cut the time in half.

With 25k and 60k records, you’re well into the territory where my recommendation would be to use a database platform with indexes on the important fields. If you’re stuck writing it yourself, you’d get better performance by doing it with compiled C# code (even if you perform that compilation with Add-Type in a PowerShell script); there’s quite a bit of overhead every time PowerShell access a property or method of an object that doesn’t apply to most C# code. (If you’re familiar with Reflection, thinking of having to write a C# program that can only use Reflection to access object data; that’s basically what PowerShell has to do.)

For the heck of it (sticking with PowerShell only), here’s how I would rewrite their Join-Object function. The only change in how you use it is that I’ve replaced the -Where scriptblock parameter with -LeftJoinProperty and -RightJoinProperty:

Join-Object -Left $employee -Right $entrance -Where {$args[0].Id -eq $args[1].EmployeeId} –LeftProperties "Name" –RightProperties "When" -Type OnlyIfInBoth

# becomes:

Join-Object -Left $employee -Right $entrance -LeftJoinProperty Id -RightJoinProperty EmployeeId –LeftProperties "Name" –RightProperties "When" -Type OnlyIfInBoth

It’s assumed that you’d be using the -eq operator in that script block anyway, and separating those property names into two separate parameters allowed me to take a different approach within the Join-Object function. In terms of what I actually did:

  • I built a pair of hashtables on the join properties instead, keeping the number of iterations over the arrays to a minimum. The original code's nested for / foreach loops over the entire collections were very expensive; the hashtable approach only has to perform that sort of nested loop in cases where it already knows the key fields match.
  • I got rid of the calls to Add-Member, and instead built a property hashtable to be passed to New-Object. Add-Member is very expensive, and when you're doing it hundreds of thousands of times, it adds up.
  • I got rid of the script block parameter; having to Invoke a script block many times also adds up in execution time.
  • I changed the $Type parameter from a string to an enumerated type, to squeak out a little more speed by doing integer comparisons on the enum values instead of string comparisons.

Here’s the code. In my tests, it performs a little more than twice as fast as the original. I haven’t really spent much time polishing or commenting it, but hopefully it helps:

Add-Type -TypeDefinition @'
    public enum JoinType {
        AllInLeft,
        OnlyIfInBoth,
        AllInBoth,
        AllInRight
    };
'@

function AddItemProperties($item, $properties, $hash)
{
    if ($null -eq $item)
    {
        return
    }

    foreach($property in $properties)
    {
        $propertyHash = $property -as [hashtable]
        if($null -ne $propertyHash)
        {
            $hashName = $propertyHash["name"] -as [string]
            if($null -eq $hashName)
            {
                throw "there should be a string Name"  
            }
         
            $expression = $propertyHash["expression"] -as [scriptblock]
            if($null -eq $expression)
            {
                throw "there should be a ScriptBlock Expression"  
            }
         
            $expressionValue = $expression.Invoke($item)
            
            $hash[$hashName] = $expressionValue
        }
        else
        {
            foreach($itemProperty in $item.psobject.Properties)
            {
                if ($itemProperty.Name -like $property)
                {
                    $hash[$itemProperty.Name] = $itemProperty.Value
                }
            }
        }
    }
}

    
function WriteJoinObjectOutput($leftItem, $rightItem, $leftProperties, $rightProperties)
{
    $properties = @{}

    AddItemProperties $leftItem $leftProperties $properties
    AddItemProperties $rightItem $rightProperties $properties

    New-Object psobject -Property $properties
}

function Join-Object
{
    [CmdletBinding()]
    Param
    (
        # List to join with $Right
        [Parameter(Mandatory=$true)]
        [object[]]
        $Left,

        # List to join with $Left
        [Parameter(Mandatory=$true)]
        [object[]]
        $Right,

        [Parameter(Mandatory = $true)]
        [string]
        $LeftJoinProperty,

        [Parameter(Mandatory = $true)]
        [string]
        $RightJoinProperty,

        # Properties from $Left we want in the output.
        # Each property can:
        # - Be a plain property name like "Name"
        # - Contain wildcards like "*"
        # - Be a hashtable like @{Name="Product Name";Expression={$_.Name}}. Name is the output property name
        #   and Expression is the property value. The same syntax is available in select-object and it is 
        #   important for join-object because joined lists could have a property with the same name
        [Parameter(Mandatory=$true)]
        [object[]]
        $LeftProperties,

        # Properties from $Right we want in the output.
        # Like LeftProperties, each can be a plain name, wildcard or hashtable. See the LeftProperties comments.
        [Parameter(Mandatory=$true)]
        [object[]]
        $RightProperties,

        # Type of join. 
        #   AllInLeft will have all elements from Left at least once in the output, and might appear more than once
        # if the where clause is true for more than one element in right, Left elements with matches in Right are 
        # preceded by elements with no matches. This is equivalent to an outer left join (or simply left join) 
        # SQL statement.
        #  AllInRight is similar to AllInLeft.
        #  OnlyIfInBoth will cause all elements from Left to be placed in the output, only if there is at least one
        # match in Right. This is equivalent to a SQL inner join (or simply join) statement.
        #  AllInBoth will have all entries in right and left in the output. Specifically, it will have all entries
        # in right with at least one match in left, followed by all entries in Right with no matches in left, 
        # followed by all entries in Left with no matches in Right.This is equivallent to a SQL full join.
        [Parameter(Mandatory=$false)]
        [JoinType]
        $Type = [JoinType]::OnlyIfInBoth
    )

    $leftHash = New-Object System.Collections.Specialized.OrderedDictionary
    $rightHash = New-Object System.Collections.Specialized.OrderedDictionary

    # Hashtable keys can't be null; we'll use any old object reference as a placeholder if needed.
    $nullKey = New-Object psobject

    foreach ($item in $Left)
    {
        $key = $item.$LeftJoinProperty

        if ($null -eq $key)
        {
            $key = $nullKey
        }

        $bucket = $leftHash[$key]

        if ($null -eq $bucket)
        {
            $bucket = New-Object System.Collections.ArrayList
            $leftHash.Add($key, $bucket)
        }

        $null = $bucket.Add($item)
    }

    foreach ($item in $Right)
    {
        $key = $item.$RightJoinProperty

        if ($null -eq $key)
        {
            $key = $nullKey
        }

        $bucket = $rightHash[$key]

        if ($null -eq $bucket)
        {
            $bucket = New-Object System.Collections.ArrayList
            $rightHash.Add($key, $bucket)
        }

        $null = $bucket.Add($item)
    }

    foreach ($entry in $leftHash.GetEnumerator())
    {
        $key = $entry.Key
        $leftBucket = $entry.Value

        $rightBucket = $rightHash[$key]

        if ($null -eq $rightBucket)
        {
            if ($Type -eq [JoinType]::AllInLeft -or $Type -eq [JoinType]::AllInBoth)
            {
                foreach ($leftItem in $leftBucket)
                {
                    WriteJoinObjectOutput $leftItem $null $LeftProperties $RightProperties
                }
            }
        }
        else
        {
            foreach ($leftItem in $leftBucket)
            {
                foreach ($rightItem in $rightBucket)
                {
                    WriteJoinObjectOutput $leftItem $rightItem $LeftProperties $RightProperties
                }
            }
        }
    }

    if ($Type -eq [JoinType]::AllInRight -or $Type -eq [JoinType]::AllInBoth)
    {
        foreach ($entry in $rightHash.GetEnumerator())
        {
            $key = $entry.Key
            $rightBucket = $entry.Value

            $leftBucket = $leftHash[$key]

            if ($null -eq $leftBucket)
            {
                foreach ($rightItem in $rightBucket)
                {
                    WriteJoinObjectOutput $null $rightItem $LeftProperties $RightProperties
                }
            }
        }
    }
}