How would you make this faster?

I have a big list CSV file formatted as follows:

Firstname,MiddleInitial,LastName

I want to find each AD user object that matches the each field, if possible. For example, a best case scenario would be an AD user object match that matches Firstname,Lastname and MiddleInitial exactly. The next best scenario would be the first name and last name match exactly. Based on these rules I’m still seeing hundreds of user accounts that don’t have an exact match but might have a character or two that’s off.

For example, in the CSV file the first name could be “Joe” but in AD his first name is “Joseph”.

I’ve added in a fuzzy search using the Levenshtein distance (http://www.codeproject.com/Tips/102192/Levenshtein-Distance-in-Windows-PowerShell). This function finds the number of edits that must happen on string1 before it looks like string2. It’s the best way I’ve found to perform these kind of matches.

Here’s how I’m currently doing this:

$AllAdUsers = Get-ADUser -Filter { [givenname -like '*'] -and [surname -like '*'] }
$UserCount = $AllAdUsers.Count
foreach ($Row in $Csv)
    for [$i=0; $i -lt $UserCount; $i++] {
			$FnameDistance = Get-LevenshteinDistance -first $Row.FirstName -second $AllAdusers[$i].Givenname -ignoreCase
			$LnameDistance = Get-LevenshteinDistance -first $Row.LastName -second $AllAdusers[$i].surname -ignoreCase
			$TotalDistance = $FnameDistance + $LnameDistance
			if [$i -eq 0] {
				$LowestDistance = $TotalDistance
			} elseif [$TotalDistance -lt $LowestDistance] {
				$LowestDistance = $TotalDistance
			}
		}
               $LowestDistance
}

I’m just getting the distance between my CSV field and the AD field, adding them up for each AD object and getting the one with the lowest number. It’s terribly slow but I can’t think of a better way to do it.

Any input?

The first thing I would do is move most of this code into C# (at least, if not C++). PowerShell code takes quite a bit longer to execute than its compiled equivalent, and you’ve got a very slow algorithm here to begin with ( O(n^2) just in the code you’ve shown, and another O(n^2) inside the Get-LevenshteinDistance function on top of that.)

Beyond that, the best performance improvements would come from either allowing faster comparisons (calculate levenshtein distance in a faster way), or cutting down on the number of iterations in your loops somehow. You could look into spell checker algorithms, as they have a very similar job to what you’re doing here.

I’m no developer, Mr. C++. :slight_smile: A weeeee bit of C# is what I’m comfortable with. I hadn’t looked at the function code yet probably because I just didn’t want to look under the covers and get started down that rabbit hole.

Spell checker algorithms…that’s an idea. I’ll look into that or, if I’m brave enough crack open the LevenshteinDistance function and see if there’s a better way.

Have you timed how long each of the Levenshtein operations takes? There isn’t a whole lot outside of that going on in your script that would take all that much time. If you have 10,000 users in AD and your CSV file has 100 users, then the way you are doing it will loop through all 10,000 AD users 100 times for a total of 1 million comparisons. Maybe you could try reversing the loop logic like below but I think the total number of comparisons will be the same:

$AllAdUsers = Get-ADUser -Filter *

foreach ($User in $AllAdusers)
{
    for ($i=0; $i -lt $Csv.Count; $i++)
    {
            $FnameDistance = Get-LevenshteinDistance -First $User.Givenname -Second $Csv[$i].Firstname -ignoreCase
            $LnameDistance = Get-LevenshteinDistance -First $User.Surname -Second $Csv[$i].Lastname -ignoreCase
            $TotalDistance = $FnameDistance + $LnameDistance
            
            if ($i -eq 0)
            {
                $LowestDistance = $TotalDistance
            }
            elseif ($TotalDistance -lt $LowestDistance)
            {
                $LowestDistance = $TotalDistance
            }
        }
        
        $LowestDistance
}

Also, you are doing two Levs in each for loop iteration, maybe you could compare the surnames first, check if that is within a certain distance, and then only run the givenname comparison if it meets that criterion. This could cut down drastically on the total number of Lev comparisons.

I have not timed the Levenshtein functions. However, I might try that if I decide to try to improve upon that function.

I am already comparing the given name and surname in another instance of this script. The ones that it can’t find exact matches on get passed to this fuzzy search method. I do believe if I reversed them it’d still be the same.

I’m able to get output after 20 minutes or so. It’s not the end of the world but more of an academic exercise in Powershell. :slight_smile:

How about concatenating the FirstName and LastName to FirstNameLastName? That would cut the amount on Levs in half wouldn’t it?

$Name = '{0}{1}' -f $Row.FirstName, $Row.LastName
$ADname = '{0}{1}' -f $AllAdusers[$i].Givenname, $AllAdusers[$i].surname
$CurrDistance = Get-LevenshteinDistance -first $Name -second $ADname -ignoreCase
$LowestDistance = [Math]::Min($CurrDistance,$LowestDistance)

Hmm…that’s a great idea! I don’t see why that wouldn’t work.

Actually I think it was a really bad idea…

Since the lev-algorithm is a nested loop, the amount of iterations is $first.length * $second.length

Two comparisons with a length of 5 each will make (55)+(55) which is 50 iterations.

Concatenating the strings will get us one compare with 10*10 which is 100 iterations.

If you instead could implement a limit, say if the cost is greater than X, stop comparing that would speed up the process.

Honestly I haven’t even looked at the code of the lev function. I was doing something bad and assuming it was efficient. :slight_smile: I tried to run that and it did take FOREVER.

If you look up Levenshtein Distance on Wikipedia, you’ll find the algorithm that was used in your original sample, but there’s also a different one which only uses two rows of the array. In my tests, this seemed to be around 5x faster (though this may depend on string length and other factors.) Here’s that other algorithm, converted into PowerShell:

function Get-LevenshteinDistance
{
    param([string] $first, [string] $second, [switch] $ignoreCase)
 
    $len1 = $first.length
    $len2 = $second.length
 
    if($len1 -eq 0)
    { return $len2 }
 
    if($len2 -eq 0)
    { return $len1 }
 
    if($ignoreCase -eq $true)
    {
        $first = $first.tolowerinvariant()
        $second = $second.tolowerinvariant()
    }
    
    if ($first -ceq $second) { return 0 }
    
    $v0 = 0..$len1
    $v1 = 0..$len1

    for ($i = 0; $i -lt $len2; $i++)
    {
        $v1[0] = $i + 1

        for ($j = 0; $j -lt $len1; $j++)
        {
            $cost = if ($first[$j] -ceq $second[$i]) { 0 } else { 1 }
            $v1[$j + 1] = [math]::Min($v1[$j] + 1, [math]::Min($v0[$j + 1] + 1, $v0[$j] + $cost))
        }

        $v0,$v1 = $v1,$v0
    }
    
    return $v1[-1]
}

Thanks, Dave! Talk about service! :slight_smile: I’ll run it and let you know what the time difference is.

I just ran it again and got a 60% time reduction. Thanks!

Just for giggles, here it is in C#. See how the speed compares; I’m curious. :slight_smile:

https://gist.github.com/dlwyatt/4625483f924118d77437

Here are the results I got (total milliseconds) of one thousand iterations comparing ‘Kittens’ to ‘Sitting’ with the three implementations:

325.8029
144.292
52.5131

The forum is doing something weird to the closing quotation mark of my here-string; it’s supposed to be a single quote, not double.

Bah, it also removed some code from inside the method. Hang on, I’ll repost this as a gist.

Are you bored or something today? :slight_smile:

Not really, it’s just an interesting problem. :slight_smile:

I hear you and now I know who to call on for help if I ever come up with this kind of problem in the future. This kind of problem wasn’t interesting to me. That’s why I admittingly threw it to you. However, I can definitely relate to getting so zoned in to a particular problem I lose track of time. :slight_smile: