How would you make this faster?

adbertram · October 22, 2014, 5:19am

I have a big list CSV file formatted as follows:

Firstname,MiddleInitial,LastName

I want to find each AD user object that matches the each field, if possible. For example, a best case scenario would be an AD user object match that matches Firstname,Lastname and MiddleInitial exactly. The next best scenario would be the first name and last name match exactly. Based on these rules I’m still seeing hundreds of user accounts that don’t have an exact match but might have a character or two that’s off.

For example, in the CSV file the first name could be “Joe” but in AD his first name is “Joseph”.

I’ve added in a fuzzy search using the Levenshtein distance (Levenshtein Distance in Windows PowerShell - CodeProject). This function finds the number of edits that must happen on string1 before it looks like string2. It’s the best way I’ve found to perform these kind of matches.

Here’s how I’m currently doing this:

$AllAdUsers = Get-ADUser -Filter { [givenname -like '*'] -and [surname -like '*'] }
$UserCount = $AllAdUsers.Count
foreach ($Row in $Csv)
    for [$i=0; $i -lt $UserCount; $i++] {
			$FnameDistance = Get-LevenshteinDistance -first $Row.FirstName -second $AllAdusers[$i].Givenname -ignoreCase
			$LnameDistance = Get-LevenshteinDistance -first $Row.LastName -second $AllAdusers[$i].surname -ignoreCase
			$TotalDistance = $FnameDistance + $LnameDistance
			if [$i -eq 0] {
				$LowestDistance = $TotalDistance
			} elseif [$TotalDistance -lt $LowestDistance] {
				$LowestDistance = $TotalDistance
			}
		}
               $LowestDistance
}

I’m just getting the distance between my CSV field and the AD field, adding them up for each AD object and getting the one with the lowest number. It’s terribly slow but I can’t think of a better way to do it.

Any input?

system · October 22, 2014, 5:59am

The first thing I would do is move most of this code into C# (at least, if not C++). PowerShell code takes quite a bit longer to execute than its compiled equivalent, and you’ve got a very slow algorithm here to begin with ( O(n^2) just in the code you’ve shown, and another O(n^2) inside the Get-LevenshteinDistance function on top of that.)

Beyond that, the best performance improvements would come from either allowing faster comparisons (calculate levenshtein distance in a faster way), or cutting down on the number of iterations in your loops somehow. You could look into spell checker algorithms, as they have a very similar job to what you’re doing here.

adbertram · October 22, 2014, 6:06am

I’m no developer, Mr. C++. A weeeee bit of C# is what I’m comfortable with. I hadn’t looked at the function code yet probably because I just didn’t want to look under the covers and get started down that rabbit hole.

Spell checker algorithms…that’s an idea. I’ll look into that or, if I’m brave enough crack open the LevenshteinDistance function and see if there’s a better way.

matt-mcnabb · October 22, 2014, 6:07am

Have you timed how long each of the Levenshtein operations takes? There isn’t a whole lot outside of that going on in your script that would take all that much time. If you have 10,000 users in AD and your CSV file has 100 users, then the way you are doing it will loop through all 10,000 AD users 100 times for a total of 1 million comparisons. Maybe you could try reversing the loop logic like below but I think the total number of comparisons will be the same:

$AllAdUsers = Get-ADUser -Filter *

foreach ($User in $AllAdusers)
{
    for ($i=0; $i -lt $Csv.Count; $i++)
    {
            $FnameDistance = Get-LevenshteinDistance -First $User.Givenname -Second $Csv[$i].Firstname -ignoreCase
            $LnameDistance = Get-LevenshteinDistance -First $User.Surname -Second $Csv[$i].Lastname -ignoreCase
            $TotalDistance = $FnameDistance + $LnameDistance
            
            if ($i -eq 0)
            {
                $LowestDistance = $TotalDistance
            }
            elseif ($TotalDistance -lt $LowestDistance)
            {
                $LowestDistance = $TotalDistance
            }
        }
        
        $LowestDistance
}

Also, you are doing two Levs in each for loop iteration, maybe you could compare the surnames first, check if that is within a certain distance, and then only run the givenname comparison if it meets that criterion. This could cut down drastically on the total number of Lev comparisons.

adbertram · October 22, 2014, 6:22am

I have not timed the Levenshtein functions. However, I might try that if I decide to try to improve upon that function.

I am already comparing the given name and surname in another instance of this script. The ones that it can’t find exact matches on get passed to this fuzzy search method. I do believe if I reversed them it’d still be the same.

I’m able to get output after 20 minutes or so. It’s not the end of the world but more of an academic exercise in Powershell.

simon-wahlin · October 22, 2014, 8:11am

How about concatenating the FirstName and LastName to FirstNameLastName? That would cut the amount on Levs in half wouldn’t it?

$Name = '{0}{1}' -f $Row.FirstName, $Row.LastName
$ADname = '{0}{1}' -f $AllAdusers[$i].Givenname, $AllAdusers[$i].surname
$CurrDistance = Get-LevenshteinDistance -first $Name -second $ADname -ignoreCase
$LowestDistance = [Math]::Min($CurrDistance,$LowestDistance)

adbertram · October 22, 2014, 8:32am

Hmm…that’s a great idea! I don’t see why that wouldn’t work.

simon-wahlin · October 22, 2014, 9:08am

Actually I think it was a really bad idea…

Since the lev-algorithm is a nested loop, the amount of iterations is $first.length * $second.length

Two comparisons with a length of 5 each will make (55)+(55) which is 50 iterations.

Concatenating the strings will get us one compare with 10*10 which is 100 iterations.

If you instead could implement a limit, say if the cost is greater than X, stop comparing that would speed up the process.

adbertram · October 22, 2014, 12:31pm

Honestly I haven’t even looked at the code of the lev function. I was doing something bad and assuming it was efficient. I tried to run that and it did take FOREVER.

system · October 22, 2014, 2:38pm

If you look up Levenshtein Distance on Wikipedia, you’ll find the algorithm that was used in your original sample, but there’s also a different one which only uses two rows of the array. In my tests, this seemed to be around 5x faster (though this may depend on string length and other factors.) Here’s that other algorithm, converted into PowerShell:

function Get-LevenshteinDistance
{
    param([string] $first, [string] $second, [switch] $ignoreCase)
 
    $len1 = $first.length
    $len2 = $second.length
 
    if($len1 -eq 0)
    { return $len2 }
 
    if($len2 -eq 0)
    { return $len1 }
 
    if($ignoreCase -eq $true)
    {
        $first = $first.tolowerinvariant()
        $second = $second.tolowerinvariant()
    }
    
    if ($first -ceq $second) { return 0 }
    
    $v0 = 0..$len1
    $v1 = 0..$len1

    for ($i = 0; $i -lt $len2; $i++)
    {
        $v1[0] = $i + 1

        for ($j = 0; $j -lt $len1; $j++)
        {
            $cost = if ($first[$j] -ceq $second[$i]) { 0 } else { 1 }
            $v1[$j + 1] = [math]::Min($v1[$j] + 1, [math]::Min($v0[$j + 1] + 1, $v0[$j] + $cost))
        }

        $v0,$v1 = $v1,$v0
    }
    
    return $v1[-1]
}

adbertram · October 22, 2014, 11:53pm

Thanks, Dave! Talk about service! I’ll run it and let you know what the time difference is.

adbertram · October 23, 2014, 12:01am

I just ran it again and got a 60% time reduction. Thanks!

system · October 23, 2014, 12:29am

Just for giggles, here it is in C#. See how the speed compares; I’m curious.

system · October 23, 2014, 12:34am

Here are the results I got (total milliseconds) of one thousand iterations comparing ‘Kittens’ to ‘Sitting’ with the three implementations:

325.8029
144.292
52.5131

system · October 23, 2014, 12:36am

The forum is doing something weird to the closing quotation mark of my here-string; it’s supposed to be a single quote, not double.

Bah, it also removed some code from inside the method. Hang on, I’ll repost this as a gist.

adbertram · October 23, 2014, 1:28am

Are you bored or something today?

system · October 23, 2014, 2:22am

Not really, it’s just an interesting problem.

adbertram · October 23, 2014, 2:25am

I hear you and now I know who to call on for help if I ever come up with this kind of problem in the future. This kind of problem wasn’t interesting to me. That’s why I admittingly threw it to you. However, I can definitely relate to getting so zoned in to a particular problem I lose track of time.

Topic		Replies	Views
Comparing users in an AD group to a CSV file PowerShell Help	7	196	June 3, 2022
simple error checking between AD and CSV PowerShell Help	20	189	December 29, 2016
Find if user is in AD by First and last Name PowerShell Help	13	174	June 25, 2020
Help With Arrays and MatchInfo PowerShell Help	2	153	November 23, 2014
Using CSV to Find User Without AD Account PowerShell Help	3	170	February 6, 2023

How would you make this faster?

Related Topics