Finding fuzzy duplicate items in rows of a CSV

by notarat at 2013-02-21 16:17:11

Bear with me please…I’m still very new to Powershell (I use PS 2.0 at work) and I’ve been using it to pull lists in CSV format of all users in my OU and the security/distribution groups they are in.

I see instances where users are assigned to, basically, the same distribution groups, where their names are basically 99% duplicated.

Example:

tim.jones, Marketing_East, Marketing_West, Marketing_South, Marketing_North, Marketing_East2, Marketing_US

Marketing_East & Marketing_East2 are identical in all but name.

I’d like to find a script that would automate the search through each row of my CSV File to identify instances where there are "fuzzy" duplicate groups like the example.

I’m still getting the hang of Power shell scripting, to be honest, and my efforts so far have been directed more towards "getting" the information, than "processing" it.

I searched, but I’m having a tough time finding examples of searching through a CSV row by row for, what I would call "fuzzy duplicates" (groups or items that are spelled nearly alike, but differ only a little at either the beginning or end of the field)

Are there any example scripts out there or does anyone have an example they can share?
by DonJ at 2013-02-22 04:34:28
Nothing I’ve seen. This is a pretty tough task, because the shell doesn’t have any native functionality to do this. You’ll essentially have to make a collection of every group name, and then enumerate that and perform some kind of wildcard comparison. It might be easier to load them into a SQL Server database, since you could then take advantage of SQL-side comparisons like SOUNDEX(), which is explicitly a fuzzy-comparison. PowerShell doesn’t have anything native that’s quite like it.
by notarat at 2013-02-22 05:08:54
Don,

Thanks for the response(!) even though it was a confirmation of my fears…

I’m even less adept at SQL than Power Shell, lol. I guess I’ll be buying some SQL Books this weekend, haha

Have a good one.