Speeding up comparision of two CSV file imports

Hello, can someone please help me with the following question?

I have two CSV files on contains about 62,000 records and the other about 210,000 records

I import the data into two separate variables e.g.

$A = import-csv c:\data\file1.csv # this file has 211,000 records
$B = import-csv c:\data\file2.csv # this file has 62,000 records

I then need to compare e.g.

foreach ($item in $A) {

$x = $B | where {$.Name -eq $item.Name}

if ($x.DOB -eq $item.DOB) { write-host ‘match found’}

}

The above is a basic example of what I am doing but it is extremely slow

I think the problem may lie here $x = $B | where {$.Name -eq $item.Name}

Any ideas on how to speed this up, please?

Thanks very much

Have you looked into Compare-Object ?

Could you please format your code as code here in the forum? Thanks in advance.

If you want to compare objects in PowerShell can use the cmdlet

Please read the help for the cmdlet completely including the examples to learn how to use it.

Hello All

Thanks very much for the replies so far.

The main issue is the two CSV files do not contain the same field names, so I cannot do a direct comparison between them. Therefore I import each file when use a foreach loop to compare the items of interest

here is the code

# version 0.1

[cmdletbinding()]
param (
	$InputXFile = "c:\Scripts\X\Copy\File1.csv",
	$InputADFile = "c:\Scripts\X\Copy\ADFile.csv",
	$LogFilePath = "C:\Scripts\X\LogFile.txt"
)

# logic

trap
{
	Write-Error $_
	
}

$StartTime = get-date
Write-Verbose "Importing $InputXFile"
try { $XData = Import-Csv $InputXFile -ErrorAction Stop }
catch { throw $_ }

Write-Verbose "Importing $InputADFile"
try { $ADData = Import-Csv $InputADFile -ErrorAction Stop }
catch { throw $_ }

#region compare X to AD

$I = 1

:MainXloop foreach ($XRecord in $XData)
{
	
	Write-Verbose "Stage 1 percent complete $(($I / $XData.count) * 100)"
	
	write-verbose "check the user record exists in X and AD data files"
	
	$ADRecord = $ADData | Where-Object  { $_.sAMAccountName -eq $XRecord.UniqueID }
	
	if (!$ADRecord -and $XRecord.IdentityStatus -eq 'Active')
	{
		
		$WarningMessage = "X user record with UniqueID $($XRecord.UniqueID) does not exist in the AD import file $InputADFile`r`n"
		Write-Warning $WarningMessage
		Add-Content -Value $WarningMessage -Path $LogFilePath -force
		$I++
		continue MainXloop
		
	}
	
	
	if ($XRecord.IdentityStatus -eq 'Offline' -and $ADRecord.sAMAccountName)
	{
		
		$WarningMessage = "AD User $($ADRecord.sAMAccountName) exits in the AD however the X status is Offline meaning this user should 'not' exist in AD`r`n"
		Write-Warning $WarningMessage
		Add-Content -Value $WarningMessage -Path $LogFilePath -force
		
		
	}
	elseif ($XRecord.IdentityStatus -eq 'Offline' -and -not $ADUser.sAMAccountName)
	{
		$I++
		continue MainXloop
	}
	
	else
	{ 
	
		
		write-verbose "checking deleted status"
				
		if ($XRecord.IdentityStatus -eq 'DELETED' -and $ADRecord.Enabled)
		{
			
			try { $TerminationDate = [datetime]([datetime]"$($XRecord.TerminationDate)").ToString("MM/dd/yyyy hh:mm:ss tt") } catch {}
			
			$WarningMessage = "AD User $($ADRecord.sAMAccountName) exists in AD and their account is Enabled. However X status is DELETED meaning this user should be Disabled in AD (pending AD clean-up). Their termination date was $($TerminationDate.toString('dd/MMM/yyyy hh:mm:ss tt'))`r`n"
			Write-Warning $WarningMessage
			Add-Content -Value $WarningMessage -Path $LogFilePath -force
			
		}
		
		
		Write-Verbose "checking active status"
		
		if ($XRecord.IdentityStatus -eq 'Active')
		{
			
			$Date = get-date
			
			#region check action start date
			
			if ($XRecord.ActionStartDate)
			{
				# convering date from US format (as X input file dates in US format)
				
				$ActionStartDate = [datetime]([datetime]"$($XRecord.ActionStartDate)").ToString("MM/dd/yyyy hh:mm:ss tt")
				
				Write-Verbose "checking action start date to ensure AD user is not enabled if start date is in the future"
				
								
				if ( ($ActionStartDate -gt $Date) -and $ADRecord.Enabled)
				{
					$WarningMessage = "AD User $($ADRecord.sAMAccountName) account is set to Enabled, however their X ActionStartDate is in the future e.g. $($ActionStartDate.toString('dd/MMM/yyyy hh:mm:ss tt')) meaning the AD account should not be enaled at this time`r`n"
					Write-Warning $WarningMessage
					Add-Content -Value $WarningMessage -Path $LogFilePath -force
				}
			}
			
			#endregion
			
			#region check expiry date
			write-verbose "checking expiry date"
			$ExpiryDate = [datetime]([datetime]"$($XRecord.ExpiryDate)").ToString("MM/dd/yyyy hh:mm:ss tt")
			if (($ExpiryDate -lt (get-date)) -and $ADRecord.Enabled)
			{
				$WarningMessage = "AD User $($ADRecord.sAMAccountName) account is set to Enabled, however their  X ExpiryDate is in the past e.g. $($ExpiryDate.toString('dd/MMM/yyyy hh:mm:ss tt')) meaning the AD account should be disabled pending clean-up`r`n"
				Write-Warning $WarningMessage
				Add-Content -Value $WarningMessage -Path $LogFilePath -force
			}
			#endregion
						
			#region check termination date
			write-verbose "checking TerminationDate date"
			$TerminationDate = [datetime]([datetime]"$($XRecord.TerminationDate)").ToString("MM/dd/yyyy hh:mm:ss tt")
			if (($TerminationDate -lt (get-date)) -and $ADRecord.Enabled)
			{
				$WarningMessage = "AD User $($ADRecord.sAMAccountName) account is set to Enabled, however their  X TerminationDate is in the past e.g. $($TerminationDate.toString('dd/MMM/yyyy hh:mm:ss tt')) meaning the AD account should be disabled pending clean-up`r`n"
				Write-Warning $WarningMessage
				Add-Content -Value $WarningMessage -Path $LogFilePath -force
			}
			
			
			#endregion
			
			$I++
		}
		
		else { }
		
		
	} # end else statement
		
	
	
}


#end region

#region compare AD to X 

Write-Verbose "Checking AD Data"

$I = 1
:MainADloop  foreach ($ADRecord in $ADData)

{
	Write-Verbose "Stage 2 percent complete $( ($I / $ADData.count) * 100 )"
	$XRecord = $XData | Where-Object  { $_.UniqueID -eq $ADRecord.sAMAccountName}
	
	if (!$XRecord)
	{
		
		$WarningMessage = "AD User with sAMAccountName (UID) $($ADRecord.sAMAccountName) does not exist in the X import file $InputXFile`r`n"
		Write-Warning $WarningMessage
		Add-Content -Value $WarningMessage -Path $LogFilePath -force
		$I++
		continue MainADloop
		
	}
	
	if ($ADRecord.Enabled -eq 'False' -and $XRecord.IdentityStatus -eq 'Active')
	{
		
		$WarningMessage = "AD User with sAMAccountName (UID) $($ADRecord.sAMAccountName) is Disabled in AD, however their X  IdentityStatus is Active this would 'normally' indicate an Enabled account in AD. Their X EmployeeStatus filed is set to $($XRecord.EmployeeStatus) and their EmployeeStatusSpecific is set to $($XRecord.EmployeeStatusSpecific)`r`n"

		Write-Warning $WarningMessage
		Add-Content -Value $WarningMessage -Path $LogFilePath -force
				
	}
	
	$I++
	
	
}

$EndTime = Get-Date

"Script complete, time taken $($EndTime - $StartTime)"
#endregion




You can create an additional header during the import process with a calculated property. For example: if the imported data have a header hostname instead of computername you could do something like this:

Import-Csv -Path 'c:\Scripts\X\Copy\File1.csv' -Delimiter ',' | 
    Select-Object -Property *,@{Name = 'ComputerName'; Expression = {$_.hostname}}

Now you could compare two data sets on the property “ComputerName”

1 Like

Thanks all I will take a look into that option too, I think the fundermental problem is the amount of data I need to compare. I will get back for the forum if I come up with a solution using the suggestions above or an alternative, thanks

@cxmelga

Create two [DataTable] objects to hold sets A and B and put a primary key on both objects. Then, use this index in search methods later (e.g. Find() ). There may be more scripting overall, but you will leverage an index value for your join and comparison operations rather than string matches.

$myDataTable.PrimaryKey = $myDataTable.Columns["Name","DOB"]

I cannot say for sure, but as you have multiple actions executing for each row item in larger A against all of smaller B; this should add to the time cost. If you have a SQL Server instance handy, that may be a better environment in which to perform these types operations over mid-to-large data sets.

See Also

DataRowCollection.Find Method

1 Like

Thanks Dicey

I will take a look at that option too :slight_smile:

Hi again,
Dicey would you be kind enough to give me an example of code to import two CSV files and put their data into tables for comparison, as I get the idea but I am not experienced enough, but if I can get started on an example that would be great :slight_smile:

Thanks very much

Why don’t you start with something easier to understand? :wink: :smiley:

$A = import-csv c:\data\file1.csv 
$B = import-csv c:\data\file2.csv | Select-Object -Property *, @{Name = 'Name'; Expression = { $_.'here you should provide the property you want to compare' }}

Compare-Object -ReferenceObject $A -DifferenceObject $B -Property Name -IncludeEqual -PassThru

Assumed you have the data set with the property to compare with the name “Name” in the CSV file “file1.csv” and the data set with the property to compare with a different name in the CSV file “file2.csv” you should provide the name of the property you want to use in the calculated property.

Depending on the result you expect you may play a little bit with the Compare-Object command. Switch the valuers for -ReferenceObejct and -DifferenceObejct for example.

1 Like

Dicey would you be kind enough to give me an example of code…

@cxmelga - certainly! I have a couple of long-distance runs scheduled this weekend, so it might be Sunday evening before I log back into the forum.

I started my IT career with relational databases long ago, and so [DataTable] and keys are objects I like to consider as an alternative strategy. The syntax is pretty quick to grasp, and—although I like to script each column—you can get clever with loops to reduce the lines of code required for wide tables.

1 Like

…import two CSV files and put their data into tables for comparison -@cxmelga

…if I can get started on an example that would be great -@cxmelga

Remarks

This demonstration features functions, classes, a datatable object with a primary key, and the use of the Find() method. The goal is to use functional logic in the original post and provide examples of different programming techniques while maintaining short execution times.

Performance

Good performance is difficult to judge when there are no standards with which to compare. Other approaches may be more suitable for the circumstances of a program and its operational environment. For this reason, I chose to demonstrate multiple action and item types rather than following one course of action.

Notes

  • Set A has 485268 rows; it was more efficient not to store this set as a [datatable] on my personal laptop.
  • 25 rows are expected to be found in A that have two elements that match rows in B.
    • These results are stored in C.
  • B is treated as a dimensional object.
    • B is not a subset of A, and the elements in B’s primary key must be non-null and unique in the set.
  • (Special) Elements in all objects are alphabetically ordered.
  • (Special) CSV files do not have a header row.

Demo Code

using namespace System.Collections;
using namespace System.Data;

$LoadSetA = $true;

if($LoadSetA)
{
    $A = Import-Csv A:\resultsA.csv -Header CarrierTrackingNumber,ModifiedDate,Name,ProductID,ProductNumber,SalesOrderDetailID,SalesOrderID,SetNbr,UnitPrice;
}

Class Bi
{
    [string] $Color;
    [string] $DaysToManufacture;
    [float] $ListPrice;
    [string] $Name;
    [int] $ProductID;
    [string] $ProductNumber;
    [int] $ReorderPoint;
    [datetime] $SellEndDate;
    [datetime] $SellStartDate;
    [string] $Style;

    Bi($q1,$q2,$q3,$q4,$q5,$q6,$q7,$q8,$q9,$q10)
    {
        $this.Color = $q1;
        $this.DaysToManufacture = $q2;
        $this.ListPrice = $q3;
        $this.Name = $q4;
        $this.ProductID = $q5;
        $this.ProductNumber = $q6;
        $this.ReorderPoint = $q7;
        $this.SellEndDate = $this.SetDateTime($q8);
        $this.SellStartDate = $q9;
        $this.Style = $q10;
    }

    [datetime]SetDateTime($d1)
    {
        [datetime]$result = switch ($d1)
        {
            {$_ -eq 'NULL'}{'1900-01-01'; break}
            {[string]::IsNullOrEmpty($_)}{'1900-01-01'; break}
            default {$d1}
        }
        return $result;
    }
}

$B = [Datatable]::new('Product','Production')
    [void]$B.Columns.Add('Color',[string]);
    [void]$B.Columns.Add('DaysToManufacture',[string]);
    [void]$B.Columns.Add('ListPrice',[decimal]);
    [void]$B.Columns.Add('Name',[string]);
    [void]$B.Columns.Add('ProductID',[int]);
    [void]$B.Columns.Add('ProductNumber',[string]);
    [void]$B.Columns.Add('ReorderPoint',[int]);
    [void]$B.Columns.Add('SellEndDate',[datetime]);
    [void]$B.Columns.Add('SellStartDate',[datetime]);
    [void]$B.Columns.Add('Style',[string]);

    $B.PrimaryKey = $B.Columns['ProductID','SellStartDate'];

    function AddItemB([Bi]$item)
    {
        $row = $B.NewRow();
        foreach($i in $item)
        {
            $row.Color = $i.Color;
            $row.DaysToManufacture = $i.DaysToManufacture;
            $row.ListPrice = $i.ListPrice;
            $row.Name = $i.Name;
            $row.ProductID = $i.ProductID;
            $row.ProductNumber = $i.ProductNumber;
            $row.ReorderPoint = $i.ReorderPoint;
            $row.SellEndDate = $i.SellEndDate;
            $row.SellStartDate = $i.SellStartDate;
            $row.Style = $i.Style;
        }
        $B.Rows.Add($row);
    }

foreach($i in (Import-Csv A:\resultsB.csv -Header Color,DaysToManufacture,ListPrice,Name,ProductID,ProductNumber,ReorderPoint,SellEndDate,SellStartDate,Style))
{
    $Bii = [Bi]::new($i.Color,$i.DaysToManufacture,$i.ListPrice,$i.Name,$i.ProductID,$i.ProductNumber,$i.ReorderPoint,$i.SellEndDate,$i.SellStartDate,$i.Style);
    AddItemB($Bii);
}

$C = [ArrayList]::new();

foreach($r in $A)
{
    $key = $r.ProductID,$r.ModifiedDate;
    if($B.Rows.Find($key))
    {
        [void]$C.Add($r);
    }
}

return $A.Count, $B.Rows.Count, $C.Count;

Results

Measure of Total Operation

measureTotalOp

Measure Operation with preloaded A ($LoadSetA = $false)

measureBCOnly

Measure Load A

measureLoadA

Measure Find and Load C

measureMatchAndAdd

1 Like

Thanks very much Dicey, I will take a look at your reply, and create some input
files on my system and learn from your examples, much appreciated :slight_smile:

Charlie

1 Like