Compare-Object Optimations

Hi everyone, if possible could someone please point me in the right direction with optimising this script.
Essentially it is importing a directory search, a csv, performing a compare then outputting matching results.

If I only have a few items in the test directory it works without issue and is very quick
However when on the real directory with a few thousand items, with real data (a few thousand items)
I have just worked out that this single line is going to take approximately 33 days to complete.

I wouldn’t say I’m impatient but… I might need it to run slightly quicker than that! :slight_smile:

Any help greatly appreciated

$DataFilter is the problem causer :slight_smile:

# $SearchPath is searched for PDF invoices including subdirectories #
$SearchPath = "C:\Somepath\Live"
# $ScanDirectories is the RESULT of a directory search including only .PDF files - This variable is passed into the array called $ScanArray #
$ScanDirectories =@(get-childitem -Path $SearchPath -Include *.pdf -Recurse | Select-Object Basename, Fullname)
# $SQLExport is the file exported from  (SQL) with the Order Details. The column we need is "InternalRef" #
$SQLExport = "C:\Somepath\Live\SQLExport.csv"
# This file is used for outputting the results for the Missing invoices, i.e files that cannot be find based on the Dimensions reference #
$PDFInvoiceReport = "C:\InvoiceScript\PDFInvoiceReport.csv"
# $ExportInvoices Empty Array to hold final script results
$ExportInvoices =@()

#################################################
#################### ARRAYS #####################
#################################################

# $SQLArray array is a list of all order numbers found in SQL Export "InternalRef" Column)
$SQLArray = @((import-csv $SQLExport).InternalRef) | sort-object
# $ScanArray array is populated with a trimmed file basename from the $ScanDirectories search, basically a list of all PDF files with 0's removed from the basename
$ScanArray = @(($ScanDirectories).Basename.Trim('0')) | Sort-Object
# $DataFile is the filtered comparison between the arrays


#### THE LINE THAT IS GOING TO TAKE 33 DAYS TO FINISH #################################################################################################################################################
## FOREACH was added just to check it was actually doing something - each item takes approximately 6 seconds to be compared
$DataFilter = $ScanDirectories | where {Compare-Object -ReferenceObject $ScanArray -DifferenceObject $SQLArray -IncludeEqual -ExcludeDifferent | Where-Object sideindicator -EQ '==' | Select-Object inputobject } | foreach($obj) { write-host "Found a matching file - This takes a long time to complete "}


# $HyperLink is just a string to add to $DataFilter.$line so that it can create a Hyperlink in excel for the export
$Hyperlink ='=HYPERLINK("'
# $Quote is just a quote mark to append to $Hyperlink+$Line.FullName+$Quote i.e. =HYPERLINK("$Line.FullName")
$Quote = '")'


ForEach($Line in $DataFilter)  {
write-host "Found invoice for OrderRef" $Line.BaseName.Trim('0') "getting ready to export"
$LineExport = New-Object PSObject
$LineExport | Add-Member -MemberType NoteProperty -Name "InternalRef" -Value $Line.BaseName
$LineExport | Add-Member -MemberType NoteProperty -Name "Location - Click the path to open Invoice" -Value ("{0}{1}{2}" -f $HyperLink, $Line.FullName, $Quote)
$ExportInvoices += $LineExport }

$export = $ExportInvoices | export-csv -path $PDFInvoiceReport -NoTypeInformation


Hmmm …

You are over commenting your code. Anyone knowing just a little bit of PowerShell knows that this is a variable definition. Please do not add a comment for every single line of code. That makes the code even harder to read.

You are limitting yourself with this Select-Object. Why not keeping the rest of the properties? You might use them later on. :wink:

That’s unnecessary in the vast majority of the cases I know of. PowerShell implicitly creates an array for you if a query returns more than one result. If you want to cast a query to an array use this @() around your actual query.

That’s what I meant in the other thread. Instead of limitting the objects to a single property you should enrich them with a property with the same name to compare agains each other.
In this case you should either add the property BaseName to your array $SQLArray if there is none. Or add the property InternalRef containing actually the BaseName.

Now … your enemy … :wink: With this line of code you’re basically running the comparison for each single element in your array $ScanDirectories again and again and again.

On top of that

When you use -IncludeEqual and -ExcludeDifferent you don’t need this particular Select-Object because there are only results with the SideIndicator ==.
And

That does not make sense because you actually don’t have an input object without using the parameter -PassThru for Compare-Object and the objects you used for comparison are stripped off of all useful properties by your earlier commands. :wink:

So without actually really understanding what you want to do you should start with something like this:

$SearchPath = "C:\Somepath\Live"
$SQLExport = "C:\Somepath\Live\SQLExport.csv"
$PDFInvoiceReport = "C:\InvoiceScript\PDFInvoiceReport.csv"

$SQLArray = Import-Csv $SQLExport | Select-Object -Property *,@{Name = 'BaseName'; Expression = {$_.InternalRef}}
$ScanDirectories = Get-ChildItem -Path $SearchPath -Include *.pdf -Recurse 

$DataFilter = 
    Compare-Object -ReferenceObject $ScanDirectories -DifferenceObject $SQLArray -Property 'BaseName' -IncludeEqual -ExcludeDifferent 

Hi again Olaf thanks for the speedy response,
Reading your comments I had to chuckle at seeing how bad I am at this hahaha
Will deffo make a note of the points :slight_smile:

Yes sorry the commenting is more for me than anyone else.
I have just tested your script and it completed in about 20 seconds :slight_smile:
One thing I do need to do however is pass the “fullname” of the file into the csv export.

So it will read like;
12345 c:\somepath\somefile.pdf

This is the reason why the $datafilter variable was set to reference the $Scandirectory
That was the only way I could get it to work with my limited knowledge, worked fine with 10 test items. Nearly 8000 items however caused a “slight” performance issue :rofl:

Thanks

Don’t worry. We all started small once. :wink:

I’m unsure if I got this right. But if that’s the property you want to compare you have to have it in both arrays.

If you want to compare two objects or two arrays of objects you don’t have to limit the objects to one single property each. Instead you have to make sure that both objects or arrays of objects have one property in common you can provide for Compare-Object for comparison.

Hi Olaf sorry for the late reply.

I would like to compare the SQLexport-internal ref to file basenames (with the 00’s stripped from the basename so they match up correctly)

If there is a match between the basenames(0’s removed and the Internal ref), from there I need to export the basename and the filepath which is found under the “fullname” property of the get-childitem command.

Does that make sense?

Thanks

Hi Olaf here is a very high level diagram which “may” help

It does actually. :wink:

Try this:

$SearchPath         = 'C:\Somepath\Live'
$SQLExport          = 'C:\Somepath\Live\SQLExport.csv'
$PDFInvoiceReport   = 'C:\InvoiceScript\PDFInvoiceReport.csv'

$SQLArray = 
    Import-Csv -Path $SQLExport | 
        Select-Object -Property *,@{Name = 'StrippedBaseName'; Expression = {$_.InternalRef}}
$ScanDirectories = 
    Get-ChildItem -Path $SearchPath -Include *.pdf -Recurse | 
        Select-Object -Property *, @{Name = 'StrippedBaseName'; Expression = { $_.BaseName -as [int] } } 

$Comparison = 
    Compare-Object -ReferenceObject $ScanDirectories -DifferenceObject $SQLArray -Property 'StrippedBaseName' -IncludeEqual -ExcludeDifferent -PassThru
$Comparison |
    Export-Csv -Path $PDFInvoiceReport -NoTypeInformation
$Comparison |
    Select-Object -Property BaseName,FullName

If it does not show the expected results you may try to switch the values of -ReferenceObject with the -DifferenceObject.

Hi Olaf, just tried this, does appear to take a very long time to run
I will try and add a “foreach” into the line just to check how long it takes to find each item. It however appears to stop on the $Scandirectories variable now, presumably the expressions to declare as integer

image

Hi Olaf, apologies, it would appear that this is actually working on the test directories with a few items.

I think potentially it may be the sheer volume of items that is causing the problem.
I don’t know why it would say it can’t find an item however when running on the real directory :slight_smile:

If performance matters you should measure it. Read the help for

I assume “L:” is a network drive, isn’t it? I’d recommend to query the needed list locally on the server where the folder physically is. That would speed up.

That is a very good idea indeed,
I will move everything over to the file server and give it a try as we are also co-located, so even though I am in the office, the data is technically elsewhere, I have a few meeting shortly but will come back as soon as I can.

Thanks

Hi Olaf, I have ran this on the actually server and it now takes about 5 minute to complete.

I’m going to do a few ammendments for formatting purposes on the export but the core comparison part of the script works wonderfully now.

Once again, you are a gentleman, thanks for your help :slight_smile:

Great. I’m glad it helped. :+1:t4: :smiley:

Just one last question on this if you don’t mind.
Don’t worry it still works :slight_smile:

If I remove the select-object from the end of comparison array, I get all the additional info from the $Scandirectory array.
However if I reverse this, i.e. try to pull it from the $SQLArray, it still pulls the basename info but nothing else?

Any ideas, this is additional to original however it would actually be really useful to have.

image

Why should I start to mind now? :wink:

I don’t know how to explain that correctly. Usually only the reference object will be part of the output when you use -PassThru.
You might play a little bit for yourself:

$Data1 = 
@'
First,Last,Compare
John,Lennon,1
George,Harrison,2
Paul,McCartney,3
Ringo,Starr,4
'@ |
    ConvertFrom-Csv

$Data2 = 
@'
Name,Compare
Mick,1
Ron,5
Keith,3
Charlie,8
'@ |
    ConvertFrom-Csv

Compare-Object -ReferenceObject $Data1 -DifferenceObject $Data2 -Property Compare -PassThru -IncludeEqual

Compare-Object -ReferenceObject $Data2 -DifferenceObject $Data1 -Property Compare -PassThru -IncludeEqual

Please run the two Compare-Object command lines separately!!