Find duplicate files in multiple folders

Hello, I’m putting together a script that will take a file in a users workstation and place it in 3 network folders. We are looking in two of those folders to make sure that the source file isn’t already in there. The script was working great with just one network folder to consider but now I need to change the script to look in two network folders and I’m not sure how to handle that. The file name has this signature: someCode_datetimestamp.pdf the seconds part of the timestamp is incremented to try and find a unique file in the network locations. In other words to make sure the new file name isn’t in those folders. I could use some suggestions on how this should be handled. Any help is greatly appreciated.

[pre]
[CmdletBinding()]
Param(
[Parameter(Mandatory=$True,Position=1)]
[string]$someid,

[Parameter(Mandatory=$True,Position=2)]
[string]$OffCode
)

<#
input parameters from batch file: someid, OffCode
someid example : ‘bmx15321’
offcode example : ‘ABC’
#>

cls

<# #######################################################

this copies and removes files from users local C:\temp\UserDocs\ folder and puts them in three network locations.
this script will rename files if a similar name is found in the destination folders AND the file contents are different.
the file name is incremented by 1 second until there is a unique file name availabe in dest and dest3. (until there is no match in dest and dest3)
if the file is the same name and same content, it will not be processed. ( will be skipped )
if the file is the same name and different content, it will process, increment time and rename the file.
if the file is a different name but same contnet, it will process, increment time and rename the file.
if the file is a different name and different content, it will copy & delete original. (will not increment timestamp)

####################################### #>

use office to prepend officeCode to file names

if ( $OffCode -eq “ABC” ) {
$officeCode = “01”
$officeAbbr = “ABC”
}
if ( $OffCode -eq “XYZ” ) {
$officeCode = “02”
$officeAbbr = “XYZ”
}

$src = “C:\temp\UserDocs”

$dest = "\CompanyServer\Destination" # check for dupes
$dest2 = "\CompanyServer\Goober"+$officeAbbr+"Destination2" # we are NOT checking dupes on this one.
$dest3 = "\ComapanyServer\Destination3" # check for dupes

$errNoOffice = “There was no office code found when attempting to copy files from users Temp folder.”
$errDupeFound = "A duplicate file was found in your temp directory. This file already exists in the destination folder for " + $someid
$PSEmailServer = “mailrelay.CoolCompany.com
$APPLIMail = “techSupport@CoolCompany.com

similar to Get-FileHash. use this to get hashes for all files in a folder

Function Get-FileHashTSO([String] $FileName,$HashName = “SHA1”)
{
$FileStream = New-Object System.IO.FileStream($FileName,[System.IO.FileMode]::Open)
$StringBuilder = New-Object System.Text.StringBuilder
[System.Security.Cryptography.HashAlgorithm]::Create($HashName).ComputeHash($FileStream)|%{[Void]$StringBuilder.Append($_.ToString(“x2”))}
$FileStream.Close()
$StringBuilder.ToString()
}

listOfHashesFromDest represents the files in the destination folder

$listOfHashesFromDest = New-Object System.Collections.ArrayList;
foreach($destFile in Get-ChildItem $dest)
{
$destHash = Get-FileHashTSO $destFile.FullName “SHA1”
$listOfHashesFromDest.add($destHash.ToString())
}

listOfHashesFromDest3 represents the files in the destination 3 folder

$listOfHashesFromDest3 = New-Object System.Collections.ArrayList;
foreach($dest3File in Get-ChildItem $dest3)
{
$dest3Hash = Get-FileHashTSO $dest3File.FullName “SHA1”
$listOfHashesFromDest3.add($dest3Hash.ToString())
}

only continue if a valid office code is found

if ( $officeAbbr -eq “ABC” -OR $officeAbbr -eq “XYZ” ) {

# the files from the users local temp folder 
$srcFiles  = Get-ChildItem -Path $src -Filter *.pdf 


# for each file in the source folder we want to see if there is a duplicate file in dest and dest3.
# we'll check the hashes to make sure it's a real duplicate. 
# if the file has the same name but a different hash, it's not really a duplicate and we try to move it to the dest and dest3 network locations.
# when it has a duplicate we attempt to rename the file by incrementing the "Seconds" timestamp by 1 second. 
# we increment the timestamp until there is no file in dest and dest3 with the same name.

foreach ($file in $srcFiles) {

    # only process files with date signature. does not process files like 'yaddayadda.pdf'
    if ( $file.Name -match "\d{14}.pdf" -OR $file.Name -match "\d{8}_\d{6}.pdf" ) {


               
                $fileName= $file.BaseName
                $fileDate= $fileName.Substring(0,8)
                $fileTimeStamp=  $fileName.Substring($fileName.Length -6,6)    

                $fileFullPathDest = "$dest\"+$officeCode+"_"+$fileDate+"_"+$fileTimeStamp+".pdf"
				$fileFullPathDest3 = "$dest3\"+$officeCode+"_"+$fileDate+"_"+$fileTimeStamp+".pdf"

				# test if the source file is in any of the destination folders.
                # if file NOT found in destination folder. it's not a duplicate. copy to dest,dest2,dest3 and delete the source file
                if(
					!(Test-path ("$dest\"+$officeCode+"_"+$fileDate+"_"+$fileTimeStamp+".pdf")) 
					-AND
					!(Test-path ("$dest3\"+$officeCode+"_"+$fileDate+"_"+$fileTimeStamp+".pdf"))
				)  {
                    
					# there were no dupes do regular stuff
                    Copy-Item -Path $file.FullName -Destination ("$dest\"+$officeCode+"_"+$fileDate+"_"+$fileTimeStamp+".pdf")  
                    Copy-Item -Path $file.FullName -Destination ("$dest2\"+$officeCode+"_"+$fileDate+"_"+$fileTimeStamp+".pdf") 
					Copy-Item -Path $file.FullName -Destination ("$dest3\"+$officeCode+"_"+$fileDate+"_"+$fileTimeStamp+".pdf") 
                    Remove-Item $file.FullName -Force
        
                 } else {
                 
                         #check if the file is already in the destination folders. this check the contnet of the file not just the file name.
                         # ----------------------------------------------------------------------------------------------------------------
                         $hasADupe = "No" 
                         $testFile = $file.FullName
                         $srcFileHash = Get-FileHashTSO $testFile "SHA1"
                         
                         if (($listOfHashesFromDest -contains $srcFileHash) -OR ($listOfHashesFromDest3 -contains $srcFileHash))
						 {
                            $hasADupe = "Yes"
                         }
                         # ---------------------------------------------------------------------------------------------------------
                         


                     # file found in destination folder AND content is different (same file name but it's actually a different file) 
                     if(Test-path $fileFullPathDest -or Test-path $fileFullPathDest3 -and $hasADupe -eq "No") {  

							# file name exists in one of the destination folders now lets see which one.
							$foundInDest = "No"
							$foundInDest3 = "No"
							
							if(Test-path $fileFullPathDest){
								$foundInDest="Yes"
							}
							if(Test-path $fileFullPathDest3){
								$foundInDest3="Yes"
							}
                            
							# this next block is the trouble spot. I'm not sure how to check dest and dest3 as a unit and increment the timestamp
							# until there is NOT a file, by the same name, in both dest AND dest3.
							# this was working great with only dest but now I have to incorporate dest3 in the logic.
							# /////////////////////////// start problem spot //////////////////////////
							
                            # do this for each time the filename is found in the destination folder
                            while (Test-Path($fileFullPathDest) ) {   
      
                                # increment time 1 second, pad with leading zero if necessary
                                # -------------------------------------------------------------
                                $myTime = $fileTimeStamp
                                $myTime = [datetime]::ParseExact($myTime,'HHmmss',$null)
                                $myTime = $myTime.AddSeconds(1)
                                $myTime = "{0:hhmmss}" -f [datetime]$myTime
                                $ts = $myTime
                                # -------------------------------------------------------------
                                        
                                # set the file name with the new incremented time value                                                                          
                                $fileFullPathDest="$dest\"+$officeCode+"_"+$fileDate+"_"+$ts+".pdf"
                                $fileTimeStamp=$ts
                        
                            } # end while 

                            # write-host "finished increment"
							# ////////////////////////// end problem spot ////////////////////////////

							
							# final step 
                            # use new timestamp to rename file in all destination folders. delete source file. 
                            Copy-Item -Path $file.FullName -Destination ("$dest\"+$officeCode+"_"+$fileDate+"_"+$ts+".pdf") 
                            Copy-Item -Path $file.FullName -Destination ("$dest2\"+$officeCode+"_"+$fileDate+"_"+$ts+".pdf") 
							Copy-Item -Path $file.FullName -Destination ("$dest3\"+$officeCode+"_"+$fileDate+"_"+$ts+".pdf") 
                            Remove-Item $file.FullName -Force
							
							
                      
                      } else {
                      
                        
                            # a duplicate file was found. send a notification email to the user
                            $PSEmailServer = "mailrelay.CoolCompany.com"
                            Send-MailMessage -From $APPLIMail -To $APPLIMail &#8175;
                            -Subject "Duplicate found" -Body "Attention: $errDupeFound The file is : $file "   
                      
                      } # if(Test-path $fileFullPathDest -or Test-path $fileFullPathDest3 -and $hasADupe -eq "No")
                       
                              
                 } # end if(!(Test-path ("$dest\"+$officeCode+"_"+$fileDate+"_"+$fileTimeStamp+".pdf")) )             

                

    } # end if ( $file.Name -match "\d{14}.pdf" -OR $file.Name -match "\d{8}_\d{6}.pdf" )


} # end foreach($file in $srcFiles)

} else {

# no office codes match. skip everything and send a notification email.
$PSEmailServer = "mailrelay.CoolCompany.com"
Send-MailMessage -From $APPLIMail -To $APPLIMail -Subject "No office code found" -Body "Error: $errNoOffice. Contact $someid for details"                 

} # end if $officeAbbr -eq “HO” …

[/pre]

something loading the file names into an object and then using the compare-object might work

$Reference = Get-ChildItem -Path C:\BaseDir
$Difference = Get-ChildItem -Path C:\DifferenceDir
Compare-Object -ReferenceObject $Reference -DifferenceObject $Difference -Property Name, Length -PassThru -ExcludeDifferent -IncludeEqual | ForEach-Object {
    $Item= $_
    $Item
    $Difference|Where-Object {$_.Name-eq$Item.Name-and $_.Length-eq$Item.Length}
} | Select-Object -ExpandProperty FullName

Thanks aj2019. That does return the file that is common in two folders but I don’t think that helps me because it’s only comparing against one folder. I need to check it against 2 folders and iterate the source file name only if is not found in both of those folders. I think the functionality of your code is similar to what the get Hash operations currently does. Would you suggest replacing the current hash compare with the Item.Length comparison?

if file name and length is enough to garantee uniqness then it would be faster not to make hash on the content and just go with the name and length properties

if a hashes were made in the foreach-object after the compair-object above i think it would only run if name and length already match so perpahs that’s a non issue

i guess the snippit above could be made into function, then it could be called muiltiple times for each directory to compair against

or alternativly the difference object passed into it could be set to contain the file names from all the targets

Thanks for your input aj2019. I think you’re right about the hash consideration being a non issue. And you are spot on about the comparison function. That’s really the crux of the whole thing. How would something like that be implemented within the script I have? You see, the increment part is important. I want to increment the Seconds timestamp of a file but also check the other folder for that newly created file name to make sure it’s not in the folder already. How do I need to modify the script to do that? There must be some standard way of handling this kind of thing.

glad if any help. maybe it’s possible to avoid bringing that logic to the top of the script, and attempt to write it in every case

if ( -Not (Test-Path $file() ))
{
New-Item -Path $file 
}