System.io.fileinfo streaming files

Use case
There are approximately >10 millions files in the source location. At some point, these were archived to a separate location. I need to verify if the files in the source location are identical to the archived files since there is no history of these been validated i.e. were the files copied across successfully and if not, output a list.

Plan
Return a unique hash for each file e.g. MD5, SHA256

Evaluating options
As I understand it, system.io.fileinfo streams files. In comparison get-childitem reads into memory first. Since each file needs to be hashed, my thoughts are to loop through each file e.g. foreach { #perform hash | out-file }.

Given the sheer number of files, performance is key. Assuming system.io.fileinfo streams files, I’m guessing as each file is processed, it progresses to the foreach{} loop. Or does it first build a list of files which are subsequently processed?

If the latter, how does this differ to get-childitem?

you can use ViceVersa software, downlead the trail for 30 days

Starting with ViceVersa PRO version 3 , there is a new option called ‘ Use also SHA-256 hash for comparison ’.

You can use this option with CRC comparison.

When this option is enabled, ViceVersa will calculate both the SHA-256 along with CRC32 values during comparison and will use these to compare file contents.
Using SHA-256 is more reliable than just using CRC32 to compare file contents, however it is slower.

Thanks for the suggestion @maxjamakovic. Using third-party software on controlled devices isn’t really an option at this stage hence the question on using native solutions such as PowerShell.

You don’t have to install the software on device where data is, just use sandbox device vice versa supports UNC

Not sure if this is optimal enough for you, but it took me 2 minutes to compare the hash of 200k files (100k each folder).

$file = (Get-ChildItem .\Desktop\source).FullName
$file2 = (Get-ChildItem .\Desktop\archive).FullName
$hash1 = Get-FileHash -Path $file
$hash2 = Get-FileHash -Path $file2
 
$splat = @{
    ReferenceObject = $hash1
    DifferenceObject = $hash2
    Property = 'hash'
    IncludeEqual = $true
    PassThru = $true
}

Compare-Object @splat | Select-Object Hash,SideIndicator,Path | 
Export-Csv -Path .\Desktop\filehash.csv -NoTypeInformation
2 Likes

@random-commandline, thanks. Given that there are files larger than 4TB, get-childitem and get-filehash may not be as optimal. As far as my understanding goes system.io.fileinfo & system.io.directoryinfo are likely to be faster although, I’m unsure if system.io.fileinfo streams files & hashes these concurrently with looping through a large number of files.

Would you know if it does stream or does it pre-build a list of files first?

@maxjamakovic - Setting up a sandbox may be an option. Like many large organizations, it will take time to provision it (change management, approvals, etc). Is there a particular reason you are suggesting Vice Versa instead of PowerShell?

I used many times for migration, one time 50 million files it just works