System.io.fileinfo streaming files

motivatedgorilla · September 25, 2021, 8:52pm

Use case
There are approximately >10 millions files in the source location. At some point, these were archived to a separate location. I need to verify if the files in the source location are identical to the archived files since there is no history of these been validated i.e. were the files copied across successfully and if not, output a list.

Plan
Return a unique hash for each file e.g. MD5, SHA256

Evaluating options
As I understand it, system.io.fileinfo streams files. In comparison get-childitem reads into memory first. Since each file needs to be hashed, my thoughts are to loop through each file e.g. foreach { #perform hash | out-file }.

Given the sheer number of files, performance is key. Assuming system.io.fileinfo streams files, I’m guessing as each file is processed, it progresses to the foreach{} loop. Or does it first build a list of files which are subsequently processed?

If the latter, how does this differ to get-childitem?

maxjamakovic · September 25, 2021, 11:23pm

you can use ViceVersa software, downlead the trail for 30 days

Starting with ViceVersa PRO version 3 , there is a new option called ‘ Use also SHA-256 hash for comparison ’.

You can use this option with CRC comparison.

When this option is enabled, ViceVersa will calculate both the SHA-256 along with CRC32 values during comparison and will use these to compare file contents.
Using SHA-256 is more reliable than just using CRC32 to compare file contents, however it is slower.

motivatedgorilla · September 25, 2021, 11:40pm

Thanks for the suggestion @maxjamakovic. Using third-party software on controlled devices isn’t really an option at this stage hence the question on using native solutions such as PowerShell.

maxjamakovic · September 25, 2021, 11:49pm

You don’t have to install the software on device where data is, just use sandbox device vice versa supports UNC

random-commandline · September 26, 2021, 1:19am

Not sure if this is optimal enough for you, but it took me 2 minutes to compare the hash of 200k files (100k each folder).

$file = (Get-ChildItem .\Desktop\source).FullName
$file2 = (Get-ChildItem .\Desktop\archive).FullName
$hash1 = Get-FileHash -Path $file
$hash2 = Get-FileHash -Path $file2
 
$splat = @{
    ReferenceObject = $hash1
    DifferenceObject = $hash2
    Property = 'hash'
    IncludeEqual = $true
    PassThru = $true
}

Compare-Object @splat | Select-Object Hash,SideIndicator,Path | 
Export-Csv -Path .\Desktop\filehash.csv -NoTypeInformation

motivatedgorilla · September 26, 2021, 1:45am

@random-commandline, thanks. Given that there are files larger than 4TB, get-childitem and get-filehash may not be as optimal. As far as my understanding goes system.io.fileinfo & system.io.directoryinfo are likely to be faster although, I’m unsure if system.io.fileinfo streams files & hashes these concurrently with looping through a large number of files.

Would you know if it does stream or does it pre-build a list of files first?

motivatedgorilla · September 26, 2021, 1:48am

@maxjamakovic - Setting up a sandbox may be an option. Like many large organizations, it will take time to provision it (change management, approvals, etc). Is there a particular reason you are suggesting Vice Versa instead of PowerShell?

maxjamakovic · September 26, 2021, 5:29am

I used many times for migration, one time 50 million files it just works

Topic		Replies	Views
Create .hash file of directory, then compare with same directory PowerShell Help	4	1418	May 16, 2024
Comparing Hashes in Powershell PowerShell Help	2	451	May 16, 2024
Compare file contents on remote server PowerShell Help	6	425	May 16, 2024
File Copy - Check Hash and File Date? -Newbie PowerShell Help	11	1078	May 16, 2024
Newbie advice please PowerShell Help	9	279	May 16, 2024

System.io.fileinfo streaming files

Related topics