Newbie advice please

The following script works to obtain a list of duplicate files but I’m sure there is most likely a briefer solution. The final columns I require are the full path+filename, hash, file length and count. I found similar examples on the web (and the MS online help and SS64 are useful) but I wanted to have a bash at my own solution. Any assistance/advice much appreciated.

$hash = @{label="HashX";expression={(get-filehash -literalpath $_.fullname).Hash}}
Get-ChildItem -file -Recurse |
where-object {$_.length -gt 0} |
select-object FullName,Length |
group -property length |
Where-Object Count -gt 1 |
select-object -expandproperty group |
select-object -property fullname,$hash,length |
group -property HashX |
where { $_.count -gt 1 } |
Select-Object -property count -expandproperty group |
Sort-Object -Property @{Expression = {[int]$_.length}; Descending = $True}, @{Expression = "HashX"; Descending = $False}, @{Expression = "FullName"; Descending = $False} |
Export-Csv -Path "E:\Duplicates.csv" -Encoding ascii -NoTypeInformation

If I understood your requirement properly,

Didn’t below give waht you need ?

$hash = @{label="HashX";expression={(get-filehash -literalpath $_.fullname).Hash}}
Get-ChildItem -file -Recurse | where-object {$_.length -gt 0} | select-object -property fullname,$hash,length |
Sort-Object -Property @{Expression = {[int]$_.length}; Descending = $True}, @{Expression = "HashX"; Descending = $False}, @{Expression = "FullName"; Descending = $False} | Export-Csv -Path "E:\Duplicates.csv" -Encoding ascii -NoTypeInformation

[quote quote=155336]If I understood your requirement properly,

Didn’t below give waht you need ?[/quote]

Hi kvprasoon, thank you for your reply. It will work but I don’t want to calculate hash values for files that can’t possibly have a duplicate i.e. no other file exists having the same length. My solution just looks a bit “clunky” to me. If it can’t be done much “cleaner” I’ll be very happy as it’s my first go at using PS.

Cheers

Jon

As to two things you state here:

--- but I wanted to have a bash at my own solution --- it's my first go at using PS.

If you are new, it is really vital that you get ramped up on all PS first, to avoid, misconceptions, errors, bad habits, confusions, etc.

As for …

I found similar examples on the web (and the MS online help and SS64 are useful) but I wanted to have a bash at my own solution.

… It’s OK to challenge oneself, we all do it, but don’t reinvent the wheel, unless you are absolutely sure it’s a better wheel. The fact that you made the two points above, and your response to kvprasoon …

It will work but I don't want to calculate hash values for files that can't possibly have a duplicate i.e. no other file exists having the same length. My solution just looks a bit "clunky" to me. If it can't be done much "cleaner"

… directly indicates you trepidations about your approach. Again, challenging and questioning your decisions / approaches is again, part of the game. Yet, when you lean on experienced folks for guidance, and those folks give you explicit guidance, it is, of course your decision to accept it or not, but being new means you should give it more weight than the current thinking / approach you are considering pursuing.

Now, all that being said, this statement alone…

I don't want to calculate hash values for files that can't possibly have a duplicate i.e. no other file exists having the same length.
... calls into question your thought process here. Meaning, How would any code already know this? Only you would, and that would have to be in advance, so, then you would set exclusions for such files.

Hashes are not specifically about length, it is about content. You could have files that are dups, that are exactly the same length but the actual content is not. This translates into your potential misunderstanding about how file hash works.

For example:
Two files, exactly the same length, two different hashes and only one difference. See if you can spot it.

# Show two known duplicate files
'D:\Test\MyContent0.txt','D:\Test\MyContent0 - Copy.txt' | 
ForEach {
Get-ChildItem -Path $PSItem

"`nShowing file contents for $PSItem"
Get-Content -Path $PSItem

"`nShowing file hash for $PSItem"
Get-FileHash -Path $PSItem
}


Directory: D:\Test


Mode                LastWriteTime         Length Name                                                                                                      
----                -------------         ------ ----                                                                                                      
-a----         5/5/2019   1:56 PM            107 MyContent0.txt                                                                                            

Showing file contents for D:\Test\MyContent0.txt
MirrorView Name: DUB-C2_SRMTEST01_L252
Synchronizing Progress(%): 100
MirrorView Name: DUB-C2_SYS02_L1

Showing file hash for D:\Test\MyContent0.txt

Algorithm : SHA256
Hash      : 3C6B3B04D2509EFE845D90DFFCF38797FC05123B69A54862CAF6FCFFBB8DC27B
Path      : D:\Test\MyContent0.txt

-a----         5/5/2019   1:57 PM            107 MyContent0 - Copy.txt                                                                                     

Showing file contents for D:\Test\MyContent0 - Copy.txt
MirrorView Name: DUB-C2_SRMTEST01_L253
Synchronizing Progress(%): 100
MirrorView Name: DUB-C2_SYS02_L1

Showing file hash for D:\Test\MyContent0 - Copy.txt

Algorithm : SHA256
Hash      : 9932E36915DD4E711B9BABB640AA6FB923C6823E5E620FC7BD5E720A53BE93E7
Path      : D:\Test\MyContent0 - Copy.txt

Duplicate files are duplicate based on the content, not file length. Names must be unique in Windows as we all know.

So, you thinking here …

It will work but I don't want to calculate hash values for files that can't possibly have a duplicate i.e. no other file exists having the same length.

… is not valid.

To find and deal with dups, you must check all files for content sameness, their is no way around that.

Also, though scouring the web for examples is a good thing, you should always look to the help files first,
then to the MS powershellgallery.com, to see if a module or script already exists to leverage as is or tweak or learn from.

Example:

DuplicateFinder 1.1 https://www.powershellgallery.com/packages/DuplicateFinder/1.1 This module give tools to find and clean file duplications

Resources for you the hopefully assist you in your journey.

# Using the built-in help files and samples and scripts already on your system
<#
Search local and remote PowerShell Repositories to locate module, commnad, script 
or function that meet the target filter
#>

Get-Command -Module PowerShellGet
<#
CommandType     Name                                               Version    Source
-----------     ----                                               -------    ---
Function        Find-Command                                       1.6.0      Pow
Function        Find-DscResource                                   1.6.0      Pow
Function        Find-Module                                        1.6.0      Pow
Function        Find-RoleCapability                                1.6.0      Pow
Function        Find-Script                                        1.6.0      Pow
... 
#>

Get-Module -Name '*duplicate*' | 
Format-Table -AutoSize
 
Find-Module -Name '*duplicate*' | 
Format-Table -AutoSize

<#
Version Name            Repository Description                                               
------- ----            ---------- -----------                                               
1.1     DuplicateFinder PSGallery  This module give tools to find and clean file duplications
1.0.1   Get-Duplicate   PSGallery  A module to find and list duplicate files  
#>


# All Help topics and locations
Get-Help about_*
Get-Help about_Functions

Get-Help about* | Select Name, Synopsis

Get-Help about* | 
Select-Object -Property Name, Synopsis |
Out-GridView -Title 'Select Topic' -OutputMode Multiple |
ForEach-Object { Get-Help -Name $_.Name -ShowWindow }

explorer "$pshome\$($Host.CurrentCulture.Name)"


# Get parameters, examples, full and Online help for a cmdlet or function

# Get a list of all functions
Get-Command -CommandType Function | 
Out-GridView -PassThru -Title 'Available functions'

# Get a list of all commandlets
Get-Command -CommandType Cmdlet | 
Out-GridView -PassThru -Title 'Available cmdlets'

# Get a list of all functions for the specified name
Get-Command -Name '*ADGroup*' -CommandType Function | 
Out-GridView -PassThru -Title 'Available named functions'

# Get a list of all commandlets for the specified name
Get-Command -Name '*ADGroup**'  -CommandType Cmdlet | 
Out-GridView -PassThru -Title 'Available named cmdlet'

# get function / cmdlet details
(Get-Command -Name Get-ADUser).Parameters
Get-help -Name Get-ADUser -Full
Get-help -Name Get-ADUser -Online
Get-help -Name Get-ADUser -Examples

Function Get-HelpExamples
{
    [CmdletBinding()]
    [Alias('ghe')]

    Param
    (
        [string]$CmdletName = (
            Get-Command -Name '*' | 
            Out-GridView -PassThru -Title 'Select a cmdlet to see examples'
        )
    )

    If ((Get-Help -Name $CmdletName).Examples)
    {
        (((Get-Help -Name $CmdletName).Examples | 
        Out-String -Stream) -match '.*\\>|C:\\PS>') -replace '.*\\>|C:\\PS>' | 
        Out-GridView -Title 'Select a sample to use' -PassThru
    }
    Else {Write-Warning -Message "The were no help examples discovered"}
}

# Get parameter that accepts pipeline input
Get-Help Get-ADUser -Parameter * | 
Where-Object {$_.pipelineInput -match 'true'} | 
Select * 

# List of all parameters that a given cmdlet supports along with a short description:
Get-Help dir -para * | 
Format-Table Name, { $_.Description[0].Text } -wrap


# Find all cmdlets / functions with a target parameter
Get-Command -CommandType Function | 
Where-Object { $_.parameters.keys -match 'credential'} | 
Out-GridView -PassThru -Title 'Available functions which has a specific parameter'

Get-Command -CommandType Cmdlet | 
Where-Object { $_.parameters.keys -match 'credential'} | 
Out-GridView -PassThru -Title 'Results for cmdlets which has a specific parameter'

# Get named aliases 
Get-Alias | 
Out-GridView -PassThru -Title 'Available aliases'

# Get cmdlet / function parameter aliases
(Get-Command Get-ADUser).Parameters.Values | 
where aliases | 
select Name, Aliases | 
Out-GridView -PassThru -Title 'Alias results for a given cmdlet or function.'

For getting rample up. There are tons of free / low cost resoruces to use: For example:

'https://docs.microsoft.com/en-us/powershell
'https://blogs.msmvps.com/richardsiddaway/2019/02/21/the-source-of-powershell-cmdlets

See these discussions
'https://social.technet.microsoft.com/wiki/contents/articles/183.windows-powershell-survival-guide.aspx
'https://www.reddit.com/r/PowerShell/comments/98dw5v/need_beginner_level_script_ideas_to_learn
'https://www.reddit.com/r/PowerShell/comments/7oir35/help_with_teaching_others_powershell
'https://www.reddit.com/r/PowerShell/comments/99dc5d/powershell_for_a_noob
'https://www.reddit.com/r/PowerShell/comments/ax83qg/how_to_learn_powershell
'https://blogs.technet.microsoft.com/heyscriptingguy
'https://adventofcode.com
'https://community.idera.com/database-tools/powershell/using_powershell/f/general-12/68263/getting-started-with-powershell

Youtube, just serach for ‘Beginning Powershell’, 'intermediate PowerShell, ‘advanced PowerShell’, etc.

MS Channel9 and TechNet Virtrual lab - there are no seperate PowerShell specific ones, but anything on Exchange, AD, Azure, etc., all have PowerShell requirements

— Microsoft Virtual Academy —
'https://mva.microsoft.com/liveevents/powershell-jumpstart
'https://mva.microsoft.com/search/SearchResults.aspx#!q=PowerShell&lang=1033
'https://mva.microsoft.com/en-us/training-courses/getting-started-with-microsoft-powershell-8276
'https://mva.microsoft.com/en-us/training-courses/getting-started-with-microsoft-powershell-8276?l=r54IrOWy_2304984382

— Microsoft Channe9 —
'https://channel9.msdn.com/Series/GetStartedPowerShell3
'https://channel9.msdn.com/Search?term=powershell#ch9Search&lang-en=en&pubDate=year

— Youtube —
'https://www.youtube.com/watch?v=wrSlfAfZ49E
'https://www.youtube.com/results?search_query=beginning+powershell
'https://www.youtube.com/results?search_query=powershell+ise+scripting+for+beginners
'https://www.youtube.com/playlist?list=PL6D474E721138865A - Learn Windows PowerShell in a Month of Lunches - YouTube

MOC on-demand, if you cannot go in person.
'https://www.microsoftondemand.com/courses/microsoft-course-10961
'https://www.microsoftondemand.com/courses/microsoft-course-10962

— eBooks and sites —
'https://powertheshell.com/cookbooks
'https://powershell.org/ebooks
'https://leanpub.com/u/devopscollective
'https://powershell.org/free-resources
'https://rkeithhill.wordpress.com/2009/03/08/effective-windows-powershell-the-free-ebook
'https://veeam.com/wp-powershell-newbies-start-powershell.html
'https://reddit.com/r/PowerShell/comments/3cki73/free_powershell_reference_ebooks_for_download
'https://blogs.technet.microsoft.com/pstips/2014/05/26/free-powershell-ebooks
'https://www.idera.com/resourcecentral/whitepapers/powershell-ebook
'http://mikefrobbins.com/2015/04/17/free-ebook-on-powershell-advanced-functions
'https://books.goalkicker.com/PowerShellBook
'https://github.com/vexx32/PSKoans
'http://www.powertheshell.com/topic/learnpowershell

— Book(s) to leverage —

Beginning —

Learn Windows PowerShell in a Month of Lunches 3rd Edition
Donald W. Jones (Author),‎ Jeffrey Hicks (Author)
ISBN-13: 978-1617294167
ISBN-10: 1617294160

Internediate —

Windows PowerShell Cookbook: The Complete Guide to Scripting Microsoft’s Command Shell 3rd Edition
Lee Holmes (Author)
ISBN-13: 978-1449320683
ISBN-10: 1449320686

Advanced —

Windows PowerShell in Action 3rd Edition
by Bruce Payette (Author),‎ Richard Siddaway (Author)
ISBN-13: 978-1633430297
ISBN-10: 1633430294

Wouldn’t be a precondition for files to have content sameness, to have length or size sameness as well?

Olaf,

Yeppers, as noted by the two simple files, that are in my response. They are exactly the same size/character count/line count/length ‘107’, but not the same content. There is only a one character difference, thus a making it a different file, thus a different hash.

[quote quote=155354]Olaf,

Yeppers, as noted by the two simple files, that are in my response. They are exactly the same size/character count/line count/length ‘107’, but not the same content. There is only a one character difference, thus a making it a different file, thus a different hash.[/quote]

I know that … that’s why the script from the OP is looking for files with the same size. And only if they have the same size it’s calculating hashes to check if the files are actually the same. What’s wrong with that? Or did I miss something again?

Hi postanote

Thanks for all the resources but I don’t understand how the following could possibly be wrong…

So, you thinking here …

It will work but I don't want to calculate hash values for files that can't possibly have a duplicate i.e. no other file exists having the same length.
… is not valid.

If there’s only 1 file with a length for example of 500 bytes then there’s no point hashing it as it can’t possibly have a duplicate.

I’ve looked at DuplicateFinder which you kindly provided and it also groups by length prior to doing any hashing.

Cheers

Jon

Hi Olaf

I’m glad you also spotted that as I didn’t want to appear rude by questioning postanote’s reply especially as he provided so many helpful links.

Cheers

Jon