'String in Files Search' - More Efficient?

I’m kinda new to Powershell, and as a practice what I have learned, wrote the following script/utility to search through all files in a given path, whose file name match one or multiple parts of the given name, that contain one or multiple given strings.

Here is the script.


Clear-Host
Clear-Variable -name [a..z]

$inpath = read-host "Enter Search Path"
$filepart = @{}
$searchtext = @{}

$another = "Y"
$i = 0
while ($another -eq "Y") {
    $filepart[$i] = read-host "Enter Partial File Name"
    if ($filepart[$i] -eq 'XXXX') {$another = 'N'}
    $filepart[$i] = "*" + $filepart[$i] + "*"
    $i++
}
$i--

$another = "Y"
$j = 0
while ($another -eq "Y") {
    $searchtext[$j] = read-host "Enter Search String"
    if ($searchtext[$j] -eq 'XXXX') {$another = 'N'}
    $j++
}
$j--


$outlist = .{ 
for ($x = 0; $x -le $i; $x++) {
    for ($y = 0; $y -le $j; $y++) {
        Get-ChildItem -path $inpath | Where-Object { $_.name -like $filepart[$x]} | Select-String -pattern $searchtext[$y] | group path | select name
    }    
}
}


$outlist

The script works fine as is, but the question I have has to do with efficiency.

The directory I am searching has 7720 items in it, about 34 MB of data. With entering only two $fileparts and two $searchtext it takes about 6 -7 mins to complete. Is there a more efficient way to accomplish the same goal. I was wondering if creating two loops, one to get a list of applicable files in the path as one, then searching only the objects in that list for the searchtext. But I could not figure out how to get the output of one to pipe into the search of the other, and give a meaningful list.

Or maybe, 6-7 mins to search that many files with that much data is pretty efficient?

Any suggestions would be appreciated.

First, don’t do a Clear-Variable up-front. There’s no need; at the start of the script, you’ve got nothing to clear, so it’s just wasted time.

I’m not sure there’s a hugely more-efficient way to go after this. You’re searching 68MB of data, and .NET’s file I/O isn’t the fastest thing in the world. YOU enumerating is always slower than a cmdlet enumerating; here, you’re letting Get-ChildItem do the enumeration, which is probably your fastest option.

Things that I can see:

  1. You are actually getting the files with Get-ChildItem once for each file filter and once for each search term. So with 2 file filters and 2 search terms you are going out to the file system 4 times. If you had 100 file filters and 20 search terms, you’d be going out to the file system 2000 times. Everything else aside, you only have to query the file system once for each filter. To do so place Get-ChildItem before the y loop and store the results in a variable

  2. You are searching each file for the search term, the whole file, but since you are grouping, I assume you don’t need to know how many matches. One match is sufficient to select the file. Therefore you only need to search until you find the first match. You can do that with Select -First 1. Now in powershell version 2, this wouldn’t afford you much of an improvement, but in powershell 3+ it’ll stop searching once the first match is returned. The performance improvement will be variable based on the size of the files and how soon in the file the match is found. If the files are larger, better performance savings. If the files are smaller, smaller performance savings. And obviously, if a match is found in line 1 it saves a lot more than if a match is found in line 1000. At any rate, you should see some amount of performance gain, that amount depending on the data at hand.

  3. You can use Select-String to search for more than one term at once. Searching for all the terms at once will help performance. Also, your $outlist may contain duplicate files and using select-string to search for all the terms at once will avoid the duplication issue.

  4. You are using Where-Object to apply the file name filter; however, you are able to use the -Filter parameter of Get-ChildItem. This will improve performance.

Here is an illustration of the techniques I used to test and demonstrate the methods mentioned above. In the last illustration I changed to looping method to a more powershellish way to loop. I’ve commented the typical response times I got over my test data. Keep in mind that a lot of this performance has to do with your data and you may get different results. In particular, given a situation where you have a whole lot of very short files, the extra For-EachObject loop I started to use may cause the performance to go the other way. Measure-Command is your friend.

$path = 'C:\Users\cduff\Downloads\test\29\src'

$VerbosePreference = 'Continue'


Write-Verbose "Search File, Group Method"
Measure-Command {gi "$path\1.txt" | select-string Lorem | group Path | select name}
#~100ms
Write-Verbose "Search File, Find First Method"
Measure-Command {gi "$path\1.txt" | select-string Lorem | select -First 1 | select path}
#~2ms

Write-Verbose "Filter Right Method"
Measure-Command {Get-ChildItem -Path $path | Where-Object {$_.name -like '*search1*'} }
#~1900ms
Write-Verbose "Filter Left Method"
Measure-Command {Get-ChildItem -Path $path -Filter "*search1*" }
#~19ms

$files = @(
 "*search1*"
 "*search2*"
)

$terms = @(
 "John"
 "Bob"
)

Write-Verbose "Loop original"
Measure-Command {
    for ($x = 0; $x -lt 2; $x++) {
        for ($y = 0; $y -lt 2; $y++) {
            Get-ChildItem -path $path -Filter $files[$x] | 
            Select-String -pattern $terms[$y] | 
            Group path | 
            Select Name
        }    
    }
}
#~6900ms

Write-Verbose "Loop modified to only search for files once"
Measure-Command {
    for ($x = 0; $x -lt 2; $x++) {
        $children = Get-ChildItem -path $path -Filter $files[$x]
        for ($y = 0; $y -lt 2; $y++) {
            $children | 
            Select-String -pattern $terms[$y] | 
            Group path | 
            Select Name
        }    
    }
}
#~6500ms

Write-Verbose "Loop modified to only search for terms once"
Measure-Command {
    for ($x = 0; $x -lt 2; $x++) {
        Get-ChildItem -path $path -Filter $files[$x] |
        Select-String -pattern $terms | 
        Group path | 
        Select Name 
    }
}
#~4200ms

Write-Verbose "Loop modified with select first"
Measure-Command {
    for ($x = 0; $x -lt 2; $x++) {
        Get-ChildItem -path $path -Filter $files[$x] |
        ForEach-Object {
            $_ |
            Select-String -pattern $terms | 
            Select-Object -First 1 |
            Select-Object path
        }
    }
}
#~285ms

Write-Verbose "Loop modified change loop construct"
Measure-Command {
    ForEach ($filter in $files) {
        Get-ChildItem -path $path -Filter $filter |
        ForEach-Object {
            $_ |
            Select-String -pattern $terms | 
            Select-Object -First 1 |
            Select-Object path
        }
    }
}
#~285ms

Thank you both for your help. I’ll let you know how it goes.

So I made the changes as suggested… and it made a whole lot of difference. Running just two filepart, with two search strings took 6 mins 25 seconds the old way, and 54 seconds the new way. Here is the updated code. Thanks again for all the help, especially with the “more powershellish” way of coding. Getting to that “mindset” consistently, I think is the biggest challenge for me at this time.

Here is the completed code, so far.

I hope to add date ranges, and then maybe a GUI interface as well.

$inpath = read-host “Enter Search Path”
$filepart = @{}
$searchtext = @{}

$i = 0
for (;:wink: {
$filepart[$i] = read-host “Enter Partial File Name”
if ($filepart[$i] -eq ‘XXXX’) {break}
$filepart[$i] = “" + $filepart[$i] + "
$i++
}
$i–
$filepart = $filepart[0…$i]

$j = 0
for (;:wink: {
$searchtext[$j] = read-host “Enter Search String”
if ($searchtext[$j] -eq ‘XXXX’) {break}
$j++
}
$j–
$searchtext = $searchtext[0…$j]

$outlist = .{
ForEach ($filter in $filepart) {
Get-ChildItem -path $inpath -Filter $filter |
ForEach-Object {
$_ |
Select-String -pattern $searchtext |
Select-Object -First 1 |
Select-Object path
}
}
}

$outlist