Efficient array data collecting and memory management

Hi, I’m hoping you can help. I need to be able to extract information from a very large folder structure containing over 1 million files. The information I need includes:

  • Newest file Name & Date

  • Count of newest files modified within 3 moths of newest file

  • Total File Count & Size

All of the above can be easily extracted once a collection is made, but I don’t know the most efficient way to collect it. I have used two different approaches but each has its flaws and I wondered if there is a better approach.

The following approach is far quicker than the next approach but it consumes a huge amount of memory. Also, if I run it in a ForEach loop to process multiple large folders, the memory never gets returned after each instance and crashes the computer before the script completes (at which point memory would be returned):

remove-variable FolderFiles -ErrorAction SilentlyContinue
$FolderFiles=Get-ChildItem -LiteralPath "C:\TEMP" -Recurse -File -force | Select-Object Name, FullName, LastWriteTime, Length | Sort-Object LastWriteTime -Descending
$FolderFiles

The following approach does not consume large amounts of memory but it takes many times longer to complete each search:

remove-variable FolderFiles -ErrorAction SilentlyContinue
$FolderFiles=@()
Get-ChildItem -LiteralPath "C:\TEMP" -Recurse -File -force | Select-Object Name, FullName, LastWriteTime, Length | Sort-Object LastWriteTime -Descending | foreach-Object {
           remove-variable filedata -ErrorAction SilentlyContinue
           $fileData = [PSCustomObject]@{

          "Name"=$_.Name
          "FullName"=$_.FullName
          "LastWriteTime"=$_.LastWriteTime
          "Length"=$_.Length
          }
          $FolderFiles+=$filedata

       }
$FolderFiles | Format-Table

I’ve used C:\TEMP in the above examples.

Is there a more efficient way of capturing large volumes of file information into an array that can then be interrogated?. I know that += isn’t the best approach but I’m not familiar with the best way to achieve what I’m after.

Thanks.

Dealing with as many files as you mention, and running into memory issues, I would say += may be the source of your performance issues. Basically, it tears down and rebuilds the array for each iteration. This article may help you out. I personally use the Generic.List

2 Likes

If it’s about speed you may use robocopy to collect the raw data. With its command line switch /l it only lists the files and you can use this listing to extract the information you’re after.

In addition to the other answers regarding the inefficiency of the “+=” operator to populate the $FolderFiles array, your second example is creating two PSCustomObjects for each file. The first one is created using the Select-Object cmdlet before sorting the data. The second one is explicitly created by you in the ForEach-Object block. That second one is unnecessary (it contains the same information as the $_ variable). You’re also unnecessarily removing the “filedata” variable inside the ForEach-Object loop.
Also, if you have a million files to examine, using the Format-Table is going to be pretty useless!
All you really need is something like this:

$FolderFiles = Get-ChildItem -LiteralPath "C:\TEMP" -Recurse -File -force | Select-Object Name, FullName, LastWriteTime, Length | Sort-Object LastWriteTime -Descending

The fastest and cleanest way (imo) is just letting powershell collect the values for you

$somelist = foreach($something in $someotherthing){
    Output
}

No need to clear the variable, create a container, or add to the container. Just drop the object/output on the implicit output stream and it will gather in the variable. Now you must control errant output because it will capture anything that happens to emit something.

That will create a scalar or a fixed size array depending if 1 or more items are collected.


$output = foreach($num in 1){
    $num
}

$output -is [array]

$output = foreach($num in 1,2){
    $num
}

$output -is [array]

False
True

You can force it by casting to [array] or surrounding the whole thing with an array subexpression @()

Another way to collect all is using the -OutVariable parameter on cmdlets while $null or | Out-Null the implicit output

$null = ForEach-Object -InputObject 1,2 -Process {$_} -OutVariable output

I like to do this when I have a list I need to be able to remove items from selectively, something you can’t do with a fixed size array.

$output.Remove(2)

If you must create a container and add items one by one (should be very specific reasons and not that often) I would use the generic list.

$output = [System.Collections.Generic.List[object]]::new()

1,2 | ForEach-Object {
    $output.Add($_)
}

You can have a list for just strings, int, customtype, etc. Object seems to work for most things most of the time. Lists are super fast, you can add/remove just like with arraylist but it won’t ever pollute your output stream like arraylist does.

1 Like

Thank you all for your suggestions. I tried a few different approaches yesterday. Firstly I tried this:

$output = [System.Collections.Generic.List[object]]::new()
Get-ChildItem -LiteralPath $WorkingFolder\$RootFolderName -Recurse -File -force | Select-Object Name, FullName, LastWriteTime, Length | ForEach-Object {
  $output.Add($_)
}

Then I tried this:

Get-ChildItem -LiteralPath $WorkingFolder\$RootFolderName -Recurse -File -force | Select-Object Name, FullName, LastWriteTime, Length | ForEach-Object -Process {$_} -OutVariable RootFolderFiles | Out-Null
$RootFolderFiles=$RootFolderFiles | Sort-Object LastWriteTime -Descending

I even tried this:

      If(Test-Path "C:\Temp\GetFolderStatsv2\OutFile.txt"){remove-Item "C:\Temp\GetFolderStatsv2\OutFile.txt"}
       Get-ChildItem -LiteralPath $WorkingFolder\$RootFolderName -Recurse -File -force | Select-Object Name, FullName, LastWriteTime, Length | ForEach-Object -Process {$_} | Export-Csv -LiteralPath "C:\Temp\GetFolderStatsv2\OutFile.txt" -Delimiter ";" -Append
       $TempArray = Import-Csv -LiteralPath "C:\Temp\GetFolderStatsv2\OutFile.txt" -Delimiter ";"

       $RootFolderFiles = $TempArray | ForEach-Object {
         [PSCustomObject]@{
           Name = $_.Name
           FullName =  $_.FullName
           LastWriteTime = $($_.LastWriteTime | Get-Date) # Convert date 'String' to 'DateTime'
           Length = $([int64]$_.Length)                     # Convert Length 'String' to 'Numeric'
          }
        }
       Remove-Variable TempArray
       $RootFolderFiles=$RootFolderFiles | Sort-Object LastWriteTime -Descending

but I still seem to be hitting the issue of the computer drained of memory on very large numbers of files in a directory structure.

Even if you do this:

$FolderFiles = Get-ChildItem -LiteralPath “C:\\TEMP” -Recurse -File -force | Select-Object Name, FullName, LastWriteTime, Length | Sort-Object LastWriteTime -Descending

There’s a very good chance you’re going to run out of memory. The Sort-Object requires that the data to be sorted be contained in memory. Piping the data into the sort won’t change that.

Part of the problem is that you’re creating a PSCustomObject for each file. That adds quite a bit of overhead for a small amount of data from each files’ information.
Long ago (when memory was expensive and data were store on mag-tape) we’d use a sort program that read a reasonable chunk of information, sorted it, wrote it to another tape, then repeated the process over and over. Then it would read the smaller chunks and merge them, doing this over and over until all the data were in the required sequence. The process was called a K-way Sort/Merge.

Lacking that, you could try a few things:

  • Write the unordered information to disk and use LINQ to sort it

  • Write the data to a database and let the database figure out the sorting.

  • Use the *nix ‘sort’ which will use disk space to sort when necessary.

Or, you could try “string-ifying” the data instead of using a PSCustomObject for each file. Make the LastWriteTime property into a FIXED-LENGTH TEXT string and place that at the start of the string. Separate the fields with a comma, and sort the whole string. This should cut down on the memory needs.

1 Like

A million files isn’t actually that much I think. I tried the following code on 4 folders with about 1.2 million files with a size of about 2 TiB and it took less than 6 minutes and I haven’t had any issue with memory consumption. :man_shrugging:

$Folder = 
    'c:\temp'
$result = 
    Get-ChildItem $Folder -Recurse -force -file -ErrorAction SilentlyContinue
$newestFile = 
    $result | 
        Sort-Object -Property LastWriteTime | 
            Select-Object -Last 1
$ModifiedLast3Months = 
    $newestFile.LastWriteTime.AddMonths(-3)
$ModifiedLast3MonthsFileList = 
    $result_01 | 
        Where-Object -Property LastWriteTime -GT -Value $ModifiedLast3Months 
$newestFile | 
    Select-Object -Property Name, LastWriteTime
$ModifiedLast3MonthsFileList.Count
$Result | 
    Measure-Object -Sum -Property length | 
        Select-Object *

Depending on the amount of files and depending on what you actually want to do with the data you collect I wouldn’t recommend to output them to the console. Instead you may export them to a CSV file or another file format suitable for your analysis needs.

1 Like

Hello,

I have tried another approach without sorting data and without storing ALL items in an array in memory.
I scan the directories and just keep in memory files older than the current newest file - 3 months.

param(
	[string]$rootdir = "c:\",
	[int]$age=3
)

Set-PSDebug -Off
Set-PSDebug -strict		# error when try to use uninitialize variable

#-- global initialization --
$start = Get-Date
$startwatch = [system.diagnostics.stopwatch]::StartNew()

$fullscript	= $MyInvocation.MyCommand.path
$fullscript	= [System.IO.Path]::GetFullPath($fullscript)
$scriptdir	= [System.IO.Path]::GetDirectoryName($fullscript)
$scriptname	= [System.IO.Path]::GetFileNameWithoutExtension($fullscript)
$scriptext	= [System.IO.Path]::GetExtension($fullscript)
[System.Environment]::CurrentDirectory = $scriptdir


function Write-Log {
	param( $str )

	$prefix = (get-date).ToString( "yyyy/MM/dd;HH:mm:ss" )

	write-host "$prefix;$str"
}

write-log -str "begin"

$ModifiedLast3Months	= [datetime]::MinValue
$RecentDate				= [datetime]::MinValue
$RecentItem				= $null
[long]$TotalNbFile		= 0
[long]$TotalSize		= 0
$arrRecentItems			= [System.Collections.Generic.List[System.IO.FileInfo]]::new()


#-- start the scan of the directory --
write-log -str ". scan directory : $rootdir"
$StopWatch				= [system.diagnostics.stopwatch]::StartNew()
Get-ChildItem -path $RootDir -filter "*" -file -Recurse -force -ErrorAction SilentlyContinue | foreach {
	$TotalNbFile++
	$TotalSize += $_.length

	if ($_.lastwritetime -gt $ModifiedLast3Months) {
		#-- add item to old files array --
    	$arrRecentItems.add( $_ )
	}

	if ( $_.lastwritetime -gt $RecentDate ) {
		$RecentItem = $_
		$RecentDate = $RecentItem.LastWriteTime
		$ModifiedLast3Months = $RecentDate.AddMonths(-$age)

		write-log -str "  > new recent file : $($RecentDate.ToString('yyyy/MM/dd HH:mm:ss')) - $($RecentItem.FullName)"

		$i = 0
		while ( $i -lt $arrRecentItems.count ) {
			if ( $arrRecentItems.item($i).LastWriteTime -le $ModifiedLast3Months) {
				$arrRecentItems.RemoveAt($i)
			} else {
            	$i++
			}
		}
	}
}
$stopwatch.stop()

#-- display results --
write-log -str ". results"
write-log -str "  > total scan time    : $($StopWatch.elapsed.ToString())"
write-log -str "  > total files number : $TotalNbFile"
write-log -str "  > total files size   : $TotalSize"
if ($recentItem) {
	write-log -str "  > newest file        : $($RecentItem.LastWriteTime.ToString('yyyy/MM/dd HH:mm:ss')) - $($RecentItem.FullName)"
	write-log -str "  > recent files count : $($arrRecentItems.count)"
}

write-log -str "end (in $($startwatch.elapsedmilliseconds) ms)"

On a large directory, here is an example of result :

2025/12/07;19:07:27;begin
2025/12/07;19:07:27;. scan directory : X:\xxx
...
2025/12/07;19:40:06;. results
2025/12/07;19:40:06;  > total scan time    : 00:32:39.3038650
2025/12/07;19:40:06;  > total files number : 2 802 559
2025/12/07;19:40:06;  > total files size   : 3 295 821 934 956
2025/12/07;19:40:06;  > newest file        : 2025/12/07 16:17:48 - X:\xxx\...
2025/12/07;19:40:06;  > recent files count : 184 810
2025/12/07;19:40:06;end (in 1959383 ms)

The server used for this test has mechaninal hard drives.
On my workstation, the result is a bit different :

2025/12/07;18:49:15;begin
2025/12/07;18:49:15;. scan directory : c:\
...
2025/12/07;18:51:54;  > total scan time    : 00:02:39.3658454
2025/12/07;18:51:54;  > total files number : 789 077
2025/12/07;18:51:54;  > total files size   : 224 271 194 804
2025/12/07;18:51:54;  > newest file        : 2025/12/07 18:51:14 - C:\...
2025/12/07;18:51:54;  > recent files count : 187 748
2025/12/07;18:51:54;end (in 159385 ms)

May I ask how much RAM the server has you’re planing to use for this task? … or you used to use and ran into the issues you’re trying to circumvent now?

As I said - I used the code above to scan some directories with about 1.2 million files and I have never had any issue with memory consumption. The server I used had a pretty usuall config with 4 CPU cores and 16GiB of RAM. :man_shrugging:

I just thought about that and without digging into your code … that means that you may have to update the list of files older than the current newest file - 3 months when the current newest file changes during the scan of the directories … right?

Hi, thanks I will give those options a try. For clarity of what I’m trying to achieve, I am looking to archive any folders that do not contain any recently modified files. To do this I need to capture the newest file in a folder structure. In order to gauge how many files have been most recently modified (to rule out one rogue file), I also wish to count how many other files were modified around the time of the newest file (1,3 & 6 months earlier than newest). Getting the total file count & size is desirable but not essential. My script spits all this information into an Excel workbook, example below:

There is no requirement to update any lists or files, it is purely a data scraping exercise. If that means capturing raw data as text and then doing conversions to date and int formats to sort & interrogate then I’ll take any solution that doesn’t kill the computer trying, which btw is a W10 VM with 12GB of memory. I could flood the VM with extra memory but that is hardly an efficient solution.

Hmmm …

<my 2 cents>

Giving a Windows system with some workload to do at least about 16GiB of RAM is hardly considered flooding it with memory. :man_shrugging:

So I got it wrong. You said you …

And how do you determine the current newest file? If - by accident - the files are sorted backwards every single file your script comes accros is the newest file in that moment. So in each loop iteration the newest file changes. And therefor the files modifid within 3 months of the current newest file should change to, shouldn’t they?

Is this the VM with the folders you want to scan? I’d urgently recommend to run the script locally - not remote.

</my 2 cents>

Hi,

the folders I am scanning are network shared folders. The script runs from the W10 VM desktop and scans overnight which works great until I hit a very large folder. I agree 16GB is not flooding it, I just meant if I had to apply a ridiculous amount just to avoid the VM memory maxing out.

I’m trying out Christophe’s idea of only collecting newer files into the array. I figured that would require two searches, one to find the newest file and then a second scan using that files date stamp as a reference.

Again … I urgently recommend to run the script locally on the server or at least per Invoke-Command. That will speed up the execution a lot. And it may avoid any memory consumption errors … assuming you have enough memory in the file server. :smirking_face:

3 Likes

Agree with @Olaf - Run the script on the server where the files live. Dealing with large numbers of files over a network connection is always slow.

Imagine you’re reading a huge document one page at a time, but the pages are in another state and you have to drive back and forth between where the pages are stored and where they are read for each page.
Now imagine you are sitting in the same room as all the pages - much faster since there’s no transportation overhead. Reading data across a network is kind of like that

Thank you all for your input on this conundrum. The use of a VM was never my first choice and was born from necessity but you all put forward a valid and strong argument so I will endeavour to have the script moved to the server.

Regarding the script by Christophe which only captures up to 3 months, overnight testing looks positive. Memory consumption is significantly down. Before, a 2 million file folder search would use up all available computer memory and crash whereas now it now uses less than 0.5GB. This of course would increase slightly on folders with large volumes of files within 3 months of the newest, but it is a far cry from where I was at the start of this question. Many thanks Christophe. Sometimes you can be too close to a problem to conceive an alternative approach.

1 Like

Hi,

This is not dependent of whether the items are sorted or not. The most recent date is calculated as the scan progresses. When a file is newer than the reference date (newest date - age), it is added to the recent items array.

if this file is newer than the current newest file, then :

  • RecentDate is changed with this new date
  • Reference date is updated with RecentDate-age
  • Recent Items array is updated by removing items older than the new Reference date

So, only one scan of the directory and no sort !!!
Instead of keeping all items in memory, I only keep in memory recent items. When the reference date changes, i just have to loop in the array in memory to remove items that have became older than the new reference date. It is much faster than rescanning directory.

I didn’t change the name of the reference date variable ($ModifiedLast3Months) but it is not necessarily 3 months because the age (number of months) is a parameter at the beginning of the script (like $rootdir).

I should change the variable name ModifiedLast3Months to ReferenceDate and replace age (in month) by a TimeSpan object ($TimeSpanAge). Instead of using AddMonths with a negative value, i could use $ReferenceDate = $RecenteDate - $TimeSpanAge

Here is the latest version of the script

param(
	[string]$rootdir = "c:\",
	[TimeSpan]$age=[TimeSpan]"90.00:00:00"	# 90 days by default
)

Set-PSDebug -Off
Set-PSDebug -strict		# error when try to use uninitialize variable

#-- global initialization --
$start = Get-Date
$startwatch = [system.diagnostics.stopwatch]::StartNew()

$fullscript	= $MyInvocation.MyCommand.path
$fullscript	= [System.IO.Path]::GetFullPath($fullscript)
$scriptdir	= [System.IO.Path]::GetDirectoryName($fullscript)
$scriptname	= [System.IO.Path]::GetFileNameWithoutExtension($fullscript)
$scriptext	= [System.IO.Path]::GetExtension($fullscript)
[System.Environment]::CurrentDirectory = $scriptdir


function Write-Log {
	param( $str )

	$prefix = (get-date).ToString( "yyyy/MM/dd;HH:mm:ss" )

	write-host "$prefix;$str"
}


write-log -str "begin"

$ReferenceDate			= [datetime]::MinValue
$RecentDate				= [datetime]::MinValue
$RecentItem				= $null
[long]$TotalNbFile		= 0
[long]$TotalSize		= 0
$arrRecentItems			= [System.Collections.Generic.List[System.IO.FileInfo]]::new()

#-- start the scan of the directory --
write-log -str ". scan directory : $rootdir"
$StopWatch				= [system.diagnostics.stopwatch]::StartNew()
Get-ChildItem -path $RootDir -filter "*" -file -Recurse -force -ErrorAction SilentlyContinue | foreach {
	$TotalNbFile++
	$TotalSize += $_.length

	if ($_.lastwritetime -gt $ReferenceDate) {
		#-- add item to old files array --
    	$arrRecentItems.add( $_ )
	}

	if ( $_.lastwritetime -gt $RecentDate ) {
		$RecentItem = $_
		$RecentDate = $RecentItem.LastWriteTime
		$ReferenceDate = $RecentDate - $age

		write-log -str "  > new recent file : $($RecentDate.ToString('yyyy/MM/dd HH:mm:ss')) - $($RecentItem.FullName)"

		$i = 0
		while ( $i -lt $arrRecentItems.count ) {
			if ( $arrRecentItems.item($i).LastWriteTime -le $ReferenceDate) {
				$arrRecentItems.RemoveAt($i)
			} else {
            	$i++
			}
		}
	}
}
$stopwatch.stop()

#-- display results --
write-log -str ". results"
write-log -str "  > total scan time    : $($StopWatch.elapsed.ToString())"
write-log -str "  > total files number : $TotalNbFile"
write-log -str "  > total files size   : $TotalSize"
if ($recentItem) {
	write-log -str "  > age                : $($age.ToString())"
	write-log -str "  > reference date     : $($ReferenceDate.ToString('yyyy/MM/dd HH:mm:ss'))"
	write-log -str "  > newest file        : $($RecentItem.LastWriteTime.ToString('yyyy/MM/dd HH:mm:ss')) - $($RecentItem.FullName)"
	write-log -str "  > recent files count : $($arrRecentItems.count)"
}

write-log -str "end (in $($startwatch.elapsedmilliseconds) ms)"

And a sample (newest file and all files changed in the 30 days before the newest one)

.\SearchFile.ps1 -rootdir "c:\" -age ([timeSpan]"30.00:00:00")

And a sample of result

2025/12/11;13:53:36;begin
2025/12/11;13:53:36;. scan directory : c:\
...
2025/12/11;13:54:01;. results
2025/12/11;13:54:01;  > total scan time    : 00:00:25.1372786
2025/12/11;13:54:01;  > total files number : 87241
2025/12/11;13:54:01;  > total files size   : 36519136647
2025/12/11;13:54:01;  > age                : 15.00:00:00
2025/12/11;13:54:01;  > reference date     : 2025/11/26 13:49:38
2025/12/11;13:54:01;  > newest file        : 2025/12/11 13:49:38 - C:\xxx\...
2025/12/11;13:54:01;  > recent files count : 2630
2025/12/11;13:54:01;end (in 25167 ms)