Powershell - Delete line from text file if it contains certain string

Hello everyone,

I have script below which will delete a line that contains a certain string which works fine but it is extremely slow since file2 is big (over 50MB). To increase performance, how do I modify it to do:

  1. Delete only one line on first match (don’t know if this will improve performance)

  2. file2 get’s saved on every run which may cause performance issue?

Other ideas to improve performance will be greatly appreciated. Thank you.

 

foreach ($string in (Get-Content c:\strings.txt))

{

(Get-Content ‘c:\file2.csv’) -notmatch $string | Set-Content ‘c:\file2.csv’

}

You are on the right track with not writing the changes to the file in the loop. Instead save the changes to a variable and set it at the end. Also, save your file content to a variable so you dont have to keep opening it each iteration of the loop. Also, using .NET speeds things up exponentially over using the Cmdlets. Not noticeable for small tasks, but at the files get larger it makes a big difference!

$Path = 'C:\file2.csv’
$Content = [System.IO.File]::ReadAllLines($Path)
foreach ($string in (Get-Content c:\strings.txt))
{
  $Content = $Content -replace $string,''
}
$Content | Set-Content -Path $Path

You could try Select-String which is usually faster than Get-Content:

(Select-String -Path .\test.txt -Pattern $strings -NotMatch).Line | Set-Content .\file2.txt

Also could use a switch statement.

$result = Switch -Regex -File c:\strings.txt {
    "[^$string]" {$_}
}

$result | Set-Content C:\file2.csv

Thank you everyone for responding. I ended up using Logon’s suggestion which is much quicker than what I had.

Thank you but the script only delete the match string. I would like for remove the entire line when a partial string is matched.

Eg. string = abc

line to be deleted = abc.def

Thank you for any further assistance.

 

 

 

I am glad you found my suggestion helpful. I apologize for my misunderstanding your objective. If you want to remove the whole line rather than just the match then I think select-string is your best bet.

$Path = 'C:\file2.csv’
$Content = [System.IO.File]::ReadAllLines($Path)
foreach ($string in (Get-Content C:\strings.txt))
{
  $Content = $Content | Select-String -Pattern $string -NotMatch
}
$Content | Set-Content -Path $Path

Hi Logan, switching to $Content = $Content | Select-String -Pattern $string -NotMatch” works but it is now back to slow :). Any other suggestions? Thanks.

OK, so how about we create a string to use for “-replace” rather than returning all the strings that don’t match? I did some simple testing and this seems to be faster than using the “-NotMatch” parameter of Select-String. Does this work any faster in your use case?

$Path = 'C:\file2.csv’
$Content = [System.IO.File]::ReadAllLines($Path)
foreach ($string in (Get-Content C:\strings.txt))
{
  $replaceString = ($Content | Select-String -Pattern $string).Line
  $Content = $Content -replace "$([regex]::Escape($replaceString))"
}
$Content | Set-Content -Path $Path

Hello Logan, seems a little faster but still very slow. 1.5MB strings.txt file and 44MB file2.csv file takes over 2 hours to complete. I have files that are much larger than that so it will take days :(. If all else fails then I will use SQL scripting as a workaround. Thank you for your help thus far.

Just wondering if you’ve tested the code I posted? Get-Content could be the bottleneck and in both of Logan’s examples, Get-Content is being used. You don’t need to use Get-Content with Select-String and not using it can give a performance increase?

Just a quick example with a 2.5MB file:

PS E:\Temp\Files> Measure-Command {$content = Get-Content .\WindowsUpdate.log; $content | Select-String -Pattern 'Error'} | Select-Object Milliseconds

Milliseconds
------------
         438


PS E:\Temp\Files> Measure-Command {Select-String .\WindowsUpdate.log -Pattern 'Error'} | Select-Object Milliseconds

Milliseconds
------------
          51

Incorporating Matt’s recommendation to skip loading the file content to memory and passing the path to Select-String directly does appear to be quite a bit faster. Another thing I noticed in one of Matt’s posts is providing the entire collection of strings as the pattern rather than looping through it, which would also reduce the processing time. With his suggestions in mind this looks like its about as fast as I can get it to be. Good call, on skipping loading the content into memory, Matt!

$Path = 'C:\file2.csv'
$stringColl = [System.IO.File]::ReadAllLines('C:\Strings.txt')
$content = (Select-String -Path $Path -Pattern $stringColl -SimpleMatch -NotMatch).line
$content | Set-Content -Path $Path -Force

I added the -SimpleMatch parameter to get past potential hangups with RegEx tokens appearing in the string collection being passed to -Pattern. It’s the same as escaping the string with the RegEx class static method.

Hello Matt, yes I tested your script also and it appears to be a little faster but still too slow. Maybe it has something to do with constantly reading and writing to the file. Perhaps I will use SQL as it is better for managing larger data. Thank you.

There is an inbetween method that might work for you that doesn’t involve SQL (as a server) but does use OleDB that might be worth investigating as you’re using CSV files.

Chrissy LeMaire (founder of dbatools) has some good articles on handling large CSV files.

This script is probably good a starting point for what you’re trying to do (in theory, it’s just a case of modifying the query on line 43 :slight_smile: )

https://blog.netnerds.net/2015/01/quickly-find-duplicates-from-csv-using-powershell/

I just ran a test using the 2 million sales records file here:

http://eforexcel.com/wp/downloads-18-sample-csv-files-data-sets-for-testing-sales/

which is 2 million rows and 238 MB and it took just under 5 minutes to find the >385000 duplicate rows.

Looks complicated but promising. I will check it out. Thanks.

Thanks Logon. I think this will do even though speed seems to be the same. I will just let it run overnight :). Thanks again for your assistance.

It’s not too bad. Here’s how I modified it to get it working for your use case.

What I did was build an array using the strings I want to filter against. Admittedly, this is a much smaller list than yours and I’m not sure how well it will scale, but you could split it into multiple queries.

Relevant lines 42 & 43:

$filters = Get-Content E:\temp\files\countries.txt
$q = $filters -join "', '"
This makes an array of strings to use in the query that looks like this
country1, 'country2', 'country3', 'country4 - the outer quotes, which are missing, are added in the query itself.
I then ran a select on all rows where $q does not appear in the countries column (this is referenced as F2) - I couldn't get column names working so I stuck with the default provided by OleDB.

Relevant line 46:

$sql = "SELECT * FROM [$tablename] WHERE F2 NOT IN ('$q') "
Once finished, the filtered data is in the datatable referenced by $dt so you can just export it as normal.
$dt | Export-CSV E:\Temp\Files\FilteredData.csv -NoTypeInformation
Now, as I said, my filter list was much smaller than yours, but the data file was 25MB (I made a subset of the example file previously posted). It took just 12.9 seconds to get 47000 rows (from 200,000) into $dt.
My modified script is below. Full credit for this goes to Chrissy LeMaire, my edits are no more than tinkering.
Note: I can't get VS Code to use the x86 shell so run this in the ISE or terminal.
# The Text OleDB driver is only available in PowerShell x86. Start x86 shell if using x64.
# This has to be the first check this script performs.
if ($env:Processor_Architecture -ne "x86")   { 
	Write-Warning "Switching to x86 shell"
	&"$env:windir\syswow64\windowspowershell\v1.0\powershell.exe" "$PSCommandPath $args"; return 
}

Change to your CSV file name, must end in .csv or .tsv

$csvfile = “E:\temp\sales_tmp.csv”

Does the first row contain column names?

$firstRowColumns = $false

What’s the delimiter? Use `t for tabbed.

$csvdelimter = “,”

By default, OleDbconnection columns are named F1, F2, F3, etc unless $firstRowColumns = $true

Alternatively, you could make it check all rows. I’ll add that to the script later and post it.

$checkColumns = “F1,F2,F3,F4,F5,F6,F7,F8,F9,F10,F11,F12,F13,F14”

################### No need to modify anything below ###################
$datasource = Split-Path $csvfile
$tablename = (Split-Path $csvfile -leaf).Replace(".","#")

switch ($firstRowColumns) {
$true { $firstRowColumns = “Yes” }
$false { $firstRowColumns = “No” }
}

$elapsed = [System.Diagnostics.Stopwatch]::StartNew()
[void][Reflection.Assembly]::LoadWithPartialName(“System.Data”)

Setup OleDB using Microsoft Text Driver.

$connstring = “Provider=Microsoft.Jet.OLEDB.4.0;Data Source=$datasource;Extended Properties=‘text;HDR=$firstRowColumns;FMT=Delimited($csvdelimter)’;”

$conn = New-Object System.Data.OleDb.OleDbconnection
$conn.ConnectionString = $connstring
$conn.Open()
$cmd = New-Object System.Data.OleDB.OleDBCommand
$cmd.Connection = $conn

$filters = Get-Content E:\temp\files\countries.txt
$q = $filters -join “’, '”

Perform select on CSV file, then add results to a datatable using ExecuteReader

$sql = "SELECT * FROM [$tablename] WHERE F2 NOT IN (’$q’) "
$cmd.CommandText = $sql
$dt = New-Object System.Data.DataTable
$dt.BeginLoadData()
$dt.Load($cmd.ExecuteReader([System.Data.CommandBehavior]::CloseConnection))
$dt.EndLoadData()
$totaltime = [math]::Round($elapsed.Elapsed.TotalSeconds,2)

Get Total Row Count

$conn.Open()
$cmd.CommandText = “SELECT COUNT(*) as TotalRows FROM [$tablename]”
$totalrows = $cmd.ExecuteScalar()
$conn.Close()

Output some stats

$dupecount = $dt.Rows.Count
Write-Host “Total Elapsed Time: $totaltime seconds. $dupecount duplicates found out of $totalrows total rows. You can access these dupes using `$dt.” -ForegroundColor Green


 

Andy, I just posted example code which should point you in the right direction but it’s been lost to the spam filter. I have requested it be released so please check back for it.

Trying now…thanks.