Powershell - Delete line from text file if it contains certain string

andylau27 · December 11, 2020, 5:23am

Hello everyone,

I have script below which will delete a line that contains a certain string which works fine but it is extremely slow since file2 is big (over 50MB). To increase performance, how do I modify it to do:

Delete only one line on first match (don’t know if this will improve performance)
file2 get’s saved on every run which may cause performance issue?

Other ideas to improve performance will be greatly appreciated. Thank you.

foreach ($string in (Get-Content c:\strings.txt))

{

(Get-Content ‘c:\file2.csv’) -notmatch $string | Set-Content ‘c:\file2.csv’

}

bo1 · December 11, 2020, 6:32am

You are on the right track with not writing the changes to the file in the loop. Instead save the changes to a variable and set it at the end. Also, save your file content to a variable so you dont have to keep opening it each iteration of the loop. Also, using .NET speeds things up exponentially over using the Cmdlets. Not noticeable for small tasks, but at the files get larger it makes a big difference!

$Path = 'C:\file2.csv’
$Content = [System.IO.File]::ReadAllLines($Path)
foreach ($string in (Get-Content c:\strings.txt))
{
  $Content = $Content -replace $string,''
}
$Content | Set-Content -Path $Path

matt-bloomfield · December 11, 2020, 6:59am

You could try Select-String which is usually faster than Get-Content:

(Select-String -Path .\test.txt -Pattern $strings -NotMatch).Line | Set-Content .\file2.txt

krzydoug · December 11, 2020, 10:13am

Also could use a switch statement.

$result = Switch -Regex -File c:\strings.txt {
    "[^$string]" {$_}
}

$result | Set-Content C:\file2.csv

andylau27 · December 11, 2020, 11:29am

Thank you everyone for responding. I ended up using Logon’s suggestion which is much quicker than what I had.

andylau27 · December 12, 2020, 3:32am

Thank you but the script only delete the match string. I would like for remove the entire line when a partial string is matched.

Eg. string = abc

line to be deleted = abc.def

Thank you for any further assistance.

bo1 · December 12, 2020, 2:04pm

I am glad you found my suggestion helpful. I apologize for my misunderstanding your objective. If you want to remove the whole line rather than just the match then I think select-string is your best bet.

$Path = 'C:\file2.csv’
$Content = [System.IO.File]::ReadAllLines($Path)
foreach ($string in (Get-Content C:\strings.txt))
{
  $Content = $Content | Select-String -Pattern $string -NotMatch
}
$Content | Set-Content -Path $Path

andylau27 · December 13, 2020, 6:28am

Hi Logan, switching to “$Content = $Content | Select-String -Pattern $string -NotMatch” works but it is now back to slow :). Any other suggestions? Thanks.

bo1 · December 15, 2020, 8:01am

OK, so how about we create a string to use for “-replace” rather than returning all the strings that don’t match? I did some simple testing and this seems to be faster than using the “-NotMatch” parameter of Select-String. Does this work any faster in your use case?

$Path = 'C:\file2.csv’
$Content = [System.IO.File]::ReadAllLines($Path)
foreach ($string in (Get-Content C:\strings.txt))
{
  $replaceString = ($Content | Select-String -Pattern $string).Line
  $Content = $Content -replace "$([regex]::Escape($replaceString))"
}
$Content | Set-Content -Path $Path

andylau27 · December 15, 2020, 10:13am

Hello Logan, seems a little faster but still very slow. 1.5MB strings.txt file and 44MB file2.csv file takes over 2 hours to complete. I have files that are much larger than that so it will take days :(. If all else fails then I will use SQL scripting as a workaround. Thank you for your help thus far.

matt-bloomfield · December 15, 2020, 10:30am

Just wondering if you’ve tested the code I posted? Get-Content could be the bottleneck and in both of Logan’s examples, Get-Content is being used. You don’t need to use Get-Content with Select-String and not using it can give a performance increase?

Just a quick example with a 2.5MB file:

PS E:\Temp\Files> Measure-Command {$content = Get-Content .\WindowsUpdate.log; $content | Select-String -Pattern 'Error'} | Select-Object Milliseconds

Milliseconds
------------
         438


PS E:\Temp\Files> Measure-Command {Select-String .\WindowsUpdate.log -Pattern 'Error'} | Select-Object Milliseconds

Milliseconds
------------
          51

bo1 · December 15, 2020, 12:04pm

Incorporating Matt’s recommendation to skip loading the file content to memory and passing the path to Select-String directly does appear to be quite a bit faster. Another thing I noticed in one of Matt’s posts is providing the entire collection of strings as the pattern rather than looping through it, which would also reduce the processing time. With his suggestions in mind this looks like its about as fast as I can get it to be. Good call, on skipping loading the content into memory, Matt!

$Path = 'C:\file2.csv'
$stringColl = [System.IO.File]::ReadAllLines('C:\Strings.txt')
$content = (Select-String -Path $Path -Pattern $stringColl -SimpleMatch -NotMatch).line
$content | Set-Content -Path $Path -Force

I added the -SimpleMatch parameter to get past potential hangups with RegEx tokens appearing in the string collection being passed to -Pattern. It’s the same as escaping the string with the RegEx class static method.

andylau27 · December 15, 2020, 12:07pm

Hello Matt, yes I tested your script also and it appears to be a little faster but still too slow. Maybe it has something to do with constantly reading and writing to the file. Perhaps I will use SQL as it is better for managing larger data. Thank you.

matt-bloomfield · December 15, 2020, 11:38pm

There is an inbetween method that might work for you that doesn’t involve SQL (as a server) but does use OleDB that might be worth investigating as you’re using CSV files.

Chrissy LeMaire (founder of dbatools) has some good articles on handling large CSV files.

This script is probably good a starting point for what you’re trying to do (in theory, it’s just a case of modifying the query on line 43 )

I just ran a test using the 2 million sales records file here:

which is 2 million rows and 238 MB and it took just under 5 minutes to find the >385000 duplicate rows.

andylau27 · December 16, 2020, 8:28am

Looks complicated but promising. I will check it out. Thanks.

andylau27 · December 16, 2020, 8:39am

Thanks Logon. I think this will do even though speed seems to be the same. I will just let it run overnight :). Thanks again for your assistance.

matt-bloomfield · December 16, 2020, 12:19pm

It’s not too bad. Here’s how I modified it to get it working for your use case.

What I did was build an array using the strings I want to filter against. Admittedly, this is a much smaller list than yours and I’m not sure how well it will scale, but you could split it into multiple queries.

Relevant lines 42 & 43:

$filters = Get-Content E:\temp\files\countries.txt

$q = $filters -join "', '"

This makes an array of strings to use in the query that looks like this

country1, 'country2', 'country3', 'country4 - the outer quotes, which are missing, are added in the query itself.

I then ran a select on all rows where $q does not appear in the countries column (this is referenced as F2) - I couldn't get column names working so I stuck with the default provided by OleDB.

Relevant line 46:

$sql = "SELECT * FROM [$tablename] WHERE F2 NOT IN ('$q') "

Once finished, the filtered data is in the datatable referenced by $dt so you can just export it as normal.

$dt | Export-CSV E:\Temp\Files\FilteredData.csv -NoTypeInformation

Now, as I said, my filter list was much smaller than yours, but the data file was 25MB (I made a subset of the example file previously posted). It took just 12.9 seconds to get 47000 rows (from 200,000) into $dt.

My modified script is below. Full credit for this goes to Chrissy LeMaire, my edits are no more than tinkering.

Note: I can't get VS Code to use the x86 shell so run this in the ISE or terminal.

# The Text OleDB driver is only available in PowerShell x86. Start x86 shell if using x64.
# This has to be the first check this script performs.
if ($env:Processor_Architecture -ne "x86")   { 
	Write-Warning "Switching to x86 shell"
	&"$env:windir\syswow64\windowspowershell\v1.0\powershell.exe" "$PSCommandPath $args"; return 
}
Change to your CSV file name, must end in .csv or .tsv
$csvfile = “E:\temp\sales_tmp.csv”
Does the first row contain column names?
$firstRowColumns = $false
What’s the delimiter? Use `t for tabbed.
$csvdelimter = “,”
By default, OleDbconnection columns are named F1, F2, F3, etc unless $firstRowColumns = $true
Alternatively, you could make it check all rows. I’ll add that to the script later and post it.
$checkColumns = “F1,F2,F3,F4,F5,F6,F7,F8,F9,F10,F11,F12,F13,F14”
################### No need to modify anything below ###################

$datasource = Split-Path $csvfile

$tablename = (Split-Path $csvfile -leaf).Replace(“.”,“#”)
switch ($firstRowColumns) {

$true { $firstRowColumns = “Yes” }

$false { $firstRowColumns = “No” }

}
$elapsed = [System.Diagnostics.Stopwatch]::StartNew()

[void][Reflection.Assembly]::LoadWithPartialName(“System.Data”)
Setup OleDB using Microsoft Text Driver.
$connstring = “Provider=Microsoft.Jet.OLEDB.4.0;Data Source=$datasource;Extended Properties=‘text;HDR=$firstRowColumns;FMT=Delimited($csvdelimter)’;”
$conn = New-Object System.Data.OleDb.OleDbconnection

$conn.ConnectionString = $connstring

$conn.Open()

$cmd = New-Object System.Data.OleDB.OleDBCommand

$cmd.Connection = $conn
$filters = Get-Content E:\temp\files\countries.txt

$q = $filters -join “', '”
Perform select on CSV file, then add results to a datatable using ExecuteReader
$sql = "SELECT * FROM [$tablename] WHERE F2 NOT IN (‘$q’) "

$cmd.CommandText = $sql

$dt = New-Object System.Data.DataTable

$dt.BeginLoadData()

$dt.Load($cmd.ExecuteReader([System.Data.CommandBehavior]::CloseConnection))

$dt.EndLoadData()

$totaltime = [math]::Round($elapsed.Elapsed.TotalSeconds,2)
Get Total Row Count
$conn.Open()

$cmd.CommandText = “SELECT COUNT(*) as TotalRows FROM [$tablename]”

$totalrows = $cmd.ExecuteScalar()

$conn.Close()
Output some stats
$dupecount = $dt.Rows.Count

Write-Host “Total Elapsed Time: $totaltime seconds. $dupecount duplicates found out of $totalrows total rows. You can access these dupes using `$dt.” -ForegroundColor Green

matt-bloomfield · December 16, 2020, 12:23pm

Andy, I just posted example code which should point you in the right direction but it’s been lost to the spam filter. I have requested it be released so please check back for it.

andylau27 · December 17, 2020, 7:26am

Trying now…thanks.

Topic		Replies	Views
Please check the spam filter Website & Forum Feedback	0	279	December 16, 2020
Make powershell script run faster PowerShell Help	8	694	May 16, 2024
Delete lines with a script PowerShell Help	11	269	May 16, 2024
How to delete contents of a file based on line numbers? PowerShell Help	19	2086	May 16, 2024
remove multiple line in file on match PowerShell Help	9	346	May 16, 2024

Powershell - Delete line from text file if it contains certain string

Change to your CSV file name, must end in .csv or .tsv

Does the first row contain column names?

What’s the delimiter? Use `t for tabbed.

By default, OleDbconnection columns are named F1, F2, F3, etc unless $firstRowColumns = $true

Alternatively, you could make it check all rows. I’ll add that to the script later and post it.

Setup OleDB using Microsoft Text Driver.

Perform select on CSV file, then add results to a datatable using ExecuteReader

Get Total Row Count

Output some stats

Related topics