It’s not too bad. Here’s how I modified it to get it working for your use case.
What I did was build an array using the strings I want to filter against. Admittedly, this is a much smaller list than yours and I’m not sure how well it will scale, but you could split it into multiple queries.
Relevant lines 42 & 43:
$filters = Get-Content E:\temp\files\countries.txt
$q = $filters -join "', '"
This makes an array of strings to use in the query that looks like this
country1, 'country2', 'country3', 'country4 - the outer quotes, which are missing, are added in the query itself.
I then ran a select on all rows where $q does not appear in the countries column (this is referenced as F2) - I couldn't get column names working so I stuck with the default provided by OleDB.
Relevant line 46:
$sql = "SELECT * FROM [$tablename] WHERE F2 NOT IN ('$q') "
Once finished, the filtered data is in the datatable referenced by $dt so you can just export it as normal.
$dt | Export-CSV E:\Temp\Files\FilteredData.csv -NoTypeInformation
Now, as I said, my filter list was much smaller than yours, but the data file was 25MB (I made a subset of the example file previously posted). It took just 12.9 seconds to get 47000 rows (from 200,000) into $dt.
My modified script is below. Full credit for this goes to Chrissy LeMaire, my edits are no more than tinkering.
Note: I can't get VS Code to use the x86 shell so run this in the ISE or terminal.
# The Text OleDB driver is only available in PowerShell x86. Start x86 shell if using x64.
# This has to be the first check this script performs.
if ($env:Processor_Architecture -ne "x86") {
Write-Warning "Switching to x86 shell"
&"$env:windir\syswow64\windowspowershell\v1.0\powershell.exe" "$PSCommandPath $args"; return
}
Change to your CSV file name, must end in .csv or .tsv
$csvfile = “E:\temp\sales_tmp.csv”
Does the first row contain column names?
$firstRowColumns = $false
What’s the delimiter? Use `t for tabbed.
$csvdelimter = “,”
By default, OleDbconnection columns are named F1, F2, F3, etc unless $firstRowColumns = $true
Alternatively, you could make it check all rows. I’ll add that to the script later and post it.
$checkColumns = “F1,F2,F3,F4,F5,F6,F7,F8,F9,F10,F11,F12,F13,F14”
################### No need to modify anything below ###################
$datasource = Split-Path $csvfile
$tablename = (Split-Path $csvfile -leaf).Replace(“.”,“#”)
switch ($firstRowColumns) {
$true { $firstRowColumns = “Yes” }
$false { $firstRowColumns = “No” }
}
$elapsed = [System.Diagnostics.Stopwatch]::StartNew()
[void][Reflection.Assembly]::LoadWithPartialName(“System.Data”)
Setup OleDB using Microsoft Text Driver.
$connstring = “Provider=Microsoft.Jet.OLEDB.4.0;Data Source=$datasource;Extended Properties=‘text;HDR=$firstRowColumns;FMT=Delimited($csvdelimter)’;”
$conn = New-Object System.Data.OleDb.OleDbconnection
$conn.ConnectionString = $connstring
$conn.Open()
$cmd = New-Object System.Data.OleDB.OleDBCommand
$cmd.Connection = $conn
$filters = Get-Content E:\temp\files\countries.txt
$q = $filters -join “', '”
Perform select on CSV file, then add results to a datatable using ExecuteReader
$sql = "SELECT * FROM [$tablename] WHERE F2 NOT IN (‘$q’) "
$cmd.CommandText = $sql
$dt = New-Object System.Data.DataTable
$dt.BeginLoadData()
$dt.Load($cmd.ExecuteReader([System.Data.CommandBehavior]::CloseConnection))
$dt.EndLoadData()
$totaltime = [math]::Round($elapsed.Elapsed.TotalSeconds,2)
Get Total Row Count
$conn.Open()
$cmd.CommandText = “SELECT COUNT(*) as TotalRows FROM [$tablename]”
$totalrows = $cmd.ExecuteScalar()
$conn.Close()
Output some stats
$dupecount = $dt.Rows.Count
Write-Host “Total Elapsed Time: $totaltime seconds. $dupecount duplicates found out of $totalrows total rows. You can access these dupes using `$dt.” -ForegroundColor Green