Remove Duplicate Rows in a CSV

I’d like to use Powershell to remove some rows and save a delimited text file without changing the file formatting. The file is a list of associates exported from workday. It recently started showing up with some duplicate rows (same employee ID on two separate rows). Until they can fix the problem and send me a file without duplicates, I’m hoping to automate a ‘fix’ on my end by having Powershell:

  1. Find rows with a duplicate employee ID (each new line starts with the employee ID)
  2. For the duplicate employee ID, remove the row where Status = Terminated
  3. For the duplicate employee ID, keep the row where Status = Active
  4. Save the file with a new name (do not change the original file formatting)
Employee_ID Name Status Action Needed
11111111 Joe Smith Terminated Remove this row
11111111 Joe Smith Active Keep this row
22222222 Jane Smith Terminated Remove this row
22222222 Jane Smith Active Keep this row
33333333 Bob Smith Terminated Remove this row
33333333 Bob Smith Active Keep this row
Or, if it isn't possible to do the above without changing the file formatting, then something like:
  1. Find rows with a duplicate employee ID
  2. Remove all rows containing the duplicate employee ID
  3. Save the file with new name (do not change the original file formatting)
 

Take a look at Group-Object:

$csv = @"
Employee_ID,Name,Status,Action Needed
11111111,Joe Smith,Terminated,Remove this row
11111111,Joe Smith,Active,Keep this row
22222222,Jane Smith,Terminated,Remove this row
22222222,Jane Smith,Active,Keep this row
33333333,Bob Smith,Terminated,Remove this row
33333333,Bob Smith,Active,Keep this row
44444444,Frank Adams,Terminated,Keep this row
55555555,Ralph Dickens,Active,Keep this row
66666666,Alice White,Active,Keep this row
"@ | ConvertFrom-Csv

$csvGroup = $csv | Group-Object -Property Employee_id

$newCSV = @()
$newCSV += ($csvGroup | Where{$_.Count -gt 1}).Group | Where{$_.Status -ne 'Terminated'}
$newCSV += ($csvGroup | Where{$_.Count -eq 1}).Group


$newCSV

Output:

Employee_ID Name          Status     Action Needed
----------- ----          ------     -------------
11111111    Joe Smith     Active     Keep this row
22222222    Jane Smith    Active     Keep this row
33333333    Bob Smith     Active     Keep this row
44444444    Frank Adams   Terminated Keep this row
55555555    Ralph Dickens Active     Keep this row
66666666    Alice White   Active     Keep this row

Rob, your example works nicely, but when I try to import and export the file, the file formatting ends up totally different. I’m importing a file with around 130k rows using an SSIS job, so Powershell can’t disrupt the original file formatting. Any recommendations with the file import/export? This is what I tried:

$csv = Import-Csv -Path "C:\PathToFile\file.csv"| ConvertFrom-Csv

$csvGroup = $csv | Group-Object -Property Employee_id

$newCSV = @()
$newCSV += ($csvGroup | Where{$_.Count -gt 1}).Group | Where{$_.Status -ne 'Terminated'}
$newCSV += ($csvGroup | Where{$_.Count -eq 1}).Group


$newCSV | Out-File "C:\PathToFile\file_modified.csv"

Export it as a CSV:

$newCSV | Export-CSV -Path "C:\PathToFile\file_modified.csv" -NoTypeinformation

There are methods to remove duplicate in SSIS:

Since this is a temporary fix, I didn’t want to modify the SSIS job in production. I was hoping to just insert a Powershell step to clean the file before SSIS runs. When they fix the file and remove the duplicates, I’ll stop using this. Unfortunately when I try Export-CSV, it

  1. gives an error
  2. creates a file, but with messed up formatting
I tried running this on my local machine and on a separate server, with the same results. Using Posh version 5.1.14409.1018. Error message:
Export-Csv : Cannot bind argument to parameter 'InputObject' because it is null.
At line:11 char:11
+ $newCSV | Export-Csv -Path "C:\PathToFile\file ...
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ CategoryInfo : InvalidData: (:) [Export-Csv], ParameterBindingValidationException
+ FullyQualifiedErrorId : ParameterArgumentValidationErrorNullNotAllowed,Microsoft.PowerShell.Commands.ExportCsvCommand
The original file looks something like:
Employee_ID,Name,Status
11111111,Joe Smith,Active
22222222,Jane Smith,Active
33333333,Bob Smith,Active
The exported file looks something like:
"@{Employee_ID=11111111; Name=Joe Smith; Status=Active}"
"@{Employee_ID=22222222; Name=Jane Smith; Status=Active}"
"@{Employee_ID=33333333; Name=Bob Smith; Status=Active}"

I figured out the original file is formatted UNIX UTF-8. Looks like Posh changes the formatting to DOS UTF-8. Is Posh able to modify and save the file without changing the original formatting / encoding?

Encoding can be specified for Import-CSV and Export-CSV:

-Encoding
Specifies the encoding for the exported CSV file. The default value is UTF8NoBOM.

The acceptable values for this parameter are as follows:

ASCII: Uses the encoding for the ASCII (7-bit) character set.
BigEndianUnicode: Encodes in UTF-16 format using the big-endian byte order.
OEM: Uses the default encoding for MS-DOS and console programs.
Unicode: Encodes in UTF-16 format using the little-endian byte order.
UTF7: Encodes in UTF-7 format.
UTF8: Encodes in UTF-8 format.
UTF8BOM: Encodes in UTF-8 format with Byte Order Mark (BOM)
UTF8NoBOM: Encodes in UTF-8 format without Byte Order Mark (BOM)
UTF32: Encodes in UTF-32 format.

Thanks for the ideas, Rob. I went through and tried all the encoding options. Looks like I can’t keep the original UNIX file formatting, it always converts the file format to DOS.

I have been working with an SSIS solution which imports 131 CSV files. This entry caught my eye, and while it may have been solved long ago, I’m curious if the following solution would’ve answered the OP issue.

Remarks

  • File paths removed; showing only file name.
  • Data are sourced to AdventureWorks2014.
  • IsCurrent defined as iif(EndDate IS NOT NULL,1,0)
  • The new file was opened in Excel (ugh, cuss, spit, &c.) to verify data quality.
  • PSVersion: 5.1.14409.1018
$originalFile = "EmployeeDeptHist.csv";
$newFile = "EmployeeDeptHist_CurrentOnly.csv";

Import-Csv $originalFile | Where-Object {$_.IsCurrent -eq 1} | Export-Csv -NoTypeInformation $newFile -Force;
# validate results
ii $newFile;