remove duplicates in csv

stuid lastname firstname gradelevel status entrydate
28372 ACEVEDO KEYLIANIS 0 E 8/31/2015
28166 CAHILL ALLANNA 0 N 8/31/2015
28166 CAHILL ALLANNA 0 E 9/24/2015

ive got data in a csv as above. alot more lines of course with a few more duplicates. im trying to drop the duplicates, but keep the one of the two with the most recent “entrydate”, and send it to a new csv along with the otheres that arent duplicates.
having a difficult time figuring out how to go about this.

Well… hmm. Tough thing here is that, if the dates are different, then they’re not actually duplicates as far as the computer is concerned. Do I have that correct?

my bad… the “stuid” would be determining that its a duplicate… and keeping the stuid(row) with the most recent “entrydate” is what im trying to accomplish

I might start by passing them to Group-Object, grouping on the “stuid” field. That’ll give you a group object for each student. I’d probably then pass them to a ForEach.

For each one whose group contained one item, I’d just output the item since there was no duplicate. For each one with more than one, sort them on entry date in descending order and output only the first object from the collection. That’ll give you the most recent.

Import-Csv |
Group-Object -prop stuid |
ForEach {
if ($.Count -eq 1) { $ }
else { $_[0] }
} |

That’s completely off the top of my head and will doubtless need some mangling to make it work, but that’s the general approach I’d probably start with. Like, I need to see if there’s a Count property on the collections output by Group-Object, but I bet there is. I’d need to see if that $_[0] syntax worked, but I bet it would. Or something similar.

ok… ive gotten the following code to ( i think… lol ) give me the proper results on screen.
how do i export this to a csv?

$data = import-csv C:\ELEM-Student.csv

$data | Foreach-Object {$_.entrydate = [DateTime]$_.entrydate; $_} | 
Group-Object stuid | 
Foreach-Object {$_.Group | Sort-Object entrydate | Select-Object -Last 1}

never mind… i thought it would be a little more difficult then just adding

“| export-csv c:\filename.csv”

$data = import-csv C:\SapphireAutoTool\ELEM-Student.csv

$data | Foreach-Object {$_.entrydate = [DateTime]$_.entrydate; $_} | 
Group-Object stuid | 
Foreach-Object {$_.Group | Sort-Object entrydate | Select-Object -Last 1} | export-csv c:\test1.csv -NoTypeInformation

ok, so this is definitely working correctly for me. by that i mean its giving me the results i would expect, and it seems to work fairly quicky on almost 1800 rows. would this be the best way?

Does this have to be powershell? Seams it might be just a simple to open the CSV in Excel and use the remove duplicates tool.

this is actually another step i have to add to a much larger powershell script im working on. so yes… it has to be powershell due to the way the data is coming down initially. i have no control over how its coming to me. everything is scripted so far, so im looking to script the “cleanup” of the original csv.

Then I would say what you have is probably the best way, although you might be able to improve it a little by moving your sort-object. Put it before the Group-Object so that you sort the whole list once rather than having to sort multiple times on the individual groups. You can also typecast your property for sort rather than converting it in a foreach loop first.

$data = import-csv C:\SapphireAutoTool\ELEM-Student.csv

$data | Sort-Object [datetime]entrydate |
Group-Object stuid | 
Foreach-Object {$_.Group | Select-Object -Last 1} |
export-csv c:\test1.csv -NoTypeInformation

so by your suggestion, im sorting by the “entrydate” column first, then by using “group-object stuid” im running a foreach against it to get rid of the duplicates?

Correct. You are:

  1. Getting an unsorted list
  2. Sorting the list by entrydate with the latest entries at the bottom
  3. Grouping the sorted list by stuid
  4. Getting the last entry of each group
  5. Exporting the last entry of each group to CSV

thank you curtis, and don. i appreciate the help.