LARGE csv issue (7+ GB - ~25 million lines)

Trying to quantify corruption and data loss in my ECM. Did a query in SQL to get the number of items that are SUPPOSED to be there. Query returned 25 million lines. Saved as csv becomes almost 8 GB. I think there are multiple lines per file.

I want to run PowerShell against the csv and get a new csv with only unique values. Plan to then use that csv with PowerShell (Test-Path ) to get a list of what is/isn’t there so I can try to remediate.

I have tried gc large.csv | get-unique > new.csv - I ended up with an 11 GB csv.

I am trying:

Import-CSV .\large.csv | srt filepath -Unique | Export-CSV .\new.csv - NoTypeInformation

As you can imagine, it is using huge amounts of time and huge amounts of RAM. What am I doing wrong? How can I do this more efficiently?

Thank you for your help!

Bruce Olschewski

Anything you do with a CSV that large is going to be ridiculously slow, assuming it even works at all (since you may run into Out of Memory exceptions at some point.)

It might be easier to just count the number of lines in the file. The fastest way to do that in PowerShell 3.0 or later is probably this:

This way, the .NET framework (compiled C# code) is doing all of the heavy lifting, and you’re also not having to store the whole file in memory, since this is all being done with IEnumerable.

Bruce, I think to increase efficiency you should use SELECT DISTINCT in SQL or SSIS instead doing the heavy work outside. The SQL engine is much more efficient.

Just my 2 cents.

Best,
Daniel

Thank you Daniel! I used that and reduced it to about 17 million lines. :slight_smile:

Is there a way I can read records out of the csv to do a check-path on the record and write a $false (if missing) to another csv without having to load 5 GB of data into memory?

Two suggestions:

  1. Use SQL to get unique rows, not Powershell. The functionality is built into the T-SQL syntax.
  2. Use Powershell to query the database directly, without storing the data in an interim CSV file. This way Powershell can process the results row by row without wasting both diskspace and memory.

Bruce, have a look at the Netnerds blog. Chrissy LeMaire has written some great articles on working with large CSV files:

Can you explain the whole scenario and what you are trying to do? You have data in a database and it sounds like a lot of data. SQL has an engine meant to work with millions of records and perform advanced queries, but Powershell isn’t the best place to work with millions of records. If you explain what you are trying to do with the data and the end goal, we’ll be able to tell you the options to make it work as efficiently as possible.

Here is what I am trying to do. My document management system experienced some severe problems. The NAS that copy 2 was on died. That was when we found data corruption in copy 1. I need to figure out what is missing as I try to recover. The database for the system is MS SQL 2005. I queried the db to get a list of what it THINKS is there (getting doc handle and filepath as part of the query). That is the source of the huge csv.

What I want to do is take the list of what the db thinks exists and use PowerShell to check-path each item and then write if the file is missing to another csv so I can find what is missing and work to get my system trustworthy again.

This is the query I used to get the large csv:
select distinct
hsi.itemdata.itemnum, hsi.itemdata.itemname, hsi.itemdata.datestored, hsi.itemdatapage.filepath
From
hsi.itemdata, hsi.itemdatapage
Where
hsi.itemdatapage.diskgroupnum = 104
and
hsi.itemdata.status 16
and
hsi.itemdata.itemnum = hsi.itemdatapage.itemnum

Now I want to read the csv (if possible) without loading it entirely into memory and do a “for each” to test each file.

Does that clarify the situation? Is there a better way?

Thanks,

Bruce

I’d argue that it may be better if you adjust your SQL query to create a new database, and keep the data in SQL. Get it to add another (blank) column though.

Two options then might be to either

(1) Skip PowerShell completely, and create a stored procedure / function via VB/C# that carries out the checks, returning a true/false based on the result, and write a new SQL query using the SP/Function. It’s a fairly straightforward process to be honest.

or

(2) Create a PowerShell script which executes performs a retrieval of a specified number of rows at a time (via SQL’s ROW_NUMBER), and checks for the files existence. If it exists, update the table column to indicate a true or false for if the file exists. You could run this as a job in SQL so you can leave it churning away. You might also be able to write your code as a workflow and use parallel execution to speed it up.

I’m not a SQL guru by any manner or means, but i’d go for option 1 if speed is of the essence.

Why not just use xcopy? If you have two directories and you just want to compare one with the other and copy missing files, I don’t see the need to even involve or use SQL if you just need to compare file structures.