Reading a large file, tweaking and then writing out the results

I have a 300Mb file. I need to read the file, deduplicate the content, and also edit some of the lines based on some simpler criteria.

My issue is that I cannot seem to get this to complete in a sensible amount of time.

I have written some code which is even relatively slow with small files, but does do the job. Unfortunately not appropriate when the file gets to its full size.

I have tried using various approaches, but non ear performant, so I am looking for code that can

read a file
deduplicate the content, full lines
edit the content of a line, based on some simple criteria

The code that I have goes along the lines of
$hash = @{} $outstream = [System.IO.StreamWriter] $newfile = [System.IO.File]::ReadLines($file.FullName) | % { if ($hash.$_ -eq $null -and $_ -ne [char]34) { $lastChar = $_.SubString($_.Length - 1, 1) if ($LastChar -ne [char]34) { $a = $_ + [char]34 } else { { $a = $_ } $stream.WriteLine($a) } $hash.$_ = 1 } }
This is only a snippet of the code. I am looking at approaches / guidance as to how how to make this approach more effecient
Anyone got any suggestions as to how to make this more effecient?

Tractor Boy,
Welcome to the forum. :wave:t4:

Your code seems to be broken/incomplete. That happens when you don’t use the proper formatting. Could you please edit your post and correct the formatting of the code? Simply click on the preformatted text button ( </> ) and paste the code where you’ve been told.

Thanks in advance.

It might be helpful as well when you post a small part of the input data as well. (formatted as code as well please) :wink:

Thanks for your comments. I have updated but the formatted doesn’t does to be as useful using the preformatting option.

I have tried a variety of methods to get efficient read - edit - write the above is just one of those attempts, and purely a guide as to the type of thing that I am doing. I was hoping someone might be able to point me in a directions that I could dig into that would make my task complete quicker.

As an idea this code is estimated to take 3.33 hours to process the file. I am can carry out those same tasks manually to a tool like notepad++ in a few minutes, as such I was hoping that there may be an alternate approach that may given improvements. 30 minutes if something that would be an acceptable to time for a coded solution.


If you do it right it should look something like this

$hash = @{} 
$outstream = [System.IO.StreamWriter] 
$newfile =
[System.IO.File]::ReadLines($file.FullName) | 
    % { 
        if ($hash.$_ -eq $null -and $_ -ne [char]34) { 
            $lastChar = $_.SubString($_.Length - 1, 1) 
            if ($LastChar -ne [char]34) { 
                $a = $_ + [char]34 
            else {
                { $a = $_ } 
            $hash.$_ = 1 

I’m just not sure if evereything is like your original code.

Simply click on the preformatted text button ( </> ) first and then paste your code.

Could you please share a few lines of the input file as well for us to play around a little bit?

Thanks in advance.

the input files is not relevant to my problem its just some text, random length of content

so something like

I guess the key is that some data is duplicated which has to be removed, and some lines need ‘fixing’

I have just copied the code that you provided above and pasted that, but the formatting seems to go as soon as it goes into the back ticks. I am obviously doing something wrong but have no idea what that may be, sorry.

I did not provide any code at all. I just tried to show you how to post code. It is YOUR CODE. I just tried to reformat it and posted it just like it is!!! :wink:

OK, but I don’t like to create a text file big enough to play around with by myself typing a lot of text to be able to help you with your problem. :wink: Can you understand this? So it would be nice if you could post your code and some sample data correctly. So we can copy it and try it.

Thanks in advance. :+1:t4: