Parsing a file with string and binary data

I have a file (WARC data) that contains a mix of string and binary data. I have to modify some of the strings without affecting the binary parts. I tried this using Get-content and looping through each line, searching for the string I needed to edit and using Add-content to dump the result to a file. This appears to corrupt the file. (I’m not particularly surprised by this).
Going deeper, the edits I need to make are like this:

  • Locate the pattern
  • Find the “.” on that line (before next CRLF)
  • Replace the “.” and all characters to the next CRLF with “Z”

Is there a way to approach this in Powershell? Note that some of the files will be multi-GB in size should it matter.

Here’s my current attempt (Includes borrowed converter. My code starts after “#######”)

filter ConvertTo-String
    Param (
        [Parameter( Mandatory = $True,
                    Position = 0,
                    ValueFromPipeline = $True )]
        [ValidateScript( { -not (Test-Path $_ -PathType Container) } )]

    $Stream = New-Object IO.FileStream -ArgumentList (Resolve-Path $Path), 'Open', 'Read'

    # Note: Codepage 28591 returns a 1-to-1 char to byte mapping
    $Encoding = [Text.Encoding]::GetEncoding(28591)
    $StreamReader = New-Object IO.StreamReader -ArgumentList $Stream, $Encoding

    $BinaryText = $StreamReader.ReadToEnd()


    Write-Output $BinaryText
############### my code starts here

$BinaryString = ConvertTo-String D:\Acc\hanzo2\WARCfiles\warca.warc
#$DateRegex = [Regex] '\x57\x41\x52\x43\x2d\x44\x61\x74\x65\x3a.*\.' 
$DateRegex = [Regex] 'WARC-Date:.*\.' #matches up to first dot

$DateRegex.Matches($BinaryString)| foreach{
    $curindex = $_.index
    $curlen = $_.length 
    $curdatestr = $BinaryString.Substring($curindex,$curlen-1) 
    $curdot = $BinaryString.IndexOf(".",$curindex)
    $cureol = $BinaryString.IndexOf("`r`n",$curindex)
    $lengthdif = $curEOL - $curdot -1 
    $newdatestr = $curdatestr + "Z" + (" " * $lengthdif)

Thanks for looking


How big is the file? Get-Content reads everything into memory. If the file is rather big you’re probably better off using another approach than Get-Content. Add-Content is the wrong cmdlet for your use case, you were probably looking for Set-Content :slight_smile:

Thanks Sebastian. I was using Add-content since I was writing it out line by line. Is that wrong?
I added some code above with my newer attempt.