I have a file (WARC data) that contains a mix of string and binary data. I have to modify some of the strings without affecting the binary parts. I tried this using Get-content and looping through each line, searching for the string I needed to edit and using Add-content to dump the result to a file. This appears to corrupt the file. (I’m not particularly surprised by this).
Going deeper, the edits I need to make are like this:
- Locate the pattern
- Find the “.” on that line (before next CRLF)
- Replace the “.” and all characters to the next CRLF with “Z”
Is there a way to approach this in Powershell? Note that some of the files will be multi-GB in size should it matter.
Here’s my current attempt (Includes borrowed converter. My code starts after “#######”)
filter ConvertTo-String { [OutputType([String])] Param ( [Parameter( Mandatory = $True, Position = 0, ValueFromPipeline = $True )] [ValidateScript( { -not (Test-Path $_ -PathType Container) } )] [String] $Path ) $Stream = New-Object IO.FileStream -ArgumentList (Resolve-Path $Path), 'Open', 'Read' # Note: Codepage 28591 returns a 1-to-1 char to byte mapping $Encoding = [Text.Encoding]::GetEncoding(28591) $StreamReader = New-Object IO.StreamReader -ArgumentList $Stream, $Encoding $BinaryText = $StreamReader.ReadToEnd() $StreamReader.Close() $Stream.Close() Write-Output $BinaryText } ############### my code starts here $BinaryString = ConvertTo-String D:\Acc\hanzo2\WARCfiles\warca.warc $BinaryString #$DateRegex = [Regex] '\x57\x41\x52\x43\x2d\x44\x61\x74\x65\x3a.*\.' $DateRegex = [Regex] 'WARC-Date:.*\.' #matches up to first dot $DateRegex.Matches($BinaryString)| foreach{ $curindex = $_.index $curlen = $_.length $curdatestr = $BinaryString.Substring($curindex,$curlen-1) $curdot = $BinaryString.IndexOf(".",$curindex) $cureol = $BinaryString.IndexOf("`r`n",$curindex) $lengthdif = $curEOL - $curdot -1 $newdatestr = $curdatestr + "Z" + (" " * $lengthdif) $Binarystring.Replace($BinaryString,$newdatestr,1,$curindex) }
Thanks for looking
\Greg