I have a file (WARC data) that contains a mix of string and binary data. I have to modify some of the strings without affecting the binary parts. I tried this using Get-content and looping through each line, searching for the string I needed to edit and using Add-content to dump the result to a file. This appears to corrupt the file. (I’m not particularly surprised by this).
Going deeper, the edits I need to make are like this:
- Locate the pattern
- Find the “.” on that line (before next CRLF)
- Replace the “.” and all characters to the next CRLF with “Z”
Is there a way to approach this in Powershell? Note that some of the files will be multi-GB in size should it matter.
Here’s my current attempt (Includes borrowed converter. My code starts after “#######”)
filter ConvertTo-String
{
[OutputType([String])]
Param (
[Parameter( Mandatory = $True,
Position = 0,
ValueFromPipeline = $True )]
[ValidateScript( { -not (Test-Path $_ -PathType Container) } )]
[String]
$Path
)
$Stream = New-Object IO.FileStream -ArgumentList (Resolve-Path $Path), 'Open', 'Read'
# Note: Codepage 28591 returns a 1-to-1 char to byte mapping
$Encoding = [Text.Encoding]::GetEncoding(28591)
$StreamReader = New-Object IO.StreamReader -ArgumentList $Stream, $Encoding
$BinaryText = $StreamReader.ReadToEnd()
$StreamReader.Close()
$Stream.Close()
Write-Output $BinaryText
}
############### my code starts here
$BinaryString = ConvertTo-String D:\Acc\hanzo2\WARCfiles\warca.warc
$BinaryString
#$DateRegex = [Regex] '\x57\x41\x52\x43\x2d\x44\x61\x74\x65\x3a.*\.'
$DateRegex = [Regex] 'WARC-Date:.*\.' #matches up to first dot
$DateRegex.Matches($BinaryString)| foreach{
$curindex = $_.index
$curlen = $_.length
$curdatestr = $BinaryString.Substring($curindex,$curlen-1)
$curdot = $BinaryString.IndexOf(".",$curindex)
$cureol = $BinaryString.IndexOf("`r`n",$curindex)
$lengthdif = $curEOL - $curdot -1
$newdatestr = $curdatestr + "Z" + (" " * $lengthdif)
$Binarystring.Replace($BinaryString,$newdatestr,1,$curindex)
}
Thanks for looking
\Greg