I happen to be the author of the article Don linked to, and came across this article during one of the rare occasions where I check the Google WebMaster tools. Thanks for the “acknowledgement”. My reply is a bit late, however.
First, I’ll start by saying that this is a quite imprecise request for help. The data posted shows severe illogical inconsistencies, such as C:\test\test2\somefile2.txt appearing three times, with less and less data. First the comment is missing, then the “Created” field is missing. This is not even commented on? But worry not, I have accounted for this broken behaviour as well - at least partially… Named captures are indeed handy here. There will simply be duplicate entries made for such files, with less info for the “broken” occurrences. I made some notes about what you might want to do instead in the comments in the code itself.
Second, I made quite a few assumptions when writing this code and regexp. One being that there will always be a “Path” and “File” entry. Another being that there are no comments with two newlines in a row, since I split on multiple newlines. That last one might bite you, I suspect. If so, keep in mind that you should try to anchor on “Path:”, which seems like the safe choice given your (quite broken ;-)) test data.
The text file I’m attaching is based on CSV generated from parsing the initial poster’s exact data as pasted from the web browser into notepad.exe. Another thing I made sure to do, despite not accounting for multiple newlines in a row in a comment, is to handle potential multi-line comments; they will be joined with a semicolon in the comment field.
The regexp could have been written differently; especially playing with “.”, (?s), “\r\n”, “.+?”, etc. will all work in many different ways. I chose something halfway sane here. Initially I wrote it without stripping out \r, but doing that was apparently simply a lot easier than the other options.
Anyway, here’s the code I wrote that works for all your test data, pasted verbatim from the browser into notepad:
Set-StrictMode -Off
$SingleLineData = (Get-Content -Raw -Path E:\temp\old-output.txt) -join "`n"
# This makes things a lot easier in the regex later (strip out \r).
$SingleLineData = $SingleLineData -replace '\r', ''
# Consider broken data with repeated elements... Use hash keys that represent path + file.
# Couldn't be bothered now... To handle the utterly broken test data, and the possibility
# of a path+file without "Created" and "Comment" appearing before an entry with one or both
# of those, you'd want to store an object in the hash and to look up to see if properties
# are already populated using some if statements. Won't bother.
#$AlreadyDone = @{}
# Probably should consider the case of comments with two newlines in a row.
# That would require different logic that anchors on "Path:" which seems safe
# even given this utterly broken test data...
# I copied and pasted the output from my browser to notepad, and the test data had
# \r\n for some newlines. I figure this might be an artifact carried over from the
# actual output data, so beware of that.
#@(
foreach ($BlockText in $SingleLineData -split "\n{2,}") {
if ($BlockText -imatch "^\s*Path:\s+(?'Path'[^\n]+)\s*(?:Created:\s+(?'Created'[^\n]+))?\s*File:\s+(?'File'[^\n]+)\s*(?s:Comment\s+\([^)]+\)\s*:[^\n]*\n(?'Comment'.+))?") {
New-Object PSObject -Property @{
Path = $Matches['Path']
Created = $Matches['Created']
File = $Matches['File']
Comment = $Matches['Comment'] -replace '[\r\n]+', ';'
}
}
}
#) | Export-Csv -Encoding UTF8 -NoType -Path old-output.csv
Cheers.
-Joakim