Out-String / select-string formatting

Hello guys,

Okay, this will be probably a long shot, but I want to understand how it works.
I have a bunch of files that has partially an xml format - since these are not real xmls, Select-Xml and linked cmdlets are not viable options .
However I am looking for some “elements” in the files, for this I am using the script of:

$path = "C:\Users\*.xml"
$out = "C:\Users\out.txt"

Get-ChildItem $path |
    Select-String -Pattern "<status>", "<logicalIdentifier>" | out-file $out

This will give me the following result in the out.txt:

file1.xml:11:            <logicalIdentifier>1234-YYYMMDD</logicalIdentifier>
file1.xml:12:            <status>Accepted</status>
file2.xml:11:            <logicalIdentifier>2345-YYYMMDD</logicalIdentifier>
file2.xml:12:            <status>Accepted</status>
file3.xml:11:            <logicalIdentifier>3456-YYYMMDD</logicalIdentifier>
file3.xml:12:            <status>Accepted</status>
file4.xml:11:            <logicalIdentifier>4567-YYYMMDD</logicalIdentifier>
file4.xml:12:            <status>Accepted</status>

How can I format this that:

  • it will list only the matching patterns excluded the location (line in the file)?
  • it does not display the “xml tags”? As these are not really xml files, I cannot refer to them as nodes/childnodes ?

I was also experimenting to store the values in an array, but the scriplet gives me an empty array:

Get-ChildItem $path |
    Select-String -Pattern "<status>", "<logicalIdentifier>"  | 

 Foreach-Object {
        $first, $last = $_.Matches[0].Groups['<logicalIdentifier>', '<status>'].Value
        [PSCustomObject] @{
            ID = $first
            Status = $last
                  }
    }

I’m not sure if I got it right … try this:

Select-String -Path 'C:\Users\*.xml' -Pattern '<status>', '<logicalIdentifier>' | 
ForEach-Object {
    [PSCustomObject]@{
        FileName = $_.Filename
        Pattern  = $_.Pattern
        Match    = ($_.Line -replace '\<.+?\>').Trim()
    }
}
1 Like

This is actually brilliant. While I am still digesting what magic was done with the RegEx replacement, could you advise how to remove the duplications on the filename “column”?
I tried

FileName = $_.Filename | Get-Unique

and

FileName = $_.Filename | sort -Unique

so far, but giving the same results:
Screenshot 2022-07-07 174250

So basically one filename could be enough per Pattern pairs

There are no duplicates. You have two separate independent patterns. Each pattern will produce a match. So you have in one file one match for the pattern <logicalIdentifier> and one match for the pattern <status>.

Depending on what you need the data for you could use

to summarize the results.

1 Like

Here are two other options. I think what you’re trying to get is an object with the filename, the ID, and the status. Please correct me if I am incorrect.

$path = "C:\Users\*.xml"
$out = "C:\Users\out.txt"

$pattern = "(?s)<logicalIdentifier>(?<Identifier>.+?)(?=<).+?<status>(?<Status>.+?)(?=<)"

foreach($file in Get-ChildItem -Path $path){
    $content = Get-Content $file.FullName -Raw

    [regex]::Matches($content,$pattern).captures | ForEach-Object {
        [PSCustomObject]@{
            FileName   = $file.FullName
            Identifier = $_.groups['Identifier'].value
            Status     = $_.groups['Status'].value
        }
    }
}

or

$path = "C:\Users\*.xml"
$out = "C:\Users\out.txt"

foreach($file in Get-ChildItem -Path $path){
    switch -Regex -File $file.fullname {
        '<logicalIdentifier>(?<Identifier>.+?)(?=<)' {
            $id = $matches.Identifier
        }

        '<status>(?<Status>.+?)(?=<)' {
            [PSCustomObject]@{
                FileName   = $file.FullName
                Identifier = $id
                Status     = $matches.Status
            }
        }
    }
}

Note that both these options depend on the ID coming before the status

1 Like

Hello krzydoug and thank you for sharing your ideas.
I run both scripts and indeed they also give what I was looking for.
Just a few things to clarify for me:

  1. in the first one, the pattern contains both, the ID and the status, handled as one string, am I correct? How does the script separate these?
  2. Identifier = $_.groups['Identifier'].value does that mean that the “Identifier” column displays the RegEx value of the relevant pattern?
  3. In this part in the second script:

You are using a subxpression ‘(?=<)’, but I am not sure why. You already declared where to look for: (?<Status>.+?) - I am really not familiar with the regex approach

I used the regex class matches method, which will find all occurrences of matching text.

I used named “capture groups” for readability. Iterate the captures, pull the groups object (like a dictionary) and pull the ‘Identifier’ property. It’s only named that because of the named capture group.

The pattern matches <status> and then any number of characters .+ but the shortest amount of characters ? that’s what this pattern here is combined .+? Without some stopping marker, it would match the rest of all the text. This says match until you encounter the character <

I encourage you to check out this regex demo and also use this site to learn more about regex. Regex can be used almost everywhere, so it is beneficial outside of powershell as well.

1 Like