Extracting and grouping text from file

Obligatory newbie warning here!!

Could someone provide some pointers for me that will allow me to extract 3 values that are grouped together from a .txt file.

The contents of the text file will be multiple email notifications, and the 3 values will need to be grouped together by email i.e. Email 1 = Value 1, Value 2, Value 3. The values can be identified in each email by a preceding word e.g. “Customer Id:”, but other than that, they sit in a singular string and are not delimited in any way.

The most progress I’ve made so far is with regex.matches($my_text).value, where “regex” refers to an expression that identifies one of the values I need, but I haven’t really got an understanding of how it’s working so struggling to develop it further.

Thanks
Carl

carl.work
Welcome to the forum. :wave:t4:

Your description is a little vague. I think it would be helpful when you post some sample data and ideally the code you already have so far.

There are some cmdlets available for tasks like this. I’d recommend to read the complete help including the examples for

It should be the best way to extract the text you’re after. Then you could use

To group the result in the required way.

Regardless of that - when you post code, error messages, console output or sample data please format it as code using the preformatted text button ( </> ).

Thanks in advance

Hi Olaf

Thank you for the response. I had initially tried Select-String with a -Pattern parameter, but it would return the full string, rather than just the values I wanted from the String.

For example, the text below is an example of a text file, containing 2 strings, each containing the 3 values I want.

Lorem ipsum dolor sit amet consectetur adipiscing House No: 1234567 elit Aliquam dapibus congue Street No: 6845234 arcu sed fringilla Duis id ligula vel risus tristique mattis suscipit Customer Id, 8-7654123 et metus In ipsum lectus faucibus non lacus

Lorem ipsum dolor sit amet consectetur adipiscing House No: 9876543 elit Aliquam dapibus congue Street No: 2796481 arcu sed fringilla Duis id ligula vel risus tristique mattis suscipit Customer Id, 8-6684523 et metus In ipsum lectus faucibus non lacus

The values I want from that string are the just the numbers following “House No:”, “Street No:”, and “Customer Id”.

The closest I’ve got so far is:

$text = Get-Content -Path textfile.txt

$regex = [regex] "House No: (.{7})"

$regex.Matches($text).Value

This returns:

House No: 1234567
House No: 9876543

Which is definitely progress, but have got stuck with a couple of aspects. One, why is it working without a ForEach? Two, how to I get the other values in there (Street No and Customer Id) and outputted to a readable format…

Thanks
Carl

It returns actually much much much more than that. Try to pipe the output to a Select-Object * and you will see what I mean. :wink:

In your case where you’re looking for more than one match per line it might be easier for a beginner to take a little more procedural approach. You read the file or input text line by line and treat each single line with three separate regex patterns … something like this:

$InputText = @'
Lorem ipsum dolor sit amet consectetur adipiscing House No: 1234567 elit Aliquam dapibus congue Street No: 6845234 arcu sed fringilla Duis id ligula vel risus tristique mattis suscipit Customer Id, 8-7654123 et metus In ipsum lectus faucibus non lacus
Lorem ipsum dolor sit amet consectetur adipiscing House No: 9876543 elit Aliquam dapibus congue Street No: 2796481 arcu sed fringilla Duis id ligula vel risus tristique mattis suscipit Customer Id, 8-6684523 et metus In ipsum lectus faucibus non lacus
'@

$InputText -split '\n' | 
    ForEach-Object {
        [PSCustomObject]@{
        HouseNo    =  $($_ -match '(?<=House\sNo:\s)(\d{7})' | Out-Null ; $Matches[1])
        StreetNo   = $($_ -match '(?<=Street\sNo:\s)(\d{7})' | Out-Null ; $Matches[1])
        CustomerID = $($_ -match '(?<=Customer\sId,\s)(\d-\d{7})' | Out-Null ; $Matches[1])
        }
    }

The ouput will look like this:

HouseNo StreetNo CustomerID
------- -------- ----------
1234567 6845234  8-7654123
9876543 2796481  8-6684523

But that does not mean it would be impossible do use Select-String. Just the regex pattern would be a little more complex. :wink:

$Pattern = '(?<=House\sNo:\s)(\d{7}).*(?<=Street\sNo:\s)(\d{7}).*(?<=Customer\sId,\s)(\d-\d{7})'

$InputText -split '\n' | 
Select-String -Pattern $Pattern | 
    ForEach-Object {
        [PSCustomObject]@{
            LineNumber= $_.LineNumber
            HouseNo = $_.Matches.Groups[1]
            StreetNo   = $_.Matches.Groups[2]
            CustomerID = $_.Matches.Groups[3]
        }
    }

And in this case the output looks like this:

LineNumber HouseNo StreetNo CustomerID
---------- ------- -------- ----------
         1 1234567 6845234  8-7654123
         2 9876543 2796481  8-6684523

You just have to come up with a strategy if there are lines with only one or two of the patterns matching. :wink: In such cases you would get either no results or potentially wrong results because the automatic variable $Matches is only populated when the -match operator returns $true. :point_up_2:t4: :point_up_2:t4: :thinking: :thinking:

1 Like

Hi Olaf

That’s all really helpful, thank you :+1:

What do I do about the lines that don’t contain any matches, but are screwing up my results?

In my actual text file, because the contents are saved from an email, there are many additional lines of text that don’t contain information I want i.e. each email generates 15 lines of text, only the last one contains the information I need.

You have to be more specific. Which approach are you using and how are your results screwd up?

When you use Select-String and there’s no match it will not return anything.

How does your text file look like? Do you have the text from multiple emails in one file or do you have 1 email per file but miltiple files?

If you use the foreach loop approach you have to make sure that the automatic variable $Matches get’s removed after each match.

Sorry, I was using your first suggestion:

$InputText -split '\n' | 
    ForEach-Object {
        [PSCustomObject]@{
        HouseNo    =  $($_ -match '(?<=House\sNo:\s)(\d{7})' | Out-Null ; $Matches[1])
        StreetNo   = $($_ -match '(?<=Street\sNo:\s)(\d{7})' | Out-Null ; $Matches[1])
        CustomerID = $($_ -match '(?<=Customer\sId,\s)(\d-\d{7})' | Out-Null ; $Matches[1])
        }
    }

If Select-String excludes the strings that don’t match, I guess I should be using your second suggestion then?

The text file is multiple emails in one file. The number will vary, one day it might be 3, one day it might be 300 if something breaks.

Ah, so that would explain why it returns the last found matching value per line, when there is no matches values? How would I remove it then? And would this prevent it from running on the lines where there are no matches?

If the lines you’re after ALWAYS have all three patterns then this might be a good idea. :wink: :+1:t4:

You’re right. :+1:t4:

Hmmm … you are allowed to try to solve some minor problems by yourself from time to time!!! :wink: If you want to remove a variable you could use

If you’re unsure if there’s a cmdlet for the task you want to achieve you could try to find it with

Please read the complete help including the examples to learn how to use it.
In this special case I would have used it like this:

Get-Command -Noun *variable*

I’d expect that.

Solutions based on regular expression depend on the reliability of the uniformity and consistency of the input text. If one of the numbers is 8 digits instead of 7 it would still match but you wouldn’t get the last number in your results. If a colon changes or a comma or a white space your patterns will not match anymore. You should keep that in mind and check the process and the results on a regular base.

Hi Olaf

I’m trying to implement your Select-String suggestion below:

I’m assuming that I need to replace $Pattern with my own regex

'(?<=House\sNo:\s)(\d{7})'

Which works perfectly by itself. It returns only 2 lines in the output with the correct values, but how do I then add the other 2 regexes? I’ve looked online and it appears that I only need to separate them with a comma, but when I then run the script, it either stops providing an output altogether or still returns just the first value (HouseNo).

Do not descibe it - show it. :wink: In the best case you post the code exactly like you use it (if it does not have sensitive information in it.)

I just noticed that I forgot to post the regex pattern I used along with my code suggestion. I added it in my answer above.
Instead of 3 separate regex patterns I used a big one and used the grouping feature to separate them in the output.

Olaf, you’re the best. Thank you, I’ve now got it working.

In an effort to better understand it then, does each regex created refer then to a numbered group based on their position?

The more I look at it, the less I understand… :face_with_raised_eyebrow:

I need to do some reading!

Thanks again,
Carl

That’s always a good idea. :+1:t4: :wink:

What a pity. It should actually be the other way around. :stuck_out_tongue_winking_eye:

You’re totally right. :+1:t4: Regular expressions can be overwelming for beginners. But it’s worth keeping up. They’re powerfull when you know them a little bit. I’m not fluently speaking regex as well. If I have to look something up I’m used to take a look at this site first:

And a great place to test and play is this: