regex

I need to extract server name and KB numbers from a PDF document. I know regex can handle this, but I am illiterate. Was wondering if someone could lend a hand.

The PDF is broken into sections, one for each server, and lists missing KBs before it moves to the next server. I’d like to make a script to extract the info in plain text so I can use powershell to test if the KBs are installed/available and what options they have. That last part I can handle myself.

So if I copy/paste the PDF text into note pad it looks like this.

Hostname: server-123.domain-123.com IP: 158.39.128.12 OS: Microsoft Windows Server 2008 R2 Service Pack 1 Critical Microsoft XML Parser (MSXML) and XML Core Services Unsupported Critical MS15-­‐034: Vulnerability in HTTP.sys Remote Code Execution (3042553) Critical MS15-­‐034: Vulnerability in HTTP.sys Remote Code Execution (3042553) (uncredentialed check) High Microsoft Windows Unquoted Service Path Enumeration High MS KB2269637: Insecure Library Loading Could Allow .... You can see the paste action puts a CR after every word... annoying. Notice how sometimes the numeral only is enclosed in (#####)'s but sometimes it says KB##### as well.

And I want it to look like this (I can worry about duplicates later):

server-123.domain-123.com
KB3042553
KB3042553
KB2269637

The “Hostname: xxxxxxxx.domain-123.com” will always be the same throughout the document, and can be used to mark a new server (section). I would eventually like the output into a PS object that has server name and an array of associated (unique) KB#s.

Unless someone has some cool new way of reading PDF files in PowerShell you’re only real option is to look into iTextSharp.dll

Its briefly discussed here, plus a bit more from just a simple google.

http://stackoverflow.com/questions/15684699/how-to-parse-pdf-content-to-database-with-powershell

Unfortunately PowerShell cannot natively read PDF documents.

That being said if you just port this over to a txt file you can do something like this:

Get-Content C:\textfile.txt | select-string ‘\w.domain-123.com’,‘KB\d+’
From there you could pipe this to out-file accordingly.

That’s cool. I can work around by pasting it into notepad just like in my example.
The real question is the regex problem.

OK, to prove I’m not lazy I figured out this much of it.

$t = gc .\cimc.txt
$regex = '([^A-Z]\d{6,8})|KB\d{6,8}'
$t -match $regex

(3059317)
KB2269637:
(3057839)
3009008:
(3035132)

How do I get rid of the parentheses and the colons? I just want the numbers, but the parentheses were used to qualify the numbers within.
I thought another line of code and a simple ‘\d’ would do it but I guess I was wrong.

Try this:

(gc .\cimc.txt | Select-String ‘\d{6,8}|KB\d{6,8}:’ -AllMatches).Matches.Value

You could used trim or -replace

$t = gc .\cimc.txt
$regex = '[^A-Z]\d{6,8}|KB\d{6,8}'
$t -match $regex | ForEach-Object {$_.Trim("K","B","(",")",":")}

or

$t = gc .\cimc.txt
$regex = '[^A-Z]\d{6,8}|KB\d{6,8}'
($t -match $regex) -replace '[^\d]'

3042553
3042553
2269637

I was wondering why you both removed my parentheses, then I did some research. Turns out I need that listed as “(” and “)” … baby steps.

Thanks Curtis. I used your trim option, event though the replace looks more elegant. Seeing the characters helps me comprehend it when I read it later.

$RawData = gc .\cimc.txt
$regex = '\([^A-Z]\d{6,8}|KB\d{6,8}\)'
$KBNums = $RawData -match $regex | ForEach-Object {'KB' + $_.Trim("K","B","(",")",":")}

I added a few things and this code gets me a nice, consistent list of KB#s I can use for lookup.

So that gets me halfway there. Now I need the server at the top, and remove duplicates. Every time I hit a new hostname I make a new server and unique list.

The pattern can be as follows:

Hostname:
dyn-­-172-­-79-­-158-­-145.domain-123.com
IP:

or

Hostname:
mer03943.domain-123.com
IP:

…but i want to ignore any where there’s nothing in the middle–the next line will always be “IP:”

Hostname:
IP:

I’ll comment again with my first attempt. I think I need a hash table? Can a value in a hash table be an array? Time to play…

Ha, I removed them just playing around with the regex, didn’t mean to leave the off :).

Yes, you have have an array as a value of a hash.

PS C:\> $hash = @{
    String1 = "Value1";
    String2 = "Value2";
    Array1 = @("Value3", "Value4", "Value5");
    String3 = "Value6"
}

$hash

Name                           Value                                                                                                                                                 
----                           -----                                                                                                                                                 
String3                        Value6                                                                                                                                                
String2                        Value2                                                                                                                                                
String1                        Value1                                                                                                                                                
Array1                         {Value3, Value4, Value5}

Cool, that’ll come in handy when I figure out this other thing.

I got the hostname and IP lines, and even the URL, but all separate.

$RegexKB = '\([^A-Z]\d{6,8}|KB\d{6,8}\)'
$RegexHN = 'Hostname:'
$RegexURL = '([\da-z\.-]+)\.([\da-z\.-]+)\.([a-z]{2,6})'
$RegexIP = 'IP:'

How do I find all three together, so that I can ignore when the middle one isn’t there?

Eventually I need to parse this txt file, every time the script runs into a HN/URL/IP trio of lines, it dumps that URL string into a hash table and for its value it uses the array of string values concocted in the above solution, but stops searching for new values when it hits the next URL or a HN/IP pair of lines without a URL.

and come out with a hash table like so…

$RawData = gc .\cimc.txt
$KBNums = $RawData -match $RegexKB | ForEach-Object {'KB' + $_.Trim("K","B","(",")",":")} | select -Unique

$hash = @{
$RegexURL[0] = $KBNums;
$RegexURL[1] = $KBNums;
$RegexURL[2] = $KBNums;
$RegexURL[3] = $KBNums;
$RegexURL[4] = $KBNums;
}

I know it doesn’t make sense the way I wrote it here, but you get the idea: my end goal right?

Well, are those values always at the very top of the file like in the example content? If so you can just pull the directly without doing any filter and test on them.

$t = Get-Content .\cimc.txt

$t[0]
$t[1]
$t[2]
$t[3]

If ($t[1]) {
    "Hostname not blank"
}

Results:
Hostname:
server-123.domain-123.com
IP:
128.59.238.12
Hostname not blank

No, it’s just one file with hundreds of KB#s and dozens of server URLs.
Basically it’s a security report formatted so that IT guys pull their hair out trying to review the data. We are starting to get a lot of them and this would increase productivity a billion times (in my head).
This function should output KB numbers so that I can pipe them into a script that searches for them in WUAU and outputs patch status and other info. But I’m getting ahead of myself.

I tried using \n and \r and \n\r and \r\n but when I try something like
‘(Hostname:)\n(long regex string for URL)\n(IP:)’
I get no matches.

ya the match is going to compare the regex on one line at a time, it’s not going to compare against all three lines. I think what you are going to have to do is loop through each line one at a time so that you can find your “start of record” which is indicated by a hostname: value then check the next line to see if it has a value, and decided what to do based on that result.

For confirmation, the first instance of

Hostname:

IP:

that you find, you want to stop searching completely. No need to finish the rest of the file?

Well, sort of. I want to come out with multiple sections. If i have to use a split in the beginning to accomplish this I guess that will have to do, then I’m iterating a bunch of times and creating that many txt files or some other output I really don’t want.

The file contains hostnames as URLs, but some devices such as network switches do not have hostnames but do appear on the report. It’s random. You could switch the logic to say find a URL, then start collecting KB#s until you hit the next ‘hostname:’ line, then search for the next URL. That actually sounds a lot simpler.

I’m not sure how I would loop thru inside of a regex.

I think this is kinda what you are asking for. It will at least give you the building blocks to expand upon. Of course I only have the one sample data, so I repeated it 3 time in my input file and blanked out the line below “HOSTNAME:” on the second instance.

Update: I went back and added comments throughout the script so hopefully it will make since as to how this script is functioning and provide a better understanding on how to parse seemingly haphazard text in the future.

# Regex String for matching KB number in input text
$RegexKB = '\([^A-Z]\d{6,8}|KB\d{6,8}\)'

# Initilize record variable as null
$record = $null

#get content from input file
$cimc = Get-Content .\cimc.txt

# use the variable $i in a for loop where $i starts at 0 and increments by +1
# until it is no longer less the number of lines in the input file
For ($i=0; $i -lt ($cimc.count); $i++) {

    # For the current line in the loop switch code execution based on the value
    Switch ($cimc[$i])
    {
        # If the current line's value is "Hostname:" then we have found the
        # beginning of a new record
        "Hostname:"
        {
            # Check to see if we are already working on a record and if so,
            # send it to the Pipeline as long as the Hostname value is not blank
            If ($record)
            {
                If ($record.Hostname) {
                    $record
                }#if

                # Start a new record setting Hostname as the value on the next
                # line (current line + 1), and the IPaddress as the value on
                # the next 3rd line (current line + 3)
                $record = [PSCustomObject]@{
                                Hostname = $cimc[$i+1];
                                IPAddress = $cimc[$i+3];
                                KBs = @()
                          }
            }#if
            Else
            {
                # Start a new record setting Hostname as the value on the next
                # line (current line + 1), and the IPaddress as the value on
                # the next 3rd line (current line + 3)
                $record = [PSCustomObject]@{
                    Hostname = $cimc[$i+1];
                    IPAddress = $cimc[$i+3];
                    KBs = @()
                }
            }#else
        }#Hostname:

        # If the current line's value is anything else then this default action
        # will be taken
        default
        {
            # For the current line, check and see if it match a KB number based
            # on our Regex Expression
            If ($cimc[$i] -match $RegexKB)
            {
                # Replace all characters that are not numerical with nothing,
                # prefix it with KB, and add it to the KBs array in the record
                $record.KBs += @("KB$($cimc[$i] -replace '[^\d]')")
            }#if
        }#default
    }#switch
}#for

# Check to see if the last record has a Hostname value and output it to the
# pipeline if so
If ($record.Hostname) {
    $record
}#if

Results Like:

Hostname                         IPAddress                       KBs                            
--------                         ---------                       ---                            
server-123.domain-123.com        128.59.238.12                   {KB3042553, KB3042553}         
server-123.domain-123.com        128.59.238.12                   {KB3042553, KB3042553}         

Looks good now… so much thanks

$RegexKB = '\([^A-Z]\d{6,8}|KB\d{6,8}\)'
$record = $null
$RawData = Get-Content .\cimc.txt

For ($i=0; $i -lt ($RawData.count); $i++) {
    Switch ($RawData[$i]) 
    {
        "Hostname:"
        {
            If ($RawData[$i+1] -ne 'IP:') 
            {
                If ($record) {
                    If ($record.Hostname -and $record.KBs) {
                        $record
                    }
                    $record = [PSCustomObject]@{
                                    Hostname = if ($RawData[$i+1] -match "^dyn[0-9-]") {
                                                    (($RawData[$i+1].TrimStart('dyn-­-') -split '\.')[0]) -replace '-­-','.'; 
                                               } else {
                                                    ($RawData[$i+1] -split '\.')[0]
                                               }
                                    KBs = @()
                              }
                } Else {
                    $record = [PSCustomObject]@{
                        Hostname = if ($RawData[$i+1] -match "^dyn[0-9-]") {
                                         (($RawData[$i+1].TrimStart('dyn-­-') -split '\.')[0]) -replace '-­-','.'; 
                                   } else {
                                         ($RawData[$i+1] -split '\.')[0]
                                   }
                        KBs = @()
                    }
                }
            } Else {
                $record = $null
            }
        }#Hostname:
        default {
            If ($RawData[$i] -match $RegexKB) {
                #replace all characters that are not numerical with nothing, prefix it with KB, and add it to the array
                $record.KBs += @("KB$($RawData[$i] -replace '[^\d]')")
            }
        }#default
    }#switch
}#for
If ($record.Hostname -and $record.KBs) {
    $record
}

gives

Hostname               KBs                  
--------               ---                  
174.59.10.178         {KB3017349, KB3057181, KB3058985, KB3033857}                      
174.59.10.189         {KB3042553, KB3042553, KB2722479, KB3011443...}                   
svrad01               {KB3000483, KB3041836, KB3032323, KB3046306...}                   
174.59.10.131         {KB3042553, KB3011443, KB3000483, KB3046306...}                   
svrcitrix             {KB3042553, KB3042553, KB2500212, KB2962486...}                   
174.59.10.153         {KB3042553, KB3011443, KB3011780, KB3017349...}                   
174.59.10.159         {KB3042553, KB3011443, KB3000483, KB3046306...}                   
174.59.10.165         {KB3042553, KB3011443, KB3000483, KB3046306...}                   
174.59.10.190         {KB3042553, KB3011443, KB3000483, KB3046306...}                   
174.59.10.139         {KB3042553, KB3011443, KB3000483, KB3046306...}                   
174.59.10.149         {KB3042553, KB3000483, KB3046306, KB3049576...}                   
174.59.10.111         {KB3042553, KB3011443, KB3000483, KB3046306...}                   
174.59.10.174         {KB3042553, KB3042553, KB3000483, KB3046306...}                   
174.59.10.129         {KB3042553, KB3000483, KB3046306, KB3049576...}                   
174.59.10.129         {KB3042553, KB2500212, KB3011443, KB3000483...}                   
174.59.10.125         {KB3042553, KB2500212, KB3011443, KB3000483...}                   
174.59.10.172         {KB3042553, KB3000483, KB3046306, KB3049576...}                   
174.59.10.137         {KB3042553, KB3011443, KB3000483, KB3046306...}                   
svrfile02             {KB3042553, KB3042553, KB3000483, KB3041836...}                   
svrarchive            {KB3000483, KB3041836, KB3032323, KB3046306...}                   
174.59.10.107         {KB2500212, KB3000483}                                            
webmail               {KB3042553, KB3042553, KB3000483, KB3041836...}                   
174.59.10.150         {KB3042553, KB3000483, KB3046306, KB3049576...}                   
174.59.10.101         {KB3042553, KB3000483, KB3046306, KB3049576...}                   
174.59.10.198         {KB3042553, KB3011443, KB3000483, KB3046306...}                   
printer4              {KB3000414, KB2992611, KB3042553, KB2500212...}                   
174.59.10.145         {KB3042553, KB3011780, KB3017349, KB3021674...}                   
174.59.10.195         {KB3042553, KB3000483, KB3046306, KB3049576...}                   
174.59.10.155         {KB3042553, KB3000483, KB3046306, KB3049576...}                   
174.59.10.186         {KB3042553, KB3011443, KB3000483, KB3046306...}                   
tradedev              {KB3042553, KB2962486, KB3000483, KB3041836...}                   
svrdev                {KB3042553, KB3000483, KB3041836, KB3032323...}                   
invest-­-serv         {KB3042553, KB3042553, KB3000483, KB3041836...}                   
174.59.10.192         {KB3042553, KB3042553, KB3000483, KB3041836...}                   
svrproddb             {KB3042553, KB3042553, KB3000483, KB3041836...}                   
connect4              {KB3042553, KB3000483, KB3041836, KB3032323...}                   
svrprodws             {KB3042553, KB3000483, KB3041836, KB3032323...}                   
174.59.10.191         {KB3042553, KB3000483, KB3041836, KB3032323...}                   
printer6              {KB3042553, KB3000483, KB3041836, KB3032323...}  

Question, having issues running this on PS v2.0. Anything in this script that would make the output all weird and wonky? I have to write things for the lowest common denominator.

Name                           Value
----                           -----
Hostname                       251.59.174.178
KBs                            {KB3017349, KB3057181, KB3058985, KB3033857}
Hostname                       251.59.174.189
KBs                            {KB3042553, KB3042553, KB2722479, KB3011443...}
Hostname                       svrad01
KBs                            {KB3000483, KB3041836, KB3032323, KB3046306...}
Hostname                       251.59.174.131
KBs                            {KB3042553, KB3011443, KB3000483, KB3046306...}
Hostname                       svrcitrix
KBs                            {KB3042553, KB3042553, KB2500212, KB2962486...}

edit:
just found this
https://powershell.org/forums/topic/powershell-2-0-object-formatting-driving-me-crazy/#post-18711
and it seems like I may need a new thread.

Yes, PowerShell 2.0 does not support [PSCustomObject]@{}. That was introduced in 3.0.

If you have to stick to 2.0, you will have to revert to using.

New-Object -TypeName PSObject -Property @{}

Perfect! Easy!