Find text on a web page

Hi,
I am accessing an internal web page which has a list of servers and uuids from which I would like to search for a server and get its uuid.

Each line on the html page looks like this:
server1,7a063332-2e05-53338-ff5b-5843116ad838
server2,5d883442-ab28-122b-9646-1cb44b8c344d
server5,36c915bf-3d21-a222-b0a9-6f10d22d7b011

After using the invoke-restmethod command and exporting to a csv file, the exported format becomes:
server1,7a063332-2e05-53338-ff5b-5843116ad838server2,5d883442-ab28-122b-9646-1cb44b8c344dserver5,36c915bf-3d21-a222-b0a9-6f10d22d7b011

#1 is there any way to easily just find the server and uuid with the invoke-restmethod ?
#2 if #1 cannot be done, how can I extract it from the csv file?

And no, I cannot get the uuid from the server at this point. I need this list.
Thanks in advance!

You might look into Invoke-WebRequest; I’m not sure exactly what you need to do, though.

If the web server is returning that as raw text, rather than something XML- or JSON-encoded, then it can be difficult to detect the end-of-line characters, which is why you’re getting a single line of output in your CSV. I’d try to solve that. If the items were coming through as individual lines, your task would be very easy - just use Import-Csv.

Try looking at the raw data being returned by the server and see if it includes CRLF (carriage return, line feed) at the end of each line, or something else.

You probably could use the HTML DOM model to pull the data from the site. In IE, if you hit F12 it will open a DOM explorer and determine what HTML object (DIV, SPAN, TABLE, TEXTBOX, etc) contains your data, you can do something like this: http://powershell.com/cs/blogs/tips/archive/2013/09/02/importing-website-tables-into-excel.aspx

In this blog, they are pulling information out of a table. If the raw data is place in a DIV, like so (standard tags replace so HTML would not render);

{html}
{head}
{title}Awesome Website{/title}
{/head}
{body}
{div id=“div_Hdr”}Server Data{/div}
{div id=“div_ServerData”}
server1,7a063332-2e05-53338-ff5b-5843116ad838{br/}
server2,5d883442-ab28-122b-9646-1cb44b8c344d{br/}
server5,36c915bf-3d21-a222-b0a9-6f10d22d7b011{br/}
{/div}
{/body}
{html}

You could do something like:

$content = $data.ParsedHtml.getElementsByTagName("div"))[1].InnerHTML

The .getElementsByTagName will find all DIV’s in the website (it could be a LOT), but in this example there are 2, so 0 represents div_Hdr and 1 represents div_ServerData (object array). You want the actual RAW content, so you want to get what is contained in the DIV, so it would be .InnerHTML. If you explore the DOM and see that the container does have an ID, it’s cleaner to just get the object by ID (or Name):

$content = $data.ParsedHtml.getElementById("div_ServerData").innerHTML)

As Don eluded, the question is really what format the data is in to be able to parse it and make it viable Powershell object. If it’s a div with {br /} line breaks, then you can split the content on that and commas and create a custom PSObject from that. If it’s a table, which is ideal, just do a search on ‘parsing table dom powershell’ and take it from there. The built-in DOM explorer in IE (feature is available in most browsers) you should be able to narrow down pretty quickly what container object and if there is any easy way to identify (e.g. ID or Name) of the object to start your parsing. Good luck!! If you figure it out, post some code for others to use.

Edit: Any forum guys, is there a proper way to show something as literal like HTML without rendering it? I tried the pre, blockqoute, nothing and all of them rendered the html. Thanks.

As a fun example, take this website for instance. I hit F12 in my IE browser, hit the cursor in the box on the top middle of the bar and place it on the forum topics object and clicked it. I could see the container object with all of the topics is a UL (unordered list) with an ID of bbp-forum-2683. Then I started parsing the objects until I finally arrived at the title which was UL (bbp-forum-2683) > UL > LI > A > title. Note that when I parse the LI that there are other A tags in collection, but I only wanted the forum title, so I looked at the returned object attributes and saw there was a CSS class ‘bbp-topic-title’ that I used to filter the results:

$URL = "https://powershell.org/forums/forum/windows-powershell-qa/"

# reading website data:
$data = Invoke-WebRequest -Uri $URL 

# get the first table found on the website and write it to disk:
$data.ParsedHtml.getElementByID("bbp-forum-2683") | foreach{
    $_.getElementsByTagName("ul") | foreach{
        $_.getElementsByTagName("li") | Where{$_.className -eq 'bbp-topic-title'} | foreach{
            $_.getElementsByTagName("a") | foreach{
                $_.Title
            }
        }
    }
}

Returns:

Forums Tips and Guidelines
Check these External Forums for Specific Topics
Find text on a web page
Help emailing formatted HTML – Please
Need a script to use a threading.
Powershell v3 & v2 Compatibilty
Advanced Function Optional Parameter Problem
Help Bulk Permission changes O365 Conferance Rooms
Variable Output in Email
Detecting param variable input only
Scenario Training?
How to Format a list to be emailed?
how to temporarily un-protect certain objects??
Powershell 2.0 and Sql server 2005 help
import same.csv | For Each {
Strange behavior with the Get-Alias cmdlet
Redundancy of Code in Advanced Function

Edit: Not finding a good way to post HTML code on the new forum plugin yet. It’s even unwrapping double-escaped HTML.

Dare I try to type in some JavaScript and see if that executes? It’s impossible, right TweetDeck :slight_smile:

Wow thanks for the help everyone!

I used the F12 and browsed the DOM. It showed up as this, for example (written in poor format due to the limitations of posting html:

after META, it says http equiv = content type content text/html

all of the lines are within the body of the html

Is the object body, or ? Hmm sorry I know nothing about html :frowning: but now it’s on my list to learn :slight_smile:

What if I just take this file and remove the line breaks, or replace them?

The result I am looking for is that I can search for the server name, then set a variable for its uuid found.

It’s possible it’s just in the BODY. You can just try:

$data.ParsedHtml.getElementsByTagName("body").InnerHTML

See if that contains the server data, then it’s just parsing it

Hi,
I’m back :slight_smile:
I now have an HTML file as output , and in it a table. It looks like this, with headers:

hostname uuid
server01 5d7b0d42-760d-cf02-4c0d-b00848b20a38
server02 5d7b0d42-760d-cf02-4c0d-b00848b20a38

What I am trying to do is search for server02, and set a variable to the server’s uuid.

Can someone help?
Thanks!

Rob that is well good - pretty impressive what you can parse and how to get those details out.

You are just using standard Powershell commands\logic now that you have a PSObject:

$object = @()
$object += New-Object -TypeName PSObject -Property @{Hostname="Server1";UUID="5d7b0d42-760d-cf02-4c0d-b00848b20a38"}
$object += New-Object -TypeName PSObject -Property @{Hostname="Server2";UUID="5d7b0d42-760d-cf02-4c0d-b00848b20a38"}

$object | Where{$_.HostName -eq "server1"} | foreach{ Some-Command -UUID $_.UUID }

Thanks.
But this is an html file I am searching in, so no psobject… I think I am missing something?
Thanks!

You just create a blank object and then redirect what is being outputted into the object, like so:

PS C:\>
 $URL = "https://powershell.org/forums/forum/windows-powershell-qa/"
 
# reading website data:
$data = Invoke-WebRequest -Uri $URL 
$myPSObject = @() 
# get the first table found on the website and write it to disk:
$myPSObject = $data.ParsedHtml.getElementByID("bbp-forum-2683") | foreach{
    $_.getElementsByTagName("ul") | foreach{
        $_.getElementsByTagName("li") | Where{$_.className -eq 'bbp-topic-title'} | foreach{
            $_.getElementsByTagName("a") | foreach{
                $_ | Select Title, HREF
            }
        }
    }
}

$myPSObject

title                                                  href                                                  
-----                                                  ----                                                  
Forums Tips and Guidelines                             https://powershell.org/forums/topic/forums-tips-a...
Check these External Forums for Specific Topics        https://powershell.org/forums/topic/check-these-e...
Little scripting help                                  https://powershell.org/forums/topic/little-script...
WINRM authentication                                   https://powershell.org/forums/topic/winrm-authent...
WINRM kerberos & Negotiate                             https://powershell.org/forums/topic/winrm-kerbero...
WinRM with non-domain joined machine using Certs       https://powershell.org/forums/topic/winrm-with-no...
Exchange cmdlet error change in PS 3 vs PS 4           https://powershell.org/forums/topic/exchange-cmdl...
Find text on a web page                                https://powershell.org/forums/topic/find-text-on-...
using select-object                                    https://powershell.org/forums/topic/using-select-...
File Copy Access is denied                             https://powershell.org/forums/topic/file-copy-acc...
Dell Warranty Information                              https://powershell.org/forums/topic/dell-warranty...
Foreign Security Principals                            https://powershell.org/forums/topic/foreign-secur...
Module review                                          https://powershell.org/forums/topic/module-review/  
winrm g & e swithch diffrence                          https://powershell.org/forums/topic/winrm-g-e-swi...
Help with setting up PSRemoting                        https://powershell.org/forums/topic/help-with-set...
Help with adding a script method (and quite possibl... https://powershell.org/forums/topic/help-with-add...
LOG for Copy-item                                      https://powershell.org/forums/topic/log-for-copy-...

PS C:\> $myPSObject | Where{$_.Title -like "*module*"}

title                                                  href                                                  
-----                                                  ----                                                  
Module review                                          https://powershell.org/forums/topic/module-review/  

Do you know of a way to get this to work with a file on disk? If you pass a file:// URI to Invoke-WebRequest, you get back a different type of object which doesn’t contain the parsed HTML objects.

I’ve been able to use the HTML Agility Pack for this, which requires downloading an extra DLL (and it helps to be familiar with XPath), but is also quite fast compared to accessing the HTML DOM through the objects that Invoke-WebRequest returns:

Add-Type -Path .\HtmlAgilityPack.dll

$doc = New-Object HtmlAgilityPack.HtmlDocument
$doc.Load("$pwd\forum.html")

$links = $doc.DocumentNode.SelectNodes('//li[@class="bbp-topic-title"]/a')

$properties = @(
    @{ Name = 'Title'; Expression = { $_.GetAttributeValue('Title', "") } }
    @{ Name = 'Href';  Expression = { $_.GetAttributeValue('href', "") } }
)

$links | Select-Object -Property $properties

BTW, you can use the same library to handle parsing of live webpages as well. This has the benefit of keeping the faster performance of the HTML Agility Pack library, and consistent code regardless of where the HTML came from. To do so, use Invoke-WebRequest as before, and pass its Content property to the LoadHtml method of HtmlAgilityPack.HtmlDocument:

Add-Type -Path .\HtmlAgilityPack.dll

$URL = "https://powershell.org/forums/forum/windows-powershell-qa/" 
$data = Invoke-WebRequest -Uri $URL 

$doc = New-Object HtmlAgilityPack.HtmlDocument
$doc.LoadHtml($data.Content)

$links = $doc.DocumentNode.SelectNodes('//li[@class="bbp-topic-title"]/a')

$properties = @(
    @{ Name = 'Title'; Expression = { $_.GetAttributeValue('Title', "") } }
    @{ Name = 'Href';  Expression = { $_.GetAttributeValue('href', "") } }
)

$links | Select-Object -Property $properties

Use “Long path tool” software and keep yourself cool.