Help getting table data out of HTML (Scraping)

gazza · October 22, 2018, 3:15pm

Hey guys. I am totally stuck trying to pull a table out of some HTML.
I am logging onto a website and pulling back the HTML information but for the life of me, I can’t seem to get that data into a useful format.

I’m scraping using ‘New-Object -com’ because I couldn’t work out how to use ‘Invoke-Webrequest’ and log in so I fell back to InternetExplorer.Application as this was working for me (plus I can set it to visible and see it).

After login I am collecting the information like so

$tables = @($ie.Document.getElementsByTagName('table'))

and inside that, I have a field called ‘innerHTML’ that contains my table. I just can’t work out how to convert the information in that table into something readable and I’m wanting to perform a foreach at the end to upload some cells via an API to another site.

I’ve tried converting to XML\JSON, and splitting on "`n"I’ve followed every link on the internet that I can find and tried all their solutions but none seem to work for me. I have also tried a bunch of functions sourced from the net but they also no data returned.

I have put an edited copy of the table in this pastebin… it only has 2 rooms, but my original data will have 50 (unless I can work out how to return all object and not just the first page, then it will have about 300)

https://pastebin.com/R7FfHCvX

If anyone could point me in the right direction as i’m all out of ideas?
I think that this would be a fairly common task in PowerShell so I’m guessing that I just don’t know the terminology to search for it on google.

gazza · October 22, 2018, 3:26pm

This is what the table looks like on the website if it helps.

https://imgur.com/a/3O66kmR

random-commandline · October 22, 2018, 5:38pm

What site are you logged into and what information do you need from ‘innerHTML’? Also, using the ‘getElementById()’ method will be faster.

gazza · October 22, 2018, 6:21pm

The site is zoom.us, its just a meeting room application and I’m accessing its dashboard to collect the status of the rooms in error so that I can update my CMDB and send alerts off it if required. The image in the link above will explain better what I am wanting to pull out of the table, but I am after the following Row\Cell information
for Room Name, Health, Issue

In the Pastebin, they are on the following lines.
Room Name - Line 58
Health - Line 64
Issue - Lines 67,68

I did try ‘getElementById()’ method but there is no ID for that information, just the table (unless I am missing something).

I thought that I could just dump it all into an Object and then just access the information by dot sourcing it or maybe by using ConvertTo-XML first but that doesn’t work. So I tried a whole heap of examples that I found online but none of the examples worked for me no matter what I tried. I can always get to the step of accessing the table, but whatever the next step is in the example, and no matter how I reworked the function it always failed for me.

gazza · October 22, 2018, 6:28pm

Ok, I was being stupid. I see what you mean about using the ‘getElementById()’ method.
So I can pull just the first entry. I’ll just have to work out how to pull the others, then I can do a replace or regex on the string or something.

Edit:
I still may not be able to do it that way as each section reuses the ID and according to this, it can’t be reused.

https://stackoverflow.com/questions/9435616/powershell-ie-automation-getelementbyid-with-multiple-entries

gazza · October 22, 2018, 8:17pm

If I use the ‘getElementById()’ method then I can access the data for the first row , but as each row all have the same name I can’t get the others. e.g ID=“RoomName”.
Then if I instead take the HTML and split on “<.?tr>” so that I can iterate through each row\tr then I can no longer access the data via the method (which would have solved my issue, as “outerText” has the info that I need in plain text).

I’m still not sure how this should be grabbed.
Do you have any other ideas for me?

I managed to get invoke-webrequest to work, so will try with ParsedHtml to see if that make a difference.

postanote · October 22, 2018, 10:37pm

How are you walking the web site to ensure you are on / using the right object?

For example, If I wanted to use say pwpush.com and interact with it.

$url = 'https://pwpush.com'

($FormElements = Invoke-WebRequest -Uri $url -SessionVariable fe) 

($Form = $FormElements.Forms[0]) | Format-List -Force

$Form | Get-Member

$Form.Fields


# Full pass

Clear-Host

$password = '1234'
$loginUrl = 'https://pwpush.com'

$ie = New-Object -com internetexplorer.application
$ie.visible = $true
$ie.navigate($loginUrl)

while ($ie.Busy -eq $true) { Start-Sleep -Seconds 1 }

($ie.document.getElementById('password_payload') | select -first 1).value = $password
Start-Sleep -Seconds 1 

$ie.Document.getElementsByName('commit').Item().Click();
Start-Sleep -Seconds 1

Now pulling table info from a page, you’d do similar stuff, using something like this…

# get the first table found on the website and write it to disk:
@($data.ParsedHtml.getElementsByTagName("table"))[0].OuterHTML

… of course that is only after you’ve walked the site to get the correct element.

_timpringle · October 23, 2018, 3:46am

Hey Gary, I’ve taken a look at their website, and think you might be making it a bit too tricky than it needs to be. They have an API available which you can use Zoom API , which will be able to give you all this information without the need for scraping. Judging by the number of rooms you’re talking about, it sounds like you have a license which would allow you to use it with an API key and will give you data back to do with what you want and not need to scrape.

For any websites, I recommend in taking a look using Chrome with Developer Tools to have a look at what’s happening in the background. Take a look particularly for XHR requests, which are API calls. But either way, you get lots of information that help hugely to automate getting data.

Scraping has its uses, but very quickly builds up technical debt that is as important to focus on as the coding itself.

gazza · October 23, 2018, 2:42pm

Wow. I had no idea they had an API. I will use that.

For completeness, I was doing the following.

$postParams = @{email="email@domain.com";password="password"}
try{
Invoke-WebRequest "https://zoom.us/signin" -SessionVariable Zoom -Body $postParams -Method Post
$Dashboard = Invoke-WebRequest "https://zoom.us/account/metrics/zp" -WebSession $Zoom -Body $postParams -Method Post
}
catch {
$err=$_.Exception}

$splitTR = $Dashboard.ParsedHtml.documentElement.innerHTML -split "<.?tr>"

But I will definitely switch to the API. It would have been good if they told me they had one when I last spoke with there support about there dashboard.

gazza · October 23, 2018, 5:37pm

So turns out that you can’t see the dev options until you have registered on another site (https://developer.zoom.us)… not sure why they didn’t tell me about the API when I was asking them questions about the dashboard but using the API obviously helped.

Anyway, that was a simple solution, took me about 10min after I sorted that out.

For completeness I was doing this to scrape (now changed to API call).


$postParams = @{email="email@domain.com";password="password"}
try{
    Invoke-WebRequest "https://zoom.us/signin" -SessionVariable Zoom -Body $postParams -Method Post
    $Dashboard = Invoke-WebRequest "https://zoom.us/account/metrics/zp" -WebSession $Zoom -Body $postParams -Method Post
    $Dashboard.ParsedHtml.nameProp}
catch {
    $err=$_.Exception}

$splitTR = $Dashboard.ParsedHtml.documentElement.innerHTML -split "<.?tr>"

Topic		Replies	Views
Extracting parent tables on HTML PowerShell Help	6	242	May 16, 2024
Regex with inner HTML PowerShell Help	2	117	May 16, 2024
Get Parsed HTML from downloaded html file PowerShell Help	3	244	May 16, 2024
Having trouble pulling a value out of a hash table PowerShell Help	7	236	May 16, 2024
Manipulating Text or HTML code PowerShell Help	1	194	May 16, 2024

Help getting table data out of HTML (Scraping)

Related Topics