Downloading Tables from webpages

Hello,

I am after creating a script where by I can download a table off a site and then look in a cell for a colour and if the colour matches red then sent a prompt stating the information in the cell next to it.

Does anyone know how to do this?

So far I can download the site however I cant pull the table or anything I have the following code :

$webClient = new-object System.Net.WebClient
$webClient.Headers.Add(“user-agent”, “PowerShell Script”)
$output = $webClient.DownloadString(“http://www.google.co.uk”)

 Obviously I have replaced google.co.uk with my actual site.

Many Thanks

James

This is going to be HARD. Depending on how the page is built, it might be IMPOSSIBLE.

In HTML, table cells are <td> tags. The problem is, you can’t just look at the tag to tell if it’s a specific color. It might be <td style=“color:red;”> but it might also be something like <td class=“xxx”>, with a style sheet applying the red color to that class. There’s about a hundred ways a cell could end up being red - depending on how the page is built, it may be impossible to statically look at it and tell. The color may even be applied from a separate CSS file.

Ah right. I have noticed now looking at the site that the word is in the colour as well so the colour of the cell is stated. So if the cell is green inside the cell it actually has the text Green. Would this make life a bit easier?

Many Thanks

James

Well, “easier” is relative. You’re dealing with a giant text document.

If it’s well-formed XHTML (and it should have a meta tag indicating that, near the top), then you can have the shell parse it as XML and manipulate it. That’s easier.

If it isn’t, you’re going to probably have to treat it as a giant hunk of text and use regular expressions to parse it. That’s harder. You’ll essentially have to construct a regex that looks for whatever text pattern you need (the table cell HTML) and captures the subexpression (the content of the table cell) into a variable.

You MIGHT, if this isn’t running on a server, be able to use the Internet Explorer COM object to parse the HTML string into a Document Object Model (DOM), which can be manipulated a bit like XML. Also not super-easier, but maybe easier than a regex.

Sadly, I’m not your regex guy. If that’s the direction you need to go, we’ll try and elicit some more helpers for ya.

Thanks for that I have had a look at the source and this is at the top which is what I presume you mean:

!DOCTYPE html PUBLIC “-//W3C//DTD XHTML 1.0 Transitional//EN” “http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd

html xmlns=“http://www.w3.org/1999/xhtml

I presume this means that we can use XML?

Many Thanks

James

I think I have progressed a bit from some searching but a bit stuck now.

I can get the table from the page but thats about it currently:

$ie = New-Object -com “InternetExplorer.Application”
$ie.navigate(http://www.google.co.uk)
$doc = $ie.Document
($doc.getElementsByTagName(“table”))|%{$_.innerHTML}

Any help would be greatly apprecited.

Many Thanks

James

ok I Have got the following:

$ie = New-Object -com “InternetExplorer.Application”
$ie.navigate(“http://www.google.co.uk”)
$doc = $ie.Document
($doc.getElementsByTagName(“table”))|%{$_.innerText} | Out-File C:\Output.txt

$Red = Select-String C:\Output.txt -pattern “Red”
$Yellow = Select-String C:\Output.txt -pattern “Yellow”

If(Select-String C:\Output.txt -pattern “Email and OWA”){
$Service = Select-String C:\Output.txt -pattern “Email and OWA”
$Test = $Service -split(“Green”)
$Test
$Test[0]
$Test[1]
$Test[2]
}

Which essentially dumps the table with no formatting into a text file and I am trying to sort out the information. However there are no spaces or anything between the columns of the table which makes things difficult.

An extract of the information as shown is here:

Email and OWAPlanned maintanance affecting external emailsGreenPlease be aware that on Wednesday the 31st July and Wednesday 7th August there will be interruptions to
external e-mail flowing in and out of the organisation.
The service will stop at 5.30pm and may be down until 9.00pm.

Can anyone give any idea’s on this?

Many Thanks

James

I’d say you want to sort that out BEFORE you get rid of the HTML. The HTML tags are how you tell the difference between columns.

Ahhh right how would I go about that? I cant seem to fathom it out.

Many Thanks

James

Well, that’s the thing. It’s hard. You have to write a regex to capture the “whatever” and then a subexpression to capture just the contents of the cell. Again, I’m not especially good at that. Me, I’d try and parse the file as XML or use a DOM approach, because it’s what I’m more comfortable with.

Hmmm I have no idea which either of them mean so any information would be greatly appreciated. I did try to get it out as an XML however it didnt view correctly. I used the ConvertTo-XML cmdlet if thats any use?

Many Thanks

James

No, ConvertTo-XML isn’t right, and once you parse as XML it won’t “display” at all. It’s an object hierarchy. If your HTML document is in $html, you’d do “[xml]$xml = $html” to attempt to parse the HTML as XML and create an XML object model. If it errors, then the HTML isn’t compliant with XML standards (which isn’t unusual - HTML isn’t exactly the same as XML, although they’re close).

Using the code above I have tried the code you suggested which in my code bove is:

[xml]$xml = $doc

and I get the following error:

Cannot convert value “mshtml.HTMLDocumentClass” to type “System.Xml.XmlDocument”. Error: “The specified node cannot be
inserted as the valid child of this node, because the specified node is the wrong type.”
At line:1 char:1

  • [xml]$xml = $doc
  •   + CategoryInfo          : MetadataError: (:) [], ArgumentTransformationMetadataException
      + FullyQualifiedErrorId : RuntimeException
    
    

So does this mean that I am unable to do it? If this is the case how can I progress or is this where the regex stuff comes in?

Many Thanks

James

That probably means it won’t work, and in terms of how you progress with writing the regex - unfortunately, as I said, I’m not the regex master :(. I’m personally not going to be much help to you. But I’ll try and get a couple of other folks to take a look at this.

If you can, try to provide a BRIEF snippet of the actual HTML - you will need to attach that as a file, since you can’t post actual HTML in the forums. It’s often helpful to see what the actual data looks like.

Then, if you can provide a concise explanation of what you want to extract from the HTML. I think what you want is something like, “I need to identify table cells that contain the word ‘Red’ and extract those contents into a variable,” but it’ll be better if you re-state the problem you’re trying to solve. We’ve gone down a couple of paths in this thread, so it’d be good to make sure anyone coming in at this point can quickly understand what you’re after.

Ok thank you for your help.

I am after trying to get the row which contains either Red or Yellow into a variable and then output the information in the cells of that row out so that it can be displayed to the user.

I hope this makes sense.

I have the following code:

$ie = New-Object -com “InternetExplorer.Application”
$ie.navigate(“http://www.google.co.uk/”)
$doc = $ie.Document

($doc.getElementsByTagName(“table”))|%{$_.innerhtml} | Out-File C:\Output.txt

$Red = Select-String C:\Output.txt -pattern “Red”
$Yellow = Select-String C:\Output.txt -pattern “Yellow”

Which works however I need it better formatted which is what I am unable to work out how to do.

An example of the Output.txt is attatched and the site has obviously changed

Any Assistance would be greatly appreciated.

James

I’ll take a stab at it.
I generally approach this using a multi-line single-line regex, in a here-string (I call it “jumbo shrimp in a can”). Let me review the requirments and that data and see what I can come up with.

First problem - there doesn’t appear to be any “Red” or “Yellow” in the sample data. Are we looking for background color, or foreground color?

I believe I’ve accomplished what you are looking for with HTML DOM parsing, which is the first attempt in Powershell and my head hurts a little from beating it on my desk. However, this looks like a dashboard for some internal status, so I would highly suggest trying to get access to a database\view\webservice or something to get raw data versus page crawling. This could work today and tomorrow it will not because you are the mercy of the developers and how the parsing is setup in the script at the moment.

I took the Output1 and 2 text files and just put a starting and ending table tags and ran the script against a local copy. From reading through your conversation thus far, it looks like you are trying to get critical and warning status by filtering on the color during your parse. I would leverage Powershell and pull all results and then filter on those results versus trying to figure out all of the DOM parsing to only pull certain results. This code:

$ie = New-Object -com “InternetExplorer.Application”
$ie.navigate(“C:\Users\Rob\Desktop\output2.html”)

$htmlResults = @()

foreach($table in $ie.Document.getElementsByTagName(“table”)) {
$tableHdr = @()

foreach($th in $table.getElementsByTagName("th")) {
    $columnName = $th.getElementsByTagName("a").Item(0).innerHTML
    $tableHdr += $columnName
}


foreach($tr in $table.getElementsByTagName("tr")) {
    #filter to find the rows in the table where the data is
    if($tr.style.backgroundColor -eq "rgb(239, 243, 251)" -or $tr.style.backgroundColor -eq "white") {
        $tds = $tr.getElementsByTagName("td")
        $rowProps = New-Object PSObject
        for ($i=0; $i -lt $tds.Length; $i++) {
             if ($tds.Item($i).innerHTML -like "&lt;span*") {
                $rowProps | Add-Member -MemberType NoteProperty -Name $tableHdr[$i].ToString() -Value ($tds.Item($i).getElementsByTagName("span").Item(0).innerHTML)
             }
             else {
                $rowProps | Add-Member -MemberType NoteProperty -Name $tableHdr[$i].ToString() -Value ($tds.Item($i).innerHTML)
             }
        }
        $htmlResults += $rowProps      
    }
}

}
$htmlResults
$ie.Quit()

produces the following results (removed the Email and OWA for formatting):

Council Services Description Status Information


Carefirst - Adults No major problems Green
Citrix No major problems Green
Civica APP No major problems Green
Civica IBS No major problems Green
website No major problems Green
EDRMS No major issues Green
Payroll System No major problems Green
Frontline No major problems Green
Internet No major problems Green
Intranet No major problems Green
Tribal - Childrens’ No major problems Green
Northgate Housing System No major issues Green
Wireless network No major issues. Green
SAP No major problems Green
VOIP Telephone System No major problems Green
VPN Homeworking Access No major problems Green
PECOS No major problems Green
PARIS No major problems. Green
Other No major problems Green
Mobile Telephony Service No major problems Green
Office Online No major problems Green
iTrent HR Payroll No major problems Green

Now that you have that, if there were Red or Yellow status, you could just do simple filters:

$htmlResults | Where {$_.Status -eq “Red”}

Here’s a regex solution that works with the posted test data. There appears to be 2 color settings per entry, and none of them have values of Red or Yellow. I’m not sure which one is supposed to change, so I parsed out both of them, and you can key off of whichever one you need.

$data = gc .\Output1.txt -raw

[regex]$regex1 = @’
(?ms)
<TR style=“BACKGROUND-COLOR: \S+?”>
<TD style=“WIDTH: 20%”>.+?</TD>
<TD style=“WIDTH: 20%”>.+?</TD>
<TD style="BACKGROUND-COLOR: \S+?;.+?</TD>
<TD style=.+?Information>.*?</SPAN></TD></TR>
'@

$Entries=
($regex1.Matches($data)).groups |
select -ExpandProperty value

[regex]$regex2 = @’
(?ms)
<TR style=“BACKGROUND-COLOR: (\S+?)”>
<TD style=“WIDTH: 20%”>(.+?)</TD>
<TD style=“WIDTH: 20%”>(.+?)</TD>
<TD style="BACKGROUND-COLOR: (\S+?);.+?</TD>
<TD style=.+?Information>(.*?)</SPAN></TD></TR>
'@

Foreach ($Entry in $Entries)
{
$Entry -match $regex2 > $null
New-Object PSObject -Property @{
Color = $Matches[1]
Color2 = $Matches[4]
System = $Matches[2]
Message = $Matches[3]
Info = $Matches[5] -replace ‘<BR>’,“`n”
}
}

Rob,

Thank you for your reply the code you have supplied seems to work fine however it doesnt seem to pull all results from the table when I put it against the actual webpage.

The results I get are:

Citrix
Civica IBS
EDRMS
Network
Frontline
Intranet
Northgate Housing System
SAP
VPN Homeworking Access
PARIS
Mobile Telephony Service
iTrent HR Payroll

Which from your example isnt correct. The only thing I have changed is the address it looks at in the $ie.navigate and that was the result.

Do you have any idea’s?

Many Thanks

James