Downloading Tables from webpages

As I mentioned, it’s tough to build the code because if anything changes, no worky. Take a look at the tables on the page. I believe the issue is the looping through the tables versus only parsing the table you care about. If the table has a name or ID then you can do .GetElementByName or .GetElementByID versus looping through all tables with .GetElementsByTagName(“table”). If the table has an inconsistent or no name\id, you can identify it by index or .Item(1) (which 1 indicates the second table on the page).

$ie = New-Object -com “InternetExplorer.Application” $ie.navigate(“C:\Users\Rob\Desktop\output2.html”)

$htmlResults = @()

foreach($table in $ie.Document.getElementsByTagName(“table”)) {
$table
}

That will get you all of the tables on the page and dump all of the attributes of that table. You are looking for a unique identifier to help filter out tables you don’t want to parse. Use the rest of the code I posted and try to limit the table parsing to just the table you want to parse.

It appears the sample data isn’t the same as what’s on the web page. The regex could probably be adjusted to work with the actual data, but I can’t do that without knowing what’s different.

Hello,

I dont know if it will help but here is the dump from your command about pulling the Tables from the page (attatched). I have tried the getElementByID and that returns the same information which is why I am confused.

I presume I am missing something in the HTML?

Many Thanks

James

What are you using to retrieve the data that you’re using the regex on? That’s made to work with a single, multiline string. The first coded you posted (using system.net.webclient) should do that.

I am sorry I dont understand I am not using a regex as far as I am aware. The code I orgionally posted was from what I threw together after some googling and playing around.

Could you please explain a little more and I maybe able to help?

Many Thanks

James

Right now you have two solutions, one using regex and one using DOM. You also have two different methods of getting the data from the web page - one using system.net.webclient and one using IE. The regex solution I posted is appropriate for raw page source data, like the data stream you’d get back from reading a web page using system.net.webclient.

I think it would be easier to use the IE method since this is on all machines. Although your regex I must admit I havent tried mainly because I am not sure about how to get the Output file each time for it to be used so although this may work better I dont fully understand it.

Many Thanks

James

Try this:

$ie = New-Object -com “InternetExplorer.Application”
$ie.navigate(“C:\Users\Rob\Desktop\output2.html”)

$htmlResults = @()

#foreach($table in $ie.Document.getElementsByTagName(“table”)) {
$table = $ie.Document.getElementsByTagName(“table”).Item(1)
$tableHdr = @()

foreach($th in $table.getElementsByTagName("th")) {
    $columnName = $th.getElementsByTagName("a").Item(0).innerHTML
    $tableHdr += $columnName
}


foreach($tr in $table.getElementsByTagName("tr")) {
    #filter to find the rows in the table where the data is
    if($tr.style.backgroundColor -eq "rgb(239, 243, 251)" -or $tr.style.backgroundColor -eq "white") {
        $tds = $tr.getElementsByTagName("td")
        $rowProps = New-Object PSObject
        for ($i=0; $i -lt $tds.Length; $i++) {
             if ($tds.Item($i).innerHTML -like "<span*") {
                $rowProps | Add-Member -MemberType NoteProperty -Name $tableHdr[$i].ToString() -Value ($tds.Item($i).getElementsByTagName("span").Item(0).innerHTML)
             }
             else {
                $rowProps | Add-Member -MemberType NoteProperty -Name $tableHdr[$i].ToString() -Value ($tds.Item($i).innerHTML)
             }
        }
        $htmlResults += $rowProps      
    }
}

#}

$htmlResults

$ie.Quit()

So, versus looping through each table in the page, we’re statically setting it to the table with the index item. Indexing starts at 0, so table 0 in your page is just used for formatting the current date. Index 1 is the table you want to parse. If you look at your output3.txt, it shows a ASP.net generated ID (ctl00_ContentPlaceHolder2_GridView1), so it can be directly accessed that way too:

$table = $ie.Document.getElementByID(“ctl00_ContentPlaceHolder2_GridView1”)

That ID is a randomly generated ID by ASP.net, so it is not a good ID to hardcode as it can change. Hopefully this works for you and DOM parsing makes a little more sense.

Hello,

Thank you for that I am still only getting the following:

Citrix
Civica IBS
EDRMS
Network
Frontline
Intranet
Northgate Housing System
SAP
VPN Homeworking Access
PARIS
Mobile Telephony Service
iTrent HR Payroll

However after having a look at the code on the page the colours are slighty different where you have:

if($tr.style.backgroundColor -eq “rgb(239, 243, 251)”

The page code as:

if($tr.style.backgroundColor -eq “#eff3fb

So I have updated the code with the above and its working as it should.

Yes the DOM approach is simpler and I am now guessing (as I will have to test) but I can search for anything which has a status of red or yellow and then I can just pop that to screen.

Thank you very much for your help I appreciate it all from everyone in the thread.

James

Another mystery solved, glad it’s working for you.

Yes it is.

I do however have one slight question which is confusing me…

I have added in the following:

`

$Red = $htmlResults | Where{$_.Status -eq "Red"}
$Yellow = $htmlResults | Where{$_.Status -eq "Yellow"}
$Green = $htmlResults | Where{$_.Status -eq "Green"}

If($Red -ne ""){
ForEach-Object{
Write-Host $Red."Services" 
Write-Host $Red.Description 
Write-Host $Red.Status
Write-Host $Red.Information
 }
}

If($Yellow -ne ""){
ForEach-Object{
Write-Host $Yellow."Services" 
Write-Host $Yellow.Description 
Write-Host $Yellow.Status
Write-Host $Yellow.Information
 }
}

`

Which works fine and pulls out the relevant information however if there is more than one of them it throws all the information together.

So if I output $Yellow (On test) I get:

Services Description Status Information


Carefirst System upgrade planned for… Yellow System will not be availab…
Internet Slow between 12 - 2pm Yellow Service affected: Internet…
Intranet Slow between 12 - 2pm Yellow Service affected: Internet…

However if I look at the output from the code above:

`

If($Yellow -ne ""){
foreach($line in $lines){
Write-Host $Yellow."Services" 
Write-Host $Yellow.Description 
Write-Host $Yellow.Status
Write-Host $Yellow.Information
 }
}

`

I get the following output (code tags due to html in output):

`
Carefirst Internet Intranet
System upgrade planned for 4th - 6th Feb  Slow between 12 - 2pm Slow between 12 - 2pm
Yellow Yellow Yellow
System will not be available for 3 days.<BR>Please make provisions for this in your service areas.  Service affected: In
ternet Access particularly between 12 - 2pm.<BR><BR>Who is affected? All Internet users including home workers and certa
in remote offices.<BR><BR>Problem description: General slowness of Internet access caused by greater use of the Internet
 at these times. Problems connecting remotely for home workers and site to site VPN attached offices.<BR><BR>What is cur
rently being done to resolve it? Internet traffic reviewed and a rationalisation taken place. New lines have been added.
 This has released additional capacity to be used until the internet feed is upgraded.<BR><BR>Next steps: Upgrade curren
t internet feed.<BR><BR>Expected solution: Feb 2011 Service affected: Internet Access particularly between 12 - 2pm<BR><
BR>Who is affected? All Internet users including home workers and certain remote offices.<BR><BR>Problem description: Ge
neral slowness of Internet access caused by greater use of the Internet at these times. Problems connecting remotely for
 home workers and site to site VPN attached offices.<BR><BR>What is currently being done to resolve it? Internet traffic
 reviewed and a rationalisation taken place. New lines have been added. This has released additional capacity to be used
 until the internet feed is upgraded.<BR><BR>Next steps: Upgrade current internet feed<BR><BR>Expected Solution: Feb 201

`

Do you have any idea’s what I have done wrong? I am thinking I need a foreach rather than a foreach-object but I cant get it to work…

Many Thanks

James

The issue is you are looping through and pulling individual properties and writing them to your console as strings. If you type $yellow in the console, it is an object and will display properly. What are you trying to do with the individual results? This code has several issues:

If($Yellow -ne ""){ foreach($line in $lines){ Write-Host $Yellow."Council Services" Write-Host $Yellow.Description Write-Host $Yellow.Status Write-Host $Yellow.Information } }

What is $lines? The code is saying for every item ($line) in $lines, do x. Then $Yellow.ColumnName is going to dump the entire column as a string.

Try something like this:

$yellow | foreach{$_}

or

foreach ( $line in $yellow ) { $line }

Another command you should look at is:

$htmlResults | Group-Object -Property Status -NoElement

Hello,

Nevermind sorted it out after some fiddling and realising I was going about it wrong!

$Red = $htmlResults | Where{$.Status -eq “Red”}
$Yellow = $htmlResults | Where{$
.Status -eq “Yellow”}
$Green = $htmlResults | Where{$_.Status -eq “Green”}

If($Red -ne “”){
foreach($line in $Red) {

$Start = "More Information: "
$NewLine = $line.Information -replace “<BR>” , “`n”
$Info = “$Start $NewLine”

Write-Host "Service: " $line.“Council Services”
Write-Host "Description: " $line.Description
Write-Host "Status: " $line.Status
Write-Host $Info
Write-Host " "
Write-Host " "
}
}

If($Yellow -ne “”){
foreach($line in $Yellow){
$Start = "More Information: "
$NewLine = $line.Information -replace “<BR>” , “`n”
$Info = “$Start $NewLine”
Write-Host "Service: " $line.“Council Services”
Write-Host "Description: " $line.Description
Write-Host "Status: " $line.Status
Write-Host $Info
Write-Host " "
Write-Host " "
}
}

Which got me the spaced output I required.

Many Thanks

James

Try:

$yellow | Format-Table -AutoSize -Wrap