Convert web page source to HTML object

Hi all,

I am rewriting my script that used the com object InternetExplorer.Application to login, navigate, find elements (links) in page source. So now I am using selenium module to work with FireFox driver to do the same.

I have reached a point where I am able to do the login and navigate, I just can’t find a way to convert the page source (HTML string) into an object that I can filter by tag name and class and select hrefs.

I found an example at a website that I could create a com object of HTML and write the page source so I would get a DOM object, however it still doesn’t work when I try to look for ‘a’ tags there nothing found, like this data was not added to the object.

I need your help converting a page source (HTML string) into an object that I can work with.

Here is the sample code which uses ‘selenium’ module:

$PSCred = Get-Credential
$FFDriver = Start-SeFirefox
$FFDriver.Navigate().GoToURL('https://login.somewebsite.com/')
$FFDriver.FindElementByName('email').sendkeys($PSCred.Username)
$FFDriver.FindElementByName('password').sendkeys($PSCred.GetNetworkCredential().password)
$FFDriver.FindElementByName('password').submit()
Start-Sleep -Seconds 3
$FFDriver.Navigate().GoToURL('https://www.somewebsite.com/aaa/bbb/')
$FFDriver.Title
$Source = $FFDriver.PageSource
# Create HTML file Object
$HTML = New-Object -ComObject "HTMLFile"
# Write HTML content according to DOM Level2
$HTML.IHTMLDocument2_write($Source)
$LinkElements = $HTML.getElementsByTagName('a') | where{$_.href -like "$xxx*"} | where{$_.className -eq 'xxx'}

# cleanup
Remove-Variable -Name PSCred
$FFDriver.Close()
$FFDriver.Quit()
$FFDriver.Dispose()

ofergnant,

Please provide some examples of the data you are looking for if you can provide a public site this will help. Are you not able to use the Invoke-WebRequest to call the website?

Invoke-WebRequest -Uri https://www.google.com

 

[quote quote=173272]ofergnant,

Please provide some examples of the data you are looking for if you can provide a public site this will help. Are you not able to use the Invoke-WebRequest to call the website?

<textarea class=“ace_text-input” style=“opacity: 0; height: 17.6px; width: 6.59775px; left: 44px; top: 0px;” spellcheck=“false” wrap=“off”></textarea>

1
Invoke-WebRequest -Uri https://www.google.com
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
[/quote] I would be better if I could use Invoke_WebRequest to login, the website is using auth0 (3 redirects) with javascript and ajax, when I tried using it there was no forms created, I guess because they take more time to load pulling the js from external source. I wasn't sure also if I could keep the session logged in while parsing ad following links. But I think -sessionvaraible parameter does that, so if I manage to login I am good for the rest of the script.

For now, I have found a workaround solution using HTMLagility pack suggested here: https://powershell.org/forums/topic/find-text-on-a-web-page/ . I managed to create HTML object the same way I done before, only that their ‘load’ method to write the web source to the object is doing a better job. Using his example I managed to find links, and filter by class name then pull out titles and hrefs.

I still need to learn better about this agility package filter options as it has a syntax I am not familiar with.

If anyone can suggest me with a simpler approach using Invoke-WebRequest to work I would prefer it as I don’t like many dependencies in my automation scripts.

BTW, the website I am not able to login using Invoke-WebRequest is https://login.thetimes.co.uk

Here is the code addition:

Register-PackageSource -Name MyNuGet -Location https://www.nuget.org/api/v2 -ProviderName NuGet
Install-Package HtmlAgilityPack

$Source = $FFDriver.PageSource
# Create HTML object
$doc = New-Object HtmlAgilityPack.HtmlDocument
$doc.LoadHtml($Source)
# Get all today's new links
$Links = $doc.DocumentNode.SelectNodes('//a[@class="classname"]')