Web Scraping. I want to pull only one value within one of the tag in the xml.


I am trying to do web scrap to get a value for of a particular message. Below is the xml content i am interested of all in the webpage, I have to search only the href=“browse.jsp;jsessionid=oillma25wtod1qzbhi3jenj5t?JMSDestination=Consumer.Siebel.VirtualTopic.catalog_changed_events” and within that I want only the first value 0.

I tried ParsedHtmlbytagname for Tr and tried with inner text but nothing working. Kindly let me know if anyone has any thoughts on this.

Consumer.Siebel.VirtualTopic.catalog_changed_ev… Consumer.Siebel.VirtualTopic.catalog_changed_events


<a href="browse.jsp;jsessionid=oillma25wtod1qzbhi3jenj5t?JMSDestination=Consumer.Siebel.VirtualTopic.catalog_changed_events">Browse</a>
<a href="queueConsumers.jsp;jsessionid=oillma25wtod1qzbhi3jenj5t?JMSDestination=Consumer.Siebel.VirtualTopic.catalog_changed_events">Active Consumers</a>
<a href="queueProducers.jsp;jsessionid=oillma25wtod1qzbhi3jenj5t?JMSDestination=Consumer.Siebel.VirtualTopic.catalog_changed_events">Active Producers</a>
<a href="queueBrowse/

?view=rss&feedType=atom_1.0" title=“Atom 1.0”>
<a href="queueBrowse/

?view=rss&feedType=rss_2.0" title=“RSS 2.0”>

<a href="send.jsp;jsessionid=oillma25wtod1qzbhi3jenj5t?JMSDestination=Consumer.Siebel.VirtualTopic.catalog_changed_events&amp;JMSDestinationType=queue">Send To</a>
<a href="purgeDestination.action;jsessionid=oillma25wtod1qzbhi3jenj5t?JMSDestination=Consumer.Siebel.VirtualTopic.catalog_changed_events&amp;JMSDestinationType=queue&amp;secret=46be1146-426f-4c4e-beea-8f532fa6e7b6">Purge</a>
<a href="deleteDestination.action;jsessionid=oillma25wtod1qzbhi3jenj5t?JMSDestination=Consumer.Siebel.VirtualTopic.catalog_changed_events&amp;JMSDestinationType=queue&amp;secret=46be1146-426f-4c4e-beea-8f532fa6e7b6">Delete</a>

As far as I can remember this forum doesn’t deal well with XML code pasted into the post.
So it’s better if you paste the XML into Gist and add the Gist URL in the post.


Thanks for your input. Below is the link for my xml.


Not 100% sure what you’re after.

But if you have the above data in a variable in my example called htmlData you could do something like this.

$htmlData = Get-Content test.html -Raw # I put your example into a file, so you would change this to whatever suits you.
$htmlValue = $htmlData | ConvertFrom-String | Select P7

You can skip the Select P7 just to see the layout of the data.

This will only work if the data is consistant, meaning that the value entry will always end up in P7, otherwise you would need something to identify the specific tag you’re searching for.

Another but a bit crude option would be to:

$htmlValues = $htmlData -split ""

Edit: The split operator would be the /TD tag but I can’t add the chevrons in the example, since it will be scrubbed for the same reason I mentioned above.

Which will create an array based on splitting the raw data on the /TD tag.

Otherwise you may want to check html parsers like Html Agility Pack and so forth.
But then you’re kind of leaving the Powershell realm and go into C#, XPath and Linq.

WOW. It worked. Can I ask one last help?.

From your script, below is the output of it:



I want the value ‘0’ that is between the tag 0. I tried the regular expression and -match or -pattern but nothing is working. Below is the output of the Get-Member of the variable storing the above value.

PS C:\Users\kd****> $htmldata | Get-Member

TypeName: Selected.System.Management.Automation.PSCustomObject

Name MemberType Definition

Equals Method bool Equals(System.Object obj)
GetHashCode Method int GetHashCode()
GetType Method type GetType()
ToString Method string ToString()
P7 NoteProperty string P7=0



You could do it in multiple ways, kind of depends on how easy you want to read it and so forth.
But here is an example.

$htmlData = Get-Content test.html -Raw # I put your example into a file, so you would change this to whatever suits you.
$htmlValue = $htmlData | ConvertFrom-String | Select -ExpandProperty P7
$htmlTagValue = $htmlValue[4]

So the extra steps are -ExpandProperty which will return just the content of P7, not the header itself.
Then you can decide how you want to extract the value.
The option above is kind of quick and dirty in the sense that if the data is not consistent (same issue with P7) every time you will get errors.
What the [4] do is taking the fifth value from the string, strings can be used as if they are an array of characters.

To make it a bit more robust and if the value you want only contains numbers then you could do a simple regex instead.

$htmlTagValue = $htmlValue -replace '\D'

But it depends possible values in the tag, what you need and can do and so forth.

I sincerely thank you for your quick and detail response. It perfectly worked. Many Thanks Mr.Fredrik Kacsmarck.