Assistance with Invoke-WebRequest

Hi everyone,

Hoping someone here can help me with some website scraping I’m trying to do using Invoke-WebRequest. I would like to scrape the comments section of a website I frequent. After poking around and doing counts and href lookups, I have figured out that it’s a Wordpress blog and the comments are in a #comments subdirectory. If I do the following:

(Invoke-Webrequest -uri https://thewpblogsite.com/directory/#comments).Content, I get all the data from the comments section, but of course I only want the comments themselves. I’ve noticed that each comment is wrapped in the following tags…

<section class=“comment-content comment”>

</section><!-- .comment-content –>

I’ve tried a ton of ways to narrow this down in a way that PS will understand but I’m still fairly new to Powershell so I’m not advanced enough to know how to extract this out.

Any assistance is greatly appreciated. Thanks in advance.

Nelson

Can you share an example output. You can a gist in gist.github.com and copy paste the url here to share the XML snippet.

Sure.

Here is the website whose comments I wish to scrape, in my variable… $URI = “https://frugalvagabond.com/get-non-lucrative-residence-visa-spain/#comment

And my other variable…

$HTML = Invoke-WebRequest -Uri $URI

Based on the tag and class name I see for the comments I want to scrape, here is what I am entering…

($HTML.ParsedHtml.getElementsByTagName(“SECTION”) | Where{ $_.className -like “comment-content comment” } ).innertext

But I get nothing in return.

I have run GM and OGV as well and while I think I am choosing the right tag and class, the comment data doesn’t come up. I assume it’s a different tag and class but not sure which I should be choosing since it’s all that really stands out to me and all I want is the text of the actual comments on this webpage.

Thank you in advance.

Nelson

< !– .comment-content –>

That looks like more of a comment in HTML for a CSS class, but regardless, in order to assist we need an example of the XML or JSON that is being sent in the response to provide parsing opportunities.

Thank you, Rob. Sorry but this is all fairly new to me. Happy to get this for you but how would I go about getting that output? I see kvprasoon suggested gist.github.com, went there but I don’t see where I could enter the web address.

A small nudge in the right direction would be greatly appreciated. Thanks!

Nelson

Create the gist and just paste the link here. This forum will pull it from gist and show it here.

Not sure if this is what you’re looking for, but here is my gist of the pieces I’d like to be able to scrape…

https://gist.github.com/nelsonsaenz/7f921bd16976c82195c32609a4b815c2

Here is a start. HTML is nested and there are multiple layers, but just picking out the comments, you could do something like this:

$URI = "https://frugalvagabond.com/get-non-lucrative-residence-visa-spain/#comment"
$HTML = Invoke-WebRequest -Uri $URI

#Start at the level where all of the comments are
$div = $html.ParsedHtml.getElementById('comments')

$results = foreach ($divElem in $div) { 
    #Get the title of the blog that is being commented on
    $title = $divElem.getElementsByTagName('h2') | Select -ExpandProperty innerText

    #Loop through all of the li elements, which is basically each post
    foreach ($liElem in $divElem.getElementsByTagName('li')) { 
        #Grab the LI element ID
        $id = $liElem.id
        $paragraph = @()
        #Loop through each P, which is each paragraph and create array
        foreach ($pElem in $liElem.getElementsByTagName('p')) {
            $paragraph += $pElem.innerText
        }

        [pscustomobject]@{
            Title   = $title
            Id      = $id
            Comment = $paragraph -join [environment]::NewLine
        }
    }
}

Output:

PS C:\WINDOWS\system32> $results.Count
1106

PS C:\WINDOWS\system32> $results | Select -First 5

Title                                                                  Id               Comment                                                                                                                                                                  
-----                                                                  --               -------                                                                                                                                                                  
1,106 thoughts on “How to Get a Spanish Non-Lucrative Residence Visa”  li-comment-22751 Congratulations! What a process that was and what a resource you created. Cannot wait to follow along!...                                                                
1,106 thoughts on “How to Get a Spanish Non-Lucrative Residence Visa”  li-comment-22752 Woo hoo! Thank you! And thank you for being one of the early secret-keepers about this journey! I hope the post will help someone down the road (though I guess it's a...
1,106 thoughts on “How to Get a Spanish Non-Lucrative Residence Visa”  li-comment-35849 Well its helping me,… and i would love to pay for it with a coffee or lunch...                                                                                           
1,106 thoughts on “How to Get a Spanish Non-Lucrative Residence Visa”  li-comment-35903 Thanks, Imran. I'm thrilled it's helping! Once you make it to this end look me up and we'll grab some coffee                                                             
1,106 thoughts on “How to Get a Spanish Non-Lucrative Residence Visa”  li-comment-41052 Do you need to have your visit within 3 months of leaving? If I want to go mid July do I have to wait mid April for an appointment or can I go in sooner?...             

Rob,

Thank you so much for your help. I’ve been studying what you sent over and just had a few questions so that I can better understand how you went about this…

So it looks like I was confusing what I think you said were CSS tags for HTML tags? I see that you ended up going with tagnames h2, li, etc. I would assume these are HTML tags and then also finding what would be the unique identifier for each comment which looks to be id.

Thank you again, I will definitely study your code as a template moving forward. Really appreciate it.

Nelson

 

So it looks like I was confusing what I think you said were CSS tags for HTML tags? I see that you ended up going with tagnames h2, li, etc. I would assume these

Your question trailed off there, but if you were searching text and performing a parse would be more difficult. The parsing using in HTML is Document Object Model (DOM), which is designed for JavaScript, but regardless we can use these methods to programatically parse HTML. If you are familiar with HTML (or XML), they are a nested structure of nodes, so there is a standard structure with HTML > BODY and then it’s up to the developer. Typically, a good place to review the structure is the developer tab (F12) in the browser when you’re on the page. It allows you to search and expand\collapse the HTML to narrow down where to start, which I used the named DIV, so when you say you want all P tags, it is only under that DIV. The P is paragraph, so if you have multiple paragraphs in your comment, we put them in an array append a carriage return. In summary, it’s about narrowing things down as much as possible and looping through the nodes to get what you want

Something odd is going on with this post. I think some of the html tags broke something because the option to edit, qoute, etc. posts is missing in this entire thread which is most likely what happened to your message as well

Hi Rob,

OK, that makes sense but I also did think that since this is way outside my wheelhouse, was very possible I was asking my question in a confusing manner.

In any event, thank you again!! Greatly appreciated.