Scrape Webpage with PS for Part Info

pjw2376 · November 1, 2022, 6:15pm

I need to scrape a bunch of webpages to obtain part numbers and other information. I have been unsuccessfully trying for two days and figured it was time to ask the experts. The data I’m looking to capture is: Ref#, PartName,Price,Qty

$URL = https://www.yamahapartshouse.com/oemparts/a/yam/500449cff8700209bc790600/valve
$Request = Invoke-WebRequest -Uri $URL -UseBasicParsing

An example of the HTML for a single part looks like the following:

<div class="partlistrow">
        <form action="/cart/addoempart" data-name="Valve, Intake" data-qoh="0" data-sku="5NL-12111-00-00" id="add_1_5NL-12111-00-00" method="post">
        <input type="hidden" id="manf_1" name="manf" value="YAM" />
        <input type="hidden" id="assembly_1" name="assembly" value="500449cff8700209bc790600" />
        <input type="hidden" id="sku_1" name="sku" value="5NL-12111-00-00" />
        <div class="c0"><span>1</span></div>
        <div class="c1">
            <div class="c1a" style="width:100%;display:table;table-layout:fixed;">
                <span style="white-space: nowrap;">Valve, Intake</span>
            </div>
            <div class="clear"></div>
            <div class="c1b">
                
                    <a href="/oemparts/p/yamaha/5nl-12111-00-00/valve-intake"><span class="itemnumstrike">5NL-12111-00-00</span></a> <a href="/oemparts/p/yamaha/5nl-12111-30-00/valve-intake"><span class="itemnumnew">5NL-12111-30-00</span></a>
                
            </div>
            <div class="clear"></div>
            
        </div>
        
                <div class="c2"><span class="dbl">$116.99</span></div>    
                <div class="c3"><input type="text" id="qty_1" name="qty" class="input_1 center required qtyinput" value="1" /></div>
                <div class="c4"><input type="submit" id="addtocart_1_5NL-12111-00-00" class="ui ui-icon-smadd btnpartadd" value="Add" /></div>
            
            
        <div class="clear"></div>
        </form>
    </div>

When I search around on how to best do this, there doesn’t seem to be as much info as I would expect. There is some talk about using “parsedhtml”, however, it is blank in my case. There are other references to use regex, but I haven’t had much success with that either. Can someone please point me in the best direction on how to go about this. I have (42) pages that I need to pull info for. Any help is greatly appreciated.

Key HTML TAGS:


<div class="c0"><span>1</span></div>
 <div class="c1a" style="width:100%;display:table;table-layout:fixed;">
      <span style="white-space: nowrap;">Valve, Intake</span>
 </div>
<div class="c2"><span class="dbl">$116.99</span></div>  
<div class="c3"><input type="text" id="qty_1" name="qty" class="input_1 center required qtyinput" value="1" /></div>

Olaf · November 1, 2022, 8:53pm

pjw237690,
Welcome back to the forum. … long time no see.

… and I can tell you why … you cannot parse html with PowerShell and regex. If they do not offer an API you can use to get the data in a machine readable format like XML or JSON or even better CSV you’re pretty much out of luck. Sorry.

tonyd · November 1, 2022, 10:30pm

First off, I agree with Olaf

In the past, I have used VBS to access the HTML DOM (Document Object Model) to get web page information, but that was ages ago. It also used IE which for the most part, is dead. I also don’t have time to test something, but you might get some ideas here:

And, should you still have IE available …
https://social.technet.microsoft.com/Forums/scriptcenter/en-US/38a40c23-99a7-41e5-b5de-e527da600cc0/html-dom-with-powershell?forum=winserverpowershell

In my case, I wrote a cheesy script go get FedEX shipping quotes. You will need to view the page source to get the Element ID’s, or classes or whatever you want returned as they should have names defined.

Another word of caution. If the page is internet facing, and changes, your script typically breaks.

krzydoug · November 1, 2022, 11:22pm

This is due to you calling -UseBasicParsing - just consider the name of that parameter, and your own words “parsedHTML is blank.”

You should be very careful to ensure you are not breaking the terms of service of the specified or any site for that matter. You may even be breaking the law depending. If you have any doubt, contact the site and ask for their permission. They would hopefully respond with “here is our API” because that is going to be the best choice. You can try and parse HTML with regex but it’s problematic because without a doubt the website formatting will change, causing the brittle regex parsing to fail. I agree with Olaf 100% that an API is your best choice.

That being said, to parse anything in regex you just have to take it chunk by chunk. Figure out the first step, then build upon it until you can do something ridiculous like this.

$URL = 'https://www.yamahapartshouse.com/oemparts/a/yam/500449cff8700209bc790600/valve'
$Request = Invoke-WebRequest -Uri $URL -UseBasicParsing

$pattern = '(?s)c0"><span>(?<Ref>.+?)</.+nowrap;">(?<PartName>.+?)</.+itemnum(?:new)?">(?<ItemNumber>.+?)</.+dbl">(?<Price>.+?)</.+input" value="(?<Quantity>.+?)"'

$partlist = foreach($chunk in $Request.Content -split 'partlistrow'){
    if($chunk -match $pattern){
        $Matches.Remove(0)

        [PSCustomObject]$Matches
    }
}

$partlist

The output will look like this

You can then format it for console output, output to csv, etc.

$partlist | Format-Table -Property Ref, PartName, ItemNumber, Price, Quantity

See this regex demo for details on the pattern

tonyd · November 1, 2022, 11:49pm

Well done Krazy Doug

I really do need to learn RegEx.

Olaf · November 2, 2022, 12:06am

pjw2376 · November 2, 2022, 8:25pm

@krzydoug, that is awesome, thank you for your help with this. I guess it’s time I start learning RegEx.

Topic		Replies	Views
Help getting table data out of HTML (Scraping) PowerShell Help	10	947	May 16, 2024
Web Scraping with Invoke-WebRequest PowerShell Help	3	353	May 16, 2024
VMware Releases PowerShell Help	6	127	May 16, 2024
Assistance with Invoke-WebRequest PowerShell Help	12	796	May 16, 2024
Web Scraping. I want to pull only one value within one of the tag in the xml. PowerShell Help	10	221	May 16, 2024

Scrape Webpage with PS for Part Info

Related topics