Scrape Webpage with PS for Part Info

I need to scrape a bunch of webpages to obtain part numbers and other information. I have been unsuccessfully trying for two days and figured it was time to ask the experts. The data I’m looking to capture is: Ref#, PartName,Price,Qty

$URL = https://www.yamahapartshouse.com/oemparts/a/yam/500449cff8700209bc790600/valve
$Request = Invoke-WebRequest -Uri $URL -UseBasicParsing

An example of the HTML for a single part looks like the following:

<div class="partlistrow">
        <form action="/cart/addoempart" data-name="Valve, Intake" data-qoh="0" data-sku="5NL-12111-00-00" id="add_1_5NL-12111-00-00" method="post">
        <input type="hidden" id="manf_1" name="manf" value="YAM" />
        <input type="hidden" id="assembly_1" name="assembly" value="500449cff8700209bc790600" />
        <input type="hidden" id="sku_1" name="sku" value="5NL-12111-00-00" />
        <div class="c0"><span>1</span></div>
        <div class="c1">
            <div class="c1a" style="width:100%;display:table;table-layout:fixed;">
                <span style="white-space: nowrap;">Valve, Intake</span>
            </div>
            <div class="clear"></div>
            <div class="c1b">
                
                    <a href="/oemparts/p/yamaha/5nl-12111-00-00/valve-intake"><span class="itemnumstrike">5NL-12111-00-00</span></a> <a href="/oemparts/p/yamaha/5nl-12111-30-00/valve-intake"><span class="itemnumnew">5NL-12111-30-00</span></a>
                
            </div>
            <div class="clear"></div>
            
        </div>
        
                <div class="c2"><span class="dbl">$116.99</span></div>    
                <div class="c3"><input type="text" id="qty_1" name="qty" class="input_1 center required qtyinput" value="1" /></div>
                <div class="c4"><input type="submit" id="addtocart_1_5NL-12111-00-00" class="ui ui-icon-smadd btnpartadd" value="Add" /></div>
            
            
        <div class="clear"></div>
        </form>
    </div>

When I search around on how to best do this, there doesn’t seem to be as much info as I would expect. There is some talk about using “parsedhtml”, however, it is blank in my case. There are other references to use regex, but I haven’t had much success with that either. Can someone please point me in the best direction on how to go about this. I have (42) pages that I need to pull info for. Any help is greatly appreciated.

Key HTML TAGS:


<div class="c0"><span>1</span></div>
 <div class="c1a" style="width:100%;display:table;table-layout:fixed;">
      <span style="white-space: nowrap;">Valve, Intake</span>
 </div>
<div class="c2"><span class="dbl">$116.99</span></div>  
<div class="c3"><input type="text" id="qty_1" name="qty" class="input_1 center required qtyinput" value="1" /></div>


pjw237690,
Welcome back to the forum. :wave:t4: … long time no see. :slight_smile:

… and I can tell you why … you cannot parse html with PowerShell and regex. If they do not offer an API you can use to get the data in a machine readable format like XML or JSON or even better CSV you’re pretty much out of luck. Sorry. :man_shrugging:t4:

First off, I agree with Olaf :slight_smile:

In the past, I have used VBS to access the HTML DOM (Document Object Model) to get web page information, but that was ages ago. It also used IE which for the most part, is dead. I also don’t have time to test something, but you might get some ideas here:

And, should you still have IE available …
https://social.technet.microsoft.com/Forums/scriptcenter/en-US/38a40c23-99a7-41e5-b5de-e527da600cc0/html-dom-with-powershell?forum=winserverpowershell

In my case, I wrote a cheesy script go get FedEX shipping quotes. You will need to view the page source to get the Element ID’s, or classes or whatever you want returned as they should have names defined.

Another word of caution. If the page is internet facing, and changes, your script typically breaks.

This is due to you calling -UseBasicParsing - just consider the name of that parameter, and your own words “parsedHTML is blank.”

You should be very careful to ensure you are not breaking the terms of service of the specified or any site for that matter. You may even be breaking the law depending. If you have any doubt, contact the site and ask for their permission. They would hopefully respond with “here is our API” because that is going to be the best choice. You can try and parse HTML with regex but it’s problematic because without a doubt the website formatting will change, causing the brittle regex parsing to fail. I agree with Olaf 100% that an API is your best choice.

That being said, to parse anything in regex you just have to take it chunk by chunk. Figure out the first step, then build upon it until you can do something ridiculous like this.

$URL = 'https://www.yamahapartshouse.com/oemparts/a/yam/500449cff8700209bc790600/valve'
$Request = Invoke-WebRequest -Uri $URL -UseBasicParsing

$pattern = '(?s)c0"><span>(?<Ref>.+?)</.+nowrap;">(?<PartName>.+?)</.+itemnum(?:new)?">(?<ItemNumber>.+?)</.+dbl">(?<Price>.+?)</.+input" value="(?<Quantity>.+?)"'

$partlist = foreach($chunk in $Request.Content -split 'partlistrow'){
    if($chunk -match $pattern){
        $Matches.Remove(0)

        [PSCustomObject]$Matches
    }
}

$partlist

The output will look like this
image

You can then format it for console output, output to csv, etc.

$partlist | Format-Table -Property Ref, PartName, ItemNumber, Price, Quantity

image

See this regex demo for details on the pattern

1 Like

Well done Krazy Doug :slight_smile:

I really do need to learn RegEx.

1 Like

:wink:

1 Like

@krzydoug, that is awesome, thank you for your help with this. I guess it’s time I start learning RegEx.

1 Like