Web Scraping with Invoke-WebRequest

Hi All,

I am new to PowerShell. I was learning about web scraping I have created below script to fetch data from flipkart and save it to excel.

$excel = New-Object -ComObject excel.application
$excel.Application.Visible = $false
$excel.DisplayAlerts = $false#dont prompt the user
$book=$excel.Workbooks.Add()
$sheet1 = $book.Worksheets.Item(1)
$sheet1.name = 'FLIPKART'
$sheet1.Activate() | Out-Null
$sheet1.Cells.Item(1,1)="Name"
$sheet1.Cells.Item(1,2)="Rating"
$sheet1.Cells.Item(1,3)="RAM/ROM"
$sheet1.Cells.Item(1,4)="Display"
$sheet1.Cells.Item(1,5)="Camera"
$sheet1.Cells.Item(1,6)="Battery"
$sheet1.Cells.Item(1,7)="Processor"
$sheet1.Cells.Item(1,8)="Warranty"
$sheet1.Cells.Item(1,9)="Prize"
$rowCounter=2;

[int]$pages=Read-Host "Enter no of pages"
1..$pages |ForEach-Object {
    $page=Invoke-WebRequest -Uri "https://www.flipkart.com/search?q=mobiles&sid=tyy%2C4io&as=on&as-show=on&otracker=AS_QueryStore_OrganicAutoSuggest_1_7_na_na_na&otracker1=AS_QueryStore_OrganicAutoSuggest_1_7_na_na_na&as-pos=1&as-type=RECENT&suggestionId=mobiles%7CMobiles&requestId=192f78df-047c-41ee-a82b-b82d74eb9811&as-searchtext=mobiles&page=$_"

    $products=$page.ParsedHtml.getElementsByTagName("div")|Where-Object {$_.className -eq "_1-2Iqu row"}
    for($i=0;$i -le $products.Count;$i++)
    {
        if($products[$i] -ne $null)
        {
            $sheet1.Cells.Item($rowCounter,1)=$products[$i].childNodes[0].childNodes[0].innerHTML
            $sheet1.Cells.Item($rowCounter,2)=$products[$i].childNodes[0].childNodes[1].childNodes[0].innerText
            $sheet1.Cells.Item($rowCounter,3)=$products[$i].childNodes[0].childNodes[2].childNodes[0].childNodes[0].innerText
            $sheet1.Cells.Item($rowCounter,4)=$products[$i].childNodes[0].childNodes[2].childNodes[0].childNodes[1].innerText
            $sheet1.Cells.Item($rowCounter,5)=$products[$i].childNodes[0].childNodes[2].childNodes[0].childNodes[2].innerText
            $sheet1.Cells.Item($rowCounter,6)=$products[$i].childNodes[0].childNodes[2].childNodes[0].childNodes[3].innerText
            $sheet1.Cells.Item($rowCounter,7)=$products[$i].childNodes[0].childNodes[2].childNodes[0].childNodes[4].innerText
            $sheet1.Cells.Item($rowCounter,8)=$products[$i].childNodes[0].childNodes[2].childNodes[0].childNodes[5].innerText
            $sheet1.Cells.Item($rowCounter,9)=$products[$i].childNodes[1].childNodes[0].childNodes[0].childNodes[0].innerHTML
            $rowCounter++;
        }
    }
}
$path="C:\Users\$env:USERNAME\Desktop\FLIPKART_mobiles.xlsx"
$result=Test-path -path $path
if($result)
{Remove-Item $path}
$book.SaveAs($path)
Write-Host "File saved as "$path
$excel.quit()

I am facing 2 issues

1st issue : whenever I start new PowerShell console window the script works fine for first 1 or 2 attempt after that it shows error like below:

PS C:\Users\lenovo\Desktop\Ramakant\powershell> .\temp.ps1
Enter no of pages: 1
Exception from HRESULT: 0x800A138A
At C:\Users\lenovo\Desktop\Ramakant\powershell\temp.ps1:77 char:5
+     $products=$page.ParsedHtml.getElementsByTagName("div")|Where-Obje ...
+     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : OperationStopped: (:) [], COMException
    + FullyQualifiedErrorId : System.Runtime.InteropServices.COMException

Cannot index into a null array.
At C:\Users\lenovo\Desktop\Ramakant\powershell\temp.ps1:80 char:12
+         if($products[$i] -ne $null)
+            ~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : InvalidOperation: (:) [], RuntimeException
    + FullyQualifiedErrorId : NullArray

2nd issue : Every product has Name, Rating, Prize but other information may not available How to check if childNodes[] is present or not. it gives me error like this if any product has not any childNodes but script able to add all other available data

Enter no of pages: 3
Invalid argument.
At C:\Users\lenovo\Desktop\Ramakant\powershell\temp.ps1:88 char:13
+             $sheet1.Cells.Item($rowCounter,7)=$products[$i].childNode ...
+             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : OperationStopped: (:) [], ArgumentException
    + FullyQualifiedErrorId : System.ArgumentException

Thanks

 

Avoid Excel unless it’s required. It’s better to create a Powershell object and then export rather than parsing like that. Looking at the structure of the page, it’s a better approach to use names, ids and classnames to parse than all of the childnode indexes:

$page = Invoke-WebRequest -Uri "https://www.flipkart.com/search?q=mobiles&sid=tyy%2C4io&as=on&as-show=on&otracker=AS_QueryStore_OrganicAutoSuggest_1_7_na_na_na&otracker1=AS_QueryStore_OrganicAutoSuggest_1_7_na_na_na&as-pos=1&as-type=RECENT&suggestionId=mobiles%7CMobiles&requestId=192f78df-047c-41ee-a82b-b82d74eb9811&as-searchtext=mobiles&page=1"
 
$products=$page.ParsedHtml.getElementsByTagName("div") | Where-Object {$_.className -eq "_1-2Iqu row"}

$results = foreach ($product in $products) {
    
    $tVe95H = @($product.getElementsByClassName('tVe95H'))

    [pscustomobject]@{
        'Name'      = ($product.getElementsByClassName('_3wU53n'))[0].innerText
        'Rating'    = ($product.getElementsByClassName('_38sUEc'))[0].innerText
        'RAM\ROM'   = $tVe95H[0].innerText
        'Display'   = $tVe95H[1].innerText
        'Camera'    = $tVe95H[2].innerText
        'Battery'   = $tVe95H[3].innerText
        'Processor' = $tVe95H[4].innerText
        'Warranty ' = $tVe95H[5].innerText
        'Prize'     = $tVe95H[6].innerText
    }
}

$results #| Export-Csv -Path C:\myresults.csv -NoTypeInformation

Output:

Name      : POCO M2 (Pitch Black, 64 GB)
Rating    : 8,673 Ratings & 737 Reviews
RAM\ROM   : 6 GB RAM | 64 GB ROM | Expandable Upto 512 GB
Display   : 16.59 cm (6.53 inch) Full HD+ Display
Camera    : 13MP + 8MP + 5MP + 2MP | 8MP Front Camera
Battery   : 5000 mAh Lithium Polymer Battery
Processor : MediaTek Helio G80 Processor
Warranty  : 1 Year for Handset, 6 Months for Accessories
Prize     : 

Name      : POCO M2 (Slate Blue, 64 GB)
Rating    : 8,673 Ratings & 737 Reviews
RAM\ROM   : 6 GB RAM | 64 GB ROM | Expandable Upto 512 GB
Display   : 16.59 cm (6.53 inch) Full HD+ Display
Camera    : 13MP + 8MP + 5MP + 2MP | 8MP Front Camera
Battery   : 5000 mAh Lithium Polymer Battery
Processor : MediaTek Helio G80 Processor
Warranty  : 1 Year for Handset, 6 Months for Accessories
Prize     : 

Name      : Realme C11 (Rich Grey, 32 GB)
Rating    : 1,24,758 Ratings & 8,577 Reviews
RAM\ROM   : 2 GB RAM | 32 GB ROM | Expandable Upto 256 GB
Display   : 16.51 cm (6.5 inch) HD+ Display
Camera    : 13MP + 2MP | 5MP Front Camera
Battery   : 5000 mAh Battery
Processor : Mediatek Helio G35 Processor
Warranty  : Brand Warranty of 1 Year Available for Mobile and 6 Months for Accessories
Prize     : 

Name      : Realme C11 (Rich Green, 32 GB)
Rating    : 1,24,758 Ratings & 8,577 Reviews
RAM\ROM   : 2 GB RAM | 32 GB ROM | Expandable Upto 256 GB
Display   : 16.51 cm (6.5 inch) HD+ Display
Camera    : 13MP + 2MP | 5MP Front Camera
Battery   : 5000 mAh Battery
Processor : Mediatek Helio G35 Processor
Warranty  : Brand Warranty of 1 Year Available for Mobile and 6 Months for Accessories
Prize     : 

Name      : Infinix Hot 9 (Violet, 64 GB)
Rating    : 55,782 Ratings & 3,784 Reviews
RAM\ROM   : 4 GB RAM | 64 GB ROM | Expandable Upto 256 GB
Display   : 16.76 cm (6.6 inch) HD+ Display
Camera    : 13 MP + 2 MP + 2 MP + Low Light Sensor | 8MP Front Camera
Battery   : 5000 mAh Li-ion Polymer Battery
Processor : MediaTek Helio P22 (64 bit) Processor
Warranty  : 1 Year Warranty on Phone, 6 Months Warranty on Accessories
Prize     : 

Name      : Tecno Spark Go 2020 (Aqua Blue, 32 GB)
Rating    : 6,092 Ratings & 838 Reviews
RAM\ROM   : 2 GB RAM | 32 GB ROM | Expandable Upto 256 GB
Display   : 16.56 cm (6.52 inch) HD+ Display
Camera    : 13MP + AI Lens Dual Camera | 8MP Front Camera
Battery   : 5000 mAh Lithium-ion Battery
Processor : MediaTek Helio A20 Quad Core Processor
Warranty  : 12 Months Warranty + 1 Month Extended Warranty
Prize     : 

The parsing needs work as you the span can have additional populated information and there were some instances where the warranty was in the incorrect place. You can check the count of items and try to align them, but you would need to understand all the possible options in the SPAN. Also received the error after running it numerous times, not sure what is going on there as I attempted to clear variables but it still produced the error. Hopefully the above will make things easier for you.

Thanks for replay. I really appreciate your help. I will learn and will follow your advice, thanks.