Extracting parent tables on HTML

Hi, second post (soon I will become to be helpful to others),

I’m trying to parse a local html report, to extract some data and output it to csv.

The problem comes with a report with nested tables, inside tds…

And what I want to do is get the tables on depth 0, and from there iterate row by row to extract the other data. I’ve been searching how to get that tables, but the documentation is… disperse, and I didn’t find anything nearly like that.

Thanks for all the help!

text 1 text2 text3
text 4 text5 text6
text7 text8
text9.1 text9.2 text9.3
text9.4 text9.5 text9.6
text9.7 text9.8 text9.9

You cannot place HTML or XML in forum directly, you need to use Gist to post that content. Traditionally, you want to get the data from the source and if you cannot the directly query that, then ideally you could get the report in another format like CSV or JSON that allow direct parsing. With that said, there are many examples of how to parse HTML and luckily there are pre-written functions\cmdlets to do the parse:

https://www.powershellgallery.com/packages/PowerHTML/0.1.6/Content/Public\ConvertFrom-HTML.ps1
https://stackoverflow.com/questions/26984199/how-to-extract-specific-values-from-a-html-table-and-copy-them-horizontally-to-a

Yes, sorry for the html, I changed it into the table format in the page.

For the second part, I already saw that scripts, but neither of them say anything about working with the depth of the tables, or how to deal with the nested tables.

Have you actually tested any of them? It is possible that it may not parse as expected, but it does show you how to do basic parsing so that you can modify it to work as required. There are many times I’ve gotten example code to just understand the fundamentals and then updated to do what I specifically need. If have specific questions about how to modify code and can explain what is or is not working, then post your questions with examples (in Gist)

Hi, thanks for the reply, here I paste the code:

foreach($n_file in $node_files){
# create the object to treat the html document

$html = New-Object -ComObject "HTMLFile"

# read the content of the html

$source = Get-Content -Path $n_file.FullName -Raw

# assign the content to the object

$html.IHTMLDocument2_write($source)

# get the tables of the document

$tables = $html.getElementsByTagName('table')

# create the array to store the info

$arrayTable = @(@())

# iterate over all the tables in the html file

foreach($table in $tables){

    # get all the tr elements in the current table

    $trs = $table.getElementsByTagName('tr')

    # iterate over the tr elements

    foreach($tr in $trs){

        # get all the td elements in the current table

        $tds = $tr.getElementsByTagName('td')

        # create the string to contain the data of the row

        $line = ""

        # iterate over all the td elements in the current row

        foreach($td in $tds){  

            # add the current td element text to the string

            $line += $td.innerText+"," 

        }

        # add the string as a row to the array

        $arrayTable+=,(@($line))

    }

}

# create the variable with the ouput file

$output_file = $rf+"\"+($n_file.Name -replace ".html", "")+"_file.csv"

# iterate over each row of the array

foreach($row in $arrayTable){

    # add the row to the outputfile

    Add-Content -Path $output_file -Value $row

}

}


$node_files is varible that has stored in it the files which I want to parse, that work fine.

All this part work, but nested tables displaces the information that goes after them.
My goal, and still thinking about it, is to get only the parent tables in the structure, and not the nested tables which are inside the td tags.

Ok, solved it in the most stupid way I can think.

As the reports comes in the same format always, and the nested tables have a different size than the parent tables, I only need to count the cells on row of the table, and if it has the correct number of columns, is a parent table and I can extract the data.