Parsing HTML files from "Takeout" to get Google Voice SMS messages

I am trying to write a script extract the SMS message history from a google voice account from exported .html files. The Google Takeout service exports all of these messages into .html files, but it creates an individual .html file for each conversation, which is obviously unwieldy. I’d like to get the data into a csv or other format that will allow me to do some basic data manipulation, such as search and filtering.

Initially, based on the information in this blog post, I have started trying to use the HTML Agility Pack to accomplish this. (I don’t particularly like making the script dependent on this pack though, so if there is a good alternative way to accomplish my goal without the need for add-ins, I’m open to suggestions.)

I’ve started on the script, right now I’m just trying to get it working correctly with one html file, then I will go back and add the code to have it loop through all the files. The problem I’m running into currently is that when my script parses the single HTML file I’m using for testing, instead of getting 2 unique text messages in the output, I’m getting 2 copies of the same message. Here is the script as it stands currently:

add-type -path "D:\packages\HtmlAgilityPack.1.4.9\lib\net45\HtmlAgilityPack.dll"
$doc = New-Object HtmlAgilityPack.HtmlDocument
$source = $doc.Load["C:\Temp\texts.html"]
$Title = $doc.documentnode.selectnodes["//title"]
# $ChatLog = $doc.documentnode.selectnodes["//div[@class='hChatLog hfeed']"] 
$Messages = $doc.documentnode.selectnodes["//div[@class='message']"]
$Result = $Messages | % {
    $Msg = $_
    $GMTTime = $msg.selectsinglenode["//abbr[@class='dt']"].InnerText.Trim[]
    $MsgTime = get-date $GMTTime
    $Tel = $msg.selectsinglenode["//a [@class='tel']"]
    $SenderName = $tel.InnerText.Trim[]
    $SenderNum = $tel.GetAttributeValue["href", "Number"].TrimStart["tel:+"]
    $Body = $msg.selectsinglenode["//q"].InnerText.Trim[]

    New-Object PsObject -Property @{ Time = $MsgTime; From = $SenderNum; Body = $Body } | 
        Select Time,From,Body

$result | Sort Time | ft -auto | out-string 

I’m attaching the test html file I’m working with as well, renamed to “texts.txt”.

Based on the html file and the script above, can anyone tell me what I’m doing wrong that causes the first text message to be included twice in the output instead of 2 unique messages?

Little XPath snag there. When your query starts with “//” (or “/”, for that matter), that means to search from the root of the entire document. When you want to search with some other node as the root, but still use the // functionality, you need to prefix it with a period, eg:

$GMTTime = $msg.SelectSingleNode(".//abbr[@class='dt']").InnerText.Trim()

On a side note, I haven’t found anything better than the HTML Agility Pack for quickly parsing HTML from any source, including on disk. If you use Invoke-WebRequest with an HTTP or HTTPS address, the HttpWebResponse object you get back exposes a similar DOM object, but I’m not aware of a way to get .NET to use that class when loading a file from disk. (file:// urls return a different sort of object.)

Thanks, I probably never would have found that. It seems to me that by selecting the Message nodes and passing them to the ForEach, the individual message node should be the root as far as anything in the foreach is concerned. Clearly, it doesn’t work that way.

Okay, I have got that fixed, and have a few more questions I’m hoping someone can help with.

I’ve added the code now to bring in the entire directory tree containing the text message history. For Reference, the folder structure looks like:

“.\Takeout\Voice\Texts[i]ContactName[/i][i]ContactName[/i] - Text - DateTimeStamp.html”

Here’s the script as it stands currently:

add-type -path "D:\packages\HtmlAgilityPack.1.4.9\lib\net45\HtmlAgilityPack.dll"
$SMSContacts = get-childitem c:\temp\takeout\voice\texts | select -last 1
ForEach [$Contact in $SMSContacts] {
    $Name = $Contact.Name
    $Conversations = get-childitem $Contact.fullname
    ForEach [$Conversation in $Conversations] {
        $doc = New-Object HtmlAgilityPack.HtmlDocument
        $source = $doc.Load[$Conversation.fullname]
        $Title = $doc.documentnode.selectnodes["//title"]
        $Messages = $doc.documentnode.selectnodes["//div[@class='message']"]
        $Result = ForEach [$Msg in $messages] {
            $GMTTime = $msg.selectsinglenode[".//abbr[@class='dt']"].InnerText.Trim[]
            $MsgTime = get-date $GMTTime
            $Tel = $msg.selectsinglenode[".//a [@class='tel']"]
            $SenderName = $tel.InnerText.Trim[]
            $SenderNum = $tel.GetAttributeValue["href", "Number"].TrimStart["tel:+"]
            $Body = $msg.selectsinglenode[".//q"].InnerText.Trim[]
            if [$SenderName -eq "Me"]
                $Direction = "Incoming"
                $Direction = "Outgoing"

            New-Object PsObject -Property @{ Contact = $Name; Time = $MsgTime; Type = $Direction; From = $SenderNum; Body = $Body } | 
                Select Contact,Type,Time,From,Body

            $result | export-csv -path c:\temp\texts.csv -append #| Sort Time | ft -auto | out-string

I’ve accomplished my original goal of parsing all the text message data and exporting it to csv. Now, I’d like to add similar functionality for Call history and maybe even voicemails. I’d also like to make this something I could post for others to use. I know I’ll need to add parameters for the file paths and clean up any other static info in the script. When it comes to breaking it up into functions though, I don’t have much experience with that. I’m currently envisioning 3 functions, Get-GVSMS, Get-GVCalls, Get-GVVoicemail. I thought about adding a 4th function which the other 3 would all call, which would contain the bits that get the appropriate list of source files to process, but then considered that would be a fairly small bit of code and probably not worth breaking it out separately.

So if I just go with those 3 functions, does that make sense? Or would I be better to have a single function for the user to run, with parameters for which type of data to process?

Also, as I think about it, should I take the export-csv out and just leave it to the end user to decide what to do with the data? Or, if I leave the export-csv in, perhaps the functions should use “Convert” instead of “Get” for the verb?

As for whether you want to have separate commands or a single command, that’s up to you, but returning objects and allowing the user decide what to do with them is the right thing to do. Personally, I would probably lean toward a single Import-GVFile command which returns an object that contains properties Calls, SMS, and Voicemails. (These properties would contain collections of objects corresponding to each type.) Then the caller can get everything they need from the file in one command, and choose what to do with it from there. Some possible use cases:

# Just get the voicemails and stick them in a CSV file

Import-GVFile -Path .\texts.html |
Select-Object -ExpandProperty Voicemails |
Export-Csv -Path .\Voicemails.csv -NoTypeInformation

# Create CSV files for all 3 object types:

$gvData = Import-GVFile -Path .\texts.html

$gvData.Voicemails | Export-Csv -Path .\Voicemails.csv -NoTypeInformation
$gvData.Calls | Export-Csv -Path .\Calls.csv -NoTypeInformation
$gvData.SMS | Export-Csv -Path .\SMS.csv -NoTypeInformation

# etc

Just wanted to wrap things up neatly and give the thread a conclusion. I finished my import-gvhistory.ps1 script yesterday and posted it here:

Thanks again for the help!