Powershell: parsing html links (meta tags)

hi, I must to copy the link from top line to the bottom lines, in several html files. Each html files has an unique “canonical” link. For example:

<link rel="canonical" href="https://website.com/en/america.html" />

<html code>
<html code>

<div class="somers"><a href="https://website/darertss.html" class="flags bg" hreflang="bg" title="bk"></a>
<a href="https://website.com/pas-lofet.html" class="flags sk" hreflang="sk" title="sk"></a>
<a href="https://website.com/latinamer.html" class="flags uk" hreflang="uk" title="uk"></a>
<a href="https://website.com/sacrdo.html" class="flags uk" hreflang="uk" title="uk"></a>

The output should be

<div class="somers"><a href="https://website.com/en/america.html" class="flags bg" hreflang="bg" title="bk"></a>
<a href="https://website.com/en/america.html" class="flags sk" hreflang="sk" title="sk"></a>
<a href="https://website.com/en/america.html" class="flags uk" hreflang="uk" title="uk"></a>
<a href="https://website.com/america.html" class="flags uk" hreflang="uk" title="uk"></a>

My powershell code is almost good, but only replaces the first line (the one with <div class…). And I must replace all the lines

$sourcedir = "C:\Folder1\"
  $resultsdir = "C:\Folder1\"
  Get-ChildItem -Path $sourcedir -Filter *.html | ForEach-Object {
  $content = Get-Content -Path $_.FullName -Raw
  $replacementValue = (Select-String -InputObject $content -Pattern '(?<=<a href=").+(?=</a>)').Matches.Value
  $replaceValue = (Select-String -InputObject $content -Pattern '(?<=<link rel="canonical" href=").+(?=" />)').Matches.Value
  $content.Replace("$replacementValue", "$replaceValue") | Out-File -FilePath $resultsdir\$($_.name)
  }

Also, I try to use -AllMatches, but doesn’t work. Can anyone update my code a little bit so as to replace all the lines ?

$sourcedir = "E:\Temp\Folder1\"
$resultsdir = "E:\Temp\Folder2\"

Get-ChildItem -Path $sourcedir -Filter *.html | ForEach-Object {
    $content = Get-Content -Path $_.FullName -Raw
    $replaceValue = (Select-String -InputObject $content -Pattern '(?<=<link rel="canonical" href=").+(?=" />)').Matches.Value
    $content = $content -replace 'https:\/\/.+.html',$replaceValue
    Set-Content -Path $resultsdir\$($_.name) $content
}
1 Like

thanks a lot @matt-bloomfield

by the way, @matt-bloomfield I’m thinking of a similar case. For example:

  <html code>
  <html code>
	
	<link rel="canonical" href="https://website.com/en/laptop.html" />

    <html code>
    <html code>

<meta property="og:url" content="https://website.com/accente-pronunce.html"/>

    <html code>
    <html code>
	

"@id": "https://website.com/mom-and-dad.html"

So, I want to parse the canonical link to the other links below, from meta and @id.

I don’ know why is not working my code (I update your code a little bit)

$sourcedir = "C:\Folder1\"
 $resultsdir = "C:\Folder1\"

Get-ChildItem -Path $sourcedir -Filter *.html | ForEach-Object {
   $content = Get-Content -Path $_.FullName -Raw
   $replaceValue = (Select-String -InputObject $content -Pattern '(?<=<link rel="canonical" href=").+(?=" />)').Matches.Value
   $content = $content -replace '(?<=<meta property="og:url" content=").+(?="/>)',$replaceValue
   $content = $content -replace '(?<="@id": ").+(?=")',$replaceValue
    Set-Content -Path $resultsdir\$($_.name) $content

If you look at my code, I simplified the regular expression to just replace https://<anything>.html, this would work regardless of what else you have between the tags.

1 Like

DONE. Thanks @matt-bloomfield

$sourcedir = "C:\Folder1\"
 $resultsdir = "C:\Folder1\"

Get-ChildItem -Path $sourcedir -Filter *.html | ForEach-Object {
   $content = Get-Content -Path $_.FullName -Raw
   $replaceValue = (Select-String -InputObject $content -Pattern '(?<=<link rel="canonical" href=").+(?=" />)').Matches.Value
   $content = $content -replace '(?<=<meta property="og:url" content=").+(?="/>)',$replaceValue
      Set-Content -Path $resultsdir\$($_.name) $content
	  
   $content = $content -replace '(?<="@id": ").+(?=")',$replaceValue
  Set-Content -Path $resultsdir\$($_.name) $content
}