Variable with Word's Content.Text has differences from its' Set-Content'ed simple text file; contents handelted differenly by regex

$documentText = @"








Frau
Anna Mustermanowa
Hauptstr. 1
48996 Ministadt












per beA
per Mail: anna.mustermanowa@example.com



AKTEN.NR:     SACHBEARBEITER/SEKRETARIAT    STÄDTL,
2904/24/SB    Sonja Bearbeinenko    +49 211 123190.00    21.11.2024
    Telefax:     +49 211 123190.00
    E-Mail: anwalt@ra.example.com

Superman ./. Mustermanowa
Worum es da so geht


Sehr geehrte Frau Mustermanowa,





"@

$Mandant = [regex]::match($documentText, '[^\r\n].*(?=\.\/\.)').Value
$Gegner = [regex]::match($documentText, '(?<=\.\/\.\s)[^\r\n]*').Value

$Az = [regex]::match($documentText, '\d{4}/\d{2}').Value

Write-Output "$Mandant"
Write-Output "./."
Write-Output "$Gegner"
Write-Output "$Az"

outputs

Superman
./.
Mustermanowa
2904/24

whereas

$wordApp = [Runtime.Interopservices.Marshal]::GetActiveObject('Word.Application')
$doc = $wordApp.ActiveDocument
$documentText = $doc.Content.Text
Set-Content -Path "debug.txt" -Value $documentText -Encoding UTF8

$Mandant = [regex]::match($documentText, '[^\r\n].*(?=\.\/\.)').Value
$Gegner = [regex]::match($documentText, '(?<=\.\/\.\s)[^\r\n]*').Value

$Az = [regex]::match($documentText, '\d{4}/\d{2}').Value

Write-Output "$Mandant"
Write-Output "./."
Write-Output "$Gegner"
Write-Output "$Az"

[System.Runtime.Interopservices.Marshal]::ReleaseComObject($wordApp) | Out-Null

outputs

Superman -Mail: anwalt@ra.example.com0.0049 211 123190.00       21.11.2024
./.
Mustermanowa
2904/24

here-string from the first example is generated via Set-Content -Path "debug.txt" -Value $documentText -Encoding UTF8 from the second one.

How do I achieve the same Content.Text special symbols and line breaks structure inside a variable as is archievable by Set-Content’ing it into a text file?

Basically I want the same regex behaviour in the second code sample as in the first one.

Have you tried reading the file back in after using Set-Content? Not sure if that’s clearing any control codes, but your here-string value would imply that it is:

$documentText = Get-Content -Path "debug.txt" -Raw -Encoding UTF8

That should give you the same as the here-string.

I think I’d be inclined to match the line that contains the data you want and split it, rather than look ahead and look behind as to my mind it’s simpler:

$Mandant, $Gegner = ((Select-String -InputObject $documentText -Pattern '.*\./\..*').matches.value -split './.').trim()

Kind of assume there would be no difference, but in fact there is:

$documentText = Get-Content -Path "debug.txt" -Raw -Encoding UTF8

$Mandant = [regex]::match($documentText, '[^\r\n].*(?=\.\/\.)').Value
$Gegner = [regex]::match($documentText, '(?<=\.\/\.\s)[^\r\n]*').Value

$Az = [regex]::match($documentText, '\d{4}/\d{2}').Value

Write-Output "$Mandant"
Write-Output "./."
Write-Output "$Gegner"
Write-Output "$Az"

gives out

> .\test.ps1
Superman
./.
Mustermanowa
2904/24
> gc .\debug.txt -raw
Sehr geehrte Frau Mustermanowa,le.com0.0049 211 123190.00       21.11.2024

> gc .\debug.txt -raw | Format-Hex


           00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F

00000000   0D 0D 0D 0D 0D 0D 0D 0D 0D 46 72 61 75 0D 41 6E  .........Frau.An
00000010   6E 61 20 4D 75 73 74 65 72 6D 61 6E 6F 77 61 0D  na Mustermanowa.
00000020   48 61 75 70 74 73 74 72 2E 20 31 0D 34 38 39 39  Hauptstr. 1.4899
00000030   36 20 4D 69 6E 69 73 74 61 64 74 0D 0D 0D 0D 0D  6 Ministadt.....
00000040   0D 0D 0D 0D 0D 0D 0D 0D 70 65 72 20 62 65 41 0D  ........per beA.
00000050   70 65 72 20 4D 61 69 6C 3A 20 61 6E 6E 61 2E 6D  per Mail: anna.m
00000060   75 73 74 65 72 6D 61 6E 6F 77 61 40 65 78 61 6D  ustermanowa@exam
00000070   70 6C 65 2E 63 6F 6D 0D 0D 0D 0D 41 4B 54 45 4E  ple.com....AKTEN
00000080   2E 4E 52 3A 20 09 53 41 43 48 42 45 41 52 42 45  .NR: .SACHBEARBE
00000090   49 54 45 52 2F 53 45 4B 52 45 54 41 52 49 41 54  ITER/SEKRETARIAT
000000A0   09 53 54 3F 44 54 4C 2C 0D 32 39 30 34 2F 32 34  .ST?DTL,.2904/24
000000B0   2F 53 42 09 53 6F 6E 6A 61 20 42 65 61 72 62 65  /SB.Sonja Bearbe
000000C0   69 6E 65 6E 6B 6F 09 2B 34 39 20 32 31 31 20 31  inenko.+49 211 1
000000D0   32 33 31 39 30 2E 30 30 09 32 31 2E 31 31 2E 32  23190.00.21.11.2
000000E0   30 32 34 0D 09 54 65 6C 65 66 61 78 3A 20 09 2B  024..Telefax: .+
000000F0   34 39 20 32 31 31 20 31 32 33 31 39 30 2E 30 30  49 211 123190.00
00000100   0D 09 45 2D 4D 61 69 6C 3A 20 61 6E 77 61 6C 74  ..E-Mail: anwalt
00000110   40 72 61 2E 65 78 61 6D 70 6C 65 2E 63 6F 6D 0D  @ra.example.com.
00000120   0D 53 75 70 65 72 6D 61 6E 20 2E 2F 2E 20 4D 75  .Superman ./. Mu
00000130   73 74 65 72 6D 61 6E 6F 77 61 0D 57 6F 72 75 6D  stermanowa.Worum
00000140   20 65 73 20 64 61 20 73 6F 20 67 65 68 74 0D 0D   es da so geht..
00000150   0D 53 65 68 72 20 67 65 65 68 72 74 65 20 46 72  .Sehr geehrte Fr
00000160   61 75 20 4D 75 73 74 65 72 6D 61 6E 6F 77 61 2C  au Mustermanowa,
00000170   0D 0D 0D 0D 0D 0A                                ......

Here’s the solution: https://www.reddit.com/r/PowerShell/comments/1gwegdu/comment/ly9ylmw/

To match the behavior of the first code sample, use Get-Content -Raw -Encoding UTF8 after saving the Word content with Set-Content. This will preserve line breaks and special characters, allowing your regex to work as expected.

Example:

powershell

Copy code

Set-Content -Path "debug.txt" -Value $documentText -Encoding UTF8
$documentText = Get-Content -Path "debug.txt" -Raw -Encoding UTF8

$Mandant = [regex]::match($documentText, '[^\r\n].*(?=\.\/\.)').Value
$Gegner = [regex]::match($documentText, '(?<=\.\/\.\s)[^\r\n]*').Value
$Az = [regex]::match($documentText, '\d{4}/\d{2}').Value