eugrus
November 21, 2024, 11:17am
1
$documentText = @"
Frau
Anna Mustermanowa
Hauptstr. 1
48996 Ministadt
per beA
per Mail: anna.mustermanowa@example.com
AKTEN.NR: SACHBEARBEITER/SEKRETARIAT STÄDTL,
2904/24/SB Sonja Bearbeinenko +49 211 123190.00 21.11.2024
Telefax: +49 211 123190.00
E-Mail: anwalt@ra.example.com
Superman ./. Mustermanowa
Worum es da so geht
Sehr geehrte Frau Mustermanowa,
"@
$Mandant = [regex]::match($documentText, '[^\r\n].*(?=\.\/\.)').Value
$Gegner = [regex]::match($documentText, '(?<=\.\/\.\s)[^\r\n]*').Value
$Az = [regex]::match($documentText, '\d{4}/\d{2}').Value
Write-Output "$Mandant"
Write-Output "./."
Write-Output "$Gegner"
Write-Output "$Az"
outputs
Superman
./.
Mustermanowa
2904/24
whereas
$wordApp = [Runtime.Interopservices.Marshal]::GetActiveObject('Word.Application')
$doc = $wordApp.ActiveDocument
$documentText = $doc.Content.Text
Set-Content -Path "debug.txt" -Value $documentText -Encoding UTF8
$Mandant = [regex]::match($documentText, '[^\r\n].*(?=\.\/\.)').Value
$Gegner = [regex]::match($documentText, '(?<=\.\/\.\s)[^\r\n]*').Value
$Az = [regex]::match($documentText, '\d{4}/\d{2}').Value
Write-Output "$Mandant"
Write-Output "./."
Write-Output "$Gegner"
Write-Output "$Az"
[System.Runtime.Interopservices.Marshal]::ReleaseComObject($wordApp) | Out-Null
outputs
Superman -Mail: anwalt@ra.example.com0.0049 211 123190.00 21.11.2024
./.
Mustermanowa
2904/24
here-string from the first example is generated via Set-Content -Path "debug.txt" -Value $documentText -Encoding UTF8
from the second one.
How do I achieve the same Content.Text special symbols and line breaks structure inside a variable as is archievable by Set-Content’ing it into a text file?
Basically I want the same regex behaviour in the second code sample as in the first one.
Have you tried reading the file back in after using Set-Content
? Not sure if that’s clearing any control codes, but your here-string value would imply that it is:
$documentText = Get-Content -Path "debug.txt" -Raw -Encoding UTF8
That should give you the same as the here-string.
I think I’d be inclined to match the line that contains the data you want and split it, rather than look ahead and look behind as to my mind it’s simpler:
$Mandant, $Gegner = ((Select-String -InputObject $documentText -Pattern '.*\./\..*').matches.value -split './.').trim()
eugrus
November 21, 2024, 2:55pm
3
Kind of assume there would be no difference, but in fact there is:
$documentText = Get-Content -Path "debug.txt" -Raw -Encoding UTF8
$Mandant = [regex]::match($documentText, '[^\r\n].*(?=\.\/\.)').Value
$Gegner = [regex]::match($documentText, '(?<=\.\/\.\s)[^\r\n]*').Value
$Az = [regex]::match($documentText, '\d{4}/\d{2}').Value
Write-Output "$Mandant"
Write-Output "./."
Write-Output "$Gegner"
Write-Output "$Az"
gives out
> .\test.ps1
Superman
./.
Mustermanowa
2904/24
> gc .\debug.txt -raw
Sehr geehrte Frau Mustermanowa,le.com0.0049 211 123190.00 21.11.2024
> gc .\debug.txt -raw | Format-Hex
00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
00000000 0D 0D 0D 0D 0D 0D 0D 0D 0D 46 72 61 75 0D 41 6E .........Frau.An
00000010 6E 61 20 4D 75 73 74 65 72 6D 61 6E 6F 77 61 0D na Mustermanowa.
00000020 48 61 75 70 74 73 74 72 2E 20 31 0D 34 38 39 39 Hauptstr. 1.4899
00000030 36 20 4D 69 6E 69 73 74 61 64 74 0D 0D 0D 0D 0D 6 Ministadt.....
00000040 0D 0D 0D 0D 0D 0D 0D 0D 70 65 72 20 62 65 41 0D ........per beA.
00000050 70 65 72 20 4D 61 69 6C 3A 20 61 6E 6E 61 2E 6D per Mail: anna.m
00000060 75 73 74 65 72 6D 61 6E 6F 77 61 40 65 78 61 6D ustermanowa@exam
00000070 70 6C 65 2E 63 6F 6D 0D 0D 0D 0D 41 4B 54 45 4E ple.com....AKTEN
00000080 2E 4E 52 3A 20 09 53 41 43 48 42 45 41 52 42 45 .NR: .SACHBEARBE
00000090 49 54 45 52 2F 53 45 4B 52 45 54 41 52 49 41 54 ITER/SEKRETARIAT
000000A0 09 53 54 3F 44 54 4C 2C 0D 32 39 30 34 2F 32 34 .ST?DTL,.2904/24
000000B0 2F 53 42 09 53 6F 6E 6A 61 20 42 65 61 72 62 65 /SB.Sonja Bearbe
000000C0 69 6E 65 6E 6B 6F 09 2B 34 39 20 32 31 31 20 31 inenko.+49 211 1
000000D0 32 33 31 39 30 2E 30 30 09 32 31 2E 31 31 2E 32 23190.00.21.11.2
000000E0 30 32 34 0D 09 54 65 6C 65 66 61 78 3A 20 09 2B 024..Telefax: .+
000000F0 34 39 20 32 31 31 20 31 32 33 31 39 30 2E 30 30 49 211 123190.00
00000100 0D 09 45 2D 4D 61 69 6C 3A 20 61 6E 77 61 6C 74 ..E-Mail: anwalt
00000110 40 72 61 2E 65 78 61 6D 70 6C 65 2E 63 6F 6D 0D @ra.example.com.
00000120 0D 53 75 70 65 72 6D 61 6E 20 2E 2F 2E 20 4D 75 .Superman ./. Mu
00000130 73 74 65 72 6D 61 6E 6F 77 61 0D 57 6F 72 75 6D stermanowa.Worum
00000140 20 65 73 20 64 61 20 73 6F 20 67 65 68 74 0D 0D es da so geht..
00000150 0D 53 65 68 72 20 67 65 65 68 72 74 65 20 46 72 .Sehr geehrte Fr
00000160 61 75 20 4D 75 73 74 65 72 6D 61 6E 6F 77 61 2C au Mustermanowa,
00000170 0D 0D 0D 0D 0D 0A ......
eugrus
November 25, 2024, 9:32am
4
To match the behavior of the first code sample, use Get-Content -Raw -Encoding UTF8
after saving the Word content with Set-Content
. This will preserve line breaks and special characters, allowing your regex to work as expected.
Example:
powershell
Copy code
Set-Content -Path "debug.txt" -Value $documentText -Encoding UTF8
$documentText = Get-Content -Path "debug.txt" -Raw -Encoding UTF8
$Mandant = [regex]::match($documentText, '[^\r\n].*(?=\.\/\.)').Value
$Gegner = [regex]::match($documentText, '(?<=\.\/\.\s)[^\r\n]*').Value
$Az = [regex]::match($documentText, '\d{4}/\d{2}').Value