Regex does not work

I have a regex that I use to remove subtitle duplication via VSC and can be seen working here. With the help of Bing AI (as I have not formally learnt Powershell), I came up with this function to what I’ve done manually before:

function Remove-SubtitleDuplication {
    $oldText = '(\d+\n\d+.*?\n(.*?))\n+(?:\d+\n\d+.*?\n\2\n+)+'
    $newText = '$1\n\n'

    Get-ChildItem -Recurse -Filter "*.srt" | ForEach-Object {
        $content = Get-Content -Encoding UTF8 $_.FullName -Raw
        $content = $content -replace $oldText, $newText
        Set-Content -Path $_.FullName -Value $content -Encoding UTF8
    }
}

I kept iterating on this code via Bing AI but nothing worked. Not (?msi) or (?s). I don’t know what I am doing wrong. Is VSC’s implementation of regex different than Powershell’s?

lordlance,
Welcome to the forum. :wave:t3:

Are these files from Windows? Sometimes on Windows line breaks are represented by \r\n insterad of just a \n. Could you share an original sample file?

I’m not on my laptop but the SRT is generated by Whisper AI. Don’t know if that changes things.

The regex 101 link has sample text by the way

Well, I don’t either. :man_shrugging:t3: I’d suspect it depends on what system these files are saved. :thinking:

Since it works there I’d try to reproduce it the way you have it. :point_up:t3: :wink:

I did reproduce it. The same regex.

:thinking:

:thinking:

1 Like

I used the text from the regex101 sample and created a sample file myself. It works for me this way:

$oldText = '(\d+\r?\n\d+.*?\r?\n(.*?))(?:\r?\n)+(?:\d+\r?\n\d+.*?\r?\n\2(?:\r?\n)+)+'
$newText = "`$1`r`n`r`n"
$NewContent = 
    [System.IO.File]::ReadAllText('full\path\to\your\file.srt') -replace $oldText, $newText
$NewContent

I think your script works. I am trying to pass a file with spaces inside double quotes to the user-made whisp command. Here is my script:

function whisperFile {
    $processes = @()
    Get-ChildItem $fileExtensions | ForEach-Object {
        $process = Start-Process -FilePath "pwsh.exe" -ArgumentList "-NoExit -Command & whisp `"`$($_.FullName)`"" -PassThru
        $processes += $process
    }
    $processes | Wait-Process
    
    Remove-SubtitleDuplication
}

Files with spaces still ignore spaces and only the first word is passed to whisp. I don’t know about character-escaping in pwsh so I used Bing AI to help.

Side note - You made the regex more elaborate. What was specifically wrong with my regex?

I mentioned it in my first answer. On Windows line breaks often consist of two characters … \r\n. I added that extension to your regex. :man_shrugging:t3:

Using an AI for code production actually requires enough knowledge of the target programming or scripting language from the user. Since you cannot trust the output of an AI you have to validate it yourself.

I know about /r and /n. That is not why I asked. You made the code larger besides just using /r and /n. You have 3 groups instead of just the 2 in mine. That is what I am asking about.

I have made lots of pwsh functions using just Bing AI (and sometimes help of others) so it works. This time though because it was regex-related I knew it was probably something related to regex implementation.

And because you actually cannot validate it correctly you end up with clunky, cumbersome and inefficient code. :man_shrugging:t3:

If you’re looking for a single new line with \n it’s ok to extend it to \r?\n to find an eventually existing \r? (carriage return) as well. But if you’re looking for one or more consecutive new line occurances with \n+ you need to make a group of the combination \r? and \n to match it correctly if there are more than one. So you use a non capturing group like this: (?:\r?\n)+

To a degree it won’t be flawless but you can validate if what you want is getting done or not.

If you have that low standards - you’re right. :man_shrugging:t3: :smirk: