Select portion of text and merge them

Hi, I don’t know if this easy to make, I’m not so accustomed on text editing via Powershell.

Here is my beginning setting, I have few PDF files with a variable number of pages (from 2 to 5 generally) and I need to count the occurencies of a specific word (11256) in them, I’ll need to process a single file per day.

I found a library to import the PDF and convert it to text: PowerShell Gallery | PSWritePDF 0.0.20

But there seems to be a problem.

$file1 = Convert-PDFToText -FilePath "C:\Chiusure servizi\2025-01-24_16-52-16 servizio 11284_chiusura analitica_30MGA--CASSA-----12.pdf"
#$file1 | Out-file -FilePath "C:\MonitorPrint\IVOP_txt_pdf\temporaneo.txt" -Append

$Matches = Select-String -InputObject $file1 -Pattern "11256" -AllMatches

$occuencies= $Matches.Matches.Count
Write-Output $occurrencies

The file gets importd to text, but the pages are odd, some gets duplicated for no reason, last import I had a PDF with 3 pages and it imported (in order): Page 1, Page 1, Page 2, Page 1, Page 2, Page 3.

To get to the point I would like to fix the text after the import. Every page begins with the same header: “Export Data” and contains at the bottom the page number like so: “1 / x, 2 / x …”.

In my head I would like to extract the text part between 2 headers (If the hader gets duplicated is not a problem) and save each to a variable, then I would check the page number on each variable and discard the ones I already got.

Thank you in advance for the help

Looks like the author has already opened a bug for this:

Don’t think there’s going to be much more the forum can suggest than to reach out to the owners of the product on GitHub.

1 Like

I known I could ask the developer, not sure if he is still active tho.

I was looking for a at home solution in the meanwhile. Is it possible to extract a portion of text between 2 specific words?

Ty in advance

You should be able to do that with a regular expression (regex):

1 Like