Select portion of text and merge them

Darkveemon1 · January 24, 2025, 6:23pm

Hi, I don’t know if this easy to make, I’m not so accustomed on text editing via Powershell.

Here is my beginning setting, I have few PDF files with a variable number of pages (from 2 to 5 generally) and I need to count the occurencies of a specific word (11256) in them, I’ll need to process a single file per day.

I found a library to import the PDF and convert it to text: PowerShell Gallery | PSWritePDF 0.0.20

But there seems to be a problem.

$file1 = Convert-PDFToText -FilePath "C:\Chiusure servizi\2025-01-24_16-52-16 servizio 11284_chiusura analitica_30MGA--CASSA-----12.pdf"
#$file1 | Out-file -FilePath "C:\MonitorPrint\IVOP_txt_pdf\temporaneo.txt" -Append

$Matches = Select-String -InputObject $file1 -Pattern "11256" -AllMatches

$occuencies= $Matches.Matches.Count
Write-Output $occurrencies

The file gets importd to text, but the pages are odd, some gets duplicated for no reason, last import I had a PDF with 3 pages and it imported (in order): Page 1, Page 1, Page 2, Page 1, Page 2, Page 3.

To get to the point I would like to fix the text after the import. Every page begins with the same header: “Export Data” and contains at the bottom the page number like so: “1 / x, 2 / x …”.

In my head I would like to extract the text part between 2 headers (If the hader gets duplicated is not a problem) and save each to a variable, then I would check the page number on each variable and discard the ones I already got.

Thank you in advance for the help

matt-bloomfield · January 24, 2025, 7:16pm

Looks like the author has already opened a bug for this:

github.com/EvotecIT/PSWritePDF

Convert-PDFToText - possible bug

opened 07:05AM - 28 Apr 24 UTC

PrzemyslawKlys

bug

Reported on linkedin to be verified It seems that the function Convert-PDFToT…ext is working a bit incorrect - I have to test further, but for the moment (in my environment) it works like this: Assuming that PDF has multiple pages with PageText1, PageText2,.. PageTextN, after running the function I get the result where text from every next page has all the text from previous pages, smthng like "PageText1PageText1PageText2PageText1PageText2PageText3" for pdf of 3 pages. It seems that (in my environment) I could fix it by explicitly declaring new `TextExtractionStrategy` for every call of **GetTextFromPage** so, line 1754 `[iText.Kernel.Pdf.Canvas.Parser.PdfTextExtractor]::GetTextFromPage($ExtractedPage, $iTextExtractionStrategy)` converted to `[iText.Kernel.Pdf.Canvas.Parser.PdfTextExtractor]::GetTextFromPage($ExtractedPage, [iText.Kernel.Pdf.Canvas.Parser.Listener.LocationTextExtractionStrategy]::new())` after this fix extraction worked as expected.

Don’t think there’s going to be much more the forum can suggest than to reach out to the owners of the product on GitHub.

Darkveemon1 · January 24, 2025, 7:55pm

I known I could ask the developer, not sure if he is still active tho.

I was looking for a at home solution in the meanwhile. Is it possible to extract a portion of text between 2 specific words?

Ty in advance

matt-bloomfield · January 24, 2025, 8:38pm

You should be able to do that with a regular expression (regex):

system · February 23, 2025, 8:39pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Extract expressions between () + preceeding expression in Arabic from a PDF PowerShell Help	4	178	May 16, 2024
Parsing PDF file PowerShell Help	2	477	February 10, 2024
Extracting parts of several Word docs and combining in a new doc PowerShell Help	4	544	May 16, 2024
Split PDF by Bookmarks, Export to folder PowerShell Help	4	294	May 16, 2024
Extract text between two underscores PowerShell Help	1	282	May 16, 2024

Select portion of text and merge them

Related topics