We are needing to search pdf documents stored on domain computers that contain sensitive data and log the computername along with the filepath to a csv file for auditing and eventually will have another script that moves the files into a network share. When i run my script below, it appears that it does pick up 2 of the results in the output but never writes the csv and also throws a seemingly endless string of errors. I cant seem to make any headway on this one! Anything I might be doing incorrectly here?
# Define the list of target phrases to search for in the PDFs
$searchPhrases = @(
"distribution statement b",
"distribution statement c",
"distribution statement d",
"distribution statement e",
"distribution statement f"
)
# Define the path to pdftotext (ensure this is correct on each machine or available in the environment PATH)
$pdftotextPath = "C:\Temp\xpdf-tools-win-4.05\bin64\pdftotext.exe" # Adjust accordingly
# Define the output CSV file for the results
$outputCsv = "C:\pdf_search_results.csv"
# Initialize an empty array to hold the results
$results = @()
# Get a list of all domain computers
#$computers = Get-ADComputer -Filter * | Select-Object -ExpandProperty Name
$computers = Get-ADComputer -Identity test-xps | Select-Object -ExpandProperty Name
# Loop through each computer and search for PDF files
foreach ($computer in $computers) {
Write-Host "Searching computer: $computer"
# Define the path to search for PDFs on the remote computer (you can modify this)
$pdfFolderPath = "\\$computer\C$\Users"
# Ensure the path exists before continuing
if (Test-Path $pdfFolderPath) {
# Get all PDF files in the folder
$pdfFiles = Get-ChildItem -Path $pdfFolderPath -Recurse -Filter "*.pdf" -ErrorAction SilentlyContinue
foreach ($pdfFile in $pdfFiles) {
# Convert PDF to text using pdftotext
$outputText = & $pdftotextPath -layout $pdfFile.FullName -
# Check if any of the search phrases exist in the PDF content
foreach ($phrase in $searchPhrases) {
if ($outputText -match $phrase) {
# Add the results to the array with hostname and full file path
$results += [PSCustomObject]@{
Hostname = $computer
FilePath = $pdfFile.FullName
PhraseMatched = $phrase
}
Write-Host "Found phrase '$phrase' in file: $($pdfFile.FullName) on $computer"
}
}
}
}
else {
Write-Host "Path not found: $pdfFolderPath on $computer"
}
}
# Export results to CSV if any matches were found
if ($results.Count -gt 0) {
$results | Export-Csv -Path $outputCsv -NoTypeInformation
Write-Host "Search completed. Results saved to $outputCsv"
} else {
Write-Host "No matches found."
}
Searching computer: TEST-XPS
Found phrase 'distribution statement b' in file: \\TEST-XPS\C$\Users\Administrator\Downloads\2003-4CH1.pdf on TEST-XPS
Found phrase 'distribution statement c' in file: \\TEST-XPS\C$\Users\Administrator\Downloads\NAVAIR 17-35TR-07.pdf on TEST-XPS
pdftotext.exe : Syntax Error: Couldn't find 'UniGB-UCS2-H' CMap file for 'Adobe-GB1' collection
At line:37 char:27
+ ... $outputText = & $pdftotextPath -layout $pdfFile.FullName -
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ CategoryInfo : NotSpecified: (Syntax Error: C...GB1' collection:String) [], RemoteException
+ FullyQualifiedErrorId : NativeCommandError
Syntax Error: Unknown CMap 'UniGB-UCS2-H' for character collection 'Adobe-GB1'
Syntax Error: Failed to parse font object for 'STSong-Light-UniGB-UCS2-H'
Syntax Error: Unknown font tag 'F1'
Syntax Error: Unknown font tag 'F1'
Syntax Error: Unknown font tag 'F1'
Syntax Error: Unknown font tag 'F1'
Syntax Error: Unknown font tag 'F1'
Syntax Error: Unknown font tag 'F1'
Syntax Error: Unknown font tag 'F1'
in a try/catch block so that the error is caught and you can log that you could not process that particular PDF.
It looks like the output might be truncated, but if this is a terminating error (causing the script to stop) then you’re never hitting your export block which is why the CSV file is never written.
I think you really need to review your approach to this, because at the moment you’re running the script one time and targetting every computer in your domain (including servers) and if any one of those generates a terminating error you’ll get no output, even for the computers that were successful.
You should consider Invoke-Command and running the script once per computer so that you can log which computers were successfully scanned. Select-String will also be more performant than looping over the phrases.
thank you so much for the suggestion. i found something else that uses pdftotext.exe and while it does run without any error messages, it also yields no results in the csv file. I have several pdf documents i am testing with and there are several of them that contain the phrase “distribution statement c” and “distribution statement b” but still does not seem to log these file paths as expected.
this time, i have used a .txt file containing the computer names do i just have something wrong with the results output near the bottom of the script?
# Set the domain computers to search (you can modify this part to get computer names dynamically)
$computers = Get-Content "C:\temp\computers.txt"
$outputCsv = "C:\temp\pdf_search_results.csv"
$searchText = "distribution statement b", "distribution statement c", "distribution statement d", "distribution statement e", "distribution statement f" # The text string you are searching for
# Initialize an array to store results
$results = @()
# Function to search PDFs on a specific computer
function Search-PDFsOnComputer {
param (
[string]$computerName,
[string]$searchText
)
# Define the path to where PDF files are likely stored (can adjust paths as needed)
#$pdfPaths = Get-ChildItem -Path "\\$computerName\C$\Users" -Recurse -Include *.pdf -ErrorAction SilentlyContinue
$pdfPaths = Get-ChildItem -Path "\\$computerName\c$\users\Administrator\Documents" -Recurse -Include *.pdf -ErrorAction SilentlyContinue
foreach ($pdf in $pdfPaths) {
# Extract text from PDF using pdftotext (ensure pdftotext.exe is in your PATH)
$pdfText = & "C:\Temp\xpdf-tools-win-4.05\bin64\pdftotext.exe" $pdf.FullName - 2>$null
# Check if the text exists in the PDF content
if ($pdfText -match $searchText) {
# Store the result
$results += [PSCustomObject]@{
ComputerName = $computerName
FilePath = $pdf.FullName
MatchFound = $true
}
}
}
}
# Loop through all computers and search for PDFs
foreach ($computer in $computers) {
Write-Host "Searching PDFs on $computer..."
Search-PDFsOnComputer -computerName $computer -searchText $searchText
}
# Output results to CSV
$results | Export-Csv -Path $outputCsv -NoTypeInformation
Write-Host "Search complete. Results written to $outputCsv"
Not getting any output suggests you’re not getting any results? Have you checked by writing $results to the console?
The reason you’re not getting any results is because -match expects a regular expression and you’re providing a list of strings. You can see this doesn’t work with a quick test:
$searchText = "distribution statement b", "distribution statement c", "distribution statement d", "distribution statement e", "distribution statement f" # The text string you are searching for
$pdfText = 'distribution statement b'
$pdfText -match $searchText
False
Change your search text to a regular expression and this part will work:
$searchText = "distribution statement [bcdef]" # The text string you are searching for
$pdfText = 'distribution statement b','distribution statement d'
$pdfText -match $searchText
I will offer the caveat that I have no idea what the output from pdftotext looks like and whether -match is the appropriate operator. If you still don’t get any output consider using Select-String instead.
Along the same lines as Matt, I would also test this line manually. I also dont have the tool but it looks like you are trying to re-direct STDout to $null and I think PS might be seeing the - as an aritmatic subtraction operator.
thank you both so much for the guidance, that seems to have done the trick! This is my final script that is functioning flawlessly!
# Set the domain computers to search
$computers = Get-Content "C:\temp\computers.txt" # Or use a query to get all domain computers
$outputCsv = "C:\Temp\pdf_search_results.csv"
$searchText = "distribution statement [bcdef]" # The text string you are searching for
# Initialize an ArrayList to store results (instead of a normal array)
$results = New-Object System.Collections.ArrayList
# Function to search PDFs on a specific computer
function Search-PDFsOnComputer {
param (
[string]$computerName,
[string]$searchText
)
# Define the path to where PDF files are likely stored (you can modify this)
$pdfPaths = Get-ChildItem -Path "\\$computerName\C$\Users\" -Recurse -Include *.pdf -ErrorAction SilentlyContinue
if ($pdfPaths) {
Write-Host "Found $($pdfPaths.Count) PDFs on $computerName"
} else {
Write-Host "No PDFs found on $computerName"
}
foreach ($pdf in $pdfPaths) {
# Extract text from PDF using pdftotext (ensure pdftotext.exe is in your PATH)
try {
# Capture the text output of the PDF using pdftotext
$pdfText = & "C:\Temp\xpdf-tools-win-4.05\bin64\pdftotext.exe" $pdf.FullName - 2>$null
# If no text is extracted or if pdftotext fails, skip this file
if (-not $pdfText) {
Write-Host "No text extracted from PDF: $($pdf.FullName)"
continue
}
# Check if the extracted text matches the search string
if ($pdfText -match $searchText) {
# Create a custom object to store the result
$resultObject = [PSCustomObject]@{
ComputerName = $computerName
FilePath = $pdf.FullName
MatchFound = $true
}
# Use Add() to append to the ArrayList
$null = $results.Add($resultObject)
Write-Host "Match found in $($pdf.FullName)"
}
} catch {
# Catch any errors from pdftotext and log them
Write-Host "Error processing PDF $($pdf.FullName): $_"
}
}
}
# Loop through all computers and search for PDFs
foreach ($computer in $computers) {
Write-Host "Searching PDFs on $computer..."
Search-PDFsOnComputer -computerName $computer -searchText $searchText
}
# Output results to CSV if any matches are found
if ($results.Count -gt 0) {
$results | Export-Csv -Path $outputCsv -NoTypeInformation
Write-Host "Search complete. Results written to $outputCsv"
} else {
Write-Host "No matches found. No data written to CSV."
}