Hi guys,
New to PowerShell. Been asked by legal/HR to search a folder for files that either have in the name or in the contence of the Word files contain a reference account numbers from another file that has over 1000 numbers. The folder contains over 60k documents.
How do I even start…help please?
I assume it will be at least a 2 step process, 1 for the file name and the other for what’s in the file.
Searching the filenames is trivial with PowerShell but you’re going to struggle to search the contents of the Word documents. It’s doable with COM objects but it will be a pain.
I would suggest looking at a tool like Everything instead.
Depending on why you’ve been asked to do this, I would also suggest pushing this back on HR/Legal to pay for a professional resource to gather this information. If this is evidence gathering for some sort of legal proceedings there will be processes that need to be followed and there will be professionals out there that do this investigative stuff for a living.
You can read a text file with Get-Content. The text file should contain your list of a/c numbers with each number on a separate line.
You can get files with Get-ChildItem.
You don’t want to run Get-ChildItem 1000 times so build a list of files once, and then look for matches within the list.
This is a very basic example to demonstrate the idea above. I suspect it will match too many files to be useful, but without knowing how the filenames are structured it’s hard to be more specific.
“We don’t provide complete scripts on request so you should try to build on this example and come back to us if you get stuck.”
Not a problem, I’m here to learn
What is the context of the folder where these files reside? Any chance the back end is a SharePoint site? I wrote some C# code once to search massive amounts of Office documents and was impressed at how little code that took. I would think one could do the same with PowerShell.
Just thinking outside the box. I also agree with Matt, a somewhat daunting task. One other outside the box thought, if the documents are .DOCX, you can rename/copy them to .ZIP and process the underlying XML from that export. That might be a ton of work though.