Multi-Threading & File based Logging

I have a script that would GREATLY benefit from being multi-threaded. The script basically looks up users in two systems, compares them, makes updates if needed, then logs it’s actions. Most of the time is used up doing the lookups. It would GREATLY benefit from being able to spin up several threads (asjobs I’m assuming).

Currently it takes 2.5 seconds per user to process. That’s fine for 500 or so users, but scaling it to 10,000 becomes ugly.

The first hurdle I need to figure out is how to handle logging. Currently the process does the lookups and the main function Returns at various points depending on the results it finds, and writes out an entry in 1-2 different CSV files. When I’m all done, I read in those files, cast them to HTML and send out a nice little report.

My problem is, I can’t figure out how to handle logging if I turn this into a multi-threaded process. I’m open to suggestions.

Without posting the entire set of code, here is the bases of how it works:

Reads in Userlist, dumps that userlist to the main function (big ol’ ugly foreach loop).
The main function writes to log 1, log 1 & 2, or just to log 3, depending on if users existed in both locations, if they needed updates, or if they were removed, etc. (out-file)
When done, I read in the 3 csv files, cast them to HTML then email out a report.

Any suggestions on how to keep logging straight and work correctly multi-threading this mess?

It depends a lot on how you implement the multi-threaded-ness.

For example, I’d probably modify the script to write objects to the pipeline. Just give those objects whatever properties you need to show up as columns in your final CSV - but the trick is to have all the output in those objects. Give the script input parameters so it knows what to do - including, perhaps, an array of paths to process.

Then, use Start-Job to start the job, giving it a block of paths. Repeat that, getting multiple jobs running, each job being a copy of the same script dealing with different blocks of paths. PowerShell will cache up all their output in memory.

You can then use Receive-Job to get the completed output from ALL of them in one go, piping it to Export-CSV to get your CSV file.

I think you get a lot more flexibility when your script doesn’t do “logging,” but instead outputs objects. PowerShell works with those a lot more flexibly than it does files.

If I was setting out to optimize this, I’d lose the per-event file logging, and re-write the script to write the log events to separate ouput streams (e.g. use Write-Output for everyting going to Log1, Write-Host for everything going to Log2, and Write-Verbose for everything going to Log3). When you run that as a background job, you can access and read the output buffers of the child jobs separately. You can even start reading data from those buffers while the job is still running. Once you have the background jobs, it’s just a matter of accumulating the output from the buffers of each job into a common variable in your main script, then write out your csv and render your HTML from the data in the variables.

I ended up leveraging MSMQ Queue for this. The parent script spins up dumps all the jobs to be done into one queue. Spins up 20+ jobs which pull items from the queue, process them, then dump the results (depending on the results) into other special queues. When all jobs finish, the parent drains the queues, and makes a nice report with the results it emails out.

End result, it does about 1000 users in under 3 minutes (whereas it took 30+ for half that.)