Efficient text log processing

by bitzbyte at 2012-09-13 11:44:20

I have an application that produces several text logs that run into the 100’s of MB each in the following general format:

[<WindowsEventID>] (<Timestamp>):<LogLine*>

* within the <LogLine> token there may be username that may be searched for as well

I would like to search the files with PS as well as break the result lines down into the tokens listed above. Currently I’m using Select-String to do both at the same time with the patterns below:

[code2=powershell]# Search with user info
"^[(\d{4,5})] &#40;([^)]+)&#41;]]]
[code2=powershell]# General search query
"^[(\d{4,5})] &#40;([^)]+)&#41;]
This works, but is rather slow – 30 seconds (upper end of acceptable) to 2 minutes (way too long when you’re troubleshooting with someone on the phone) is not unusual. Is there a more efficient way I could go about this? I’ve debated breaking it into two pieces and searching in the first pass, then splitting up the result lines in a second pass, but on the surface I don’t see that being much better.

Thanks!
by DonJ at 2012-09-13 11:52:19
Ow. Regex makes my head hurt.

I suspect you just may be running into the less-than-awesome file I/O routines that underlie Select-String. I have in mind that v3 was supposed to include some improvements to that, but maybe that’s just wishful thinking. You might try with v3, though, and see if there’s an improvement.

The other approach would be to parallelize this - not only break up the files into chunks, which you’d need to do, but have those chunks run at the same time. You could do that with background jobs, or in v3 could write a fairly simple workflow.
by willsteele at 2012-09-13 12:06:48
Although I am not sure it will directly benefit what you are doing with the Select-String cmdlet, the following video offers some very interesting ways to think about optimizing processing: http://bits_video.s3.amazonaws.com/022012-SUPS01_archive.f4v. It is possible a straight read of the file which then passes each line to a -match could be faster, but, I don’t have an exact command set to offer at the moment. Let me see if I can come up with a good demo.
by mjolinor at 2012-09-13 12:53:59
I would approach this using a combination of get-content with -readcount, and -match operations.

The -match operator can be applied to an array, but the file is too big to read into memory all at once.

Use the get-content with -readcount to break it up into managable chunks, then foreach through those, using the -match operations against each one.
by proxb at 2012-09-13 17:45:07
It’s been a while since I have done it, but I seem to remember having decent performance on a large file using Switch with -regex and -file parameters.
Switch -Regex -File <filename> {
"pattern" {'Stuff'}
}
by bitzbyte at 2012-09-14 05:40:13
Thanks for the suggestions everyone! I’ll do some testing and report back with what I find.