How to extract text (Easy? Think again)

by moorecj79 at 2013-04-08 18:02:13

Hi all,

I have a 200MB txt that contains information as follows:

D:\SecLogs\dc1_20130302_0630.txt:2889: Client Address: 162.128.10.75
D:\SecLogs\dc1_20130302_0630.txt:2931: Client Address: 122.178.120.75
D:\SecLogs\dc1_20130302_0630.txt:3163: Client Address: 129.163.160.208
D:\SecLogs\dc1_20130302_0630.txt:3456: Source Network Address: 122.166.170.75
D:\SecLogs\dc1_20130302_0630.txt:3526: Client Address: 102.128.150.75
D:\SecLogs\dc1_20130302_0630.txt:3562: Client Address: 111.163.110.175
D:\SecLogs\dc1_20130302_0630.txt:3598: Client Address: 172.168.170.75
D:\SecLogs\dc1_20130302_0630.txt:3655: Client Address: 132.168.131.61
D:\SecLogs\dc1_20130302_0630.txt:3691: Client Address: 152.168.129.93
D:\SecLogs\dc1_20130302_0630.txt:3727: Client Address: 112.168.160.233

I need to extract all of the IP addresses. I’ve spent many hours trying to work this out! and to be honest, it’s doing my head in…! Any assistance provided would be greatly appreciated.
by mjolinor at 2013-04-08 18:11:56
Try this:

get-content file.txt -ReadCount 5000 |
foreach {
$_ -replace ‘.+Client Address: (\S+)’,’$1’
} | set-content ip_addrs.txt
by moorecj79 at 2013-04-08 18:32:20
Didn’t work for me… Hmmm.
by mjolinor at 2013-04-08 18:59:18
Can you be a little more descriptive? “Didn’t work” covers a lot of territory.
by coderaven at 2013-04-08 19:05:50
Check an example here

$input_path = ‘c:\ps\ip_addresses.txt’
$output_file = ‘c:\ps\extracted_ip_addresses.txt’
$regex = ‘\b\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3}\b’
select-string -Path $input_path -Pattern $regex -AllMatches | % { $.Matches } | % { $.Value } > $output_file
by moorecj79 at 2013-04-08 19:49:30
mjolinor: My apologies… my response was too vague! Basically, it appears the output didn’t change at all. I’ll play around with it abit more.

coderaven: That worked perfectly! Thank you very much.
by mjolinor at 2013-04-08 20:53:03
I probably should have tested that.

Try this:

$input_path = ‘<input file path>’
$output_path = ‘<output file path>’
$regex = ‘.+\s(\S+)\s*$‘

get-content $input_path -ReadCount 5000 |
foreach {
$_ -match $regex -replace $regex,’$1’
} | sc $output_path


I duplicated your sample data for 200K records, and came up with about a 14MB file.

That extracted the IP addresses out of it in about 22 seconds.

I ran the same regex through the select-string solution, and it took a minute and 40 seconds to do the same extract.
by MasterOfTheHat at 2013-04-09 06:46:45
Interesting that there is that much of a difference in execution time. Did you run it through Measure-Command to figure that out or … ?

For better performance and better accuracy, moorecj79 may want to combine the 2 methods. Notice that coderaven’s regex is actually looking for an IP address*, whereas mjolinor’s is just pulling everything after a specific set of whitespace:
#coderaven
$regex = ‘\b\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3}\b’
#mjolinor
$regex = '.+\s(\S+)\s*$'

But, as has been stated, the -match in a ForEach-Object seems to be the faster approach.

* The “proper” regex for a valid IP address is an argument spread all over the web…
by mjolinor at 2013-04-09 07:51:02
The regex was customized for the data, any you could use either one.

Edit: Yes, I did run both scenarios throught Measure-Command to get those results.

I suspect the select-string is slower for a couple of reasons. One is that it’s creating matchinfo objects, which are fairly rich objects that provide a lot of information and functionality like pre-context and post-context. It’s a wonderful thing if you need that, but kind of wasted overhead if you don’t. The other factor is that I think select-string is probably doing a foreach through the file internally. This is easy on memory (you only have to hold one line in memory at a time), but not so great for response time.

The -replace operator will do an entire array at once, and just returns the strings. You could read the entire file in and do a single replace on that and get all the results at once, but this is hard on memory consumption, and with large files seems to get slower. I think this is due to disk contention, because you’re simultaneously trying to read the file, and at the same time memory management is trying to write that same data back out to the swap file.

The get-content with -readcount is a compromise. It will start handing you arrays of howerver many lines you’ve specified in the readcount. That will require enough memory to hold one array at a time, and you can tune that to suit your available resources by altering the ReadCount number. Higher numbers take more memory, but reduce the number of iterations. The “sweet spot” is where you’re reading in the maximum number of records it will write to memory without needing to get into the swap file to hold them.

IMHO.
by nohandle at 2013-04-10 06:49:40
Mjolinor: Do you still have the test data? I was just wondering if you could measure this approach for me :slight_smile:
gc $input_path | foreach {$.split(’ ')[-1].trim()} | sc $output_path
by mjolinor at 2013-04-10 07:16:31
That comes in at about 20 seconds on my system.

Not really unexpected, since the text split and trim methods are more efficient than regex, and are probably preferable where you can use them. The regex is much more flexible, and you can leverage the ability to use it with arrays to make up the performance difference.

FWIW you can re-create the test data I’m using like this:

(@‘
D:\SecLogs\dc1_20130302_0630.txt:2889: Client Address: 162.128.10.75
D:\SecLogs\dc1_20130302_0630.txt:2931: Client Address: 122.178.120.75
D:\SecLogs\dc1_20130302_0630.txt:3163: Client Address: 129.163.160.208
D:\SecLogs\dc1_20130302_0630.txt:3456: Source Network Address: 122.166.170.75
D:\SecLogs\dc1_20130302_0630.txt:3526: Client Address: 102.128.150.75
D:\SecLogs\dc1_20130302_0630.txt:3562: Client Address: 111.163.110.175
D:\SecLogs\dc1_20130302_0630.txt:3598: Client Address: 172.168.170.75
D:\SecLogs\dc1_20130302_0630.txt:3655: Client Address: 132.168.131.61
D:\SecLogs\dc1_20130302_0630.txt:3691: Client Address: 152.168.129.93
D:\SecLogs\dc1_20130302_0630.txt:3727: Client Address: 112.168.160.233
’@).split("`n") |
#where { -notmatch ‘^#’} |
foreach {$
.trim()} | sv text

($text * 2e4) | sc c]

That’s using text copied from the post upstream.
I’ve got an add-in I wrote for the V3 ISE that does most of the work for you:

http://mjolinor.wordpress.com/2011/12/31/clip-toarray-for-ps-v3-ise/
by nohandle at 2013-04-10 08:04:04
Thanks a lot.

I did create testing set of data using very similar approach, testing it on your station made the results relevant. (I could of course retest all of them on mine but you know… I am lazy :smiley: )
by mjolinor at 2013-04-10 08:12:27
I’m using about a 3 year old laptop with, IIRC an 5600 RPM hd, and this is a pretty disk intensive test. Somebody with 15K drives or an SSD could probably get substantially better results. For benchmark comparisons it’s important to keep everything the same between tests.