How to optimize code that tries to remove rows from collection

Hello.

I have a collection with around 250 000 rows called $shortlog. It contains fields date, time, csusername, cip. And I need to remove all lines where for same csusername where are all cip the same. Or to say it differently I want to keep only records for users who connected from more than 1 IP address. This is how I do it:

$users = $shortlog | Select-Object csusername | sort-object -Property csusername | Get-Unique -AsString
foreach ($user in $users) {
$name = $user.csusername
$count = ($shortlog | Where-Object {$_.csusername -eq $name} | Select-Object cip | sort-object -Property cip | Get-Unique -AsString | Measure-Object).Count
if ($count -eq 1) {
$shortlog = $shortlog | Where-Object {$_.csusername -ne $name}
}
}
Could this be optimized? With approximately 350 users I am going to search 350 times 250 000 rows collection.
Thank you
Best regards
Honza

I think it can. I don’t have a large data set to prove it, but I count 9 pipes in your script. You can think of each pipe as a loop iterating over the data sent to it, so there is a good deal of overhead. In contrast, my solution below has 3 pipes. To know for sure, you can wrap each solution in a script block and use the measure-command cmdlet to determine exactly which one is more efficient.

$testData = @'
date,time,csusername,cip
1/5/21,10:44,me,192.168.1.1
1/5/21,10:22,you,192.168.1.2
1/4/21,10:00,me,192.168.1.1
1/4/21,9:00,me,192.168.1.2
1/3/21,8:00,you,192.168.1.2
'@ | ConvertFrom-Csv


$testData | 
    Group-Object csusername, cip | 
        Where-Object count -gt 1 | 
            Select-Object -ExpandProperty Group

 

That is not it. The result of your script is:

date time csusername cip


1/5/21 10:44 me 192.168.1.1
1/4/21 10:00 me 192.168.1.1
1/5/21 10:22 you 192.168.1.2
1/3/21 8:00 you 192.168.1.2

but the result I need is:

date time csusername cip


1/5/21 10:44 me 192.168.1.1
1/4/21 10:00 me 192.168.1.1
1/4/21 9:00 me 192.168.1.2

csusername “me” connected from more than 1 IP address, I need all his records

csusernam “you” connected from single IP address, I do not need any of his records

Honza

I get it now. Sorry for that. It’s a little more complicated but got it down to 5 pipes. Let me know if it is more efficient.

$testData = @'
date,time,csusername,cip
1/5/21,10:44,me,192.168.1.1
1/5/21,10:22,you,192.168.1.2
1/4/21,10:00,me,192.168.1.1
1/4/21,9:00,me,192.168.1.2
1/3/21,8:00,you,192.168.1.2
'@ | ConvertFrom-Csv


$users = $testData | 
    Group-Object csusername, cip -NoElement | 
        Group-Object @{e={($_.name -split ",")[0]}} |
            Where-Object Count -gt 1 |
                Select-Object -ExpandProperty Name

$Results = $testData | Where-Object csusername -in $users

 

Actually if there is flexibility in the output format you can save the final pipe by including the group. This will give you a list of the user and unique ip with a count of how many times they had that ip but the details of each (date, time) would be in the group property.

$testData | 
    Group-Object csusername, cip | 
        Group-Object @{e={($_.name -split ",")[0]}} |
            Where-Object Count -gt 1 |
                Select-Object -ExpandProperty group

Or this may be more efficient (since the data is already filtered) and includes the details

$testData | 
    Group-Object csusername, cip | 
        Group-Object @{e={($_.name -split ",")[0]}} |
            Where-Object Count -gt 1 |
                Select-Object -ExpandProperty group |
                    Select-Object -ExpandProperty group

 

What kind of witchcraft is this?

My version ran 2 hours 16 minutes 35 seconds. Your version ran 13 seconds and the results seem to be the same. I am shocked! I need to check the results again. I liked playing with PowerShell, but after today … I should not touch it again.

Thank you

Honza

 

Well I’m no witchdoctor but glad I could help. :slight_smile: Let me know if you have questions or need me to break anything down for you.

I have double checked it and the results match. Impressive. I still do not understand your part, but obviously I cannot understand how to fly when I just invented a wheel. How do I mark your reply as answer? Is the function here?

I might have other part that would benefit from optimization if you are willing to help. But it is too long and as I am not a programmer, it might be difficult to understand. So I try to create simpler example:

I am working with same collection. Lets use the same example, let us say that following data are part of $shortlog:

date,time,csusername,cip
1/5/21,10:44,me,192.168.1.1
1/5/21,10:22,you,192.168.1.2
1/4/21,10:00,me,192.168.1.1
1/4/21,9:00,me,192.168.1.2
1/3/21,8:00,you,192.168.1.2

I need to create list of distinct users

$users = $shortlog | Select-Object csusername | sort-object -Property csusername | Get-Unique -AsString
And for each user I need to get rows with his/her records and I am comparing each two records in the collection. But I need to compare:

A-B, A-C, A-D …

B-C, B-D, B-E …

C-D, C-E, C-F …

etc.

I have created a copy of the collection, removed first row, after a cycle I always remove first row in the second collection. It looks like this:

foreach ($user in $users) {
    $name = $user.csusername
    $collection1= $shortlog | Where-Object {$_.csusername -eq $name}
    $collection2 = $collection1
    $collection2 = {$collection2}.invoke()
    foreach ($collrow1 in $collection1) {
        $ip1 = $collrow1.cip
        if ($null -ne $collection2) {
            $collection2.RemoveAt(0)
            foreach ($collrow2 in $collection2) {
                $ip2 = $collrow2.cip
                if ($ip1 -ne $ip2) {"There would be some processing here"}
            }
        }
    }
}
In the sample collection with two users it is no big deal. But if I have 350 users, the main cycle goes 350x, the inner cycles for users with 500 records goes 124750x [500! / 2* (500-2)!]. So I am making 62375000 comparisons. And it takes too long.
I am just hoping that it makes some sense. I post the original part of my script if it helps. It is not secret, I just keep my brain busy. :-)
Is there a trick for this?
Thank you.
Honza

Honza, Glad the results worked out for you. These forums do not have a mark answer feature.

I might be able to help with you second issue, but still not quite clear on your requirements. It looks like you are comparing unique ips for each user. Is that right. In other words for each unique ip the user has assigned do some processing. To do that you can just use the -unique switch parameter of Select-Object.

$testData | 
    Select-Object csusername, cip -Unique |
        ForEach-Object {
            #do processing here access object properties with $_.csusername and $_.cip
            #if this turns in to more than a couple lines of code I would make it a
            #foreach loop instead of the foreach-object cmdlet
        }

While I’m waiting for clarification on that. The “trick” in the previous solution is to use the Group-Object command. When you pipe a collection to Group-Object and specify the properties to group them on, it will return a collection of objects with 4 properties. I really only use 3 of those: Count, Name, Group. Count will be the number of objects in the group, Name will be the properties it was grouped on (comma delimited string), and Group will be a collection of all the objects in that group.

 

Mike, I try to explain. MS Exchange server is my hobby. Strange, I know. I have decided to create a reporting / alerting for suspicious logons into on-premise Exchange server. That means logons that occurred from different places in time that is shorter than time needed to travel that far.

I know there are tools for that. But I like to make it on my own. It helps me learn about IIS logs, PowerShell and many more. And it actually works already. It is not a piece of art, but it works. I am taking IIS logs at 9AM from previous and current day and process them. Thanks to your optimization it takes under 3 hours to process.

What it does:

Reads the log files and stores the lines containingMicrosoft-Server-ActiveSync or EWS or OWA into variable $iislog

Then I create a $shortlog as I need only several columns. Some conditions are redundant, but they do not hurt.

$shortlog = $iislog | Where-Object {($_.csuseragent -notlike "MicrosoftNinja*")  -and ($_.scstatus -eq 200) -and ($_.csusername -ne "-") -and ($_.cip -ne "127.0.0.1") -and ($_.cip -ne "::1") -and (($_.csuristem -like "*OWA*") -or ($_.csuristem -like "*EWS*") -or ($_.csuristem -like "*Microsoft-Server-ActiveSync*"))} | Select-Object date, time, csusername, cip
 
Then I unify usernames because users use different account format: firstname.lastname@domain, firstname.lastname, domain\firstname.lastname, etc. I have just firstname.lastname in the variable and all in lowercase.

I also replace our internal private IP addresses to our external IP address, because I use public IP address database to acquire latitude and longitude and private address would not work.

I read data of previously found IP addresses from CSV. If the address in log does not match address in CSV, I send a request to online database and then store the results into my CSV for next time.

I remove all users with just 1 IP address in the log. That is the part you helped with:

$shortlog = $shortlog | Group-Object csusername, cip | Group-Object @{e={($_.name -split ",")[0]}} | Where-Object Count -gt 1 | Select-Object -ExpandProperty group | Select-Object -ExpandProperty group | Sort-Object csusername, date, time

The I remove redundant records in the log. If the user connected 5 times in the row from single address, I just need the first and last occurrence.

     $userrecords = $shortlog | Where-Object {$_.csusername -eq $name} | Sort-Object date, time
# Helping variable. This part make things a log faster. We do not need to compare all records. If the user connected from the same IP address several times in row,
# we just need the first and last record. Everything in between can be removed. But that is valid only for a group of connections that was not interupted by a connection
# from other address. (AAAABBBBAAAACCCCAAAAADDDDD) makes (AABBAACCAADD)
     $stillthesame = 0
     foreach ($userrecord in $userrecords) {
          $ip1 = $userrecord.cip
# We need to know the position of the record in the array so that we can delete it if needed.          
          $index1 = $userrecords.indexof($userrecord)
# We need to know IP adress in next record, so that we can compare it. For that we need index + 1
          $index2 = $index1 + 1
          $ip2 = $userrecords[$index2].cip
# Compare IP addresses          
          if ($ip1 -eq $ip2) {
# If they are the same, but the variable $stillthesame is 0, it is first address in the line and we need to keep it. But we know that we have two same addresses and we change the variable to 1.
               if ($stillthesame -eq 0) {
                    $stillthesame = 1
               }
# If $stillthesame is already 1, then it means that in the previous record and next record is the same IP address, so this record can be deleted. We just add the index to the aray list.
# We are going to remove the records later.
               else {
                    $collindex.add($index1) > $null
                    }
          }
# If the addresses were not the same, lets reset the help variable.
          else {
               $stillthesame = 0
          }
          }
# We have gathered indexes of records that can be deleted because they contain redundant addresses. We need to delete them from the end. If we delete them from the start, position of next records
# changes and we would delete incorrect records. So we need to sort them in descending order.          
     $collindex = $collindex | Sort-Object -Descending
# We need to remove records from array and that is not usually possible because it has fixed size. We need to convert it to System.Collections.ObjectModel.Collection.     
     $final = {$userrecords}.Invoke()
# Now we can remove the records on indexes that we gathered.     
     write-host "Removing redundant records for $name"
     get-date -Format HH:mm:ss
     foreach ($index in $collindex) {
          $final.removeat($index)
     }

 

And then the main part comes:

I go user by user, for each users I compare each two his records. Not unique records. Each two records. If the IP addresses are NOT same, I calculate distance between those two points. I have date and time for each records, so I can compute speed. Then I can use distance and speed to set severity.

# Anything under speed 100 km/hod and distance 300km can is not important. Increase severity with speed.
      if (($speed -gt 100) -and ($distance -gt 300)) {
       $sev="Low"
        if ($speed -gt 500) {
        $sev="Medium"
        if ($speed -gt 1000) {
         $sev="High"
        }
       }
# Mobile operators usually return location in capital city. If there is a big distance within country, severity is still low.                        
       if ($country1 -eq $country2) {
        $sev = "Low"
       }

And I store the results for suspicious logons.

Back to the main part. What do I do. Let us say, this is the log for my account:

"date";"time";"csusername";"cip"
"2021-01-11";"13:03:14";"jan.kovar";"188.244.55.168"
"2021-01-11";"13:03:50";"jan.kovar";"37.9.192.134"
"2021-01-11";"13:03:55";"jan.kovar";"37.9.192.134"
"2021-01-11";"13:06:46";"jan.kovar";"188.244.55.168"
"2021-01-11";"13:09:50";"jan.kovar";"37.9.192.134"
"2021-01-11";"13:09:55";"jan.kovar";"37.9.192.134"
"2021-01-11";"16:35:36";"jan.kovar";"37.48.0.203"
"2021-01-11";"16:35:45";"jan.kovar";"37.9.192.134"
"2021-01-11";"16:46:04";"jan.kovar";"89.24.45.135"
"2021-01-11";"16:46:16";"jan.kovar";"37.48.0.203"
"2021-01-11";"16:48:56";"jan.kovar";"89.24.45.135"

This is content of $collection1. I need to compare these rows, but I do not know how to do it, so I create $collection2 which is same as $collection1. I have two cycles outer foreach for $collection1 and inner foreach for $collection2. I do not want to compare same rows, so in each pass of $collection1, I remove first row from $collection2. $collection2.RemoveAt(0) Like this:

foreach ($collrow1 in $collection1) {
    $ip1 = $collrow1.cip
    $date1 = $collrow1.date
    $time1 = $collrow1.time
# Creating date + time variable
    $dt1 = $date1 + " " + $time1
# For the last record in collection 1, collection 2 would be empty. So lets check, that it is not.
    if ($null -ne $collection2) {
# Removing first row from second collection. They are the same. We want to start comparing first two rows not the same row. First row will be removed in each cycle. We want to compare: A-B. A-C, B-C.
        $collection2.RemoveAt(0)
# Compare all remaining records from collection 2 with selected record from collection 1
        foreach ($collrow2 in $collection2) {
            $ip2 = $collrow2.cip
            $date2 = $collrow2.date
            $time2 = $collrow2.time
# Creating date + time variable
            $dt2 = $date2 + " " + $time2
  
# Calculating time difference in hours.
            $howlong = (New-TimeSpan -Start $dt1 -End $dt2).totalhours
# Same IP addresses = same location, no reasong to calculate distance. If they are different, lets calculate.
            if ($ip1 -ne $ip2) {
                "Do the math, calculate distance, speed and severity"
            }
        }
    }
}

Does it make sense?

Thank you

Honza

 

 

 

 

 

 

Wow, I admire the problem set. Looks like you need to be able to compare objects within a collection. Specifically, assuming your array is sorted in time order by user, calculate the time span from the current row to the next as long as the next row is the same user but different ip. Does that sound right? If so, the only way to compare objects within the same array is with a for loop instead of a foreach. Below is a simplified example of what I mean. In this I add a new property that gives you the time span to the next connection. If there is not a subsequent row to compare it is simply set to zero. What complicates it a little is that date and time are two string properties instead of a single [datetime] object, but I work around that by creating [datetime] objects when needed.

$testdata = @'
date,time,csusername,cip
1/5/21,10:44,me,192.168.1.1
1/5/21,10:22,me,192.168.1.2
1/3/21,15:23,you,192.168.1.3
1/4/21,10:00,me,192.168.1.1
1/4/21,9:00,me,192.168.1.2
1/6/21,8:00,you,192.168.1.5
1/3/21,8:00,me,192.168.1.2,
1.1.21,10:00,you,192.168.1.1
'@ | ConvertFrom-Csv | Sort-Object csusername, @{e={Get-Date -Date ("{0} {1}" -f $_.date, $_.time)}}

for ($i = 0; $i -lt $testdata.Count; $i++)
{ 
    if ($i -eq $testdata.count - 1 -or 
       ($testdata[$i+1].cip -eq $testdata[$i].cip -and $testdata[$i].csusername -eq $testdata[$i+1].csusername) -or
       $testdata[$i+1].csusername -ne $testdata[$i].csusername ) {
            $span = [System.TimeSpan]0
    } #if nothing to compare
    else {
        $splat = @{
            end   = Get-Date -Date ("{0} {1}" -f $testdata[$i+1].date, $testdata[$i+1].time)
            start = Get-Date -Date ("{0} {1}" -f $testdata[$i].date, $testdata[$i].time)
        }
        $span = New-TimeSpan @splat
    } #else
    $testdata[$i] | Add-Member -MemberType NoteProperty -Name "span" -Value $span
}

 

 

 

Actually, collection always contains only records for single user. I am doing some changes before, so I always need to process one user at time. So it is really like all data for me, all data for you and so on. Does it change anything?

 

BTW, this method would make the last step in your process moot because it would account for single connections by a user.

It wouldn’t change anything, just no need to check for those cases or sort on csusername.