Specs
• OS: Windows 7
• PSVersion 3.0
Situation
Comparing Objects (About 30.000 items) in an array with 4 other big arrays.
All the arrays have a unique ComputerName Value, but all dataset have other data that’s needs to be added/combined with the 1st array and sometimes edited in the process.
The Problem
Speed! The solution I have now is slow, and I am looking for an faster approach.
Tryed solutions
Compare-Object can’t do the trick in this situation, or I might have missed an lesson.
ForEach in an Foreach.
—a. This is Slow.
—a. This is Faster than the ForEach in an Foreach
—b. Readability is an down side.
.INPUT
Example DataSet 1 in CSV up untill 5000 items
Username;Computername
UniqueUser1;UniqueComputer1
UniqueUser2;UniqueComputer2
UniqueUser3;UniqueComputer3
Example DataSet 2 in CSV up untill 5000 items
Username;KeyData
UniqueUser1;KeyData101
UniqueUser2;KeyData102
UniqueUser3;KeyData103
Code Sample 1
#region Foreach in an Foreach
$DataSet1 = Import-Csv -Path C:\_Temp\DataSet1.csv
$DataSet2 = Import-Csv -Path C:\_Temp\DataSet1.csv
$DataSet1 | Add-Member -MemberType NoteProperty -Value '' -Name 'KeyData'
Measure-Command -Expression {
Foreach ($CSVDataSet1_Line in $DataSet1) {
Foreach ($CSVDataSet2_Line in $DataSet2) {
If ($CSVDataSet1_Line.Username -EQ $CSVDataSet2_Line.Username) {
$CSVDataSet1_Line.Keydata = $CSVDataSet2_Line.Keydata
}# If
}#Forech2
}#Forech1
}#Measure
#endregion
Minutes : 2
Seconds : 47
Milliseconds : 166
Code Sample 2
#region Index of
$DataSet1 = Import-Csv -Path C:\_Temp\DataSet1.csv
$DataSet2 = Import-Csv -Path C:\_Temp\DataSet1.csv
$DataSet1 | Add-Member -MemberType NoteProperty -Value '' -Name 'KeyData'
Measure-Command -Expression {
Foreach ($CSVDataSet1_Line in $DataSet1) {
$RecordID = [array]::indexof(
$($DataSet2.username),
$($CSVDataSet1_Line.username)
)
}#Foreach
If (($RecordID -ne '-1') -and ($RecordID -ne '')) {
If ($CSVDataSet1_Line.Username -EQ $DataSet2.GetValue($RecordID).Username) {
$CSVDataSet1_Line.Keydata = $DataSet2.GetValue($RecordID).Keydata
}
}
}#Measure
Minutes : 0
Seconds : 40
Milliseconds : 286
#endregion
.OUTPUT
Username;Computername;KeyData
UniqueUser1;UniqueComputer1;KeyData101
UniqueUser2;UniqueComputer2;KeyData102
UniqueUser3;UniqueComputer3;KeyData103
Has anyone else struggled with an similar problem and found an faster solution?
donj
April 24, 2017, 8:14am
2
I’m not really sure there is, short of custom-writing something in C# so you can get the benefit of compiled, vs interpreted, execution. Comparing huge sets is naturally memory- and processor-intensive. If there’s a way to format the data so Compare-Object will work, that’d be the quickest solution. Honestly, 2-3 minutes isn’t bad given the size of the data and the fact that you’re in an interpreted, dynamic language.
seems your data haven’t complex structure, so you can try to convert arrays to hashes and compare keys
I’m try to simulate your case and what I get:
D:\> $DataSet1 = 1..30000 | Foreach-Object { [PSCustomObject]@{UserName="User$_"; ComputerName="
Computer$_"; KeyData='placeholder' } }
D:\> $DataSet2 = 1..5000 | Foreach-Object { $rnd = Get-Random -min 1 -max 30000; [PSCustomObject
]@{username="User$rnd"; KeyData="KeyData $_ $rnd" } }
D:\> $DataSet1 | ForEach-Object -Begin { $hash=@{} } -Process { $hash[$_.username] = $_ }
D:\> measure-command {
foreach ($d2 in $DataSet2) {
if ($hash.ContainsKey($d2.username)) {
$hash[$d2.username].KeyData = $d2.KeyData
}
}
}
Days : 0
Hours : 0
Minutes : 0
Seconds : 0
Milliseconds : 54
Ticks : 545249
TotalDays : 6,31075231481482E-07
TotalHours : 1,51458055555556E-05
TotalMinutes : 0,000908748333333333
TotalSeconds : 0,0545249
TotalMilliseconds : 54,5249
D:\> $result = $DataSet1 | ? { $_.keydata -ne 'placeholder' }
D:\> $result.count
4616
D:\> $result[0]
UserName ComputerName KeyData
-------- ------------ -------
User2 Computer2 KeyData 1785 2
Don’t see to Results count, it lesser than 5000 because of rnd, but I think you will be impressed by speed
seems your data haven’t complex structure, so you can try to convert arrays to hashes and compare keys
I’m try to simulate your case and what I get:
D:\> $DataSet1 = 1..30000 | Foreach-Object { [PSCustomObject]@{UserName="User$_"; ComputerName="Computer$_"; KeyData='placeholder' } }
D:\> $DataSet2 = 1..5000 | Foreach-Object { $rnd = Get-Random -min 1 -max 30000; [PSCustomObject]@{username="User$rnd"; KeyData="KeyData $_ $rnd" } }
D:\> $DataSet1 | ForEach-Object -Begin { $hash=@{} } -Process { $hash[$_.username] = $_ }
D:\> measure-command {
foreach ($d2 in $DataSet2) {
if ($hash.ContainsKey($d2.username)) {
$hash[$d2.username].KeyData = $d2.KeyData
}
}
}
Days : 0
Hours : 0
Minutes : 0
Seconds : 0
Milliseconds : 54
Ticks : 545249
TotalDays : 6,31075231481482E-07
TotalHours : 1,51458055555556E-05
TotalMinutes : 0,000908748333333333
TotalSeconds : 0,0545249
TotalMilliseconds : 54,5249
D:\> $result = $DataSet1 | ? { $_.keydata -ne 'placeholder' }
D:\> $result.count
4616
D:\> $result[0]
UserName ComputerName KeyData
-------- ------------ -------
User2 Computer2 KeyData 1785 2
Don’t look at Results count, it lesser than 5000 because of rnd, but I think you will be impressed by speed.
Even with hash creation it’s less than a second
donj
April 24, 2017, 12:57pm
5
Posting on behalf of someone who got flagged as spam:
seems your data haven’t complex structure, so you can try to convert arrays to hashes and compare keys
I’m try to simulate your case and what I get:
D:> $DataSet1 = 1…30000 Foreach-Object { PSCustomObject @{UserName=“User$“; ComputerName=”
Computer$ ”; KeyData=‘placeholder’ } }
D:> $DataSet2 = 1…5000 Foreach-Object { $rnd = Get-Random -min 1 -max 30000; PSCustomObject
@{username=“User$rnd”; KeyData=“KeyData $_ $rnd” } }
D:> $DataSet1 ForEach-Object -Begin { $hash=@{} } -Process { $hash $.username = $ }
D:> measure-command {
foreach ($d2 in $DataSet2) {
if ($hash.ContainsKey($d2.username)) {
$hash $d2.username .KeyData = $d2.KeyData
}
}
}
Days : 0
Hours : 0
Minutes : 0
Seconds : 0
Milliseconds : 54
Ticks : 545249
TotalDays : 6,31075231481482E-07
TotalHours : 1,51458055555556E-05
TotalMinutes : 0,000908748333333333
TotalSeconds : 0,0545249
TotalMilliseconds : 54,5249
D:> $result = $DataSet1 ? { $_.keydata -ne ‘placeholder’ }
D:> $result.count
4616
D:> $result 0
UserName ComputerName KeyData
User2 Computer2 KeyData 1785 2
Don’t see to Results count, it lesser than 5000 because of rnd, but I think you will be impressed by speed
donj
April 24, 2017, 12:59pm
6
Posting on behalf of someone who got flagged as spam:
seems your data haven’t complex structure, so you can try to convert arrays to hashes and compare keys
I’m try to simulate your case and what I get:
D:> $DataSet1 = 1…30000 Foreach-Object { PSCustomObject @{UserName=“User$“; ComputerName=”
Computer$ ”; KeyData=‘placeholder’ } }
D:> $DataSet2 = 1…5000 Foreach-Object { $rnd = Get-Random -min 1 -max 30000; PSCustomObject
@{username=“User$rnd”; KeyData=“KeyData $_ $rnd” } }
D:> $DataSet1 ForEach-Object -Begin { $hash=@{} } -Process { $hash $.username = $ }
D:> measure-command {
foreach ($d2 in $DataSet2) {
if ($hash.ContainsKey($d2.username)) {
$hash $d2.username .KeyData = $d2.KeyData
}
}
}
Days : 0
Hours : 0
Minutes : 0
Seconds : 0
Milliseconds : 54
Ticks : 545249
TotalDays : 6,31075231481482E-07
TotalHours : 1,51458055555556E-05
TotalMinutes : 0,000908748333333333
TotalSeconds : 0,0545249
TotalMilliseconds : 54,5249
D:> $result = $DataSet1 ? { $_.keydata -ne ‘placeholder’ }
D:> $result.count
4616
D:> $result 0
UserName ComputerName KeyData
User2 Computer2 KeyData 1785 2
Don’t see to Results count, it lesser than 5000 because of rnd, but I think you will be impressed by speed
Here is the same concept as that reposted by Don on behalf of someone else, just a little cleaned up to work with CSVs and to not drop records that that don’t need to combine data.
DataSet1.csv
Username,Computername
UniqueUser1,UniqueComputer1
UniqueUser2,UniqueComputer2
...
UniqueUser4999,UniqueComputer4999
UniqueUser5000,UniqueComputer5000
DataSet2.csv
UserName,KeyData
UniqueUser2613,KeyData2613
UniqueUser2736,KeyData2736
...
UniqueUser4313,KeyData4313
UniqueUser1695,KeyData1695
Code
Measure-Command {
$d1 = Import-Csv "C:\Temp\DataSet1.csv"
$d2 = Import-Csv "C:\Temp\DataSet2.csv"
$hash = @{}
$d1 | ForEach-Object {
$hash[$_.UserName] = [pscustomobject]@{
UserName = $_.UserName
ComputerName = $_.ComputerName
KeyData = $null
}
}
$d2 | ForEach-Object {
If ($hash[$_.UserName]) {
$hash[$_.UserName].KeyData = $_.KeyData
} Else {
$hash[$_.UserName] = [pscustomobject]@{
UserName = $_.UserName
ComputerName = $_.ComputerName
KeyData = $_.KeyData
}
}
}
$hash.values | Export-Csv "C:\Temp\CombinedDataSet.csv" -NoTypeInformation
}
Measure Results
Days : 0
Hours : 0
Minutes : 0
Seconds : 1
Milliseconds : 238
Ticks : 12380238
TotalDays : 1.43289791666667E-05
TotalHours : 0.0003438955
TotalMinutes : 0.02063373
TotalSeconds : 1.2380238
TotalMilliseconds : 1238.0238
CombinedDataSet.csv Results
"UserName","ComputerName","KeyData"
"UniqueUser2613","UniqueComputer2613","KeyData2613"
"UniqueUser2736","UniqueComputer2736","KeyData2736"
...
"UniqueUser4313","UniqueComputer4313","KeyData4313"
"UniqueUser1695","UniqueComputer1695","KeyData1695"
Big Thanx to the Spam poster! Too Bad the Formatting was off, but with the rewrite from Curtis I managed the create an working solution.
I Still need to rewrite the script and see if I can make other values like DateTime Work or workaround them, but the speed is just incredible!
@Don , ( Big fan btw, keep writing books and Pluralsight video’s, there great! )
The example data took a couple of minutes, only the script I am running is taking about 30-60min.
DataSet1 = 14.000 - 17.000 items
And I need to add data from 3 other data sets where the ComputerName matches.
Dataset2 = 20.000 - 40.000
Dataset3 = 5.000 - 10.000
Dataset4 = 300 – 1000
Example solution for anyone reading the Article later on:
#Import Example Data
$DataSet1 = Import-Csv -Path C:\_Temp\DataSet1.csv -Delimiter ';'
$DataSet1 | Add-Member -MemberType NoteProperty -Value 'placeholder' -Name 'KeyData'
$DataSet2 = Import-Csv -Path C:\_Temp\DataSet2.csv -Delimiter ';'
#Create Empty Hash Table
$hash=@{}
#Creating $Hash with Data From $DataSet1
$DataSet1 | ForEach-Object -Process { $hash[$_.username] = $_ }
#Name Value
#---- -----
#UniqueUser1 @{Username=UniqueUser1; Computername=UniqueComputer1; KeyData=placeholder}
measure-command {
foreach ($d2 in $DataSet2) {
if ($hash.ContainsKey($d2.username)) {
$hash[$d2.username].KeyData = $d2.KeyData
}
}
}
#Seconds : 0
#Milliseconds : 35
#Outputting $Hash values to CSV File
$hash.values | Export-Csv "C:\_Temp\CombinedDataSet.csv" -NoTypeInformation -Force -Delimiter ';'
that spammer was me
In my experience I can say that with arrays about 100000 records this method works very well. But also some speedup for complete set of data can be achieved if $data | foreach {}
pattern replaced to foreach($i in $data) {}
here is some speedup measurements for array filtering:
# Prepare to filter 50 records from 100000 by name with different metods:
#records array
$all = 1..100000 | %{ [PSCustomObject]@{ID=$_; Name="Name$_"; Guid=[Guid]::NewGuid().ToString()} }
#records to filter
$wn = 1..50 | %{ Get-Random -Minimum 1 -Maximum 99999 }
$wn = $all[$wn]
# filter by regex
$m = '^' + ($wn.Name -join '$|^') + '$'
$wn_name = $wn.Name
# filter by hash
$hash = @{}; $wn.Name | %{ $hash[$_] = 1 }
# Measurements:
# where-object + regex
C:\> measure-command { $all | ?{ $_.Name -match $m} }
Seconds : 3
Milliseconds : 50
# where-object + contains
C:\> measure-command { $all | ?{ $wn_name -contains $_.Name } }
Seconds : 3
Milliseconds : 682
#where-object + hash
C:\> measure-command { $all | ?{ $hash.ContainsKey($_.Name) } }
Seconds : 2
Milliseconds : 675
# .where method + hash
C:\> measure-command { $all.where( { $hash.ContainsKey($_.Name) } ) }
Seconds : 0
Milliseconds : 870
# foreach + regex
C:\> measure-command { foreach ($a in $all) { if ( $a.Name -match $m ) { $a } } }
Seconds : 0
Milliseconds : 648
# foreach + contains
C:\> measure-command { foreach ($a in $all) { if ( $wn_name -contains $a.Name ) { $a } } }
Seconds : 1
Milliseconds : 245
# foreach + hash
C:\> measure-command { foreach ($a in $all) { if ( $hash.ContainsKey($a.Name) ) { $a } } }
Seconds : 0
Milliseconds : 293
and…
I don’t know what problem with DateTime, but it always can be converted from string witn [DateTine]::Parse() and [DateTime]::ParseExact()