I’m practicing how to extract data from a text file and export it in columns for a csv file in ISE, and I’m confused as to why - what the title says - is happening. I am getting the data from Multinational list of popular first names and surnames?, specifically the archive link in the answer with 24 votes and the text file nam_dict but removed the text part that explains what it is and only leaving the gender, name, numbers, and $. Modified text file was too big to upload on Pastebin or here so file.io it is.
Example of what it looks like:
M Aad 4 $
M Aadam 1 $
F Aadje 1 $
M Ådne + 1 $
M Aadu 12 $
?F Aaf 1 $
F Aafke 4 $
? Aafke 1 $
F Elv<i/>ra 4 $
M <Z^>ydr<u/>nas 5 $
M <Z^>ygantas 1 $
M Zygfryd 1 $
M <Z^>ygimantas 2 $
I then feed this into ISE
Get-Content .\names_dict_master.txt | ForEach-Object {
$sex = ($_ -split ' ' | Select-String -Pattern '^\?M$|^\?F$|^M$|^F$|\?$').Matches.Value
$name = ($_ -split ' ' | Select-String -Pattern '\p{L}').Value -join ''
[PSCustomObject]@{
Sex = $sex
Name = $name
}
} | Export-Csv -Path .\names_dict_master_powershell.csv -Delimiter ',' -NoTypeInformation
and get (not completed file but that’s what it looks like)
"Sex","Name"
"M",""
"M",""
"F",""
"M",""
"M",""
"?F",""
"F",""
"?",""
"F",""
"M",""
"M",""
I don’t get why some of them return blank under the first column or just don’t match.
I’ve also tried .Matches.Value
instead of -join ''
on name and that didn’t work and returns mostly System.Object[]
. and from reading Regular expression to match non-ASCII characters? on Stack Overflow, I know \p{L}
, and \p{L}\p{M}*+
and /[\p{L}-]+/ug
are options, I tried the first one, and that’s the result in the Pastebin file, the second one gave me an error:
At line:3 char:30
+ ... name = ($_ -split ' ' | Select-String -Pattern '\p{L}\p{M}*+') -join ...
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ CategoryInfo : InvalidArgument: (:) [Select-String], ArgumentException
+ FullyQualifiedErrorId : InvalidRegex,Microsoft.PowerShell.Commands.SelectStringCommand
and the third from Regular expression to match non-ASCII characters? gives me the same result as the first one. Keeping the names how they originally look would be ideal, probably going to have to add characters to the regex like <>
. It would make sense why some characters would be excluded like <Z^>ydr<u/>nas
but instead everything is excluded.
Thank you to anyone who replies, I’m pretty sure I’m making the columns correctly, I just have no clue why some values return and some don’t.