String split on CRLF produces extra member

by stocksp at 2012-08-21 10:46:00

I have a string consisting of <div> elements separated by CRLF.
Stored in $tmp.

if I just do
$lines = $tmp .split(“rn”)

$lines contains an extra blank line between each <div>
$lines[0] is correct. $lines[1] is a blank line. through the whole file

To get what I want I’m using
$lines = $tmp -replace(“rn”, “|”)
$lines = $lines.split("|")

Which I know is ugly…
How can I get a ‘clean’ array (no blank lines) without the -replace 'hack’
by willsteele at 2012-08-21 11:02:44
Due to the way the CR LF tags are handled it can be a little challenging. I have fought this before. An alternative may be to look at spliting on a binary character instead of an escaped character. CR is 0x0D and LF is 0x0A. Perhaps splitting on one or the other of those instead of both could help.

$lines = $tmp -split 0x0D

Without know your exact data I am guessing a bit, but, this will probably work.
by DonJ at 2012-08-21 11:15:05
I’m curious, how id you read in the string to begin with? I ask because Get-Content, when reading a text file, will normally handle this for you, putting each line into a unique object. Did you maybe query this from a Web server or something?
by stocksp at 2012-08-21 11:28:42
My data is very simple it looks like this in Notepad+

<div style=“position: absolute; top: 170px; left: 32px; width:7px; font:8pt Arial; color: #000000”>1</div>
<div style=“position: absolute; top: 170px; left: 54px; width:13px; font:8pt Arial; color: #000000”>15</div>
<div style=“position: absolute; top: 170px; left: 76px; width:13px; font:8pt Arial; color: #000000”>14</div>

If I ‘show symbols’. The editor shows a ‘CR’ and ‘LF’ at the end of each line. Very standard stuff.

I tried
$lines = $tmp -split 0x0D, 0x0A

and it ‘almost’ works … a couple of the lines are mangled (missing <div>'s)

I assume I’m not passing both character correctly to -split.
by stocksp at 2012-08-21 11:36:28
The data I’m working with is really nasty HTML that a program is spitting out (its an image of a print file). I need it as single string for removing large chunks of garbage. Once I’ve stripped it down to the area I’m after, then I can break it up into lines.
by poshoholic at 2012-08-21 11:38:50
This issue is easy to resolve once you understand what is happening behind the scenes.

When you use the System.String Split method and you pass it “rn”, you’re calling the Char[] overload of this method. That method allows you to pass in an array of characters, and it will split the string on any character it finds in that array. By passing in “rn”, it will split on “r&quot; and it will also split on &quot;n”. That is why you end up with extra newlines. To fix this you need to do one of the following:

Option A: Force it to split on entire strings, not characters.

[script=powershell]$lines = $tmp.Split([string[]]“rn”,‘None’)[/script]

Option B: Use the regex -split operator instead.

[script=powershell]$lines = $tmp -split “rn”[/script]

I prefer option B, and personally I use a slightly modified version of it like this:

[script=powershell]$lines = $tmp -split “rn|r|n”[/script]

This version splits a string on the rn combo first, then it checks for r by itself, and then n by itself. I’ve dealt with strings with newline characters coming from enough sources to know that you don’t always get rn as a pair of characters for newlines, so I prefer the robustness of that last technique to make sure I get the results I want no matter what the source is.
by willsteele at 2012-08-21 11:45:49
Ah, that second option was one I recall having seen Mjolinor use. Thanks for pointing that one out Kirk. Good approach.