Help with script to search large document for Keywords and update another sheet

Hi all,

Im very new to Powershell, and im working on an app for my son who has recently been diagnosed with Crohns disease theres not much help out there for diets and so im trying to learn app coding and a bit of PS to help get him the tools he needs

Right so the why out the way, i have a big list of ingredients pulled from an API that ive put in a txt file, words are comma delaminated and i want to search for a string to find possible combinations for a set of ingredients on packets or barcodes

The quick code ive done is as below

$Path = “D:\OFG App\Ingredients”
$Files = Get-Childitem $Path -Include *.txt -Recurse | Where-Object { !($_.psiscontainer) }
$Patterns = “Straw”, “Bean”, “Soy”

$Application = New-Object -comobject word.application
$Application.visible = $False

Foreach ($File in $Files) {
Foreach ($Pattern in $Patterns){
$Document = $Application.documents.open($File.FullName,$false,$true)
$Range = $Document.content.Text
$Delimited = $Range.Split(“,”)
$Delimited | Select-String -Pattern $Pattern -AllMatches -Context 1,1
#| Select Pattern, LineNumber, Filename
$Document.close()
}
}
$Application.quit()

Just seems to hang on execution no error either just hung, any advice anyone can offer to get this to work and output these words/ strings to a CSV appending on a loop would be fantastic im just a bit out of my depth now :smiley:

Matt,
Welcome to the forum. :wave:t3:

When you post code, sample data, console output or error messages please format it as code using the preformatted text button ( </> ). Simply place your cursor on an empty line, click the button and paste your code.

Thanks in advance

How to format code in PowerShell.org 1 <---- Click :point_up_2:t4: :wink:

Is it about TXT files …

or Word documents

?
This

Could be shortened to this

$Files = Get-Childitem $Path -Include *.txt -Recurse -File

If it is about TXT files …

  • Select-String can be provided with the folder path and will search in all files in this folder.
  • Select-String can be provided with an array of patterns. You don’t have to use a loop to iterate over them.

Could you share a small excerpt of the input data and the according expected output? (both formatted as code please)

Thanks Olaf, i was originally working with *.docx and *.doc but assumed .txt should still work with the same comobject. happy to use either

Sample code from the source file looks like this, i dont really mind how the output is returned i just want to search for something such as Soy and have things like Soya, Soy Milk returned, understandably i will get random text too - this would purely be for me to get a list of variations of known ingredients for my son that he cant have


Includes a minimum of 8 (But could include nearly all) of the following, both whole and split: Adzuki Beans, Baby Lima Beans, Black Beans, Blackeye Peas, Cranberry Beans, Dark Red Kidney Beans, Garbanzos, Great Northern Bean, Large Lima beans, Light Red Kidney Beans, Navy Beans, Pink Beans, Navy Beans, Small Red Beans, White Kidney Beans, Yellow Beans, Green and Yellow Split Peas, Lentils

Chicken breast & southwest rice: [glazed chicken breast with rib meat - chicken breast with rib meat, water, salt, sodium phosphate, dextrose. glaze: water, modified corn starch, maltodextrin, methylcellulose, xanthan gum, caramel color, citric acid. brown rice - whole grain parboiled brown rice. water. gum blend thickener - water, guar gum, locust bean gum, and xanthan gum. diced tomatoes - vine-ripened tomatoes, tomato juice, less than 2% of: calcium chloride, citric acid. diced onions. diced green peppers. shredded cheddar cheese - pasteurized milk, cheese cultures, salt, enzymes, annatto, vegetable color (if colored), corn starch (anti-caking agent). chicken gravy mix - corn starch, wheat flour, chicken flavors (chicken broth, chicken fat, natural flavors, citric acid), salt, dried whey, dextrose, shiitake extract powder (shiitake extract, yeast extract), torula yeast, garlic powder, refined soybean oil, onion powder, spice, roasted onion flavor (canola oil), and oleoresin turmeric. northwest rub - spices, dehydrated garlic, sea salt, dehydrated onion, natural dehydrated lemon and spice extractive.] black beans and corn - [black beans and corn-black beans, corn, diced red peppers, diced onions, water.]
Soybean oil, sugar, soy sauce (water, wheat, soybeans, salt, sodium benzoate; less than 1/10 of 1% as a preservative), rice wine vinegar, water, toasted sesame oil, ground ginger, sodium benzoate, edta (calcium disodium).


Rolled oats, rolled wheat, agave nectar, brown sugar, whole wheat flour, dark sweet cherries (bing cherries & sunflower oil), natural cherry flavored cranberries, unsweetened coconut, walnuts, applesauce (apples & water), pomegranate juice concentrate, ch
Rolled oats, agave nectar, brown sugar, rolled wheat, whole wheat flour, blueberries, unsweetened coconut, walnuts, applesauce (apples & water), mangosteen juice concentrate, salt, soda, cinnamon, all natural vanilla flavoring, natural walnut flavoring.
Rolled oats, brown sugar, rolled wheat, growth hormone-free butter, whole wheat flour, unsweetened coconut, walnuts, maple syrup, salt, soda, natural maple flavoring.
Rolled oats, brown sugar, rolled wheat, whole wheat flour, unsweetened coconut, walnut, salt, soda, natural vanilla, corn syrup, honey, water, cinnamon,`Preformatted text`

also have the same data in xlsx if its easier to work with

Broccoli.
Pineapple, water, pineapple juice from concentrate, ascorbic acid (vitamin c [to protect color]), citric acid.
Beef broth, contains less than 1% of: salt, maltodextrin, yeast extract, beef flavor (salt, beef extract powder [beef stock, autolyzed yeast extract, salt, flavors]), monosodium glutamate, autolyzed yeast extract, caramel color, disodium inosinate, disodium guanylate.
Chicken broth, contains less than 2% of: salt, chicken fat, dextrose, hydrolyzed soy, corn and wheat gluten proteins, onion powder, modified cornstarch, autolyzed yeast, natural flavor, torula yeast, garlic powder, disodium inosinate, disodium guanylate, soybean oil, thiamin hydrochloride, tartaric acid, gum arabic, monosodium glutamate, oleoresin carrot.
Chicken broth, enriched egg noodles (wheat flour, eggs, egg whites, niacin, ferrous sulfate [iron], thiamin mononitrate, riboflavin, folic acid), chicken meat (cooked chicken meat, water, modified food starch, salt, sodium phosphate), contains less than 2% of: salt, modified food starch, chicken fat, monosodium glutamate, chicken flavor (onion powder, autolyzed yeast extract, salt, chicken broth, modified cornstarch, natural flavor), spice extract, onion powder, beta carotene (for color), garlic powder.
Water, mushrooms, soybean oil, modified food starch, wheat flour, contains less than 2% of: salt, cream, whey, monosodium glutamate, soy protein concentrate, yeast extract, natural flavor, garlic powder, titanium dioxide (color).
Tomato puree (water, tomato paste), high fructose corn syrup, water, wheat flour, contains less than 2% of: salt, modified cornstarch, potassium chloride, sea salt, citric acid, ascorbic acid, natural flavoring.
Water, prepared white beans, bacon (cured with water, salt, sugar, sodium phosphate, sodium erythorbate, sodium nitrite), tomato puree (water, tomato paste), carrots, contains less than 2% of: modified food starch, salt, sugar, wheat flour, onion powder, monosodium glutamate, dried torula yeast, natural smoke flavor, citric acid, vegetable juice concentrate (carrots, celery, onion), natural flavor, beta carotene (for color).
Chicken stock, enriched pasta (semolina [wheat], egg white, niacin, ferrous sulfate [iron], thiamin mononitrate, riboflavin, folic acid), chicken meat (cooked chicken meat, water, modified food starch, salt, sodium phosphate), carrots, contains less than 2% of: salt, modified food starch, chicken fat, celery, monosodium glutamate, flavoring, lactic acid, sugar, beta carotene (for color), onion powder, yeast extract.

I’d recommend using plain text if possible.

That actually does not help since I still didn’t get what you’re trying to achieve. :man_shrugging:t3: That’s why I asked for sample input data AND ACCORDING EXPECTED output. :wink:

i guess, when i search on Soy, ultimately i want to end up with a list of words like this

Soy
soymilk
Soyaoil
Soybread

but im expecting an output like this from the code

eda,Soy,bread
milk,soymilk, chip
oat,Soyaoil, flour
Soybread,flour,sug

As i need to add characters either side to find variations on the word “Soy”

If I save the sample data you posted in a file named “foods.txt” I can run following snippet

Select-String -Path .\Foods.txt -Pattern '\b\w*soy\w*\b' -AllMatches |
    Select-Object -ExpandProperty Matches |
        Select-Object -Property Value

and get the following output:

Value
-----
soybean
Soybean
soy
soybeans

Is it this what you’re looking for?

thats amazing thank you what do the \b\w* and \w*\b’ denote

\b is a word boundary
\w is a word cahracter
* means zero or more character

Here you can read more about:

ok great i think that makes sense, so in the case where i need to expand on charachters either before or after the word how could i do that ?

iE

Soya    
Soya    
soya    
Soya    
soyan...
soya    
Soya  

I could probably do with 10 characters either side of the search term if thats possible ?

'\bsoy\w*\b'

or

'\b\w*soy\b'

:man_shrugging:t3:

Here you can paste your input data and “play” with the regex to get what you need.

What do you mean?

perfect thanks

Select-String -Path "D:\OFG App\Ingredients\Ingredients.txt" -Pattern '\b\w*soya\w*\w\s\w*\w\b' |
    Select-Object -ExpandProperty Matches |
        Select-Object -Property Value
soyal Elaborado
soyabean oil   
_soya_ sauce   
Soyabean Oil   
_Soya_ Lecithin
_Soya_ lecithin
_Soya_ Lecithin
soyabean oil   
soyabean pro...
_Soya_ Lecithin
_Soya_ Mince   
_Soya_ Beans   
_Soya_ Protein 
_Soya_ Protei

gets me kind of where i need to be thanks so much for the help !!

Hi, quick question i have this for the most part working as i want however some words are returned like this, any idea what im doing wrong here ?

concentrated strawberry and    
concentrated strawberry and apple juice
concentrated strawberry and lemon ju...
Concentrated Strawberry Juice  

certain words appear truncated IE with “apple juice” is fine with “lemon juice” is not

likley completely wrong but im running this to capture a key word and varoius surrounding words

foreach ($w in $word){

Select-String -Path "D:\OFG App\Ingredients\Ingredients.txt" -Pattern "\b\w*\s$word\w*\s\w*\b" |
    Select-Object -ExpandProperty Matches |
        Select-Object -Property Value | Out-File $outfilepath -Append utf8

Select-String -Path "D:\OFG App\Ingredients\Ingredients.txt" -Pattern "\b\w*$word\w*\s\w*\b" |
    Select-Object -ExpandProperty Matches |
        Select-Object -Property Value | Out-File $outfilepath -Append utf8

        Select-String -Path "D:\OFG App\Ingredients\Ingredients.txt" -Pattern "\b\s\w*$word\w*\s\w*\b" |
            Select-Object -ExpandProperty Matches |
                Select-Object -Property Value | Out-File $outfilepath -Append utf8

        Select-String -Path "D:\OFG App\Ingredients\Ingredients.txt" -Pattern '\b$word\s\w*\b' |
            Select-Object -ExpandProperty Matches |
                 Select-Object -Property Value | Out-File $outfilepath -Append utf8
        

        Select-String -Path "D:\OFG App\Ingredients\Ingredients.txt" -Pattern '\b$word\w*\s\w*\b' |
            Select-Object -ExpandProperty Matches |
                Select-Object -Property Value | Out-File $outfilepath -Append utf8

    Select-String -Path "D:\OFG App\Ingredients\Ingredients.txt" -Pattern '\b\s\w*$word\w*\s*\w*\s\w*\s*\b' |
     Select-Object -ExpandProperty Matches |
        Select-Object -Property Value | Out-File $outfilepath -Append utf8

            Select-String -Path "D:\OFG App\Ingredients\Ingredients.txt" -Pattern '\b\w*\s\w*\s\w*$word\w*\s*\w*\s\w*\s*\b' |
                Select-Object -ExpandProperty Matches |
                    Select-Object -Property Value | Out-File $outfilepath -Append utf8

Please keep in mind … we cannot see your screen and we cannot read your mind and we do not have access to your input data unless you post them here. So if there are no sensitive information post the content or at least a relevant part of the file Ingredients.txt.

Your code seem incomplete …

What is $word? And BTW: … using a singular variable name for an array is a little confusing. And using single character variable names in general is even more confusing. $word in $WordList would be way more intuitive. :wink:

Shouldn’t your pattern use $w instead of $word?

This pattern cannot work this way because variable will not get expanded in single quotes. :point_up_2:t3:

Regardless of all that … could you please try to explain in simple terms what you’re actually trying to achieve? I’d suspect there might be a much simpler way. Your approach looks repetitive and cumbersome.

thanks Olaf, yeah i think im not doing a great job essentially im trying to search a massive list if ingredients mostly (99%) comma deliminated in a perfect world i would want to search on a keyword like “soy” , “nutmeg”, “paprika” and if that word is found return any text between the opening , and the closing, for example if my search was for “corn flour” or “spices” the whole of the entry from comma to comma gets returned (as below)

i would be looking to run this agaisnt a set of non allowed ingredients and want each words output to go into its own txt file

im a complete novice, around my 4th day on powershell and this is what i have, it works to a degree, not efficient at all and if i could get the whole ingredient entry from comma to comma it would be perfect

$outfilepath = "d:\wordlistcompile.txt"
$sortedfile = "d:\sortedfile_$word`.txt"
#$word = 'Cinnamon'
#$word = read-host "Enter word to search"

$wordlist = "FREEZE DRIED STRAWBERRY","Soya Bean","Soya","Benzoic Acid","(E210)","Sodium Benzoate","(E211)","Potassium Benzoate","(E212)","Calcium Benzoate","(E213)","Ethyl 4-Hydroxybenzoate","(E214)","Ethyl Para-Hydroxybenzoate","(E214)","Ethyl 4-Hydroxybenzoate","(E215)","Sodium Salt","(E215)","Sodium Ethyl Para-Hydroxybenzoate","(E215)","Sodium Ethyl Para-Hydroxybenzoate","(E216)","E216","E217","(E217)","Methyl 4-Hydroxybenzoate","(E218)","Methyl Para-Hydroxybenzoate","(E218)","(E219)","Methyl 4-Hydroxybenzoate","Sodium Methyl Para-Hydroxybenzoate","Flavourings","Spice","Spices","Cinnamon","Chocolate","Cocoa","Soy","Soya","Tomato","Tomatoes","Pickle","Vinegar","Gin","Martini","Dried Fruit","Chicory","Cinzano","Dubonnet","Tia Maria","Aroma","Clove Oil","Cinnamal","Berries","Cherry","Cherries","Mango","Mangoes","Lychee","Lychees","Strawberry","Strawberries","Raspberry","Prune","Prunes","Peach","Peaches","Papaya","Nectarine","Nectarines","Tea","Nutmeg","Methylparaben","Ethylparaben","Propylparaben","Butylparaben","Plum","Plums","Blackberry","Blackberies","Mulberry","Mulberries","Olives","Avocado","Pumpkin","Kidney Bean","Spinach","Cranberries","Cranberry","Blueberry","Tomato Puree","Dates","Raisins","Sultanas"

foreach ($word in $wordlist){

write-host "Working on $word part 1 of 7" -ForegroundColor Yellow

Select-String -Path "D:\OFG App\Ingredients\Ingredients.txt" -Pattern "\b\w*\s$word\w*\s\w*\b" |
    Select-Object -ExpandProperty Matches |
        Select-Object -Property Value | Out-File $outfilepath -Append utf8

write-host "Working on $word part 2 of 7" -ForegroundColor Yellow

Select-String -Path "D:\OFG App\Ingredients\Ingredients.txt" -Pattern "\b\w*$word\w*\s\w*\b" |
    Select-Object -ExpandProperty Matches |
        Select-Object -Property Value | Out-File $outfilepath -Append utf8

write-host "Working on $word part 3 of 7" -ForegroundColor Yellow

        Select-String -Path "D:\OFG App\Ingredients\Ingredients.txt" -Pattern "\b\s\w*$word\w*\s\w*\b" |
            Select-Object -ExpandProperty Matches |
                Select-Object -Property Value | Out-File $outfilepath -Append utf8

write-host "Working on $word part 4 of 7" -ForegroundColor Yellow

        Select-String -Path "D:\OFG App\Ingredients\Ingredients.txt" -Pattern "\b$word\s\w*\b" |
            Select-Object -ExpandProperty Matches |
                 Select-Object -Property Value | Out-File $outfilepath -Append utf8

write-host "Working on $word part 5 of 7" -ForegroundColor Yellow        

        Select-String -Path "D:\OFG App\Ingredients\Ingredients.txt" -Pattern "\b$word\w*\s\w*\b" |
            Select-Object -ExpandProperty Matches |
                Select-Object -Property Value | Out-File $outfilepath -Append utf8

write-host "Working on $word part 6 of 7" -ForegroundColor Yellow

    Select-String -Path "D:\OFG App\Ingredients\Ingredients.txt" -Pattern "\b\s\w*$word\w*\s\w*\s\w*\s\b" |
     Select-Object -ExpandProperty Matches |
        Select-Object -Property Value | Out-File $outfilepath -Append utf8

write-host "Working on $word part 7 of 7" -ForegroundColor Yellow

            Select-String -Path "D:\OFG App\Ingredients\Ingredients.txt" -Pattern "\b\w*\s\w*\s\w*$word\w*\s\w*\s\w*\s\b" |
                Select-Object -ExpandProperty Matches |
                    Select-Object -Property Value | Out-File $outfilepath -Append utf8

                sleep 10

write-host "Start Post-Processing on $word" -ForegroundColor Cyan

        gc $outfilepath | sort | get-unique > "d:\sortedfile_$word`.txt"

        sleep 5

        Remove-Item $outfilepath 

write-host "Post-Processing complete for $word moving to next" -ForegroundColor Magenta


}

sleep 5

$masterList = "D:\IngredientMasterList.txt"
$allsortedFiles = "D:\sortedfile*"

gc -Path "D:\sortedfile*" | sort| Out-File $masterList

sleep 5


write-host "removing trailing space.. of file $masterList" -ForegroundColor Yellow
$content = Get-Content $masterList
 $content | Foreach {$_.TrimEnd()} | Set-Content masterList.txt
write-host ""
 write-host "Done!" -ForegroundColor Green

 #Remove Duplicates
 write-host "Removing Duplicates from $masterList" -ForegroundColor Yellow
#(gc $masterlist | Group-Object | %{$_.group | select -First 1}) | Set-Content $masterlist
write-host ""
 write-host "Done!" -ForegroundColor Green

#Create CSV from MasterFile
#import-csv $masterList -delimiter "," | export-csv "D:\Master.csv"

#deliminate entries in MasterList 
write-host "creating Delimination on $masterList" -ForegroundColor Yellow
gc $masterList | %{$_ -replace '$',','} | out-file "D:\MasterForImport.txt"
write-host ""
 write-host "Done!" -ForegroundColor Green


 write-host "Bulk Operation Completed on $masterList" -ForegroundColor Green

 write-host ""
 write-host "Done!" -ForegroundColor Green

A picture of your input data is not helpful. You should make it as easy for us to help you as possible.

Your explanation and your code gets worse and worse actually. :smirk:

What do you think should we do with a screenshot of a part of your input data? Should we type it all manually to be able to use it? :man_shrugging:t3:

Your input data seem inconsistent though. You mentioned you get them from an API. I’d expect them to be in a consistent format then. What is it actually? Recipes? Is it CSV? Is it line break delimited plain text? Is it just an comma delimitted array? Depending on what the input data is you may treat it differently.

Topic regex …
What should this do?

… or this …

?

I’d highly recommend that you learn more from the regex info site I linked above and use the regex playground on https://regex101.com.

If you’re looking for a match between commas you may use regex lookarounds.

Given you have this input string for example:

$String =
'Includes a minimum of 8 (But could include nearly all) of the following, both whole and split: Adzuki Beans, Baby Lima Beans, Black Beans, Blackeye Peas, Cranberry Beans, Dark Red Kidney Beans, Garbanzos, Great Northern Bean, Large Lima beans, Light Red Kidney Beans, Navy Beans, Pink Beans, Navy Beans, Small Red Beans, White Kidney Beans, Yellow Beans, Green and Yellow Split Peas, Lentils'

And you’re looking for the search string “Red Kidney” you could use lookarounds like this:

$SearchString = 'Red Kidney'
$Pattern = '(?<=,)([\w\s]*{0}[\w\s]*)(?=,)' -f [REGEX]::Escape($SearchString)

(Select-String -InputObject $String -Pattern $Pattern -AllMatches).Matches.Value.trim()

Please click the link below to get an explanation of the regex.

Thanks i was just trying to explain from the picture what my goal was, not provide input data. i had provided input data earlier in the post but can provide as needed

Thanks Olaf, looks like I’ve completely misunderstood the regex i was thinking Space\Word Search word space\word etc.

You have been a massive help, thank you so much for your time, and explaining everything i now have this working as required !

The input data was from a 35GB json file that I’ve extracted just the ingredient data from to a txt file, this portion of data is untouched and would be as the original, where there are unfortunately inconsistencies but that’s fine i don’t need everything its just for collating variations on key words so this works perfectly

Yeah … regex is some what hard to digest at the first look. :man_shrugging:t3: :wink:

Really? wow … well … I’m glad it helped. :+1:t3: :love_you_gesture:t3:

Wow again … I’d suspect that PowerShell might be the wrong tool for job … but on the other hand … we use what we have or can. :man_shrugging:t3: