Blue/Green Methodology with DSC

Looking for input on the following approach when it comes to implementing DSC within a large organization.

I’m planning on installing 2 Pull servers named Green and Blue.

So if we are running under ‘Green’, all servers are pulling their info from the Green pull server.

When it comes to rolling changes forward, it will be done by getting the new config data/files on the Blue server and then pushing the change out to the herd and changing the ‘active’ pull server to Blue.

So my back-out plan using this method, if an issue is found in the Blue config, would be to re-configure the LCM on Blue nodes to pull from Green.

Both servers will be under source control, with all changes / publishing / etc being handled via automation.

I’m planning on using SQL Server ( DSC Pull servers on Windows Server 2019 ) so I can easily query what nodes are pulling from each server.

Some benefits I see of this approach:

  1. Provides a way to slowly deploy changes that affect a large number of nodes.
  2. Provides a means to quickly back-out changes made and go back to a last known good config

Anyone ever build something like this, or have comments / suggestions?

thanks,
mike j

What you need to be careful with this approach is reversing any changes. EG if your config is to ensure a folder and file exists in Green. You then add a setting in Blue for adding in a windows feature of ensuring smb1 is not present and push this out. All nodes pointing to Blue would then have SMB1 not present. But if you then notice this breaks something then switching back to green would not make any changes as the config there is only to ensure a folder and file exists.

The better way is to have a UAT environment (if you can) to test these changes there then roll out ot production to a small amount of servers and if still okay there then to all.

I understand what you’re saying, but I think that it applies in all scenarios when it comes to DSC.

The key to backing out changes caused by DSC is to ensure you never need to do it, because it is not black and white.

As for UAT, I’ve got various levels of test available. I have approximately 2000 Windows servers that I plan on using DSC to manage ( getting away from GPOs ).

I’m interested to hear from anyone using DSC at scale how they manage change.

Do you have multiple pull servers ( Sandbox, Test, Prod )?

Do you use code to push changes through your release pipeline?

How do you manage policy changes that require a reboot? My business users get fussy when I tell them a server will reboot in some time window vs. telling them it will reboot exactly at 3:30AM and be unavailable for 10 minutes.

Hi Mike,

I think having different Pull server is not necessary here, as its role is only to serve the MOF it knows to be the current configuration over HTTP. What you want is just switching between MOF versions for a given node.

The overall approach to follow here is a Release pipeline as documented in “The Release Pipeline Model” whitepaper by Steven Murawski & Michael Greene.
To quickly summarise the idea, the pipeline is the automated assembly line that will compile unique (and immutable) artefacts from your changes (in source control), before helping you progressively build trust in them by use of tests, staged releases, gates, or whatever you may need.

With DSC, you have several artefacts used and composed at different stage of the process:

  • MOFs & Checksum
  • Meta MOFs
  • DSC resources & Composites in PSModules
  • PowerShell Modules Zipped for Pull Server with checksum
  • Certificates

And some other I’m omitting here.

What we implemented at my last customer for similar scale was the following:

1. Separate dependencies in their own pipelines

If your infrastructure needs custom DSC Resources & Composites, separate them in coherent modules and build trust in them (compile & test & Release) independently (in PS Module Pipelines, using Test-Kitchen for DSC Resources/Configs). This allows you to separate the Data (parameters) with the code, to better handle complexity by reducing the scope of each changes (hence reducing batch size, reducing the risk with each change).

2. Separate Data from Code

As the code has been tested in 1. and is now trusted, you will now 'compose' your infrastructure configuration in a "Control Repository" (aka control repo, which is the puppet terminology but it's spot on). It's basically a git repository where all changes to your infrastructure are funnelled through (for collaboration), and which will eventually be applied to your nodes: the starting point of your release pipeline for your whole infrastructure. That repository is made of all the directives that makes your configuration: "on node x apply resource y of version v.v.v with parameter z". Traditionally, it's done with a DSC configuration files, but there's many ways to do it (i.e. ways managing configuration data, abstracting in Composites and so on). The ones I'll mention here are Puppet's Roles & Profiles, Chef's Roles and Runlists, Ansible's Roles & Playbooks, and my DSC Roles & Configurations. You'll still have a bit of code in there, probably to manage the automation, but only

3. Control the changes, Automatically

Before a change is allowed to be pushed/merged to that control repository, make sure you validate it: Linting, Syntax, security, everything that gives quick feedback to the user before you even start compiling MOFs.

For example, you may want to ensure no Nodes is set to allow plain text password in their MOF.

4. Compile & Version the Artefacts

Now you'll mix and mash this together, and compile the artefacts listed earlier. This will give you another opportunity for feedback, because if some parameters end up wrong, or you have resources duplicating changes, the DSC Compilation will fail and tell you. You want to version it, because you want each artefact to be Unique, so that you build trust in a unique (immutable) version, not just in something that changes over time.

So that’s important, and we’re starting to actually solve your problem.

Now you have different versions of MOFs you built over time (amongst other Artefacts):
SERVER01_v0.0.1.MOF
SERVER01_v0.0.4.MOF
SERVER01_v1.2.3.MOF

5. Separate Release & Delivery

We actually learned that lesson the hard way.

The Release first:
Once you’ve compiled those artefacts, store them somewhere where it’s easy to find and use.
We just dumped the build artefacts to a file system, where we had 1 folder per version released (ex v0.0.1)
A subfolder per type of artefact: MOF, LCMConfigs (MetaMOF), ZippedModules, RSOP (custom artefact)
Then in MOF a folder for all the stages we had (more on this later): TEST, STAGE0, STAGE1, STAGE2.
Containing each server’s MOF & MOF Checksum: TESTSERVER01.MOF…

Overtime that will be a lot of artefacts, but you can now manage your history.
That is the release (making those artefacts available in a ‘feed’).

Now the Delivery (deployment):
That would be a separate ‘process’ (or build job), which has its own rules (don’t deploy on a Friday, or whatever makes sense to you, again, to build trust in the system).
This job would most likely copy (simple file copy) what you want to release to the Pull Server: cp \v1.2.3\MOF\TEST* \pullserver\MOFs
You’d also store delivery metadata somewhere (what has been released, last known good, when it was released, approvals, ticket reference, and whatever helps you).

How that solves your problem? If you want to go back and use a previous MOF, it’s just a copy job. I’d warn you the same way @Alex Aymonier did, you’re not dealing with immutable infrastructure, so it’s never a rollback but a fix forward using a previous MOF, so it may, or may not work (but you know this already…).

Another element, is how you promote those changes so that you trust them: The promotion process.

For a given version, say v1.2.3, you will have compiled all nodes’ MOF, for every staging environment. The idea is that where you deploy first should not impact the business if there’s a failure: your first staging environment, usually test.
Then, when you think that deployment is ‘successful’ (your threshold of trust for next stage), you deploy to the following stage.
Our STAGE1 was the Developers Test servers: Production servers where 100’s developers where testing on, so if impacted it would be annoying and hinder productivity, but no end customer impact. So the required MTTR wasn’t too high.

Then Stage 2 and 3 were the production environment and the DR environment respectively.
The point is that you build up trust into that version v1.2.3 until you’re happy it goes everywhere. If it fails to meet your standard at some point, you try to ‘shift left’ (bring test and feeback earlier in the process for next deployments), and create v.1.2.4, an start the process again. NEVER jump the pipeline.

6. Continually Improve

Bear in mind, that the key questions for how to divide those stages are: - what will help me trust that change? - what is the risk of failure vs impact potential for that stage? - how can I go as fast as possible with the above limitations?

And your answer to those questions will evolve over time as your pipeline and configuration evolve, and you gain in visibility as to what’s happening to your infrastructure.

You will start fairly small, managing maybe a single role with very limited features, and a slow release cadence. Then will need to go faster and grow bigger at the same time.

Good luck!

Thanks for taking the time with that reply. Lots of good insight in there.

 

That is an excellent response Gael. It really deserves to be an article rather than a forum post, to make it easier for people to find. Thanks for sharing :slight_smile: