A Rant, would hopefully make someone in Redmond listen and try to fix it.

The beginning

At the beginning there was light. And Servers. But servers provisioning and setup became a quagmire of tools and snowflake setups. (FF much) And in the light Someone had a great idea. and thus “Infrastructure as code” came into this world.

Infrastructure as code

Infrastructure as code is a great concept, a pillar of devops. At its core it’s about treating our cloud infrastructure as code => Describe it by deployable code. Have a CI/CD pipeline starting from push and a review pipeline based on PRs And rollbacks with git revert. Azure tool of choice for infrastructure as code is the ARM template (ARM stands for Azure Resource Manager). ARM templates are JSON (See more about that sad technology choice later) and they are composed of parameters, variables, resources and outputs (some other stuff also) . Parameters can also be passed from outside the template: from a parameters file or from the CLI, thus allowing us to reuse templates in different scenarios (staging/prod for example) or dynamically fetch parameters (for example from a keyVault). Since JSON hardly suffices to describe complex scenarios, ARM templates have a notion of functions which are described as a JSON array. For example the concat function:

{“name“:”[concat(‘my-prefix’, parameters('environment'), '-this-service')]”}

ARM templates also have a “killer feature” which is lacking in other solutions: Download templates of your current setup from azure UI, allowing a transition from a UI clicks driven setup to infrastructure as code. Of course this doesn’t always work, but sometimes it’s a life saver, especially when dealing with under documented apis.

Up to now I’ve described a decent technology stack. While it has some built in problems like being very verbose even for trivial tasks (which is also an AWS Cloudformation issue) it is a good solution to the problem of infrastructure as code. If only this were true :(

But we were optimistic, so we’ve setup a git repo with our infrastructure and a CI pipeline to validate and deploy the changes. Look Mom! We’re doing infrastructure as code like the big boys! What could go wrong?

The failures of the ARM templates

Strict JSON (Life without comments)

Developers are the users of ARM templates, as developers, code maintainability and readability is at the highest priority. To achieve that with our code we try to separate concerns with breaking code to modules, classes and functions. We think about meaningful naming and we comment on our logic and decisions to the future reader (which might be us in a month). Remember that developers spend about 10x more time reading code than writing it. Choosing a strict schema based JSON as implementation language is making this very hard. For example, you can’t add fields not supported in the specific api scheme, perhaps for your own documentation, fields that might also be used to wrap ARM template with some scripts and descriptive. While I would understand if this was a “warning” or an optional “strict” schema mode, making this an error seems wrong.

While researching this post I’ve found out I could have added a `metadata` object that would include a `comments` property to every object, which is supposed to be a solution, but would only allow comments in certain places. Also when you are quickly iterating on ARM templates trying to figure out the exact correct setup, I usually add/remove blocks of code, being able to just comment out the unneeded blocks would be helpful. But no. Strict JSON. won’t allow this. Why not YAML? TOML? Or even JSON with comments?

Parameters — Parameters everywhere

ARM templates have a concept of parameters, which can be defined with a default value and type in the template and overridden with a “parameters file” and/or parameters from the CLI. Parameters along with Variables (which are composable from parameters and other variables and functions) allow our template to be dynamic. In our staging environment we might want 2 vms, while in another environment 8. Different types of vms, regions, etc. This is useful. The problem is when we use a single override parameter file for multiple templates, a scenario we’ve encountered in our CI process for infrastructure. We used one template because we had (a some) parameters used in multiple templates, Which is a reasonable use case. The problems is that when a parameters file contains a parameter not defined in the template you get an error. So you find yourself either using lots of parameters files, duplicating repeated values and hoping not to miss one when there is a change; or the awful syndrome of parametitis: replicating the single file parameters in all templates, including those that won’t be using them. Currently we have hundreds of lines of parameters in each template, while each of those templates is only using a few; making them harder to read and reason about.

Validation

An important part of an infrastructure as code tool is the ability to validate your template; preferably even do a dry-run allowing you to assert what would be created and verify the correctness of your template. Azure ARM doesn’t offer a dry-run option AFAIK but it does have a validate command, the only problem is that that validation is VERY limited. The actual validation run is validating the JSON template, missing parameters delivered (see above) and perhaps some missing dependencies. Unfortunately when we try to actually deploy the template we discover there is another validation step, which is checked only when a specific component is deployed (which might be very far into the run) and then fail.

What’s the point of a validation command if it’s not validating? An even worse problem I’ve recently encountered are some validations which, instead of checking the AST (abstract syntax tree) which is probably built from the template, tries to validate certain properties against actually deployed components instead of the declared one; ignoring the declared dependency tree. So the template fails unless I deploy those components before, and then add the dependent component . This is an anti pattern in so many ways.

Composability

An important part of software development is composability, We don’t want to write gigantic 10,000 LOC files. We want small files we can read, and we want a way to compose and reuse them in different scenarios.

For example, let’s say I have a template for creating resources for a specific microservices, I would rather have a predefined and working template for each resource, for example a postgresql db and a redis cache, and just compose them in the template defining the resources I need for this service. While ARM has the notion of a linked template (templateUri), it’s limited to templates that exist served by an http location. This is a really bad call by the ARM template team. Why not allow local templates? The usual workaround suggested is uploading the templates to a blob storage (like Azure storage accounts) and serving it from there with SASL. But why? This just adds complexity, uploading stuff to be linked- what if the template doesn’t validate? Or another errors hits us during deployment? Now we also have to take care of cleanup of the remotely uploaded resources. Obviously this isn’t a real solution, and so we find ourself with gigantic ARM templates, unreadable, and hardly workable.

Idempotency

An important requirement for infrastructure as code system is being idempotent. Running the same deployment code more then once shouldn’t change the outcome. Usually it would also mean that for infrastructure components already deployed as described it would be fast. As time goes by, your cloud infrastructure environment would, almost inevitably, grow and with it the code describing it. But we usually redeploy it all? every infrastructure push, we must be sure that it won’t break/change things that are already setup as described. Sadly, for some parts, this can’t be assured with ARM. Specifically with Azure Kubernetes Service, I’ve encountered multiple situations where redeploying the same cluster just broke the cluster in various ways.

And the worst part is, it’s just fragile, Various API changes introduce new bugs. I’ve recently found 2 bugs in Azure arm templates for AKS, With the latest api version and standard lb you can’t redeploy a deployed AKS cluster unless it’s in the exact node count you’ve created it with and haven’t autoscaled up or down. One validation hits you if you change the count to fit the current number of nodes, and another if you keep the original count. We’ve just disabled AKS deployment in our templates, because this bug rendered it unworkable. Rate limits are hit for big deployments (while it’s hard to break those deployment apart — see the above) and those rate limits are hard coded and support can’t upgrade them.

The whole thing feels like they are not really using it in their own processes; so simple to catch bugs are delivered straight to production.

Would I continue to use ARM template? Currently yes, because I don’t want to give up on Infrastructure as code. But It’s a really broken experience, and without fixing it I can’t see advanced Devops practitioners choosing the Azure cloud willingly.