



The second step to building a comprehensive automation system, after accepting the pros and cons of a comprehensive approach, is to create an authoritative database. A single authoritative database. One data source.

Why one?

0, 1, Infinity. There is no 2.

Zero One Infinity is a theory that is at home in operational automation.

You can have none of something. You can have one of something. You cannot stop at 2 somethings once you pass 1 of something.

Once you have made more than one of something, there will be more reasons to make more versions of that thing. It is inevitable. If there was a compelling reason not to try to handle both cases in a general system (one of that thing), then there will be additional compelling reasons for more versions of that thing to created.

This should be seen as an Operational Law: 0, 1, Infinity. There is no 2.

Using this as an operating law allows us to quickly apply this rule to a problem, and determine if we should not create a given solution, create a single solution system that can handle all problems in it’s domain, or create a solution that manages an infinite amount of solution systems.

These are the only ways we approach problems to render a comprehensive solution, because to violate this law and create a second system we have knowingly created a solution system that will begin increasing in count, but was not designed to manage the many possibilities this could create. It is better to approach a >1 scenario as if it will need an infinite (potentially all resources available) approach, and not paint ourselves into a corner by planning to fail to handle this growth.

An Authoritative Source

If automation is going to be able to replace humans for a given task, it must be able to perform functions that are an acceptable replacement for a human’s efforts.

What is an acceptable replacement for a human’s efforts? My goal for automation is that the automation system does exactly what I would have done in the same situation, if I knew all the details of the situation, and was making the best decision possible.

When I show up to an event to deal with something I want to know what is going on, and so I start collecting information from graphs, monitoring statistics, host and process performance and log information, network utilization and availability, and every other source of information I have the resources to collect in a reasonable amount of time before having to make a decision about an action to take about the observed event.

When I have gathered this information, I have created an authoritative source in my mind of what the situation is. I have my understanding of the architecture, how it was built, what pieces are in play, how the request traffic flows, and the goals for each request.

Using this authoritative information I can make a determination about an action to take, such as restarting a service, or redirecting traffic to a different location, to resolve whatever problem has occurred. This process I go through is documentable, and many organizations create run books or play books for their NOC, and that means it can also be documented in code.

Documenting automation processes to perform in code has been going on for as long as people have been writing code, but having an authoritative database that can be used for every step in the operational life-cycle is presently exceedingly rare, and the first step to creating a comprehensive automation system.

The Single Source of Data, For Everything

An authoritative database that contains all the information necessary to run the comprehensive automation system fits into our “0, 1, Infinity” rule by being 1. There is 1 authoritative database that contains all the information necessary to run all the automation in the overall system, comprising all the systems.

There are of course an “infinite” number of databases in any overall system. Almost every program has configuration files, or takes arguments which need to be stored somewhere (perhaps in a script, a logically interpreted database). Each of these files or databases is a separate data source. Each may contain uniquely represented data for driving that program which does not exist and is not needed anywhere else in the system.

So which is it? An infinite number of databases or a single database?

A single authoritative database is capable of providing source information to seed other databases, and also being an authoritative source on where other authoritative data lives. As long is there is a single root of authority to reliably query to get authoritative data or a location of authoritative data, it is an authoritative data system.

This can be comparable to the Internet’s DNS system, which has authoritative root servers, which give you information on the location of authoritative domain servers. The root servers are the ultimate initial authority, but they do not have all the information.

DNS follows the 0, 1, Infinity rule. There is 1 DNS system for the Internet, and therefore there is 1 Internet. There are many (infinite) networks, one even calls itself “Internet2“, but only a single Internet. Any efforts to try to create 2 of this something have been roundly rejected.

The Structure of an Automation Database

To understand what needs to be stored in an automation database, it is important to understand the entire system you are trying to comprehensively automate.

I break comprehensive automation of system/network operations down into 4 phases, giving me 100% coverage of the operational life-cycle, and thus I have hooks to hang all my data on. Having started with everything I want, I can start to sub-divide those areas to finer granularity, until I have it down to a database schema that can describe every element required to satisfy my 4 comprehensive phases.

My 4 life-cycle phases of comprehensive system and network automation are:

Provisioning Configuration Test functionality and performance Analyze results and take actions, generating more results to analyze

This life-cycle repeats from 1 to 4 to 1 again, when new hosts are required for additional scaling or to replace a broken host. After a successful configuration, such as passing all the functional and performance tests, a host will oscillate between life-cycles 3 and 4, monitoring and analysis, until a functional or performance Service Level Agreement violation occurs. Life-cycle 2, Configuration, is also re-run periodically, or on a system update event, such as adding or removing a host from a configuration file.

These are only 4 steps, but by using set theory to divide my problem, I can provide 100% coverage of efforts. Having provided 100% coverage of efforts, I can attach data to these efforts.

What does it take to complete an effort? Provisioning, configuration, testing, analysis and taking action are each related to each other, and share data. In fact, the data they do not share is significantly smaller than the data they do share, as they are all operating on the same elements.

Where do humans fit in?

Humans fit in in a few places:

Humans occupy the previously unlisted #0 spot on the now incorrectly named 4 Phase Life-Cycle: configuration of: goals (SLAs), services provided (operational services), resources to use (physical hardware and vendors), and procedures to follow to collect information and make decisions about taking actions, and how to take those actions. Humans are called upon during phase 4 when a situation that the automation system cannot cope with is encountered. These conditions are mainly restricted to multiple failures at attempting to cure SLA violations, as all monitoring is based on goals to be solved, again using set theory to divide the problem space so that it gets 100% coverage initially at a coarse grain, but becomes finer with additional information collection and layered decision making. During SLA violations humans would be required to adjust the SLA levels, or add additional physical hardware or vendor options to cure any current SLA violations. Humans might also write additional scripts, or specify additional packages to be included. Since environments are comprehensively provisioned and configured by the comprehensive automation system, staging and QA environments can be created that will accurately test the changes being introduced to the environment. Humans also need to verify that their intentions are being met by the changes, and that the changes were effective. This is largely a subjective process, and unsuitable for automation.

Four Phase Life-Cycle Schema Elements

Provisioning, configuration, testing, analysis and taking action in an operation automation system requires the follow elements:

Hardware inventory (for physical installations)

Platform specifications (OS installation information, such as Linux)

Package specifications (installation information for services that run on host operating systems, such as Apache)

Provisioning method (Kickstarting, VM provisioning, vendor instance provisioning)

A host/machine concept for working with a provisioned host

Storage inventory (for physical installations)

Storage specifications (for how to use physical, virtual or vendor storage)

A concept of an operational service (not a service that runs on an OS, but some level of service above that, such as “Web Page Requests”, instead of Apache)

Service interface configuration (how operational services interact with other operational services)

Security and access information

Script and data locations for running jobs and configuration

Physical locations (for where physical hardware or vendors are located)

External locations collected information (time series: RRD, databases, etc) and meta-data about that collection (schedule to collect, last collected, historical success)

Process specifications (how to do everything, comprehensively and with as much detail as has been added, assuming an infinite number of possible specifications)

Each of these areas requires quite a bit of expansion, which I will begin in my next article which will go into more detail about the Four Phase Life-Cycle, and start creating the schema outline. To do this comprehensively, it must first be described with 100% coverage of the structure, and then additional data can be filled in without effecting the whole.

One system to deal with infinite possibilities.

Read the next article: The Four Phase Life-Cycle