As it was already explained by @Ádám Sándor in his recent blog posts: https://blog.container-solutions.com/enterprise-grade-ci-cd-with-gitops​ and https://blog.container-solutions.com/building-a-large-scale-continuous-delivery-platform-a-case-study​ — implemening Enterprise grade CI/CD pipeline with GitOps approach is a major challenge. One way to make it simpler is to structure and systematize what ​Kubernetes and application​​delivery​​ CI/CD pipeline​ actually is. Dividing the whole process into several atomic and modular phases makes it much easier to manage and allows to create a common language across the entire IT organization and its specialized engineering teams. Below, you can find aproposal of a structured CI/CD Pipeline for enterprise scale Kubernetes and application delivery.

Let’s start with the following diagram:

The idea is to divide the entire CI/CD deployment pipeline into the following smaller and ​atomic phases:

Core (Phase 1)​: this is where we deploy all the basic Cloud, Security and Network specific components (with use of Terraform or some other ​Infrastructure as Code​ tooling) eg:

VPC, Resource Group, VMware Virtual Datacenter, etc..

Networking (Subnets, Routing, NAT, VPC Peering, etc..)

Security (eg: IAM Roles, WAF integrations, VPN/IPSEC tunnels, etc.. )

Storage (eg: EFS, NetApp Cloud Volumes, etc..)

Bastion or Jumpbox (if needed)

CloudWatch, Stackdriver, Billing Alerts, etc..

etc..

Kubernetes (Phase 2): this is where we deploy the actual Kubernetes engine, no matter which solution is our preference (with use of ​Terraform​ or some other ​Infrastructure as Code tooling):​

GKE

AKS

EKS

Or “vanilla” with: Kops, Kubespray, Rancher, etc..

Middleware (Phase 3): these are example middleware services that are prerequisites required by the actual product applications (being deployed with use of Helm, Kustomize, Helmsman, Helmfile, Flux or similar CI/CD templating and tooling):

Custom and/or third party Operators

RBAC and OPA (Open Policy Agent)

Vault and an its Automated Secret Injector

Service Catalog and its Service Brokers

EFS provisioner

Prometheus, Datadog, Sysdig, Instana, etc..

Fluentd, Logstash, Filebeat, etc..

Active Directory/LDAP integrations (Single sign-on, eg: Keycloak)

cert-manager, kube-dns, velero, etc..

etc..

Applications (Phase 4): this is where the actual product applications are being deployed (with useof Helm, Kustomize, Helmsman, Helmfile, Flux or similar CI/CD templating and tooling):

Application A (eg: Wordpress)

Application B (eg: Internal Billing System)

Application C (eg: CRM)

etc..

Validation (Phase 5): this is where we run all the final system and integration tests. If all the tests take too long we may limit the Validation phase to some reasonable subset (eg: 10minutes) and run the whole test suite (eg: 4 hours) as a nightly CI/CD job exclusively.

This division allows us to manage each layer 100% independently without affecting the rest of the CI/CD pipeline. It also allows us to split responsibility (with appropriate reflection in Git repository privileges and Code Review workflow) across different engineering teams, eg:

Phase 1: Network Team + Security Team

Phase 2: DevOps Team

Phase 3: DevOps Team

Phase 4: Product Teams

Phase 5: Product Teams + Quality Assurance Team + SRE

The first four phases start with a small ​pre-step​ called ​L (Linter)​ and end with another small post-step ​​V (Validator)​. The reasoning behind is as follows:

Linter​: we want to keep our ​Infrastructure as Code ​templates and DSL(domain-specific language) used for CI/CD pipelines definition clean and re-usable as like any other source code in our organization. Therefore there is a need to introduce appropriate tooling checking whether our configuration and templates meet all the requirements of the community driven and organization wide standards.

Examples:

Example Lint options for Terraform:

- terraform fmt

- terraform validate

- Kitchen-Terraform​​(​https://kitchen.ci​ and its plugins)

- pre-commit-terraform (​https://pre-commit.com​ and its plugins)

- etc..

- terraform fmt - terraform validate - Kitchen-Terraform​​(​https://kitchen.ci​ and its plugins) - pre-commit-terraform (​https://pre-commit.com​ and its plugins) - etc.. Example Lint options for Helm and yaml:

- helm lint

- helm template

- helmfile lint

- helmsman -dry-run

- Yamllint

- etc..

Validator: ​we want to make sure every phase was executed 100% correctly (with no errors), possible tools to be used for the purpose:

Examples:

Phase 1 Validator:

- Serverspec/Specinfra v2 (​https://serverspec.org​)

- Testinfra (testinfra.readthedocs.io)

- Security testing tools (eg: Public S3 buckets scanner)

- etc..

- Serverspec/Specinfra v2 (​https://serverspec.org​) - Testinfra (testinfra.readthedocs.io) - Security testing tools (eg: Public S3 buckets scanner) - etc.. Phase 2 Validator:

- Certified Kubernetes Conformance Program

(https://github.com/cncf/k8s-conformance​)

- Sonobuoy (​https://github.com/vmware-tanzu/sonobuoy​)

- Kube-bench (​https://github.com/aquasecurity/kube-bench​)

- etc..

- Certified Kubernetes Conformance Program (https://github.com/cncf/k8s-conformance​) - Sonobuoy (​https://github.com/vmware-tanzu/sonobuoy​) - Kube-bench (​https://github.com/aquasecurity/kube-bench​) - etc.. Phase 3 Validator:

- Check if all the Operators are functioning correctly

- Check if Service Catalog and all its Service Brokers are functioning correctly

- Check all critical RBAC & OPA rules

- Check if Vault server is accessible and functioning correctly

- Check if EFS storage is accessible and functioning correctly

- etc..

- Check if all the Operators are functioning correctly - Check if Service Catalog and all its Service Brokers are functioning correctly - Check all critical RBAC & OPA rules - Check if Vault server is accessible and functioning correctly - Check if EFS storage is accessible and functioning correctly - etc.. Phase 4 Validator:

- Application A: End-to-end tests provided by the Product A Team

- Application B: End-to-end tests provided by the Product B Team

- Application C: End-to-end tests provided by the Product C Team

- etc..

The biggest advantage of the proposed solution is that every phase can be updated in a similar way to a slider mechanism, so if, for example, we wanted to rebuild the phase 3 only — there is no need to recreate the whole cluster (meaning termination of phases 4, 3, 2, 1 and then recreation of phases 1, 2, 3, 4 completely from scratch — 8 actions required in summary) as we can achieve the same result with termination of phases 4 and 3 and then recreation of these two only (so only 4 actions are required in summary, not 8).

Dynamic slider allows us to save time, however, we should remember that every significant change applied to a such infrastructure should be tested in at least two following scenarios:

Incremental — by adding all the new changes to DSL templates as incremental steps(eg: ​terraform apply​)

— by adding all the new changes to DSL templates as incremental steps(eg: ​terraform apply​) Destroy/Create​ — by recreating the entire cluster completely from scratch to be sure it isstill feasible to recreate the entire infrastructure starting from zero (eg: ​terraform destroy+ terraform apply​) (it is a very important test from Disaster Recovery point of view)

There is a substantial difference between the ​Incremental​ and ​Destroy/Create​ workflows, for example: adding new 100 DNS entries to our DSL templates as an ​Incremental​ change may be a quick and easy operation, but if we try to do the same with ​Destroy/Create​ approach (assuming that there are already another 200 DNS entries defined at earlier stages) — final result will be: 300 DNS entries to be added at once in a very short period of time causing very high risk of exceeding Cloud Provider’s API Rate Limits and slowing down or even completely blocking the entire process.