Cloud versus on-premises based Openshift deployments have their own unique set of challenges. From a consulting perspective, I generally view cloud as easier in terms of orchestration, but with the possibility of deeper technical issues.

The main challenges people seem to face with OCP on AWS are integration with the cloud plugin, registry storage, DNS, and successfully managing the AWS and Openshift layers in harmony:





Kubernetes Cloud Provider Plugin

Don’t talk to me about Azure



Cloud Provider plugins allow for Kubernetes to integrate with the platform hosting it. The general objective of these plugins are to add features and increase reliability. At the time of writing this, the AWS Kubernetes plugin adds two features: Creating Elastic Load Balancers (ELBs) and dynamic storage (If you create a PV in Kubernetes, requests a disk with that amount of storage and attaches it) via Elastic Block Storage (EBS).

This plugin is currently pretty underutilized, but integration is still recommended because of features that are planned for the future. The use for provisioning ELBs is nullified by the Openshift Router.





Storage, Stateful Applications, limitations.

Stateless apps are easy



Elastic Block Volumes are block devices. They are not shared storage and are bound to their respective Availability Zones. These limitations need to be kept in mind. The first thing this effects is the internal docker registry if there are multiple replicas of the pod. The recommended work around is to use an S3 Bucket as registry storage. This practice has pretty solid performance, so even if you have another storage solution in place for OCP on AWS, this is still the recommended practice.

To escape the possible limitations of EBS, you could use NFS (not recommended for anything significant, but fine in a lab) or something more reliable like Openshift Container Storage (Containerized or external)





DNS

The vast majority of installs in new environments you will run into DNS issues. Cloud providers are no different.



DNS is so painful for users new to OCP/AWS to troubleshoot. Especially in environments with deviations from standard procedure.

Most guidelines I see online always assume Route53 is being utilized for DNS. If you’re using GovCloud, there is no Route53 available, making problem solving even more interesting. Route53 is easy to manage; branching away from this is where we start running into problems.

Most (Including AWS) Cloud Provider plugins require the Kubernetes NodeName to match whatever the cloud provider has those node registered as. In Amazon, this is often ip-x-y-z-q.ec2.internal . Most people don’t care for this because oc get nodes isn’t quite as pretty as most clusters, and it’s harder to keep track of nodes:

[root@ip-x-y-z-f~#] oc get nodes NAME STATUS ROLES AGE VERSION ip-10-240-1-13.ec2.internal Ready infra 3d v1.11.0+d4cacc0 ip-10-240-1-22.ec2.internal Ready compute 3d v1.11.0+d4cacc0 ip-10-240-1-44.ec2.internal Ready infra 3d v1.11.0+d4cacc0 ip-10-240-2-55.ec2.internal Ready compute 3d v1.11.0+d4cacc0 ip-10-240-2-66.ec2.internal Ready master 3d v1.11.0+d4cacc0 ip-10-240-3-23.ec2.internal Ready compute 3d v1.11.0+d4cacc0 ip-10-240-3-15.ec2.internal Ready master 3d v1.11.0+d4cacc0 ip-10-240-3-38.ec2.internal Ready master 3d v1.11.0+d4cacc0 ip-10-240-3-61.ec2.internal Ready infra 3d v1.11.0+d4cacc0

To check your meta-data hostname: curl http://169.254.169.254/latest/meta-data/hostname

In VPCs the second IP after the network IP is reserved for DNS. 169.254.169.253 is also available, but only returns default values (not usable if custom FQDNs are configured)

So: The installer needs to resolve the nodes via Amazon Private DNS. The hostname needs to be set to what Amazon knows it as. If you use custom DNS, but change the hostname, the control plane will fail to come up because that ID is based off the name in the Ansible inventory. If you change the hostname, to account for this, the cloud provider plugin fail to initialize. If you use only private AWS DNS, the install will fail because the masters cannot verify the install, because that requires successfully resolving the loadbalancer.

There are two solutions to this:

Add the private resolutions to your non-amazon DNS. Configure dnsmaq to fallback on the Amazon DNS server for private (ec2.internal) routes.

This is a pretty cool workaround a coworker showed me:

# ansible-playbook aws_custom_route_dns.yml -i openshift_inventory - hosts : all become : yes tasks : - name : Add Amazon hostnames and FQDN to /etc/hosts lineinfile : line : ' ' regexp : ' ^' state : present path : /etc/hosts backrefs : yes - name : Create ec2 dns file lineinfile : line : ' server=/ec2.internal/169.254.169.253' state : present path : /etc/dnsmasq.d/aws-dns.conf create : true owner : root group : root mode : 0644 notify : restart_dnsmasq_service handlers : - name : restart_dnsmasq_service service : name : dnsmasq state : restarted

TL;DR:





Registry Storage via S3 Bucket

This feels weird, but it’s pretty cool



This is supported out of the box and can be stood up automatically via the Openshift installer, provided the S3 exists and you provide the key or have the correct IAM roles in place

[OSEv3:vars] # AWS Registry Configuration openshift_hosted_manage_registry = true openshift_hosted_registry_storage_kind = object openshift_hosted_registry_storage_provider = s3 openshift_hosted_registry_storage_s3_accesskey = AKIAJ6VLREDHATSPBUA # Delete this line if using IAM Roles openshift_hosted_registry_storage_s3_secretkey = g/8PmTYDQVGssFWWFvfawHpDbZyGkjGNZhbWQpjH # Delete this line if using IAM Roles openshift_hosted_registry_storage_s3_bucket = openshift-registry-storage openshift_hosted_registry_storage_s3_region = us-east-1 openshift_hosted_registry_storage_s3_chunksize = 26214400 openshift_hosted_registry_storage_s3_rootdirectory = /registry openshift_hosted_registry_pullthrough = true openshift_hosted_registry_acceptschema2 = true openshift_hosted_registry_enforcequota = true openshift_hosted_registry_replicas = 3

The generated storage section of the registry configuration looks like this:

storage : delete : enabled : true cache : blobdescriptor : inmemory s3 : accesskey : AKLOLOMGBBQSPBUA secretkey : g/8PmTYDQVGssFWWFvfawdaleislongkjGNZhbWQpjH region : us-east-1 bucket : openshift-registry-storage encrypt : False secure : true v4auth : true rootdirectory : /registry chunksize : " 26214400"

This is kind of confusing on the Kubernetes side because this stored as a secret. oc describe dc docker-registry -n default gives no insight that S3 storage is being used (It shows EmptyDir) The only way to confirm it using kubectl/oc is:

oc get secret registry-config \ -o jsonpath='{.data.config\.yml}' -n default | base64 -d

Or you can just view your bucket via the AWS console and you’ll see the registry files show up in /registry.





IAM Roles

IAM roles allow/deny access to AWS resources. In this context, we use IAM roles to grant Kubernetes the permission to request EBS volumes, and connect to the S3 registry.

The role to connect to the registry, attach to the Infra Nodes:

{ "Version" : "2012-10-17" , "Statement" : [ { "Effect" : "Allow" , "Action" : [ "s3:ListBucket" , "s3:GetBucketLocation" , "s3:ListBucketMultipartUploads" ], "Resource" : "arn:aws:s3:::S3_BUCKET_NAME" }, { "Effect" : "Allow" , "Action" : [ "s3:PutObject" , "s3:GetObject" , "s3:DeleteObject" , "s3:ListMultipartUploadParts" , "s3:AbortMultipartUpload" ], "Resource" : "arn:aws:s3:::S3_BUCKET_NAME/*" } ] }

For the cloud provider plugin, attach this role to Masters:

{ "Version" : "2012-10-17" , "Statement" : [ { "Action" : [ "ec2:DescribeVolume*" , "ec2:CreateVolume" , "ec2:CreateTags" , "ec2:DescribeInstances" , "ec2:AttachVolume" , "ec2:DetachVolume" , "ec2:DeleteVolume" , "ec2:DescribeSubnets" , "ec2:CreateSecurityGroup" , "ec2:DescribeSecurityGroups" , "ec2:DescribeRouteTables" , "ec2:AuthorizeSecurityGroupIngress" , "ec2:RevokeSecurityGroupIngress" , "elasticloadbalancing:DescribeTags" , "elasticloadbalancing:CreateLoadBalancerListeners" , "elasticloadbalancing:ConfigureHealthCheck" , "elasticloadbalancing:DeleteLoadBalancerListeners" , "elasticloadbalancing:RegisterInstancesWithLoadBalancer" , "elasticloadbalancing:DescribeLoadBalancers" , "elasticloadbalancing:CreateLoadBalancer" , "elasticloadbalancing:DeleteLoadBalancer" , "elasticloadbalancing:ModifyLoadBalancerAttributes" , "elasticloadbalancing:DescribeLoadBalancerAttributes" ], "Resource" : "*" , "Effect" : "Allow" , "Sid" : "1" } ] }

All other nodes need:

{ "Version" : "2012-10-17" , "Statement" : [ { "Action" : [ "ec2:DescribeInstances" , ], "Resource" : "*" , "Effect" : "Allow" , "Sid" : "1" } ] }

Implementation Knowledge Gap

There are tons of well written Ansible Playbooks that build all of the infrastructure from scratch. You just give them a key and they work. They assume 100% AWS components, are not flexible, and could be depreciated overnight.

The largest challenge we face with the Operations side of cloud provider hosted instances of Openshift are knowledge gaps sustained by how fast and how many directions things can change. It is crucial to be able to effectively react and adapt to changes that could come to Openshift, Kubernetes, AWS, or your organization’s architecture.