After a bit of a sabbatical, I am back to looking PKS (Pivotal Container Service) again. I wanted to look at the new version 1.3, but I had to do a bit of work on my environment to allow me to do this. Primarily, I needed to upgrade my NSX-T environment from version 2.1 to 2.3. I followed this blog post from vmtechie which provides a useful step-by-step guide. Kudos to our VMware NSX-T team as the upgrade worked without a hitch. My next step was to start work on the PKS deployment. I just did a brand new deployment of PKS 1.3. I used the following set of blogs post from Keith Lee which provided invaluable help. Now, because I have a weird lab network setup, I have my PKS components on one VLAN and my vSphere management components on another VLAN. This means I have had to multi-homed a number of pieces of infra such as my vCenter appliance, my NSX Manager and my PCF OPs Manager. This caused me a bit a grief when it came to configuring the certificates (to establish trust between BOSH and PKS and the NSX Manager). Fortunately, like all these things, I learnt a little bit more about PKS troubleshooting during this exercise and I’ve decided to share some of these tidbits with you in this post.

On a few occasions, my PKS deployment got as far as this point and then keeled over (This is the verbose log output from the PKS (Pivotal Container Service) Tile by the way):

+ tags: {}

Task 24

Task 24 | 13:58:12 | Preparing deployment: Preparing deployment (00:00:06)

Task 24 | 13:58:31 | Preparing package compilation: Finding packages to compile (00:00:01)

Task 24 | 13:58:32 | Compiling packages: golang-1-linux/8fb48ae1b653b7d0b49d0cbcea856bb8da8a5700 (00:02:55)

Task 24 | 14:01:27 | Compiling packages: bosh-dns/138f3bd2440ba97f0a7d8912facb5d4a2b320850 (00:00:36)

Task 24 | 14:02:18 | Creating missing vms: pivotal-container-service/506a36a9-c9b9-4c16-b1f0-e50a50512450 (0) (00:02:37)

Task 24 | 14:04:55 | Updating instance pivotal-container-service: pivotal-container-service/506a36a9-c9b9-4c16-b1f0-e50a50512450 (0) (canary) (00:07:28)

L Error : ‘pivotal-container-service/506a36a9-c9b9-4c16-b1f0-e50a50512450 (0)’ is not running after update. Review logs for failed jobs: pks-nsx-t-osb-proxy

Task 24 | 14:12:23 | Error : ‘pivotal-container-service/506a36a9-c9b9-4c16-b1f0-e50a50512450 (0)’ is not running after update. Review logs for failed jobs: pks-nsx-t-osb-proxy

Task 24 Started Mon Feb 4 13:58:12 UTC 2019

Task 24 Finished Mon Feb 4 14:12:23 UTC 2019

Task 24 Duration 00:14:11

Task 24 error

Updating deployment:

Expected task ’24’ to succeed but state is ‘error’

Exit code 1

===== 2019-02-04 14:12:23 UTC Finished “/usr/local/bin/bosh –no-color –non-interactive –tty –environment=192.50.0.140 –deployment=pivotal-container-service-c1fe57749018d20ae221 deploy /var/tempest/workspaces/default/deployments/pivotal-container-service-c1fe57749018d20ae221.yml”; Duration: 852s; Exit Status: 1

Exited with 1.

This now begs the question – how do you review those logs? Well, what you need to do is set up BOSH CLI, query the deployment and the VM ID for the PKS VM, and then open a shell to the VM. From the shell, you can query the status of the services/jobs and examine the logs. OK, let’s break it down into distinct steps:

1. Setup BOSH CLI

bosh task” and “bosh vms“. Once my PKS deployment had failed as shown above, I was able to use them to query the state of the deployment. For those of you already familiar with PKS, you will be aware of the necessity to have a dedicated VM where multiple CLIs for BOSH, PKS and Kubernetes reside. I have already described how to set up the BOSH CLI in section 8 of this earlier post on PKS . The useful commands for querying deployment status are “” and ““. Once my PKS deployment had failed as shown above, I was able to use them to query the state of the deployment.

cormac@pks-cli:~$ bosh vms

Using environment ‘192.50.0.140’ as client ‘ops_manager’

Task 36. Done

Deployment ‘pivotal-container-service-21ced5842e2c62f5dd13’

Instance Process State AZ IPs VM CID VM Type Active

pivotal-container-service/765a0aeb-1773-419f-8465-dabfce0d0ffa failing CH-AZ 192.50.0.141 vm-edc97e9f-79f7-4709-a2e3-0a64b1feb37c large true

1 vms

Succeeded

cormac@pks-cli:~$

Now, to actually do further troubleshooting, I can use the following command to open an SSH session to the VM (-e specifies the environment and -d the deployment; the last argument is the instance):

cormac@pks-cli:~$ bosh -e 192.50.0.140 -d pivotal-container-service-21ced5842e2c62f5dd13 ssh pivotal-container-service/765a0aeb-1773-419f-8465-dabfce0d0ffa

Using environment ‘192.50.0.140’ as client ‘ops_manager’

Using deployment ‘pivotal-container-service-21ced5842e2c62f5dd13’

Task 37. Done

Unauthorized use is strictly prohibited. All access and activity

is subject to logging and monitoring.

Welcome to Ubuntu 16.04.5 LTS (GNU/Linux 4.15.0-43-generic x86_64)

The programs included with the Ubuntu system are free software;

the exact distribution terms for each program are described in the

individual files in /usr/share/doc/*/copyright.

Ubuntu comes with ABSOLUTELY NO WARRANTY, to the extent permitted by

applicable law.

Last login: Mon Feb 4 17:12:46 2019 from 192.50.0.6

To run a command as administrator (user “root”), use “sudo <command>”.

See “man sudo_root” for details.

pivotal-container-service/765a0aeb-1773-419f-8465-dabfce0d0ffa:~$

Now that we are in the PKS VM shell, we can become superuser, and query the state of the processes/jobs using the “monit summary” command. As you can see below, there is an issue with pks-nsx-t-osb-proxy which was also reported in output above. I read a few community posts that already stated that this job has an issue if certificates haven’t been setup correctly. Let’s see if we can figure out what the problem is.

pivotal-container-service/765a0aeb-1773-419f-8465-dabfce0d0ffa:~$ sudo su –

pivotal-container-service/765a0aeb-1773-419f-8465-dabfce0d0ffa:~# monit summary

The Monit daemon 5.2.5 uptime: 22m

Process ‘pks-api’ running

Process ‘broker’ running

Process ‘pks-nsx-t-osb-proxy’ not monitored

Process ‘mariadb_ctrl’ running

Process ‘galera-healthcheck’ running

Process ‘gra-log-purger-executable’ running

Process ‘cluster_health_logger’ running

Process ‘uaa’ running

Process ‘telemetry-server’ running

Process ‘bosh-dns’ running

Process ‘bosh-dns-resolvconf’ running

Process ‘bosh-dns-healthcheck’ running

System ‘system_localhost’ running

Let’s examine the appropriate logs next:

pivotal-container-service/765a0aeb-1773-419f-8465-dabfce0d0ffa:~# cd /var/vcap/sys/log/pks-nsx-t-osb-proxy/ pivotal-container-service/765a0aeb-1773-419f-8465-dabfce0d0ffa:/var/vcap/sys/log/pks-nsx-t-osb-proxy# ls -ltr total 20 -rw-r–r– 1 root root 0 Feb 4 16:50 pks-nsx-t-osb-proxy.stdout.log -rw-r–r– 1 root root 93 Feb 4 17:13 pks_nsx_t_osb_proxy_ctl.stderr.log -rw-r–r– 1 root root 4020 Feb 4 17:13 pks_nsx_t_osb_proxy_ctl.stdout.log -rw-r–r– 1 root root 10200 Feb 4 17:13 pks-nsx-t-osb-proxy.stderr.log pivotal-container-service/765a0aeb-1773-419f-8465-dabfce0d0ffa:/var/vcap/sys/log/pks-nsx-t-osb-proxy# tail pks-nsx-t-osb-proxy.stderr.log time=”2019-02-04T16:50:55Z” level=error msg=”Failed to get node properties” pks-networking=networkManager 2019/02/04 16:50:55 Error initializing a NSX-T client: Error getting network manager for cluster Get https://10.27.51.114/api/v1/node: dial tcp 10.27.51.114:443: connect: network is unreachable time=”2019-02-04T16:51:46Z” level=error msg=”Failed to get node properties” pks-networking=networkManager 2019/02/04 16:51:46 Error initializing a NSX-T client: Error getting network manager for cluster Get https://10.27.51.114/api/v1/node: dial tcp 10.27.51.114:443: connect: network is unreachable time=”2019-02-04T16:52:36Z” level=error msg=”Failed to get node properties” pks-networking=networkManager

I realized at this point that my PKS VM, deployed on VLAN 50, needed access to the NSX Manager which was currently only on VLAN 51. At this point, I realised that I had to multi-home my NSX Manager so that communication between the NSX Manager and PKS could proceed (again, this is due to my funky network layout, and probably something you would not design for in real life scenarios). Next time through the loop, when I examined the logs, I had the same failure reported in the PKS logs, but now a different message in the pks-nsx-t-osb-proxy.stderr.log.

2019/02/04 17:29:36 Error initializing a NSX-T client: Error getting network manager for cluster Get https://192.50.0.5/api/v1/node: x509: certificate is valid for 10.27.51.114, not 192.50.0.5 time=”2019-02-04T17:30:28Z” level=error msg=”Failed to get node properties” pks-networking=networkManager 2019/02/04 17:30:28 Error initializing a NSX-T client: Error getting network manager for cluster Get https://192.50.0.5/api/v1/node: x509: certificate is valid for 10.27.51.114, not 192.50.0.5 time=”2019-02-04T17:31:16Z” level=error msg=”Failed to get node properties” pks-networking=networkManager

Ugh – so I had built my NSX Manager cert for the VLAN 51 interface (IP Address), but of course I needed to have them built for the VLAN 50 interface (different IP Address). I rebuilt a new cert, re-added it to the NSX Manager, re-added it to the PKS tile AND the BOSH tile in the PCF Ops Manager UI. This time everything worked when I redeployed. The whole procedure for creating and registering certs can be found in the PKS documentation here

cormac@pks-cli:~$ bosh task Using environment ‘192.50.0.140’ as client ‘ops_manager’ Task 127 Task 127 | 08:48:46 | Preparing deployment: Preparing deployment (00:00:06) Task 127 | 08:49:06 | Preparing package compilation: Finding packages to compile (00:00:00) Task 127 | 08:49:06 | Updating instance pivotal-container-service: pivotal-container-service/765a0aeb-1773-419f-8465-dabfce0d0ffa (0) (canary) (00:02:12) Task 127 Started Tue Feb 5 08:48:46 UTC 2019 Task 127 Finished Tue Feb 5 08:51:18 UTC 2019 Task 127 Duration 00:02:32 Task 127 done Succeeded cormac@pks-cli:~$ bosh vms Using environment ‘192.50.0.140’ as client ‘ops_manager’ Task 132. Done Deployment ‘pivotal-container-service-21ced5842e2c62f5dd13’ Instance Process State AZ IPs VM CID VM Type Active pivotal-container-service/765a0aeb-1773-419f-8465-dabfce0d0ffa running CH-AZ 192.50.0.141 vm-edc97e9f-79f7-4709-a2e3-0a64b1feb37c large true 1 vms Succeeded cormac@pks-cli:~$

Sweet. Now I’m ready to build some Kubernetes clusters.