This post follows up part 1. It documents my experience and process building a reliable test process for our helm chart. The general idea is that the “clean-build-test-install-verify” flow should work consistently across parallel success/failed runs without impacting subsequent runs.

Part 1 covered the general setup, process, and analysis for experiment run 1. This post covers the analysis for runs 2–8.

Run 2

Changes since run 1:

Terminated failing node mentioned in part 1

Added -o wide to debugging output to provide more information on potential failing pods.

This run was a complete failure. There wasn’t a tiller throughout the entire run. Various helm commands failed with “transport closed” or “connection refused” type errors. The was not a single success across 10 hours. Unfortunately there was nothing in the tiller logs to debug.

Run 3

Changes since run 2:

Deleted broken tiller pod

The results were similar to run 1. There were consecutive successes followed by consecutive failures. There was a single alert regarding failing Kublet PLEG (Pod Lifecycle Event Generator) from our monitoring system which coincided with the failures. All failing pods were confined to a failing node (visible by -o wide ). This made me conclude that this is a node issue and not something in helm/tiller.

Luckily I remembered to check dmesg as part of the follow up on my periodic Docker daemon lock up issue. The logs revealed killed process for out of memory errors. The node had 98% memory utilization. That’s not good for anything. This lead me to conclude that pod/node density is too high. The next run needed to address this. It also made me wonder: what is the kubelet’s memory pressure check and why didn’t it alert in this case?

Run 4

Changes focus on memory pressure and monitoring. Changes:

Set --eviction-hard=memory.available<500mi as described here. Added resource requests and limits to all pods in the chart. Test installs happen using the sandbox stage. This stage is unique because it includes datastore (e.g. mongodb, redis, or mysql) pods for all applications in the chart. Unfortunately these pods were forgotten in the commit that added resource requests/limits to other containers. It revealed another issue running the JVM in containers on the other hand. We fixed this with our open source local DynamoDB image that uses -XX:+UseCGroupMemoryLimitForHeap to set JVM heap limits from Docker cgroups. You can read up on JVM heap settings in Docker. Here’s more info on the new -XX:+UseCGroupMemoryLimitForHeap setting. Added alarms for memory usage > 90% on Kubernetes nodes.

The first change ensures that pods are evicted (moved to another node) in memory high pressure scenarios. Our cluster uses m4.large instances with 8gb memory. This eviction policy kicks in at roughly ~95% memory utilization.

The second change minimizes the possibility that the eviction policy is triggered. Declaring resource requests ensures pods only go where they fit. Declaring resource limits caps containers resource utilization to prevent unexpected growth which may trigger evictions or downtime.

These two changes should keep nodes in a healthier state and perhaps minimize the chance of a Docker daemon lock up. Computers don’t like it when they run out of memory — especially when there’s no swap!

The results are encouraging! The experiment ran for 12 hours with a high success rate. There were 51 successes and 3 failures. There were no alarms during the experiment as well. There were also failures, but that did not impact future runs. There were successful runs after failed runs. Here’s a snippet from the log:

failures=0 success=1

failures=0 success=2

... snipped ...

failures=0 success=47

failures=0 success=48

failures=1 success=48

failures=2 success=48

failures=2 success=49

failures=2 success=50

failures=2 success=51

failures=3 success=51

The failures are internal errors from helm test . Here’s an example:

Run options: --seed 11474 # Running: ..E...E..... Finished in 4.355899s, 2.7549 runs/s, 9.1829 assertions/s. 1) Error:

GenericRPCsTest#test_ping_email_service:

Thrift::TransportException: Could not connect to test-efritin-kviberg-efr-email-service-thrift:9090:

/usr/local/bundle/gems/thrift-0.10.0.0/lib/thrift/transport/socket.rb:57:in `open'

/usr/local/bundle/gems/thrift-0.10.0.0/lib/thrift/transport/framed_transport.rb:37:in `open'

/app/vendor/cache/thrifter-adf42779d5bd/lib/thrifter.rb:91:in `call'

/app/vendor/cache/thrifter-adf42779d5bd/lib/thrifter/middleware/rpc_metrics.rb:9:in `block in call'

/app/vendor/cache/thrifter-adf42779d5bd/lib/thrifter.rb:30:in `time'

/app/vendor/cache/thrifter-adf42779d5bd/lib/thrifter/middleware/rpc_metrics.rb:8:in `call'

/app/vendor/cache/thrifter-adf42779d5bd/lib/thrifter/middleware/client_metrics.rb:9:in `block in call'

/app/vendor/cache/thrifter-adf42779d5bd/lib/thrifter.rb:30:in `time'

/app/vendor/cache/thrifter-adf42779d5bd/lib/thrifter/middleware/client_metrics.rb:8:in `call'

/usr/local/bundle/gems/middleware-0.1.0/lib/middleware/runner.rb:31:in `call'

/app/vendor/cache/thrifter-adf42779d5bd/lib/thrifter.rb:188:in `block in invoke'

/usr/local/bundle/gems/connection_pool-2.2.1/lib/connection_pool.rb:64:in `block (2 levels) in with'

/usr/local/bundle/gems/connection_pool-2.2.1/lib/connection_pool.rb:63:in `handle_interrupt'

/usr/local/bundle/gems/connection_pool-2.2.1/lib/connection_pool.rb:63:in `block in with'

/usr/local/bundle/gems/connection_pool-2.2.1/lib/connection_pool.rb:60:in `handle_interrupt'

/usr/local/bundle/gems/connection_pool-2.2.1/lib/connection_pool.rb:60:in `with'

/app/vendor/cache/thrifter-adf42779d5bd/lib/thrifter.rb:187:in `invoke'

/app/vendor/cache/thrifter-adf42779d5bd/lib/thrifter.rb:63:in `block (3 levels) in build'

test/stage1/generic_rpcs_test.rb:60:in `assert_rpcs'

test/stage1/generic_rpcs_test.rb:27:in `test_ping_email_service' 2) Error:

GenericRPCsTest#test_ping_search_service:

Thrift::TransportException: Could not connect to test-efritin-kviberg-efr-search-service-thrift:9090:

/usr/local/bundle/gems/thrift-0.10.0.0/lib/thrift/transport/socket.rb:57:in `open'

/usr/local/bundle/gems/thrift-0.10.0.0/lib/thrift/transport/framed_transport.rb:37:in `open'

/app/vendor/cache/thrifter-adf42779d5bd/lib/thrifter.rb:91:in `call'

/app/vendor/cache/thrifter-adf42779d5bd/lib/thrifter/middleware/rpc_metrics.rb:9:in `block in call'

/app/vendor/cache/thrifter-adf42779d5bd/lib/thrifter.rb:30:in `time'

/app/vendor/cache/thrifter-adf42779d5bd/lib/thrifter/middleware/rpc_metrics.rb:8:in `call'

/app/vendor/cache/thrifter-adf42779d5bd/lib/thrifter/middleware/client_metrics.rb:9:in `block in call'

/app/vendor/cache/thrifter-adf42779d5bd/lib/thrifter.rb:30:in `time'

/app/vendor/cache/thrifter-adf42779d5bd/lib/thrifter/middleware/client_metrics.rb:8:in `call'

/usr/local/bundle/gems/middleware-0.1.0/lib/middleware/runner.rb:31:in `call'

/app/vendor/cache/thrifter-adf42779d5bd/lib/thrifter.rb:188:in `block in invoke'

/usr/local/bundle/gems/connection_pool-2.2.1/lib/connection_pool.rb:64:in `block (2 levels) in with'

/usr/local/bundle/gems/connection_pool-2.2.1/lib/connection_pool.rb:63:in `handle_interrupt'

/usr/local/bundle/gems/connection_pool-2.2.1/lib/connection_pool.rb:63:in `block in with'

/usr/local/bundle/gems/connection_pool-2.2.1/lib/connection_pool.rb:60:in `handle_interrupt'

/usr/local/bundle/gems/connection_pool-2.2.1/lib/connection_pool.rb:60:in `with'

/app/vendor/cache/thrifter-adf42779d5bd/lib/thrifter.rb:187:in `invoke'

/app/vendor/cache/thrifter-adf42779d5bd/lib/thrifter.rb:63:in `block (3 levels) in build'

test/stage1/generic_rpcs_test.rb:60:in `assert_rpcs'

These errors indicate that some containers where not ready when the test executes. This is provable because the await-release step blocks until all pods are ready. Readiness and liveness probes are defined for all containers. One of the following reasons likely caused the problem:

Failed liveness probe Failed readiness probe Container restarted for scheduling reason

There is no way to know the cause at the time of this analysis. The test-release script requires more logging in the failed test condition.

it seems (curiously) that there is a relation between successive failures. Consider this log snippet:

failures=0 success=48

failures=1 success=48

failures=2 success=48

failures=2 success=49

failures=2 success=50

failures=2 success=51

failures=3 success=51

There was 48 successful executions, followed by 2 failures, 2 successes, and another failure. The test log shows the same error (the earlier log snippet) for all failing runs. This leads me to think there is something fishy going on with the specific containers in question, but there is no supporting evidence at this time. Hopefully the next run provides more data.

Run 5

This run’s goal is to acquire more data around pod state after failed test runs. This should provide enough data to correctly determine the failure reason. I’ve made the following changes for run 5:

Print all pods in the release in the failure case Describe all pods in the non-running state in the failure case. This shows the event log — more importantly the liveness and readiness probe history — along with other useful information.

Here’s the updated script/test-release .

Updated script/test-release with pod status and decribe pod output

The experiment ran for approximately 18 hours. The results are encouraging! There were 90 runs with 85 successes and 5 failures. Again, failures in one run are not impacting subsequent runs. The results show failed liveness probes did cause some of the previous failures. Here’s the log grepping for specific success or failure messages. Note that the output also captures useful bits of kubectl describe to debug such failures:

Liveness: tcp-socket :9090 delay=5s timeout=1s period=10s #success=1 #failure=3

Liveness: tcp-socket :9090 delay=5s timeout=1s period=10s #success=1 #failure=3

failures=1 success=0

failures=1 success=1

failures=1 success=2

failures=1 success=3

failures=1 success=4

failures=1 success=5

failures=1 success=6

failures=1 success=7

failures=1 success=8

failures=1 success=9

failures=1 success=10

failures=1 success=11

failures=1 success=12

failures=1 success=13

failures=1 success=14

failures=1 success=15

failures=1 success=16

failures=1 success=17

failures=1 success=18

failures=1 success=19

failures=1 success=20

failures=1 success=21

failures=1 success=22

failures=1 success=23

failures=1 success=24

failures=1 success=25

failures=2 success=25

failures=2 success=26

failures=2 success=27

failures=2 success=28

failures=2 success=29

failures=2 success=30

failures=2 success=31

failures=2 success=32

failures=2 success=33

failures=2 success=34

failures=2 success=35

failures=2 success=36

failures=2 success=37

failures=3 success=37

failures=3 success=38

failures=3 success=39

failures=3 success=40

failures=3 success=41

failures=3 success=42

failures=3 success=43

failures=3 success=44

failures=3 success=45

failures=3 success=46

failures=3 success=47

failures=3 success=48

failures=3 success=49

failures=3 success=50

failures=3 success=51

failures=3 success=52

failures=3 success=53

failures=3 success=54

failures=3 success=55

failures=3 success=56

failures=3 success=57

failures=3 success=58

failures=3 success=59

failures=3 success=60

failures=3 success=61

failures=3 success=62

failures=3 success=63

failures=3 success=64

failures=3 success=65

failures=3 success=66

failures=3 success=67

failures=3 success=68

failures=3 success=69

failures=3 success=70

Liveness: tcp-socket :9090 delay=5s timeout=1s period=10s #success=1 #failure=3

failures=4 success=70

failures=4 success=71

failures=4 success=72

failures=4 success=73

failures=4 success=74

failures=4 success=75

failures=4 success=76

failures=4 success=77

failures=4 success=78

failures=4 success=79

Liveness: tcp-socket :9090 delay=5s timeout=1s period=10s #success=1 #failure=3

Liveness: tcp-socket :9090 delay=5s timeout=1s period=10s #success=1 #failure=3

failures=5 success=79

failures=5 success=80

failures=5 success=81

failures=5 success=82

failures=5 success=83

failures=5 success=84

failures=5 success=85

This runs provides useful data for analyzing the 5 failures. Here’s the analysis for each failure from log data.

This is an interesting failure. The log shows script/await-release finished immediately. This has never happened before. I cannot believe that all pods were available just have helm install completed. I chock this up to a curious bug in script/await-release . The test filure logs show that two RPCs timed out. The get pod output shows the relevant pods were still running their init containers. This run proves there is a bug somewhere but it’s uncertain where. I will review script/await-release before run 6. This is a similar failure to #1. One RPC timed out to one of the same containers from run 1. Except this log shows that script/await-relase finished correctly and all pods are running. I would expect this test to pass given the information on hand. Default RPC timeouts need confirmation. It’s possible that various load factors prevented the server from responding in time. Centralized logging would be useful here as well. I cannot think of a straight forward way to associate tests with pods for more accurate log collection in the failure case. It’s likely the only solution is to dump logs for all pods. Same situation as #2. Similar situation as #1. The test failure logs show that script/await-release completed successfully, but the pods were restarted at some point during the test run. Similar to #1. script/await-release immediately reported that all pods were available. The failed test results show that RPCs failed because pods were still running init containers.

These results lead me to conclude that:

There is a bug in script/await-release check for Running pods. It’s possible that kubectl get pods shows them as Running for a moment before init containers start. The logs show that pods had zero restarts as well.

check for pods. It’s possible that shows them as for a moment before init containers start. The logs show that pods had zero restarts as well. Liveness probe initial delay seconds may need to be increased

RPC timeouts may need to be increased

Recording pod state before running helm test would help. This could help debug which pods underwent status changes during the test run.

would help. This could help debug which pods underwent status changes during the test run. script/await-release should print kubectl get pods output when everything is OK. This would help debug why grep -vqF 'Running' returned zero. I have a hunch that kubectl get pods is not returning more than two lines and the tail -n +2 is making the pipeline return an empty string. Although I have no idea why that would occur. I’ll add a test that output is non-empty as a compound condition for release readiness. This combined with printing the state of all pods would confirm assumptions.

Run 6

This run’s goal is to debug issues in script/await-release . Changes in this run:

Capture and print release pod status in the event of failed helm test Print all pod status when script/await-release registers everything as ready Add a guard clause that kubectl get pods -l "release=${release}" | tail -n +2 returns a non empty string before testing for non Running pods.

I’ve not changed anything regarding timeouts or liveness probes in this run. I have a feeling that probes themselves are configured with appropriate delays and timeouts. It’s more likely that pods were never ready in the first place, thus invalidating prerequisites for script/test-release.

I started the experiment and monitored it until failures occured. I noticed two failures. I terminated the run when it was clear that the changes were not reducing the failures.

The log shows that there is some strange behavior in script/await-release . The following gist shows the problem area along with a log snippet. The log shows that the script detects everything is “running”, but the following lines show the text grep ran against. You can see the log shows pods in Pending and Running states. The script should not have counted this as OK.

My only guess is strange grep behavior at this point.

Run 7

The goal for this run is further debug failures in script/await-release . I’ve made a single change: set -x to print all commands. This allows debugging variable state and what string grep ran against.

I eagerly baby sat the test runs and waited for a failure. I got one after 10 hours or so. Here’s the overall result before analyzing each individual result:

failures=0 success=1

failures=0 success=2

failures=0 success=3

... snipped ...

failures=0 success=39

failures=1 success=39

failures=2 success=39

failures=2 success=40

failures=2 success=41

failures=2 success=42

failures=2 success=43

failures=2 success=44

There are two failures in this run. The first one is the most interesting. script/await-release incorrectly exited before everything was ready. Here’s a gist with script/await-release and log from last inner loop.

The log text has long line. There are annotations at end of each line to help understanding and analisys. Unfortunately Medium’s default width is too small so scroll right to pick them up or open the Gist on Github.com.