I am preparing a demo for the features of Knative and Istio with IBM CloudPak for Application on top of OpenShift Container Platform (OCP) 4.2. Everything went smoothly, the Knative service is able to scale down to zero pods and activate upon traffic comes in automatically until I turned on the tracing capability and visualizing the traffic with Kiali.

1. Turn on Kiali

To turn on Kiali, modify the CRD ServiceMeshControlPlane named as basic-install, to enable Kiali. To have the tracing on, the Envoy sidecar is a must to capture the tracing and monitoring data. Let’s enable the sidecarInjectorWebhook so that the sidecar can be injected to the Pod upon its creation.

spec:

istio:

kiali:

enabled: true

tracing:

enabled: true

global:

defaultPodDisruptionBudget:

enabled: false

disablePolicyChecks: false

multitenant: true

omitSidecarInjectorConfigMap: false

proxy:

autoInject: disabled

grafana:

enabled: true

sidecarInjectorWebhook:

enabled: true

mixer:

enabled: true

policy:

enabled: false

telemetry:

enabled: true

prometheus:

enabled: true

Notice telemetry and prometheus are enabled for monitoring metrics.

After the changes, the Operator magic start to work, the Kiali UI is working. The webhook starts to inject the sidecar into the Pod of a Namespace that joins the service mesh. Everything seems good.

However…

2. Knative activator went to crash loop

As the knative-serving namespace is a member of the service mesh, and the operator, the OpenShift Serverless operator, reconcile the deployment of the Knative activator to have the following annotation,

sidecar.istio.io/inject: 'true'

The activator pod will have the istio sidecar injected, and now the activator pod is not able to function well and failed at the very initial startup phase with the following error.

Error loading/parsing logging configuration:Get https://172.30.0.1:443/api/v1/namespaces/knative-serving/configmaps/config-logging: http: server gave HTTP response to HTTPS client

As the activator fails, the functionality it serves as below fails and the Knative service will no longer work.

Receiving & buffering requests for inactive Revisions. Reporting metrics to the autoscaler. Retrying requests to a Revision after the autoscaler scales such Revision based on the reported metrics.

3. Simulate the problem

Let's try to repeat the problem. In the same namespace, create a debug pod using the same service account of the activator pod.

apiVersion: v1

kind: Pod

metadata:

labels:

run: debug

annotations:

sidecar.istio.io/inject: 'true'

name: debug

spec:

serviceAccountName: controller

containers:

- image: curlimages/curl

name: debug

command:

- sh

- -c

- 'sleep 3600; exit'

restartPolicy: Never

The exec into the pod, to run the following commands,

export CACERT=/run/secrets/kubernetes.io/serviceaccount/ca.crt export TOKEN=$(cat /run/secrets/kubernetes.io/serviceaccount/token) curl -H "Authorization: Bearer $TOKEN" --cacert $CACERT https://172.30.0.1/api/v1/namespaces/knative-serving/configmaps/config-logging

Then I am having the error

curl: (35) error:1408F10B:SSL routines:ssl3_get_record:wrong version number

It is the same as the activator pod. A normal http response is giving back to an https request.

Test 1. Is the normal http functioning?

The answer is shown below as yes.

kubectl -n knative-serving exec debug -i -c debug -- sh -c 'curl http://httpbin.org/headers' ... {

"headers": {

"Accept": "*/*",

"Content-Length": "0",

"Host": "httpbin.org",

"User-Agent": "curl/7.69.1-DEV",

"X-Amzn-Trace-Id": "Root=1-5e747c1d-41fbfb8274a15620acf0d30b",

"X-B3-Sampled": "0",

"X-B3-Spanid": "c645c09f1ab9c9c0",

"X-B3-Traceid": "685234a1177a2ebcc645c09f1ab9c9c0",

"X-Envoy-Expected-Rq-Timeout-Ms": "15000",

"X-Istio-Attributes":nVnLmtuYXRpdmUtc2VydmluZw=="

}

}

Test 2. Is the https working?

kubectl -n knative-serving exec debug -i -c debug -- sh -c 'curl https://httpbin.org/headers' curl: (35) error:1408F10B:SSL routines:ssl3_get_record:wrong version number

command terminated with exit code 35

The https is not working.

Test 3. Is the http in the service mesh working?

The above test shown is for external services. In the activator’s error message, the 172.30.0.1 is referring to the service of kubernetes.default. The default namespace is actually not in the service mesh, which is shown with this command,

oc -n istio-system get smmr

NAME MEMBERS

default [knative-serving tekton-pipelines kabanero plant-by-websphere plant-by-websphere-dev redis-service]

So let’s find out some service inside the service mesh, say the knative-serving,

oc -n knative-serving get svc NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE

activator-service ClusterIP 172.30.99.179 <none> 80/TCP,81/TCP,9090/TCP 10d

autoscaler ClusterIP 172.30.108.136 <none> 8080/TCP,9090/TCP,443/TCP 10d

controller ClusterIP 172.30.218.61 <none> 9090/TCP 10d

webhook ClusterIP 172.30.125.111 <none> 443/TCP 10d

Select the autoscaler service.



Bad Request kubectl -n knative-serving exec debug -i -c debug -- sh -c 'curl -s http://autoscaler.knative-serving:8080' Bad Request

The HTTP server responded and the in-mesh http service is working.

Test 4. Is the https in the service mesh working?

kubectl -n knative-serving exec debug -i -c debug -- sh -c 'curl -ks https://autoscaler.knative-serving' {

"kind": "Status",

"apiVersion": "v1",

"metadata": { },

"status": "Failure",

"message": "forbidden: User \"system:anonymous\" cannot get path \"/\"",

"reason": "Forbidden",

"details": { },

"code": 403

}

So https in-mesh is also working.

4. Walk-around the problem

With the above set of tests, the behavior of the problem is clear, an HTTPS service that is outside of the service mesh could not be accessed properly. The walk-around is quite simple, let the default namespace join the service mesh.

Edit the ServiceMeshMemberRoll to add the default namespace. We should have something as below,

oc -n istio-system get smmr default

NAME MEMBERS

default [knative-serving tekton-pipelines kabanero plant-by-websphere plant-by-websphere-dev redis-service default]

After a short while, the activator pod goes back to normal. Phew!

5. The real culprit

Walk-around is a walk-around. What is the real problem? After some googling, this issue log (https://github.com/istio/istio/issues/14264) describe the real problem.

First, identify the proxy.

istioctl proxy-status

NAME CDS LDS EDS RDS PILOT VERSION

activator-67f8b69697-sqwgb.knative-serving SYNCED SYNCED SYNCED SYNCED istio-pilot-59fc69bd66-t7zbr 1.1.11*

cluster-local-gateway-6647b96c4d-8t9bp.istio-system SYNCED SYNCED SYNCED SYNCED istio-pilot-59fc69bd66-t7zbr 1.1.11*

debug.knative-serving SYNCED SYNCED SYNCED SYNCED istio-pilot-59fc69bd66-t7zbr 1.1.11*

istio-ingressgateway-8465bbf788-x52pk.istio-system SYNCED SYNCED SYNCED SYNCED istio-pilot-59fc69bd66-t7zbr 1.1.11*

pbw-dev-v1-deployment-5bc69b6f49-gfqqd.plant-by-websphere-dev SYNCED SYNCED SYNCED SYNCED istio-pilot-59fc69bd66-t7zbr 1.1.11*

pbw-v1-deployment-5bc474485b-vgszq.plant-by-websphere SYNCED SYNCED SYNCED SYNCED istio-pilot-59fc69bd66-t7zbr 1.1.11*

pbw-v2-deployment-6c7b986d8-bgbc4.plant-by-websphere SYNCED SYNCED SYNCED SYNCED istio-pilot-59fc69bd66-t7zbr 1.1.11*

Then, looking for the name 443 particularly,

istioctl -n knative-serving proxy-config routes activator-67f8b69697-sqwgb.knative-serving --name 443 -o json

[

{

"name": "443",

"virtualHosts": [

{

"name": "tekton-dashboard.tekton-pipelines.svc.cluster.local:443",

"domains": [

"tekton-dashboard.tekton-pipelines.svc.cluster.local",

"tekton-dashboard.tekton-pipelines.svc.cluster.local:443",

"tekton-dashboard.tekton-pipelines",

"tekton-dashboard.tekton-pipelines:443",

"tekton-dashboard.tekton-pipelines.svc.cluster",

"tekton-dashboard.tekton-pipelines.svc.cluster:443",

"tekton-dashboard.tekton-pipelines.svc",

"tekton-dashboard.tekton-pipelines.svc:443",

"172.30.118.86",

"172.30.118.86:443"

],

Notice the virtual host has both normal http and 443. Check the K8s service,

kubectl -n tekton-pipelines get svc tekton-dashboard -o yaml ...

spec:

clusterIP: 172.30.118.86

ports:

- name: http

port: 443

protocol: TCP

targetPort: 8443

selector:

app: tekton-dashboard

sessionAffinity: None

type: ClusterIP

...

The 443 is named as http. This is the culprit. Edit it and rename the “http” to “https” resolve the problem. (We don’t need the walk-around to let default namespace be a member in the service mesh)

The root cause of the problem in Istio is tracked in the issue log, https://github.com/istio/istio/issues/16458. It still open at this moment.