L7 Circuit Breaking

Cilium Service Mesh defines a CiliumClusterwideEnvoyConfig CRD which allows users to set the configuration of the Envoy component built into Cilium agents.

Circuit breaking is an important pattern for creating resilient microservice applications. Circuit breaking allows you to write applications that limit the impact of failures, latency spikes, and other undesirable effects of network peculiarities.

You will configure Circuit breaking rules with CiliumClusterwideEnvoyConfig and then test the configuration by intentionally “tripping” the circuit breaker in this example.

Deploy Test Applications

$ kubectl apply -f https://raw.githubusercontent.com/cilium/cilium/1.16.4/examples/kubernetes/servicemesh/envoy/test-application-proxy-circuit-breaker.yaml

The test workloads consist of:

  • One client Deployment, fortio-deploy

  • One Service, echo-service

View information about these Pods:

$ kubectl get pods --show-labels -o wide
NAME                             READY   STATUS    RESTARTS   AGE     IP           NODE                       NOMINATED NODE   READINESS GATES   LABELS
echo-service-59557f5857-xh84s    2/2     Running   0          7m37s   10.0.0.125   cilium-control-plane   <none>           <none>            kind=echo,name=echo-service,other=echo,pod-template-hash=59557f5857
fortio-deploy-687945c6dc-6qnh4   1/1     Running   0          7m37s   10.0.0.109   cilium-control-plane   <none>           <none>            app=fortio,pod-template-hash=687945c6dc

Configuring Envoy Circuit Breaker

Apply the envoy-circuit-breaker.yaml file, which defines a CiliumClusterwideEnvoyConfig.

$ kubectl apply -f https://raw.githubusercontent.com/cilium/cilium/1.16.4/examples/kubernetes/servicemesh/envoy/envoy-circuit-breaker.yaml

Note

Note that these Envoy resources are not validated by K8s at all, so any errors in the Envoy resources will only be seen by the Cilium Agent observing these CRDs. This means that kubectl apply will report success, while parsing and/or installing the resources for the node-local Envoy instance may have failed. Currently the only way of verifying this is by observing Cilium Agent logs for errors and warnings. Additionally, Cilium Agent will print warning logs for any conflicting Envoy resources in the cluster.

Note

Note that Cilium Ingress Controller will configure required Envoy resource under the hood. Please check Cilium Agent logs if you are creating Envoy resources explicitly to make sure there is no conflict.

Verify the CiliumClusterwideEnvoyConfig was created correctly.

$ kubectl get ccec envoy-circuit-breaker -oyaml
apiVersion: cilium.io/v2
kind: CiliumClusterwideEnvoyConfig
...
resources:
- "@type": type.googleapis.com/envoy.config.cluster.v3.Cluster
  name: "default/echo-service"
  connect_timeout: 5s
  lb_policy: ROUND_ROBIN
  type: EDS
  circuit_breakers:
    thresholds:
    - priority: "DEFAULT"
      max_requests: 2
      max_pending_requests: 1
  outlier_detection:
    split_external_local_origin_errors: true
    consecutive_local_origin_failure: 2
services:
- name: echo-service
  namespace: default

In the CiliumClusterwideEnvoyConfig settings, you specified max_pending_requests: 1 and max_requests: 2. These rules indicate that if you exceed more than one connection and request concurrently, you will see some failures when the envoy opens the circuit for further requests and connections.

Tripping Envoy Circuit Breaker

Make an environment variable with the Pod name for fortio:

$ export FORTIO_POD=$(kubectl get pods -l app=fortio -o 'jsonpath={.items[0].metadata.name}')

Use the following command to call the Service with two concurrent connections using the -c 2 flag and send 20 requests using -n 20 flag:

$ kubectl exec "$FORTIO_POD" -c fortio -- /usr/bin/fortio load -c 2 -qps 0 -n 20 http://echo-service:8080

Output:

$ kubectl exec "$FORTIO_POD" -c fortio -- /usr/bin/fortio load -c 2 -qps 0 -n 20 http://echo-service:8080
{"ts":1692767216.838976,"level":"info","file":"scli.go","line":107,"msg":"Starting Φορτίο 1.57.3 h1:kdPlBiws3cFsLcssZxCt2opFmHj14C3yPBokFhMWzmg= go1.20.6 amd64 linux"}
Fortio 1.57.3 running at 0 queries per second, 4->4 procs, for 20 calls: http://echo-service:8080
{"ts":1692767216.839520,"level":"info","file":"httprunner.go","line":100,"msg":"Starting http test","run":"0","url":"http://echo-service:8080","threads":"2","qps":"-1.0","warmup":"parallel","conn-reuse":""}
Starting at max qps with 2 thread(s) [gomax 4] for exactly 20 calls (10 per thread + 0)
{"ts":1692767216.842149,"level":"warn","file":"http_client.go","line":1104,"msg":"Non ok http code","code":"503","status":"HTTP/1.1 503","thread":"1","run":"0"}
{"ts":1692767216.854289,"level":"info","file":"periodic.go","line":832,"msg":"T001 ended after 13.462339ms : 10 calls. qps=742.8129688310479"}
{"ts":1692767216.854985,"level":"info","file":"periodic.go","line":832,"msg":"T000 ended after 14.158587ms : 10 calls. qps=706.2851681456631"}
Ended after 14.197088ms : 20 calls. qps=1408.7
{"ts":1692767216.855035,"level":"info","file":"periodic.go","line":564,"msg":"Run ended","run":"0","elapsed":"14.197088ms","calls":"20","qps":"1408.739595049351"}
Aggregated Function Time : count 20 avg 0.0013703978 +/- 0.000461 min 0.00092124 max 0.002696039 sum 0.027407957
# range, mid point, percentile, count
>= 0.00092124 <= 0.001 , 0.00096062 , 10.00, 2
> 0.001 <= 0.002 , 0.0015 , 90.00, 16
> 0.002 <= 0.00269604 , 0.00234802 , 100.00, 2
# target 50% 0.0015
# target 75% 0.0018125
# target 90% 0.002
# target 99% 0.00262644
# target 99.9% 0.00268908
Error cases : count 1 avg 0.00133143 +/- 0 min 0.00133143 max 0.00133143 sum 0.00133143
# range, mid point, percentile, count
>= 0.00133143 <= 0.00133143 , 0.00133143 , 100.00, 1
# target 50% 0.00133143
# target 75% 0.00133143
# target 90% 0.00133143
# target 99% 0.00133143
# target 99.9% 0.00133143
# Socket and IP used for each connection:
[0]   1 socket used, resolved to 10.96.182.43:8080, connection timing : count 1 avg 0.000426815 +/- 0 min 0.000426815 max 0.000426815 sum 0.000426815
[1]   2 socket used, resolved to 10.96.182.43:8080, connection timing : count 2 avg 0.0004071275 +/- 0.0001215 min 0.000285596 max 0.000528659 sum 0.000814255
Connection time histogram (s) : count 3 avg 0.00041369 +/- 9.966e-05 min 0.000285596 max 0.000528659 sum 0.00124107
# range, mid point, percentile, count
>= 0.000285596 <= 0.000528659 , 0.000407128 , 100.00, 3
# target 50% 0.000346362
# target 75% 0.00043751
# target 90% 0.0004922
# target 99% 0.000525013
# target 99.9% 0.000528294
Sockets used: 3 (for perfect keepalive, would be 2)
Uniform: false, Jitter: false, Catchup allowed: true
IP addresses distribution:
10.96.182.43:8080: 3
Code 200 : 19 (95.0 %)
Code 503 : 1 (5.0 %)
Response Header Sizes : count 20 avg 370.5 +/- 85 min 0 max 390 sum 7410
Response Body/Total Sizes : count 20 avg 2340.15 +/- 465.7 min 310 max 2447 sum 46803
All done 20 calls (plus 0 warmup) 1.370 ms avg, 1408.7 qps

From the above output, you can see that the response code of some requests is 503, which triggers a circuit breaker.

Bring the number of concurrent connections up to 4.

Output:

$ kubectl exec "$FORTIO_POD" -c fortio -- /usr/bin/fortio load -c 4 -qps 0 -n 20 http://echo-service:8080
{"ts":1692767495.818546,"level":"info","file":"scli.go","line":107,"msg":"Starting Φορτίο 1.57.3 h1:kdPlBiws3cFsLcssZxCt2opFmHj14C3yPBokFhMWzmg= go1.20.6 amd64 linux"}
Fortio 1.57.3 running at 0 queries per second, 4->4 procs, for 20 calls: http://echo-service:8080
{"ts":1692767495.819105,"level":"info","file":"httprunner.go","line":100,"msg":"Starting http test","run":"0","url":"http://echo-service:8080","threads":"4","qps":"-1.0","warmup":"parallel","conn-reuse":""}
Starting at max qps with 4 thread(s) [gomax 4] for exactly 20 calls (5 per thread + 0)
{"ts":1692767495.822424,"level":"warn","file":"http_client.go","line":1104,"msg":"Non ok http code","code":"503","status":"HTTP/1.1 503","thread":"3","run":"0"}
{"ts":1692767495.822428,"level":"warn","file":"http_client.go","line":1104,"msg":"Non ok http code","code":"503","status":"HTTP/1.1 503","thread":"0","run":"0"}
{"ts":1692767495.822603,"level":"warn","file":"http_client.go","line":1104,"msg":"Non ok http code","code":"503","status":"HTTP/1.1 503","thread":"1","run":"0"}
{"ts":1692767495.823855,"level":"warn","file":"http_client.go","line":1104,"msg":"Non ok http code","code":"503","status":"HTTP/1.1 503","thread":"0","run":"0"}
{"ts":1692767495.825250,"level":"warn","file":"http_client.go","line":1104,"msg":"Non ok http code","code":"503","status":"HTTP/1.1 503","thread":"1","run":"0"}
{"ts":1692767495.825285,"level":"warn","file":"http_client.go","line":1104,"msg":"Non ok http code","code":"503","status":"HTTP/1.1 503","thread":"0","run":"0"}
{"ts":1692767495.827282,"level":"warn","file":"http_client.go","line":1104,"msg":"Non ok http code","code":"503","status":"HTTP/1.1 503","thread":"0","run":"0"}
{"ts":1692767495.827514,"level":"warn","file":"http_client.go","line":1104,"msg":"Non ok http code","code":"503","status":"HTTP/1.1 503","thread":"2","run":"0"}
{"ts":1692767495.829886,"level":"warn","file":"http_client.go","line":1104,"msg":"Non ok http code","code":"503","status":"HTTP/1.1 503","thread":"0","run":"0"}
{"ts":1692767495.830156,"level":"info","file":"periodic.go","line":832,"msg":"T000 ended after 9.136284ms : 5 calls. qps=547.268451812575"}
{"ts":1692767495.830326,"level":"warn","file":"http_client.go","line":1104,"msg":"Non ok http code","code":"503","status":"HTTP/1.1 503","thread":"2","run":"0"}
{"ts":1692767495.831175,"level":"warn","file":"http_client.go","line":1104,"msg":"Non ok http code","code":"503","status":"HTTP/1.1 503","thread":"3","run":"0"}
{"ts":1692767495.832826,"level":"warn","file":"http_client.go","line":1104,"msg":"Non ok http code","code":"503","status":"HTTP/1.1 503","thread":"3","run":"0"}
{"ts":1692767495.834028,"level":"warn","file":"http_client.go","line":1104,"msg":"Non ok http code","code":"503","status":"HTTP/1.1 503","thread":"3","run":"0"}
{"ts":1692767495.834116,"level":"info","file":"periodic.go","line":832,"msg":"T003 ended after 13.09904ms : 5 calls. qps=381.7073617608619"}
{"ts":1692767495.834865,"level":"info","file":"periodic.go","line":832,"msg":"T001 ended after 13.846811ms : 5 calls. qps=361.09397318992796"}
{"ts":1692767495.835370,"level":"info","file":"periodic.go","line":832,"msg":"T002 ended after 14.352324ms : 5 calls. qps=348.3756358900482"}
Ended after 14.386516ms : 20 calls. qps=1390.2
{"ts":1692767495.835489,"level":"info","file":"periodic.go","line":564,"msg":"Run ended","run":"0","elapsed":"14.386516ms","calls":"20","qps":"1390.1906479650806"}
Aggregated Function Time : count 20 avg 0.0024801033 +/- 0.001782 min 0.000721482 max 0.008055527 sum 0.049602066
# range, mid point, percentile, count
>= 0.000721482 <= 0.001 , 0.000860741 , 10.00, 2
> 0.001 <= 0.002 , 0.0015 , 45.00, 7
> 0.002 <= 0.003 , 0.0025 , 80.00, 7
> 0.003 <= 0.004 , 0.0035 , 85.00, 1
> 0.005 <= 0.006 , 0.0055 , 95.00, 2
> 0.008 <= 0.00805553 , 0.00802776 , 100.00, 1
# target 50% 0.00214286
# target 75% 0.00285714
# target 90% 0.0055
# target 99% 0.00804442
# target 99.9% 0.00805442
Error cases : count 13 avg 0.0016602806 +/- 0.0006006 min 0.000721482 max 0.00281812 sum 0.021583648
# range, mid point, percentile, count
>= 0.000721482 <= 0.001 , 0.000860741 , 15.38, 2
> 0.001 <= 0.002 , 0.0015 , 61.54, 6
> 0.002 <= 0.00281812 , 0.00240906 , 100.00, 5
# target 50% 0.00175
# target 75% 0.00228634
# target 90% 0.00260541
# target 99% 0.00279685
# target 99.9% 0.00281599
# Socket and IP used for each connection:
[0]   5 socket used, resolved to 10.96.182.43:8080, connection timing : count 5 avg 0.0003044688 +/- 0.0001472 min 0.000120654 max 0.00053878 sum 0.001522344
[1]   3 socket used, resolved to 10.96.182.43:8080, connection timing : count 3 avg 0.00041437933 +/- 9.571e-05 min 0.000330279 max 0.000548277 sum 0.001243138
[2]   3 socket used, resolved to 10.96.182.43:8080, connection timing : count 3 avg 0.00041114067 +/- 0.0001352 min 0.000306734 max 0.00060203 sum 0.001233422
[3]   4 socket used, resolved to 10.96.182.43:8080, connection timing : count 4 avg 0.00038631225 +/- 0.0002447 min 0.000175125 max 0.00080311 sum 0.001545249
Connection time histogram (s) : count 15 avg 0.0003696102 +/- 0.0001758 min 0.000120654 max 0.00080311 sum 0.005544153
# range, mid point, percentile, count
>= 0.000120654 <= 0.00080311 , 0.000461882 , 100.00, 15
# target 50% 0.000437509
# target 75% 0.000620309
# target 90% 0.00072999
# target 99% 0.000795798
# target 99.9% 0.000802379
Sockets used: 15 (for perfect keepalive, would be 4)
Uniform: false, Jitter: false, Catchup allowed: true
IP addresses distribution:
10.96.182.43:8080: 15
Code 200 : 7 (35.0 %)
Code 503 : 13 (65.0 %)
Response Header Sizes : count 20 avg 136.5 +/- 186 min 0 max 390 sum 2730
Response Body/Total Sizes : count 20 avg 1026.9 +/- 1042 min 241 max 2447 sum 20538
All done 20 calls (plus 0 warmup) 2.480 ms avg, 1390.2 qps

Now you can start to see the expected Circuit breaking behavior. Only 35% of the requests succeeded and the rest were trapped by Circuit breaking.

Code 200 : 7 (35.0 %)
Code 503 : 13 (65.0 %)

Cleaning up

Remove the rules.

$ kubectl delete -f https://raw.githubusercontent.com/cilium/cilium/1.16.4/examples/kubernetes/servicemesh/envoy/envoy-circuit-breaker.yaml

Remove the test application.

$ kubectl delete -f https://raw.githubusercontent.com/cilium/cilium/1.16.4/examples/kubernetes/servicemesh/envoy/test-application-proxy-circuit-breaker.yaml