Monitoring & Metrics
cilium-agent
and cilium-operator
can be configured to serve Prometheus metrics. Prometheus is a pluggable metrics collection
and storage system and can act as a data source for Grafana, a metrics visualization frontend. Unlike some metrics
collectors like statsd, Prometheus requires the collectors to pull metrics from
each source.
To run Cilium with Prometheus metrics enabled, deploy it with the
global.prometheus.enabled=true
Helm value set.
All metrics are exported under the cilium
Prometheus namespace. When
running and collecting in Kubernetes they will be tagged with a pod name and
namespace.
Installation
When deployed with the Helm value global.prometheus.enabled=true
, all Cilium
components will have the annotations to signal Prometheus whether to scrape
metrics:
prometheus.io/scrape: "true"
prometheus.io/port: "9090"
Example Prometheus & Grafana Deployment
If you don’t have an existing Prometheus and Grafana stack running, you can
deploy a stack with:
kubectl apply -f https://raw.githubusercontent.com/cilium/cilium/v1.7/examples/kubernetes/addons/prometheus/monitoring-example.yaml
It will run Prometheus and Grafana in the cilium-monitoring
namespace. You
can then expose Grafana to access it via your browser.
kubectl -n cilium-monitoring port-forward service/grafana 3000:3000
Open your browser and access https://localhost:3000/
cilium-agent
To expose any metrics, invoke cilium-agent
with the
--prometheus-serve-addr
option. This option takes a IP:Port
pair but
passing an empty IP (e.g. :9090
) will bind the server to all available
interfaces (there is usually only one in a container).
in examples/kubernetes/addons/prometheus/monitoring-example.yaml
Exported Metrics
Endpoint
Name |
Labels |
Description |
endpoint_count |
|
Number of endpoints managed by this agent |
endpoint_regenerations |
outcome |
Count of all endpoint regenerations that have completed |
endpoint_regeneration_time_stats_seconds |
scope |
Endpoint regeneration time stats |
endpoint_state |
state |
Count of all endpoints |
Services
Name |
Labels |
Description |
services_events_total |
|
Number of services events labeled by action type |
Datapath
Name |
Labels |
Description |
datapath_errors_total |
area , name , family |
Total number of errors occurred in datapath management |
datapath_conntrack_gc_runs_total |
status |
Number of times that the conntrack garbage collector process was run |
datapath_conntrack_gc_key_fallbacks_total |
|
The number of alive and deleted conntrack entries at the end of a garbage collector run labeled by datapath family |
datapath_conntrack_gc_entries |
family |
The number of alive and deleted conntrack entries at the end of a garbage collector run |
datapath_conntrack_gc_duration_seconds |
status |
Duration in seconds of the garbage collector process |
BPF
Name |
Labels |
Description |
bpf_syscall_duration_seconds |
operation , outcome |
Duration of BPF system call performed |
bpf_map_ops_total |
mapName , operation , outcome |
Number of BPF map operations performed |
Drops/Forwards (L3/L4)
Name |
Labels |
Description |
drop_count_total |
reason , direction |
Total dropped packets |
drop_bytes_total |
reason , direction |
Total dropped bytes |
forward_count_total |
direction |
Total forwarded packets |
forward_bytes_total |
direction |
Total forwarded bytes |
Policy
Name |
Labels |
Description |
policy_count |
|
Number of policies currently loaded |
policy_regeneration_total |
|
Total number of policies regenerated successfully |
policy_regeneration_time_stats_seconds |
scope |
Policy regeneration time stats labeled by the scope |
policy_max_revision |
|
Highest policy revision number in the agent |
policy_import_errors |
|
Number of times a policy import has failed |
policy_endpoint_enforcement_status |
|
Number of endpoints labeled by policy enforcement status |
Policy L7 (HTTP/Kafka)
Name |
Labels |
Description |
proxy_redirects |
protocol |
Number of redirects installed for endpoints |
proxy_upstream_reply_seconds |
|
Seconds waited for upstream server to reply to a request |
policy_l7_total |
type |
Number of total L7 requests/responses |
Identity
Name |
Labels |
Description |
identity_count |
|
Number of identities currently allocated |
Events external to Cilium
Name |
Labels |
Description |
event_ts |
source |
Last timestamp when we received an event |
Controllers
Name |
Labels |
Description |
controllers_runs_total |
status |
Number of times that a controller process was run |
controllers_runs_duration_seconds |
status |
Duration in seconds of the controller process |
SubProcess
Name |
Labels |
Description |
subprocess_start_total |
subsystem |
Number of times that Cilium has started a subprocess |
Kubernetes
Name |
Labels |
Description |
kubernetes_events_received_total |
scope , action , validity , equal |
Number of Kubernetes events received |
kubernetes_events_total |
scope , action , outcome |
Number of Kubernetes events processed |
k8s_cnp_status_completion_seconds |
attempts , outcome |
Duration in seconds in how long it took to complete a CNP status update |
IPAM
Name |
Labels |
Description |
ipam_events_total |
|
Number of IPAM events received labeled by action and datapath family type |
KVstore
Name |
Labels |
Description |
kvstore_operations_duration_seconds |
action , kind , outcome , scope |
Duration of kvstore operation |
kvstore_events_queue_seconds |
action , scope |
Duration of seconds of time received event was blocked before it could be queued |
kvstore_quorum_errors_total |
error |
Number of quorum errors |
Agent
Name |
Labels |
Description |
agent_bootstrap_seconds |
scope , outcome |
Duration of various bootstrap phases |
api_process_time_seconds |
|
Processing time of all the API calls made to the cilium-agent, labeled by API method, API path and returned HTTP code. |
API Rate Limiting
Name |
Labels |
Description |
cilium_api_limiter_adjustment_factor |
api_call |
Most recent adjustment factor for automatic adjustment |
cilium_api_limiter_processed_requests_total |
api_call , outcome |
Total number of API requests processed |
cilium_api_limiter_processing_duration_seconds |
api_call , value |
Mean and estimated processing duration in seconds |
cilium_api_limiter_rate_limit |
api_call , value |
Current rate limiting configuration (limit and burst) |
cilium_api_limiter_requests_in_flight |
api_call value |
Current and maximum allowed number of requests in flight |
cilium_api_limiter_wait_duration_seconds |
api_call , value |
Mean, min, and max wait duration |
cilium_api_limiter_wait_history_duration_seconds |
api_call |
Histogram of wait duration per API call processed |
FQDN
Name |
Labels |
Description |
qdn_gc_deletions_total |
|
Number of FQDNs that have been cleaned on FQDN garbage collector job |
cilium-operator
cilium-operator
can be configured to serve metrics by running with the
option --enable-metrics
. By default, the operator will expose metrics on
port 6942, the port can be changed with the option --metrics-address
.
Exported Metrics
All metrics are exported under the cilium_operator_
Prometheus namespace.
ENI
Name |
Labels |
Description |
eni_ips |
type |
Number of IPs allocated |
eni_allocation_ops |
subnetId |
Number of IP allocation operations |
eni_interface_creation_ops |
subnetId , status |
Number of ENIs allocated |
eni_available |
|
Number of ENIs with addresses available |
eni_nodes_at_capacity |
|
Number of nodes unable to allocate more addresses |
eni_aws_api_duration_seconds |
operation , responseCode |
Duration of interactions with AWS API |
eni_resync_total |
|
Number of synchronization operations to synchronize AWS EC2 metadata |
eni_ec2_rate_limit |
operation |
Number of times the EC2 client rate limiter kicked in |