Solution: Monitoring Amazon EKS Infrastructure
Monitoring Amazon Elastic Kubernetes Service infrastructure is one of the most common scenarios for which Amazon Managed Grafana are used. This page describes a template that provides you with a solution for this scenario. The solution can be installed using AWS Cloud Development Kit (AWS CDK) or with Terraform.
This solution configures an Amazon Managed Grafana workspace to provide metrics for your Amazon EKS cluster. The metrics are used to generate dashboards and alerts.
The metrics help you to operate Amazon EKS clusters more effectively by providing insights into the health and performance of the Kubernetes control and data plane. You can understand your Amazon EKS cluster from the node level, to pods, down to the Kubernetes level, including detailed monitoring of resource usage.
The following image shows a sample of the dashboard folder for the solution.
You can choose a dashboard to see more details, for example, choosing to view the Compute Resources for workloads will show a dashboard, such as that shown in the following image.
The metrics are scraped with a 1 minute scrape interval. The dashboards show metrics aggregated to 1 minute, 5 minutes, or more, based on the specific metric.
Logs are shown in dashboards, as well, so that you can query and analyze logs to find root causes of issues. The following image shows a log dashboard.
This solution creates and uses resources in your workspace. You will be charged for standard usage of the resources created, including:
The pricing calculators, available from the pricing page for each product, can help you understand potential costs for your solution. The following information can help get a base cost, for the solution running in the same availability zone as the Amazon EKS cluster.
Product | Calculator metric | Value |
---|---|---|
Amazon Managed Service for Prometheus |
Active series |
8000 (base) 15,000 (per node) |
Avg Collection Interval |
60 (seconds) |
|
Amazon Managed Service for Prometheus (managed collector) |
Number of collectors |
1 |
Number of samples |
15 (base) 150 (per node) |
|
Number of rules |
161 |
|
Average rules extraction interval |
60 (seconds) |
|
Amazon Managed Grafana |
Number of active editors/administrators |
1 (or more, based on your users) |
CloudWatch (Logs) |
Standard Logs: Data ingested |
24.5 GB (base) 0.5 GB (per node) |
Log Storage/Archival (Standard and Vended Logs) |
Yes to store logs: Assuming 1 month retention |
|
Expected Logs Data Scanned |
Each log insights query from Grafana will scan all log contents from the group over the specified time period. |
These numbers are the base numbers for a solution running EKS with no additional software. This will give you an estimate of the base costs. It also leaves out network usage costs, which will vary based on whether the Amazon Managed Grafana workspace, Amazon Managed Service for Prometheus workspace, and Amazon EKS cluster are in the same availability zone, AWS Region, and VPN.
Note
When an item in this table includes a (base)
value and a value
per resource (for example, (per node)
), you should add the base value
to the per resource value times the number you have of that resource. For example,
for Average active time series, enter a number
that is 8000 + the number of nodes in your cluster * 15,000
.
If you have 2 nodes, you would enter 38,000
, which is
8000 + ( 2 * 15,000 )
.
This solution requires that you have done the following before using the solution.
This solution configures AWS infrastructure to support reporting and monitoring metrics from an Amazon EKS cluster. You can install it using either AWS Cloud Development Kit (AWS CDK) or with Terraform.
This solution creates a scraper that collects metrics from your Amazon EKS cluster. Those metrics are stored in Amazon Managed Service for Prometheus, and then displayed in Amazon Managed Grafana dashboards. By default, the scraper collects all Prometheus-compatible metrics that are exposed by the cluster. Installing software in your cluster that produces more metrics will increase the metrics collected. If you want, you can reduce the number of metrics by updating the scraper with a configuration that filters the metrics.
The following metrics are tracked with this solution, in a base Amazon EKS cluster configuration with no additional software installed.
Metric | Description / Purpose |
---|---|
|
Gauge of APIServices which are marked as unavailable broken down by APIService name. |
|
Admission webhook latency histogram in seconds, identified by name and broken out for each operation and API resource and type (validate or admit). |
|
Maximal number of currently used inflight request limit of this apiserver per request kind in last second. |
|
Percent of the cache slots currently occupied by cached DEKs. |
|
Number of requests in initial (for a WATCH) or any (for a non-WATCH) execution stage in the API Priority and Fairness subsystem. |
|
Number of requests in initial (for a WATCH) or any (for a non-WATCH) execution stage in the API Priority and Fairness subsystem that were rejected. |
|
Nominal number of execution seats configured for each priority level. |
|
The bucketed histogram of duration of initial stage (for a WATCH) or any (for a non-WATCH) stage of request execution in the API Priority and Fairness subsystem. |
|
The count of initial stage (for a WATCH) or any (for a non-WATCH) stage of request execution in the API Priority and Fairness subsystem. |
|
Indicates an API server request. |
|
Gauge of deprecated APIs that have been requested, broken out by API group, version, resource, subresource, and removed_release. |
|
Response latency distribution in seconds for each verb, dry run value, group, version, resource, subresource, scope and component. |
|
The bucketed histogram of response latency distribution in seconds for each verb, dry run value, group, version, resource, subresource, scope and component. |
|
The Service Level Objective (SLO) response latency distribution in seconds for each verb, dry run value, group, version, resource, subresource, scope and component. |
|
Number of requests which apiserver terminated in self-defense. |
|
Counter of apiserver requests broken out for each verb, dry run value, group, version, resource, scope, component, and HTTP response code. |
|
Cumulative cpu time consumed. |
|
Cumulative count of bytes read. |
|
Cumulative count of reads completed. |
|
Cumulative count of bytes written. |
|
Cumulative count of writes completed. |
|
Total page cache memory. |
|
Size of RSS. |
|
Container swap usage. |
|
Current working set. |
|
Cumulative count of bytes received. |
|
Cumulative count of packets dropped while receiving. |
|
Cumulative count of packets received. |
|
Cumulative count of bytes transmitted. |
|
Cumulative count of packets dropped while transmitting. |
|
Cumulative count of packets transmitted. |
|
The bucketed histogram of etcd request latency in seconds for each operation and object type. |
|
Number of goroutines that currently exist. |
|
Number of OS threads created. |
|
The bucketed histogram of duration in seconds for cgroup manager operations. Broken down by method. |
|
Duration in seconds for cgroup manager operations. Broken down by method. |
|
This metric is true (1) if the node is experiencing a configuration-related error, false (0) otherwise. |
|
The node's name. The count is always 1. |
|
The bucketed histogram of duration in seconds for relisting pods in PLEG. |
|
The count of duration in seconds for relisting pods in PLEG. |
|
The bucketed histogram of interval in seconds between relisting in PLEG. |
|
The count of duration in seconds from kubelet seeing a pod for the first time to the pod starting to run. |
|
The bucketed histogram of duration in seconds to sync a single pod. Broken down by operation type: create, update, or sync. |
|
The count of duration in seconds to sync a single pod. Broken down by operation type: create, update, or sync. |
|
Number of containers currently running. |
|
Number of pods that have a running pod sandbox. |
|
The bucketed histogram of duration in seconds of runtime operations. Broken down by operation type. |
|
Cumulative number of runtime operation errors by operation type. |
|
Cumulative number of runtime operations by operation type. |
|
The amount of resources allocatable for pods (after reserving some for system daemons). |
|
The total amount of resources available for a node. |
|
The number of requested limit resource by a container. |
|
The number of requested limit resource by a container. |
|
The number of requested request resource by a container. |
|
The number of requested request resource by a container. |
|
Information about the Pod's owner. |
|
Resource quotas in Kubernetes enforce usage limits on resources such as CPU, memory, and storage within namespaces. |
|
The CPU usage metrics for a node, including usage per core and total usage. |
|
Seconds the CPUs spent in each mode. |
|
The cumulative amount of time spent performing I/O operations on disk by a node. |
|
The total amount of time spent performing I/O operations on disk by the node. |
|
The total number of bytes read from disk by the node. |
|
The total number of bytes written to disk by the node. |
|
The amount of available space in bytes on the filesystem of a node in a Kubernetes cluster. |
|
The total size of the filesystem on the node. |
|
The 1-minute load average of a node's CPU usage. |
|
The 15-minute load average of a node's CPU usage. |
|
The 5-minute load average of a node's CPU usage. |
|
The amount of memory used for buffer caching by the node's operating system. |
|
The amount of memory used for disk caching by the node's operating system. |
|
The amount of memory available for use by applications and caches. |
|
The amount of free memory available on the node. |
|
The total amount of physical memory available on the node. |
|
The total number of bytes received over the network by the node. |
|
The total number of bytes transmitted over the network by the node. |
|
Total user and system CPU time spent in seconds. |
|
Resident memory size in bytes. |
|
Number of HTTP requests, partitioned by status code, method, and host. |
|
The bucketed histogram of request latency in seconds. Broken down by verb, and host. |
|
The bucketed histogram of duration of storage operations. |
|
The count of duration of storage operations. |
|
Cumulative number of errors during storage operations. |
|
A metric indicating whether the monitored target (e.g., node) is up and running. |
|
The total number of volumes managed by the volume manager. |
|
Total number of adds handled by workqueue. |
|
Current depth of workqueue. |
|
The bucketed histogram of how long in seconds an item stays in workqueue before being requested. |
|
The bucketed histogram of how long in seconds processing an item from workqueue takes. |
The following tables list the alerts that are created by this solution. The alerts are created as rules in Amazon Managed Service for Prometheus, and are displayed in your Amazon Managed Grafana workspace.
You can modify the rules, including adding or deleting rules by editing the rules configuration file in your Amazon Managed Service for Prometheus workspace.
These two alerts are special alerts that are handled slightly differently than typical alerts. Instead of alerting you to an issue, they give you information that is used to monitor the system. The description includes details about how to use these alerts.
Alert | Description and usage |
---|---|
|
This is an alert meant to ensure that the entire alerting pipeline is functional. This alert is always firing, therefore it should always be firing in Alertmanager and always fire against a receiver. You can integrate this with your notification mechanism to send a notification when this alert is not firing. For example, you could use the DeadMansSnitch integration in PagerDuty. |
|
This is an alert that is used to inhibit info alerts. By
themselves, info-level alerts can be very noisy, but they are relevant
when combined with other alerts. This alert fires whenever there's a
|
The following alerts give you information or warnings about your system.
Alert | Severity | Description |
---|---|---|
|
warning |
Network interface is often changing its status |
|
warning |
File system is predicted to run out of space within the next 24 hours. |
|
critical |
File system is predicted to run out of space within the next 4 hours. |
|
warning |
File system has less than 5% space left. |
|
critical |
File system has less than 3% space left. |
|
warning |
File system is predicted to run out of inodes within the next 24 hours. |
|
critical |
File system is predicted to run out of inodes within the next 4 hours. |
|
warning |
File system has less than 5% inodes left. |
|
critical |
File system has less than 3% inodes left. |
|
warning |
Network interface is reporting many receive errors. |
|
warning |
Network interface is reporting many transmit errors. |
|
warning |
Number of conntrack entries are getting close to the limit. |
|
warning |
Node Exporter text file collector failed to scrape. |
|
warning |
Clock skew detected. |
|
warning |
Clock not synchronizing. |
|
critical |
RAID Array is degraded |
|
warning |
Failed device in RAID array |
|
warning |
Kernel is predicted to exhaust file descriptors limit soon. |
|
critical |
Kernel is predicted to exhaust file descriptors limit soon. |
|
warning |
Node is not ready. |
|
warning |
Node is unreachable. |
|
info |
Kubelet is running at capacity. |
|
warning |
Node readiness status is flapping. |
|
warning |
Kubelet Pod Lifecycle Event Generator is taking too long to relist. |
|
warning |
Kubelet Pod startup latency is too high. |
|
warning |
Kubelet client certificate is about to expire. |
|
critical |
Kubelet client certificate is about to expire. |
|
warning |
Kubelet server certificate is about to expire. |
|
critical |
Kubelet server certificate is about to expire. |
|
warning |
Kubelet has failed to renew its client certificate. |
|
warning |
Kubelet has failed to renew its server certificate. |
|
critical |
Target disappeared from Prometheus target discovery. |
|
warning |
Different semantic versions of Kubernetes components running. |
|
warning |
Kubernetes API server client is experiencing errors. |
|
warning |
Client certificate is about to expire. |
|
critical |
Client certificate is about to expire. |
|
warning |
Kubernetes aggregated API has reported errors. |
|
warning |
Kubernetes aggregated API is down. |
|
critical |
Target disappeared from Prometheus target discovery. |
|
warning |
The kubernetes apiserver has terminated {{ $value | humanizePercentage }} of its incoming requests. |
|
critical |
Persistent Volume is filling up. |
|
warning |
Persistent Volume is filling up. |
|
critical |
Persistent Volume Inodes is filling up. |
|
warning |
Persistent Volume Inodes are filling up. |
|
critical |
Persistent Volume is having issues with provisioning. |
|
warning |
Cluster has overcommitted CPU resource requests. |
|
warning |
Cluster has overcommitted memory resource requests. |
|
warning |
Cluster has overcommitted CPU resource requests. |
|
warning |
Cluster has overcommitted memory resource requests. |
|
info |
Namespace quota is going to be full. |
|
info |
Namespace quota is fully used. |
|
warning |
Namespace quota has exceeded the limits. |
|
info |
Processes experience elevated CPU throttling. |
|
warning |
Pod is crash looping. |
|
warning |
Pod has been in a non-ready state for more than 15 minutes. |
|
warning |
Deployment generation mismatch due to possible roll-back |
|
warning |
Deployment has not matched the expected number of replicas. |
|
warning |
StatefulSet has not matched the expected number of replicas. |
|
warning |
StatefulSet generation mismatch due to possible roll-back |
|
warning |
StatefulSet update has not been rolled out. |
|
warning |
DaemonSet rollout is stuck. |
|
warning |
Pod container waiting longer than 1 hour |
|
warning |
DaemonSet pods are not scheduled. |
|
warning |
DaemonSet pods are misscheduled. |
|
warning |
Job did not complete in time |
|
warning |
Job failed to complete. |
|
warning |
HPA has not matched desired number of replicas. |
|
warning |
HPA is running at max replicas |
|
critical |
kube-state-metrics is experiencing errors in list operations. |
|
critical |
kube-state-metrics is experiencing errors in watch operations. |
|
critical |
kube-state-metrics sharding is misconfigured. |
|
critical |
kube-state-metrics shards are missing. |
|
critical |
The API server is burning too much error budget. |
|
critical |
The API server is burning too much error budget. |
|
warning |
The API server is burning too much error budget. |
|
warning |
The API server is burning too much error budget. |
|
warning |
One or more targets are down. |
|
critical |
Etcd cluster insufficient members. |
|
warning |
Etcd cluster high number of leader changes. |
|
critical |
Etcd cluster has no leader. |
|
warning |
Etcd cluster high number of failed gRPC requests. |
|
critical |
Etcd cluster gRPC requests are slow. |
|
warning |
Etcd cluster member communication is slow. |
|
warning |
Etcd cluster high number of failed proposals. |
|
warning |
Etcd cluster high fsync durations. |
|
warning |
Etcd cluster has higher than expected commit durations. |
|
warning |
Etcd cluster has failed HTTP requests. |
|
critical |
Etcd cluster has a high number of failed HTTP requests. |
|
warning |
Etcd cluster HTTP requests are slow. |
|
warning |
Host clock not synchronizing. |
|
warning |
Host OOM kill detected. |
There are a few things that can cause the setup of the project to fail. Be sure to check the following.
-
You must complete all Prerequisites before installing the solution.
-
The cluster must have at least one node in it before attempting to create the solution or access the metrics.
-
Your Amazon EKS cluster must have the
AWS CNI
,CoreDNS
andkube-proxy
add-ons installed. If they are not installed, the solution will not work correctly. They are installed by default, when creating the cluster through the console. You may need to install them if the cluster was created through an AWS SDK. -
Amazon EKS pods installation timed out. This can happen if there is not enough node capacity available. There are multiple causes of these issues, including:
-
The Amazon EKS cluster was initialized with Fargate instead of Amazon EC2. This project requires Amazon EC2.
-
The nodes are tainted and therefore unavailable.
You can use
kubectl describe node
to check the taints. ThenNODENAME
| grep Taintskubectl taint node
to remove the taints. Make sure to include theNODENAME
TAINT_NAME
--
after the taint name. -
The nodes have reached the capacity limit. In this case you can create a new node or increase the capacity.
-
-
You do not see any dashboards in Grafana: using the incorrect Grafana workspace ID.
Run the following command to get information about Grafana:
kubectl describe grafanas external-grafana -n grafana-operator
You can check the results for the correct workspace URL. If it is not the one you are expecting, re-deploy with the correct workspace ID.
Spec: External: API Key: Key: GF_SECURITY_ADMIN_APIKEY Name: grafana-admin-credentials URL: https://
g-123example
.grafana-workspace.aws-region
.amazonaws.com Status: Admin URL: https://g-123example
.grafana-workspace.aws-region
.amazonaws.com Dashboards: ... -
You do not see any dashboards in Grafana: You are using an expired API key.
To look for this case, you will need to get the grafana operator and check the logs for errors. Get the name of the Grafana operator with this command:
kubectl get pods -n grafana-operator
This will return the operator name, for example:
NAME READY STATUS RESTARTS AGE
grafana-operator-1234abcd5678ef90
1/1 Running 0 1h2mUse the operator name in the following command:
kubectl logs
grafana-operator-1234abcd5678ef90
-n grafana-operatorError messages such as the following indicate an expired API key:
ERROR error reconciling datasource {"controller": "grafanadatasource", "controllerGroup": "grafana.integreatly.org", "controllerKind": "GrafanaDatasource", "GrafanaDatasource": {"name":"grafanadatasource-sample-amp","namespace":"grafana-operator"}, "namespace": "grafana-operator", "name": "grafanadatasource-sample-amp", "reconcileID": "72cfd60c-a255-44a1-bfbd-88b0cbc4f90c", "datasource": "grafanadatasource-sample-amp", "grafana": "external-grafana", "error": "status: 401, body: {\"message\":\"Expired API key\"}\n"} github.com/grafana-operator/grafana-operator/controllers.(*GrafanaDatasourceReconciler).Reconcile
In this case, create a new API key and deploy the solution again. If the problem persists, you can force synchronization by using the following command before redeploying:
kubectl delete externalsecret/external-secrets-sm -n grafana-operator
-
CDK installs – Missing SSM parameter. If you see an error like the following, run
cdk bootstrap
and try again.Deployment failed: Error: aws-observability-solution-eks-infra-
$EKS_CLUSTER_NAME
: SSM parameter /cdk-bootstrap/xxxxxxx
/version not found. Has the environment been bootstrapped? Please run 'cdk bootstrap' (see https://docs.aws.amazon.com/cdk/latest/ guide/bootstrapping.html) -
Deployment can fail if the OIDC provider already exists. You will see an error like the following (in this case, for CDK installs):
| CREATE_FAILED | Custom::AWSCDKOpenIdConnectProvider | OIDCProvider/Resource/Default Received response status [FAILED] from custom resource. Message returned: EntityAlreadyExistsException: Provider with url https://oidc.eks.
REGION
.amazonaws.com/id/PROVIDER ID
already exists.In this case, go to the IAM portal and delete the OIDC provider and try again.
-
Terraform installs – You see an error message that includes
cluster-secretstore-sm failed to create kubernetes rest client for update of resource
andfailed to create kubernetes rest client for update of resource
.This error typically indicates that the External Secrets Operator is not installed or enabled in your Kubernetes cluster. This is installed as part of the solution deployment, but sometimes is not ready when the solution needs it.
You can verify that it's installed with the following command:
kubectl get deployments -n external-secrets
If it's installed, it can take some time for the operator to be fully ready to be used. You can check the status of the needed Custom Resource Definitions (CRDs) by running the following command:
kubectl get crds|grep external-secrets
This command should list the CRDs related to the external secrets operator, including
clustersecretstores.external-secrets.io
andexternalsecrets.external-secrets.io
. If they are not listed, wait a few minutes and check again.Once the CRDs are registered, you can run
terraform apply
again to deploy the solution.