Solution: Monitoring Amazon EKS Infrastructure

https://docs.aws.amazon.com/grafana/latest/userguide/solution-eks.html

Monitoring Amazon Elastic Kubernetes Service infrastructure is one of the most common scenarios for which Amazon Managed Grafana are used. This page describes a template that provides you with a solution for this scenario. The solution can be installed using AWS Cloud Development Kit (AWS CDK) or with Terraform.

This solution configures an Amazon Managed Grafana workspace to provide metrics for your Amazon EKS cluster. The metrics are used to generate dashboards and alerts.

The metrics help you to operate Amazon EKS clusters more effectively by providing insights into the health and performance of the Kubernetes control and data plane. You can understand your Amazon EKS cluster from the node level, to pods, down to the Kubernetes level, including detailed monitoring of resource usage.

The following image shows a sample of the dashboard folder for the solution.

You can choose a dashboard to see more details, for example, choosing to view the Compute Resources for workloads will show a dashboard, such as that shown in the following image.

The metrics are scraped with a 1 minute scrape interval. The dashboards show metrics aggregated to 1 minute, 5 minutes, or more, based on the specific metric.

Logs are shown in dashboards, as well, so that you can query and analyze logs to find root causes of issues. The following image shows a log dashboard.

This solution creates and uses resources in your workspace. You will be charged for standard usage of the resources created, including:

The pricing calculators, available from the pricing page for each product, can help you understand potential costs for your solution. The following information can help get a base cost, for the solution running in the same availability zone as the Amazon EKS cluster.

Product Calculator metric Value

Amazon Managed Service for Prometheus

Active series

8000 (base)

15,000 (per node)

Avg Collection Interval

60 (seconds)

Amazon Managed Service for Prometheus (managed collector)

Number of collectors

1

Number of samples

15 (base)

150 (per node)

Number of rules

161

Average rules extraction interval

60 (seconds)

Amazon Managed Grafana

Number of active editors/administrators

1 (or more, based on your users)

CloudWatch (Logs)

Standard Logs: Data ingested

24.5 GB (base)

0.5 GB (per node)

Log Storage/Archival (Standard and Vended Logs)

Yes to store logs: Assuming 1 month retention

Expected Logs Data Scanned

Each log insights query from Grafana will scan all log contents from the group over the specified time period.

These numbers are the base numbers for a solution running EKS with no additional software. This will give you an estimate of the base costs. It also leaves out network usage costs, which will vary based on whether the Amazon Managed Grafana workspace, Amazon Managed Service for Prometheus workspace, and Amazon EKS cluster are in the same availability zone, AWS Region, and VPN.

Note

When an item in this table includes a (base) value and a value per resource (for example, (per node)), you should add the base value to the per resource value times the number you have of that resource. For example, for Average active time series, enter a number that is 8000 + the number of nodes in your cluster * 15,000. If you have 2 nodes, you would enter 38,000, which is 8000 + ( 2 * 15,000 ).

This solution requires that you have done the following before using the solution.

This solution configures AWS infrastructure to support reporting and monitoring metrics from an Amazon EKS cluster. You can install it using either AWS Cloud Development Kit (AWS CDK) or with Terraform.

This solution creates a scraper that collects metrics from your Amazon EKS cluster. Those metrics are stored in Amazon Managed Service for Prometheus, and then displayed in Amazon Managed Grafana dashboards. By default, the scraper collects all Prometheus-compatible metrics that are exposed by the cluster. Installing software in your cluster that produces more metrics will increase the metrics collected. If you want, you can reduce the number of metrics by updating the scraper with a configuration that filters the metrics.

The following metrics are tracked with this solution, in a base Amazon EKS cluster configuration with no additional software installed.

Metric Description / Purpose

aggregator_unavailable_apiservice

Gauge of APIServices which are marked as unavailable broken down by APIService name.

apiserver_admission_webhook_admission_duration_seconds_bucket

Admission webhook latency histogram in seconds, identified by name and broken out for each operation and API resource and type (validate or admit).

apiserver_current_inflight_requests

Maximal number of currently used inflight request limit of this apiserver per request kind in last second.

apiserver_envelope_encryption_dek_cache_fill_percent

Percent of the cache slots currently occupied by cached DEKs.

apiserver_flowcontrol_current_executing_requests

Number of requests in initial (for a WATCH) or any (for a non-WATCH) execution stage in the API Priority and Fairness subsystem.

apiserver_flowcontrol_rejected_requests_total

Number of requests in initial (for a WATCH) or any (for a non-WATCH) execution stage in the API Priority and Fairness subsystem that were rejected.

apiserver_flowcontrol_request_concurrency_limit

Nominal number of execution seats configured for each priority level.

apiserver_flowcontrol_request_execution_seconds_bucket

The bucketed histogram of duration of initial stage (for a WATCH) or any (for a non-WATCH) stage of request execution in the API Priority and Fairness subsystem.

apiserver_flowcontrol_request_queue_length_after_enqueue_count

The count of initial stage (for a WATCH) or any (for a non-WATCH) stage of request execution in the API Priority and Fairness subsystem.

apiserver_request

Indicates an API server request.

apiserver_requested_deprecated_apis

Gauge of deprecated APIs that have been requested, broken out by API group, version, resource, subresource, and removed_release.

apiserver_request_duration_seconds

Response latency distribution in seconds for each verb, dry run value, group, version, resource, subresource, scope and component.

apiserver_request_duration_seconds_bucket

The bucketed histogram of response latency distribution in seconds for each verb, dry run value, group, version, resource, subresource, scope and component.

apiserver_request_slo_duration_seconds

The Service Level Objective (SLO) response latency distribution in seconds for each verb, dry run value, group, version, resource, subresource, scope and component.

apiserver_request_terminations_total

Number of requests which apiserver terminated in self-defense.

apiserver_request_total

Counter of apiserver requests broken out for each verb, dry run value, group, version, resource, scope, component, and HTTP response code.

container_cpu_usage_seconds_total

Cumulative cpu time consumed.

container_fs_reads_bytes_total

Cumulative count of bytes read.

container_fs_reads_total

Cumulative count of reads completed.

container_fs_writes_bytes_total

Cumulative count of bytes written.

container_fs_writes_total

Cumulative count of writes completed.

container_memory_cache

Total page cache memory.

container_memory_rss

Size of RSS.

container_memory_swap

Container swap usage.

container_memory_working_set_bytes

Current working set.

container_network_receive_bytes_total

Cumulative count of bytes received.

container_network_receive_packets_dropped_total

Cumulative count of packets dropped while receiving.

container_network_receive_packets_total

Cumulative count of packets received.

container_network_transmit_bytes_total

Cumulative count of bytes transmitted.

container_network_transmit_packets_dropped_total

Cumulative count of packets dropped while transmitting.

container_network_transmit_packets_total

Cumulative count of packets transmitted.

etcd_request_duration_seconds_bucket

The bucketed histogram of etcd request latency in seconds for each operation and object type.

go_goroutines

Number of goroutines that currently exist.

go_threads

Number of OS threads created.

kubelet_cgroup_manager_duration_seconds_bucket

The bucketed histogram of duration in seconds for cgroup manager operations. Broken down by method.

kubelet_cgroup_manager_duration_seconds_count

Duration in seconds for cgroup manager operations. Broken down by method.

kubelet_node_config_error

This metric is true (1) if the node is experiencing a configuration-related error, false (0) otherwise.

kubelet_node_name

The node's name. The count is always 1.

kubelet_pleg_relist_duration_seconds_bucket

The bucketed histogram of duration in seconds for relisting pods in PLEG.

kubelet_pleg_relist_duration_seconds_count

The count of duration in seconds for relisting pods in PLEG.

kubelet_pleg_relist_interval_seconds_bucket

The bucketed histogram of interval in seconds between relisting in PLEG.

kubelet_pod_start_duration_seconds_count

The count of duration in seconds from kubelet seeing a pod for the first time to the pod starting to run.

kubelet_pod_worker_duration_seconds_bucket

The bucketed histogram of duration in seconds to sync a single pod. Broken down by operation type: create, update, or sync.

kubelet_pod_worker_duration_seconds_count

The count of duration in seconds to sync a single pod. Broken down by operation type: create, update, or sync.

kubelet_running_containers

Number of containers currently running.

kubelet_running_pods

Number of pods that have a running pod sandbox.

kubelet_runtime_operations_duration_seconds_bucket

The bucketed histogram of duration in seconds of runtime operations. Broken down by operation type.

kubelet_runtime_operations_errors_total

Cumulative number of runtime operation errors by operation type.

kubelet_runtime_operations_total

Cumulative number of runtime operations by operation type.

kube_node_status_allocatable

The amount of resources allocatable for pods (after reserving some for system daemons).

kube_node_status_capacity

The total amount of resources available for a node.

kube_pod_container_resource_limits (CPU)

The number of requested limit resource by a container.

kube_pod_container_resource_limits (Memory)

The number of requested limit resource by a container.

kube_pod_container_resource_requests (CPU)

The number of requested request resource by a container.

kube_pod_container_resource_requests (Memory)

The number of requested request resource by a container.

kube_pod_owner

Information about the Pod's owner.

kube_resourcequota

Resource quotas in Kubernetes enforce usage limits on resources such as CPU, memory, and storage within namespaces.

node_cpu

The CPU usage metrics for a node, including usage per core and total usage.

node_cpu_seconds_total

Seconds the CPUs spent in each mode.

node_disk_io_time_seconds

The cumulative amount of time spent performing I/O operations on disk by a node.

node_disk_io_time_seconds_total

The total amount of time spent performing I/O operations on disk by the node.

node_disk_read_bytes_total

The total number of bytes read from disk by the node.

node_disk_written_bytes_total

The total number of bytes written to disk by the node.

node_filesystem_avail_bytes

The amount of available space in bytes on the filesystem of a node in a Kubernetes cluster.

node_filesystem_size_bytes

The total size of the filesystem on the node.

node_load1

The 1-minute load average of a node's CPU usage.

node_load15

The 15-minute load average of a node's CPU usage.

node_load5

The 5-minute load average of a node's CPU usage.

node_memory_Buffers_bytes

The amount of memory used for buffer caching by the node's operating system.

node_memory_Cached_bytes,

The amount of memory used for disk caching by the node's operating system.

node_memory_MemAvailable_bytes

The amount of memory available for use by applications and caches.

node_memory_MemFree_bytes

The amount of free memory available on the node.

node_memory_MemTotal_bytes

The total amount of physical memory available on the node.

node_network_receive_bytes_total

The total number of bytes received over the network by the node.

node_network_transmit_bytes_total

The total number of bytes transmitted over the network by the node.

process_cpu_seconds_total

Total user and system CPU time spent in seconds.

process_resident_memory_bytes

Resident memory size in bytes.

rest_client_requests_total

Number of HTTP requests, partitioned by status code, method, and host.

rest_client_request_duration_seconds_bucket

The bucketed histogram of request latency in seconds. Broken down by verb, and host.

storage_operation_duration_seconds_bucket

The bucketed histogram of duration of storage operations.

storage_operation_duration_seconds_count

The count of duration of storage operations.

storage_operation_errors_total

Cumulative number of errors during storage operations.

up

A metric indicating whether the monitored target (e.g., node) is up and running.

volume_manager_total_volumes

The total number of volumes managed by the volume manager.

workqueue_adds_total

Total number of adds handled by workqueue.

workqueue_depth

Current depth of workqueue.

workqueue_queue_duration_seconds_bucket

The bucketed histogram of how long in seconds an item stays in workqueue before being requested.

workqueue_work_duration_seconds_bucket

The bucketed histogram of how long in seconds processing an item from workqueue takes.

The following tables list the alerts that are created by this solution. The alerts are created as rules in Amazon Managed Service for Prometheus, and are displayed in your Amazon Managed Grafana workspace.

You can modify the rules, including adding or deleting rules by editing the rules configuration file in your Amazon Managed Service for Prometheus workspace.

These two alerts are special alerts that are handled slightly differently than typical alerts. Instead of alerting you to an issue, they give you information that is used to monitor the system. The description includes details about how to use these alerts.

Alert Description and usage

Watchdog

This is an alert meant to ensure that the entire alerting pipeline is functional. This alert is always firing, therefore it should always be firing in Alertmanager and always fire against a receiver. You can integrate this with your notification mechanism to send a notification when this alert is not firing. For example, you could use the DeadMansSnitch integration in PagerDuty.

InfoInhibitor

This is an alert that is used to inhibit info alerts. By themselves, info-level alerts can be very noisy, but they are relevant when combined with other alerts. This alert fires whenever there's a severity=info alert, and stops firing when another alert with a severity of warning or critical starts firing on the same namespace. This alert should be routed to a null receiver and configured to inhibit alerts with severity=info.

The following alerts give you information or warnings about your system.

Alert Severity Description

NodeNetworkInterfaceFlapping

warning

Network interface is often changing its status

NodeFilesystemSpaceFillingUp

warning

File system is predicted to run out of space within the next 24 hours.

NodeFilesystemSpaceFillingUp

critical

File system is predicted to run out of space within the next 4 hours.

NodeFilesystemAlmostOutOfSpace

warning

File system has less than 5% space left.

NodeFilesystemAlmostOutOfSpace

critical

File system has less than 3% space left.

NodeFilesystemFilesFillingUp

warning

File system is predicted to run out of inodes within the next 24 hours.

NodeFilesystemFilesFillingUp

critical

File system is predicted to run out of inodes within the next 4 hours.

NodeFilesystemAlmostOutOfFiles

warning

File system has less than 5% inodes left.

NodeFilesystemAlmostOutOfFiles

critical

File system has less than 3% inodes left.

NodeNetworkReceiveErrs

warning

Network interface is reporting many receive errors.

NodeNetworkTransmitErrs

warning

Network interface is reporting many transmit errors.

NodeHighNumberConntrackEntriesUsed

warning

Number of conntrack entries are getting close to the limit.

NodeTextFileCollectorScrapeError

warning

Node Exporter text file collector failed to scrape.

NodeClockSkewDetected

warning

Clock skew detected.

NodeClockNotSynchronizzing

warning

Clock not synchronizing.

NodeRAIDDegraded

critical

RAID Array is degraded

NodeRAIDDiskFailure

warning

Failed device in RAID array

NodeFileDescriptorLimit

warning

Kernel is predicted to exhaust file descriptors limit soon.

NodeFileDescriptorLimit

critical

Kernel is predicted to exhaust file descriptors limit soon.

KubeNodeNotReady

warning

Node is not ready.

KubeNodeUnreachable

warning

Node is unreachable.

KubeletTooManyPods

info

Kubelet is running at capacity.

KubeNodeReadinessFlapping

warning

Node readiness status is flapping.

KubeletPlegDurationHigh

warning

Kubelet Pod Lifecycle Event Generator is taking too long to relist.

KubeletPodStartUpLatencyHigh

warning

Kubelet Pod startup latency is too high.

KubeletClientCertificateExpiration

warning

Kubelet client certificate is about to expire.

KubeletClientCertificateExpiration

critical

Kubelet client certificate is about to expire.

KubeletServerCertificateExpiration

warning

Kubelet server certificate is about to expire.

KubeletServerCertificateExpiration

critical

Kubelet server certificate is about to expire.

KubeletClientCertificateRenewalErrors

warning

Kubelet has failed to renew its client certificate.

KubeletServerCertificateRenewalErrors

warning

Kubelet has failed to renew its server certificate.

KubeletDown

critical

Target disappeared from Prometheus target discovery.

KubeVersionMismatch

warning

Different semantic versions of Kubernetes components running.

KubeClientErrors

warning

Kubernetes API server client is experiencing errors.

KubeClientCertificateExpiration

warning

Client certificate is about to expire.

KubeClientCertificateExpiration

critical

Client certificate is about to expire.

KubeAggregatedAPIErrors

warning

Kubernetes aggregated API has reported errors.

KubeAggregatedAPIDown

warning

Kubernetes aggregated API is down.

KubeAPIDown

critical

Target disappeared from Prometheus target discovery.

KubeAPITerminatedRequests

warning

The kubernetes apiserver has terminated {{ $value | humanizePercentage }} of its incoming requests.

KubePersistentVolumeFillingUp

critical

Persistent Volume is filling up.

KubePersistentVolumeFillingUp

warning

Persistent Volume is filling up.

KubePersistentVolumeInodesFillingUp

critical

Persistent Volume Inodes is filling up.

KubePersistentVolumeInodesFillingUp

warning

Persistent Volume Inodes are filling up.

KubePersistentVolumeErrors

critical

Persistent Volume is having issues with provisioning.

KubeCPUOvercommit

warning

Cluster has overcommitted CPU resource requests.

KubeMemoryOvercommit

warning

Cluster has overcommitted memory resource requests.

KubeCPUQuotaOvercommit

warning

Cluster has overcommitted CPU resource requests.

KubeMemoryQuotaOvercommit

warning

Cluster has overcommitted memory resource requests.

KubeQuotaAlmostFull

info

Namespace quota is going to be full.

KubeQuotaFullyUsed

info

Namespace quota is fully used.

KubeQuotaExceeded

warning

Namespace quota has exceeded the limits.

CPUThrottlingHigh

info

Processes experience elevated CPU throttling.

KubePodCrashLooping

warning

Pod is crash looping.

KubePodNotReady

warning

Pod has been in a non-ready state for more than 15 minutes.

KubeDeploymentGenerationMismatch

warning

Deployment generation mismatch due to possible roll-back

KubeDeploymentReplicasMismatch

warning

Deployment has not matched the expected number of replicas.

KubeStatefulSetReplicasMismatch

warning

StatefulSet has not matched the expected number of replicas.

KubeStatefulSetGenerationMismatch

warning

StatefulSet generation mismatch due to possible roll-back

KubeStatefulSetUpdateNotRolledOut

warning

StatefulSet update has not been rolled out.

KubeDaemonSetRolloutStuck

warning

DaemonSet rollout is stuck.

KubeContainerWaiting

warning

Pod container waiting longer than 1 hour

KubeDaemonSetNotScheduled

warning

DaemonSet pods are not scheduled.

KubeDaemonSetMisScheduled

warning

DaemonSet pods are misscheduled.

KubeJobNotCompleted

warning

Job did not complete in time

KubeJobFailed

warning

Job failed to complete.

KubeHpaReplicasMismatch

warning

HPA has not matched desired number of replicas.

KubeHpaMaxedOut

warning

HPA is running at max replicas

KubeStateMetricsListErrors

critical

kube-state-metrics is experiencing errors in list operations.

KubeStateMetricsWatchErrors

critical

kube-state-metrics is experiencing errors in watch operations.

KubeStateMetricsShardingMismatch

critical

kube-state-metrics sharding is misconfigured.

KubeStateMetricsShardsMissing

critical

kube-state-metrics shards are missing.

KubeAPIErrorBudgetBurn

critical

The API server is burning too much error budget.

KubeAPIErrorBudgetBurn

critical

The API server is burning too much error budget.

KubeAPIErrorBudgetBurn

warning

The API server is burning too much error budget.

KubeAPIErrorBudgetBurn

warning

The API server is burning too much error budget.

TargetDown

warning

One or more targets are down.

etcdInsufficientMembers

critical

Etcd cluster insufficient members.

etcdHighNumberOfLeaderChanges

warning

Etcd cluster high number of leader changes.

etcdNoLeader

critical

Etcd cluster has no leader.

etcdHighNumberOfFailedGRPCRequests

warning

Etcd cluster high number of failed gRPC requests.

etcdGRPCRequestsSlow

critical

Etcd cluster gRPC requests are slow.

etcdMemberCommunicationSlow

warning

Etcd cluster member communication is slow.

etcdHighNumberOfFailedProposals

warning

Etcd cluster high number of failed proposals.

etcdHighFsyncDurations

warning

Etcd cluster high fsync durations.

etcdHighCommitDurations

warning

Etcd cluster has higher than expected commit durations.

etcdHighNumberOfFailedHTTPRequests

warning

Etcd cluster has failed HTTP requests.

etcdHighNumberOfFailedHTTPRequests

critical

Etcd cluster has a high number of failed HTTP requests.

etcdHTTPRequestsSlow

warning

Etcd cluster HTTP requests are slow.

HostClockNotSynchronizing

warning

Host clock not synchronizing.

HostOomKillDetected

warning

Host OOM kill detected.

There are a few things that can cause the setup of the project to fail. Be sure to check the following.

  • You must complete all Prerequisites before installing the solution.

  • The cluster must have at least one node in it before attempting to create the solution or access the metrics.

  • Your Amazon EKS cluster must have the AWS CNI, CoreDNS and kube-proxy add-ons installed. If they are not installed, the solution will not work correctly. They are installed by default, when creating the cluster through the console. You may need to install them if the cluster was created through an AWS SDK.

  • Amazon EKS pods installation timed out. This can happen if there is not enough node capacity available. There are multiple causes of these issues, including:

    • The Amazon EKS cluster was initialized with Fargate instead of Amazon EC2. This project requires Amazon EC2.

    • The nodes are tainted and therefore unavailable.

      You can use kubectl describe node NODENAME | grep Taints to check the taints. Then kubectl taint node NODENAME TAINT_NAME- to remove the taints. Make sure to include the - after the taint name.

    • The nodes have reached the capacity limit. In this case you can create a new node or increase the capacity.

  • You do not see any dashboards in Grafana: using the incorrect Grafana workspace ID.

    Run the following command to get information about Grafana:

    kubectl describe grafanas external-grafana -n grafana-operator

    You can check the results for the correct workspace URL. If it is not the one you are expecting, re-deploy with the correct workspace ID.

    Spec:
      External:
        API Key:
          Key:   GF_SECURITY_ADMIN_APIKEY
          Name:  grafana-admin-credentials
        URL:     https://g-123example.grafana-workspace.aws-region.amazonaws.com
    Status:
      Admin URL:  https://g-123example.grafana-workspace.aws-region.amazonaws.com
      Dashboards:
        ...
  • You do not see any dashboards in Grafana: You are using an expired API key.

    To look for this case, you will need to get the grafana operator and check the logs for errors. Get the name of the Grafana operator with this command:

    kubectl get pods -n grafana-operator

    This will return the operator name, for example:

    NAME                               READY   STATUS    RESTARTS   AGE
    grafana-operator-1234abcd5678ef90   1/1     Running   0          1h2m

    Use the operator name in the following command:

    kubectl logs grafana-operator-1234abcd5678ef90 -n grafana-operator

    Error messages such as the following indicate an expired API key:

    ERROR   error reconciling datasource    {"controller": "grafanadatasource", "controllerGroup": "grafana.integreatly.org", "controllerKind": "GrafanaDatasource", "GrafanaDatasource": {"name":"grafanadatasource-sample-amp","namespace":"grafana-operator"}, "namespace": "grafana-operator", "name": "grafanadatasource-sample-amp", "reconcileID": "72cfd60c-a255-44a1-bfbd-88b0cbc4f90c", "datasource": "grafanadatasource-sample-amp", "grafana": "external-grafana", "error": "status: 401, body: {\"message\":\"Expired API key\"}\n"}
    github.com/grafana-operator/grafana-operator/controllers.(*GrafanaDatasourceReconciler).Reconcile

    In this case, create a new API key and deploy the solution again. If the problem persists, you can force synchronization by using the following command before redeploying:

    kubectl delete externalsecret/external-secrets-sm -n grafana-operator
  • CDK installs – Missing SSM parameter. If you see an error like the following, run cdk bootstrap and try again.

    Deployment failed: Error: aws-observability-solution-eks-infra-$EKS_CLUSTER_NAME: SSM 
    parameter /cdk-bootstrap/xxxxxxx/version not found. Has the environment been 
    bootstrapped? Please run 'cdk bootstrap' (see https://docs.aws.amazon.com/cdk/latest/
    guide/bootstrapping.html)
  • Deployment can fail if the OIDC provider already exists. You will see an error like the following (in this case, for CDK installs):

    | CREATE_FAILED | Custom::AWSCDKOpenIdConnectProvider | OIDCProvider/Resource/Default
    Received response status [FAILED] from custom resource. Message returned: 
    EntityAlreadyExistsException: Provider with url https://oidc.eks.REGION.amazonaws.com/id/PROVIDER ID already exists.

    In this case, go to the IAM portal and delete the OIDC provider and try again.

  • Terraform installs – You see an error message that includes cluster-secretstore-sm failed to create kubernetes rest client for update of resource and failed to create kubernetes rest client for update of resource.

    This error typically indicates that the External Secrets Operator is not installed or enabled in your Kubernetes cluster. This is installed as part of the solution deployment, but sometimes is not ready when the solution needs it.

    You can verify that it's installed with the following command:

    kubectl get deployments -n external-secrets

    If it's installed, it can take some time for the operator to be fully ready to be used. You can check the status of the needed Custom Resource Definitions (CRDs) by running the following command:

    kubectl get crds|grep external-secrets

    This command should list the CRDs related to the external secrets operator, including clustersecretstores.external-secrets.io and externalsecrets.external-secrets.io. If they are not listed, wait a few minutes and check again.

    Once the CRDs are registered, you can run terraform apply again to deploy the solution.

{
"by": "mhausenblas",
"descendants": 0,
"id": 40245554,
"score": 2,
"time": 1714726405,
"title": "Solution: Monitoring Amazon EKS Infrastructure",
"type": "story",
"url": "https://docs.aws.amazon.com/grafana/latest/userguide/solution-eks.html"
}
{
"author": null,
"date": null,
"description": "Explains how to use a pre-built observability solution to monitor Amazon Elastic Kubernetes Service infrastructure with Amazon Managed Grafana and Amazon Managed Service for Prometheus.",
"image": "https://docs.aws.amazon.com/images/grafana/latest/userguide/images/eks-solution-dashboard-folder.png",
"logo": "https://logo.clearbit.com/amazon.com",
"publisher": "Amazon",
"title": "Solution for Monitoring Amazon EKS infrastructure with Amazon Managed Grafana - Amazon Managed Grafana",
"url": "https://docs.aws.amazon.com/grafana/latest/userguide/solution-eks.html"
}
{
"url": "https://docs.aws.amazon.com/grafana/latest/userguide/solution-eks.html",
"title": "Solution for Monitoring Amazon EKS infrastructure with Amazon Managed Grafana - Amazon Managed Grafana",
"description": "Explains how to use a pre-built observability solution to monitor Amazon Elastic Kubernetes Service infrastructure with Amazon Managed Grafana and Amazon Managed Service for Prometheus.",
"links": [
"https://docs.aws.amazon.com/grafana/latest/userguide/solution-eks.html"
],
"image": "",
"content": "<p>Monitoring Amazon Elastic Kubernetes Service infrastructure is one of the most common scenarios for which\n Amazon Managed Grafana are used. This page describes a template that provides you with a solution \n for this scenario. The solution can be installed using <a target=\"_blank\" href=\"https://docs.aws.amazon.com/cdk/v2/guide/home.html\">AWS Cloud Development Kit (AWS CDK)</a> or with\n <a href=\"https://www.terraform.io/\" target=\"_blank\"><span>Terraform</span></a>.</p><p>This solution configures an Amazon Managed Grafana workspace to provide metrics for your Amazon EKS\n cluster. The metrics are used to generate dashboards and alerts.</p><p>The metrics help you to operate Amazon EKS clusters more effectively by providing insights\n into the health and performance of the Kubernetes control and data plane. You can\n understand your Amazon EKS cluster from the node level, to pods, down to the Kubernetes \n level, including detailed monitoring of resource usage.</p><p>The following image shows a sample of the dashboard folder for the\n solution.</p><p>You can choose a dashboard to see more details, for example, choosing to view the\n Compute Resources for workloads will show a dashboard, such as that shown in the\n following image.</p><p>The metrics are scraped with a 1 minute scrape interval. The dashboards show metrics \n aggregated to 1 minute, 5 minutes, or more, based on the specific metric.</p><p>Logs are shown in dashboards, as well, so that you can query and analyze logs to find\n root causes of issues. The following image shows a log dashboard.</p><p>This solution creates and uses resources in your workspace. You will be charged for\n standard usage of the resources created, including:</p><p>The pricing calculators, available from the pricing page for each product, can help \n you understand potential costs for your solution. The following information can help\n get a base cost, for the solution running in the same availability zone as the Amazon EKS \n cluster.</p><div><table><thead>\n <tr>\n <th>Product</th>\n <th>Calculator metric</th>\n <th>Value</th>\n </tr>\n </thead>\n <tr>\n <td><p>Amazon Managed Service for Prometheus</p></td>\n <td><p>Active series</p></td>\n <td><p>8000 (base)</p>\n <p>15,000 (per node)</p></td>\n </tr>\n <tr>\n <td></td>\n <td><p>Avg Collection Interval</p></td>\n <td><p>60 (seconds)</p></td>\n </tr>\n <tr>\n <td><p>Amazon Managed Service for Prometheus (managed collector)</p></td>\n <td><p>Number of collectors</p></td>\n <td><p>1</p></td>\n </tr>\n <tr>\n <td></td>\n <td><p>Number of samples</p></td>\n <td><p>15 (base)</p>\n <p>150 (per node)</p></td>\n </tr>\n <tr>\n <td></td>\n <td><p>Number of rules</p></td>\n <td><p>161</p></td>\n </tr>\n <tr>\n <td></td>\n <td><p>Average rules extraction interval</p></td>\n <td><p>60 (seconds)</p></td>\n </tr>\n <tr>\n <td><p>Amazon Managed Grafana</p></td>\n <td><p>Number of active editors/administrators</p></td>\n <td><p>1 (or more, based on your users)</p></td>\n </tr>\n <tr>\n <td><p>CloudWatch (Logs)</p></td>\n <td><p>Standard Logs: Data ingested</p></td>\n <td><p>24.5 GB (base)</p>\n <p>0.5 GB (per node)</p></td>\n </tr>\n <tr>\n <td></td>\n <td><p>Log Storage/Archival (Standard and Vended Logs)</p></td>\n <td><p>Yes to store logs: Assuming 1 month retention</p></td>\n </tr>\n <tr>\n <td></td>\n <td><p>Expected Logs Data Scanned</p></td>\n <td><p>Each log insights query from Grafana will scan all log \n contents from the group over the specified time \n period.</p></td>\n </tr>\n </table></div><p>These numbers are the base numbers for a solution running EKS with no additional\n software. This will give you an estimate of the base costs. It also leaves out \n network usage costs, which will vary based on whether the Amazon Managed Grafana workspace, \n Amazon Managed Service for Prometheus workspace, and Amazon EKS cluster are in the same availability zone, AWS Region,\n and VPN.</p><div><p></p><h6>Note</h6><p></p><p>When an item in this table includes a <code>(base)</code> value and a value\n per resource (for example, <code>(per node)</code>), you should add the base value\n to the per resource value times the number you have of that resource. For example,\n for <b>Average active time series</b>, enter a number \n that is <code>8000 + the number of nodes in your cluster * 15,000</code>.\n If you have 2 nodes, you would enter <code>38,000</code>, which is \n <code>8000 + ( 2 * 15,000 )</code>.</p></div><p>This solution requires that you have done the following before using the solution.</p><p>This solution configures AWS infrastructure to support reporting and monitoring\n metrics from an Amazon EKS cluster. You can install it using either <a target=\"_blank\" href=\"https://docs.aws.amazon.com/cdk/v2/guide/home.html\">AWS Cloud Development Kit (AWS CDK)</a> or with\n <a href=\"https://www.terraform.io/\" target=\"_blank\"><span>Terraform</span></a>.</p><p>This solution creates a scraper that collects metrics from your Amazon EKS cluster. Those \n metrics are stored in Amazon Managed Service for Prometheus, and then displayed in Amazon Managed Grafana dashboards. By default,\n the scraper collects all <a target=\"_blank\" href=\"https://docs.aws.amazon.com/prometheus/latest/userguide/prom-compatible-metrics.html\">Prometheus-compatible \n metrics</a> that are exposed by the cluster. Installing software in your cluster \n that produces more metrics will increase the metrics\n collected. If you want, you can reduce the number of metrics by <a target=\"_blank\" href=\"https://docs.aws.amazon.com/prometheus/latest/userguide/AMP-collector-how-to.html#AMP-collector-configuration\">updating the scraper with a configuration that filters the \n metrics</a>.</p><p>The following metrics are tracked with this solution, in a base Amazon EKS cluster \n configuration with no additional software installed.</p><div><table><thead>\n <tr>\n <th>Metric</th>\n <th>Description / Purpose</th>\n </tr>\n </thead>\n <tr>\n <td>\n <p><code>aggregator_unavailable_apiservice</code></p>\n </td>\n <td>\n <p>Gauge of APIServices which are marked as unavailable broken down\n by APIService name.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>apiserver_admission_webhook_admission_duration_seconds_bucket</code></p>\n </td>\n <td>\n <p>Admission webhook latency histogram in seconds, identified by name\n and broken out for each operation and API resource and type\n (validate or admit).</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>apiserver_current_inflight_requests</code></p>\n </td>\n <td>\n <p>Maximal number of currently used inflight request limit of this\n apiserver per request kind in last second.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>apiserver_envelope_encryption_dek_cache_fill_percent</code></p>\n </td>\n <td>\n <p>Percent of the cache slots currently occupied by cached\n DEKs.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>apiserver_flowcontrol_current_executing_requests</code></p>\n </td>\n <td>\n <p>Number of requests in initial (for a WATCH) or any (for a\n non-WATCH) execution stage in the API Priority and Fairness\n subsystem.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>apiserver_flowcontrol_rejected_requests_total</code></p>\n </td>\n <td>\n <p>Number of requests in initial (for a WATCH) or any (for a\n non-WATCH) execution stage in the API Priority and Fairness\n subsystem that were rejected.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>apiserver_flowcontrol_request_concurrency_limit</code></p>\n </td>\n <td>\n <p>Nominal number of execution seats configured for each priority\n level.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>apiserver_flowcontrol_request_execution_seconds_bucket</code></p>\n </td>\n <td>\n <p>The bucketed histogram of duration of initial stage (for a WATCH)\n or any (for a non-WATCH) stage of request execution in the API\n Priority and Fairness subsystem.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>apiserver_flowcontrol_request_queue_length_after_enqueue_count</code></p>\n </td>\n <td>\n <p>The count of initial stage (for a WATCH) or any (for a non-WATCH)\n stage of request execution in the API Priority and Fairness\n subsystem.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>apiserver_request</code></p>\n </td>\n <td>\n <p>Indicates an API server request.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>apiserver_requested_deprecated_apis</code></p>\n </td>\n <td>\n <p>Gauge of deprecated APIs that have been requested, broken out by\n API group, version, resource, subresource, and\n removed_release.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>apiserver_request_duration_seconds</code></p>\n </td>\n <td>\n <p>Response latency distribution in seconds for each verb, dry run\n value, group, version, resource, subresource, scope and\n component.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>apiserver_request_duration_seconds_bucket</code></p>\n </td>\n <td>\n <p>The bucketed histogram of response latency distribution in seconds\n for each verb, dry run value, group, version, resource, subresource,\n scope and component.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>apiserver_request_slo_duration_seconds</code></p>\n </td>\n <td>\n <p>The Service Level Objective (SLO) response latency distribution in\n seconds for each verb, dry run value, group, version, resource,\n subresource, scope and component.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>apiserver_request_terminations_total</code></p>\n </td>\n <td>\n <p>Number of requests which apiserver terminated in\n self-defense.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>apiserver_request_total</code></p>\n </td>\n <td>\n <p>Counter of apiserver requests broken out for each verb, dry run\n value, group, version, resource, scope, component, and HTTP response\n code.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>container_cpu_usage_seconds_total</code></p>\n </td>\n <td>\n <p>Cumulative cpu time consumed.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>container_fs_reads_bytes_total</code></p>\n </td>\n <td>\n <p>Cumulative count of bytes read.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>container_fs_reads_total</code></p>\n </td>\n <td>\n <p>Cumulative count of reads completed.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>container_fs_writes_bytes_total</code></p>\n </td>\n <td>\n <p>Cumulative count of bytes written.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>container_fs_writes_total</code></p>\n </td>\n <td>\n <p>Cumulative count of writes completed.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>container_memory_cache</code></p>\n </td>\n <td>\n <p>Total page cache memory.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>container_memory_rss</code></p>\n </td>\n <td>\n <p>Size of RSS.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>container_memory_swap</code></p>\n </td>\n <td>\n <p>Container swap usage.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>container_memory_working_set_bytes</code></p>\n </td>\n <td>\n <p>Current working set.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>container_network_receive_bytes_total</code></p>\n </td>\n <td>\n <p>Cumulative count of bytes received.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>container_network_receive_packets_dropped_total</code></p>\n </td>\n <td>\n <p>Cumulative count of packets dropped while receiving.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>container_network_receive_packets_total</code></p>\n </td>\n <td>\n <p>Cumulative count of packets received.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>container_network_transmit_bytes_total</code></p>\n </td>\n <td>\n <p>Cumulative count of bytes transmitted.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>container_network_transmit_packets_dropped_total</code></p>\n </td>\n <td>\n <p>Cumulative count of packets dropped while transmitting.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>container_network_transmit_packets_total</code></p>\n </td>\n <td>\n <p>Cumulative count of packets transmitted.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>etcd_request_duration_seconds_bucket</code></p>\n </td>\n <td>\n <p>The bucketed histogram of etcd request latency in seconds for each\n operation and object type.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>go_goroutines</code></p>\n </td>\n <td>\n <p>Number of goroutines that currently exist.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>go_threads</code></p>\n </td>\n <td>\n <p>Number of OS threads created.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>kubelet_cgroup_manager_duration_seconds_bucket</code></p>\n </td>\n <td>\n <p>The bucketed histogram of duration in seconds for cgroup manager\n operations. Broken down by method.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>kubelet_cgroup_manager_duration_seconds_count</code></p>\n </td>\n <td>\n <p>Duration in seconds for cgroup manager operations. Broken down by\n method.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>kubelet_node_config_error</code></p>\n </td>\n <td>\n <p>This metric is true (1) if the node is experiencing a\n configuration-related error, false (0) otherwise.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>kubelet_node_name</code></p>\n </td>\n <td>\n <p>The node's name. The count is always 1.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>kubelet_pleg_relist_duration_seconds_bucket</code></p>\n </td>\n <td>\n <p>The bucketed histogram of duration in seconds for relisting pods\n in PLEG.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>kubelet_pleg_relist_duration_seconds_count</code></p>\n </td>\n <td>\n <p>The count of duration in seconds for relisting pods in\n PLEG.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>kubelet_pleg_relist_interval_seconds_bucket</code></p>\n </td>\n <td>\n <p>The bucketed histogram of interval in seconds between relisting in\n PLEG.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>kubelet_pod_start_duration_seconds_count</code></p>\n </td>\n <td>\n <p>The count of duration in seconds from kubelet seeing a pod for the\n first time to the pod starting to run.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>kubelet_pod_worker_duration_seconds_bucket</code></p>\n </td>\n <td>\n <p>The bucketed histogram of duration in seconds to sync a single\n pod. Broken down by operation type: create, update, or sync.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>kubelet_pod_worker_duration_seconds_count</code></p>\n </td>\n <td>\n <p>The count of duration in seconds to sync a single pod. Broken down\n by operation type: create, update, or sync.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>kubelet_running_containers</code></p>\n </td>\n <td>\n <p>Number of containers currently running.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>kubelet_running_pods</code></p>\n </td>\n <td>\n <p>Number of pods that have a running pod sandbox.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>kubelet_runtime_operations_duration_seconds_bucket</code></p>\n </td>\n <td>\n <p>The bucketed histogram of duration in seconds of runtime\n operations. Broken down by operation type.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>kubelet_runtime_operations_errors_total</code></p>\n </td>\n <td>\n <p>Cumulative number of runtime operation errors by operation\n type.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>kubelet_runtime_operations_total</code></p>\n </td>\n <td>\n <p>Cumulative number of runtime operations by operation type.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>kube_node_status_allocatable</code></p>\n </td>\n <td>\n <p>The amount of resources allocatable for pods (after reserving some\n for system daemons).</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>kube_node_status_capacity</code></p>\n </td>\n <td>\n <p>The total amount of resources available for a node.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>kube_pod_container_resource_limits (CPU)</code></p>\n </td>\n <td>\n <p>The number of requested limit resource by a container.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>kube_pod_container_resource_limits (Memory)</code></p>\n </td>\n <td>\n <p>The number of requested limit resource by a container.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>kube_pod_container_resource_requests (CPU)</code></p>\n </td>\n <td>\n <p>The number of requested request resource by a container.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>kube_pod_container_resource_requests (Memory)</code></p>\n </td>\n <td>\n <p>The number of requested request resource by a container.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>kube_pod_owner</code></p>\n </td>\n <td>\n <p>Information about the Pod's owner.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>kube_resourcequota</code></p>\n </td>\n <td>\n <p>Resource quotas in Kubernetes enforce usage limits on resources\n such as CPU, memory, and storage within namespaces.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>node_cpu</code></p>\n </td>\n <td>\n <p>The CPU usage metrics for a node, including usage per core and\n total usage.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>node_cpu_seconds_total</code></p>\n </td>\n <td>\n <p>Seconds the CPUs spent in each mode.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>node_disk_io_time_seconds</code></p>\n </td>\n <td>\n <p>The cumulative amount of time spent performing I/O operations on\n disk by a node.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>node_disk_io_time_seconds_total</code></p>\n </td>\n <td>\n <p>The total amount of time spent performing I/O operations on disk\n by the node.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>node_disk_read_bytes_total</code></p>\n </td>\n <td>\n <p>The total number of bytes read from disk by the node.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>node_disk_written_bytes_total</code></p>\n </td>\n <td>\n <p>The total number of bytes written to disk by the node.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>node_filesystem_avail_bytes</code></p>\n </td>\n <td>\n <p>The amount of available space in bytes on the filesystem of a node\n in a Kubernetes cluster.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>node_filesystem_size_bytes</code></p>\n </td>\n <td>\n <p>The total size of the filesystem on the node.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>node_load1</code></p>\n </td>\n <td>\n <p>The 1-minute load average of a node's CPU usage.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>node_load15</code></p>\n </td>\n <td>\n <p>The 15-minute load average of a node's CPU usage.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>node_load5</code></p>\n </td>\n <td>\n <p>The 5-minute load average of a node's CPU usage.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>node_memory_Buffers_bytes</code></p>\n </td>\n <td>\n <p>The amount of memory used for buffer caching by the node's\n operating system.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>node_memory_Cached_bytes,</code></p>\n </td>\n <td>\n <p>The amount of memory used for disk caching by the node's operating\n system.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>node_memory_MemAvailable_bytes</code></p>\n </td>\n <td>\n <p>The amount of memory available for use by applications and\n caches.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>node_memory_MemFree_bytes</code></p>\n </td>\n <td>\n <p>The amount of free memory available on the node.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>node_memory_MemTotal_bytes</code></p>\n </td>\n <td>\n <p>The total amount of physical memory available on the node.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>node_network_receive_bytes_total</code></p>\n </td>\n <td>\n <p>The total number of bytes received over the network by the\n node.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>node_network_transmit_bytes_total</code></p>\n </td>\n <td>\n <p>The total number of bytes transmitted over the network by the\n node.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>process_cpu_seconds_total</code></p>\n </td>\n <td>\n <p>Total user and system CPU time spent in seconds.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>process_resident_memory_bytes</code></p>\n </td>\n <td>\n <p>Resident memory size in bytes.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>rest_client_requests_total</code></p>\n </td>\n <td>\n <p>Number of HTTP requests, partitioned by status code, method, and\n host.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>rest_client_request_duration_seconds_bucket</code></p>\n </td>\n <td>\n <p>The bucketed histogram of request latency in seconds. Broken down\n by verb, and host.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>storage_operation_duration_seconds_bucket</code></p>\n </td>\n <td>\n <p>The bucketed histogram of duration of storage operations.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>storage_operation_duration_seconds_count</code></p>\n </td>\n <td>\n <p>The count of duration of storage operations.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>storage_operation_errors_total</code></p>\n </td>\n <td>\n <p>Cumulative number of errors during storage operations.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>up</code></p>\n </td>\n <td>\n <p>A metric indicating whether the monitored target (e.g., node) is\n up and running.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>volume_manager_total_volumes</code></p>\n </td>\n <td>\n <p>The total number of volumes managed by the volume manager.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>workqueue_adds_total</code></p>\n </td>\n <td>\n <p>Total number of adds handled by workqueue.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>workqueue_depth</code></p>\n </td>\n <td>\n <p>Current depth of workqueue.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>workqueue_queue_duration_seconds_bucket</code></p>\n </td>\n <td>\n <p>The bucketed histogram of how long in seconds an item stays in\n workqueue before being requested.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>workqueue_work_duration_seconds_bucket</code></p>\n </td>\n <td>\n <p>The bucketed histogram of how long in seconds processing an item\n from workqueue takes.</p>\n </td>\n </tr>\n </table></div><p>The following tables list the alerts that are created by this solution. The alerts\n are created as rules in Amazon Managed Service for Prometheus, and are displayed in your Amazon Managed Grafana workspace.</p><p>You can modify the rules, including adding or deleting rules by <a target=\"_blank\" href=\"https://docs.aws.amazon.com/prometheus/latest/userguide/AMP-rules-edit.html\">editing the \n rules configuration file</a> in your Amazon Managed Service for Prometheus workspace.</p><p>These two alerts are special alerts that are handled slightly differently than\n typical alerts. Instead of alerting you to an issue, they give you information\n that is used to monitor the system. The description includes details about how to use\n these alerts.</p><div><table><thead>\n <tr>\n <th>Alert</th>\n <th>Description and usage</th>\n </tr>\n </thead>\n <tr>\n <td><p><code>Watchdog</code></p></td>\n <td><p>This is an alert meant to ensure that the entire alerting \n pipeline is functional. This alert is always firing, therefore it \n should always be firing in Alertmanager and always fire against a \n receiver. You can integrate this with your notification mechanism \n to send a notification when this alert is <em>not</em> \n firing. For example, you could use the \n <b>DeadMansSnitch</b> integration in \n PagerDuty.</p></td>\n </tr>\n <tr>\n <td><p><code>InfoInhibitor</code></p></td>\n <td><p>This is an alert that is used to inhibit info alerts. By \n themselves, info-level alerts can be very noisy, but they are relevant \n when combined with other alerts. This alert fires whenever there's a \n <code>severity=info</code> alert, and stops firing when another alert \n with a severity of <code>warning</code> or <code>critical</code> \n starts firing on the same namespace. This alert should be routed to \n a null receiver and configured to inhibit alerts with \n <code>severity=info</code>.</p></td>\n </tr>\n </table></div><p>The following alerts give you information or warnings about your system.</p><div><table><thead>\n <tr>\n <th>Alert</th>\n <th>Severity</th>\n <th>Description</th>\n </tr>\n </thead>\n <tr>\n <td>\n <p><code>NodeNetworkInterfaceFlapping</code></p>\n </td>\n <td><code>warning</code></td>\n <td>\n <p>Network interface is often changing its status</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>NodeFilesystemSpaceFillingUp</code></p>\n </td>\n <td><code>warning</code></td>\n <td>\n <p>File system is predicted to run out of space within the next 24\n hours.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>NodeFilesystemSpaceFillingUp</code></p>\n </td>\n <td><code>critical</code></td>\n <td>\n <p>File system is predicted to run out of space within the next 4\n hours.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>NodeFilesystemAlmostOutOfSpace</code></p>\n </td>\n <td><code>warning</code></td>\n <td>\n <p>File system has less than 5% space left.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>NodeFilesystemAlmostOutOfSpace</code></p>\n </td>\n <td><code>critical</code></td>\n <td>\n <p>File system has less than 3% space left.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>NodeFilesystemFilesFillingUp</code></p>\n </td>\n <td><code>warning</code></td>\n <td>\n <p>File system is predicted to run out of inodes within the next 24\n hours.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>NodeFilesystemFilesFillingUp</code></p>\n </td>\n <td><code>critical</code></td>\n <td>\n <p>File system is predicted to run out of inodes within the next 4\n hours.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>NodeFilesystemAlmostOutOfFiles</code></p>\n </td>\n <td><code>warning</code></td>\n <td>\n <p>File system has less than 5% inodes left.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>NodeFilesystemAlmostOutOfFiles</code></p>\n </td>\n <td><code>critical</code></td>\n <td>\n <p>File system has less than 3% inodes left.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>NodeNetworkReceiveErrs</code></p>\n </td>\n <td><code>warning</code></td>\n <td>\n <p>Network interface is reporting many receive errors.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>NodeNetworkTransmitErrs</code></p>\n </td>\n <td><code>warning</code></td>\n <td>\n <p>Network interface is reporting many transmit errors.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>NodeHighNumberConntrackEntriesUsed</code></p>\n </td>\n <td><code>warning</code></td>\n <td>\n <p>Number of conntrack entries are getting close to the limit.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>NodeTextFileCollectorScrapeError</code></p>\n </td>\n <td><code>warning</code></td>\n <td>\n <p>Node Exporter text file collector failed to scrape.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>NodeClockSkewDetected</code></p>\n </td>\n <td><code>warning</code></td>\n <td>\n <p>Clock skew detected.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>NodeClockNotSynchronizzing</code></p>\n </td>\n <td><code>warning</code></td>\n <td>\n <p>Clock not synchronizing.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>NodeRAIDDegraded</code></p>\n </td>\n <td><code>critical</code></td>\n <td>\n <p>RAID Array is degraded</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>NodeRAIDDiskFailure</code></p>\n </td>\n <td><code>warning</code></td>\n <td>\n <p>Failed device in RAID array</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>NodeFileDescriptorLimit</code></p>\n </td>\n <td><code>warning</code></td>\n <td>\n <p>Kernel is predicted to exhaust file descriptors limit soon.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>NodeFileDescriptorLimit</code></p>\n </td>\n <td><code>critical</code></td>\n <td>\n <p>Kernel is predicted to exhaust file descriptors limit soon.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>KubeNodeNotReady</code></p>\n </td>\n <td><code>warning</code></td>\n <td>\n <p>Node is not ready.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>KubeNodeUnreachable</code></p>\n </td>\n <td><code>warning</code></td>\n <td>\n <p>Node is unreachable.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>KubeletTooManyPods</code></p>\n </td>\n <td><code>info</code></td>\n <td>\n <p>Kubelet is running at capacity.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>KubeNodeReadinessFlapping</code></p>\n </td>\n <td><code>warning</code></td>\n <td>\n <p>Node readiness status is flapping.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>KubeletPlegDurationHigh</code></p>\n </td>\n <td><code>warning</code></td>\n <td>\n <p>Kubelet Pod Lifecycle Event Generator is taking too long to\n relist.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>KubeletPodStartUpLatencyHigh</code></p>\n </td>\n <td><code>warning</code></td>\n <td>\n <p>Kubelet Pod startup latency is too high.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>KubeletClientCertificateExpiration</code></p>\n </td>\n <td><code>warning</code></td>\n <td>\n <p>Kubelet client certificate is about to expire.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>KubeletClientCertificateExpiration</code></p>\n </td>\n <td><code>critical</code></td>\n <td>\n <p>Kubelet client certificate is about to expire.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>KubeletServerCertificateExpiration</code></p>\n </td>\n <td><code>warning</code></td>\n <td>\n <p>Kubelet server certificate is about to expire.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>KubeletServerCertificateExpiration</code></p>\n </td>\n <td><code>critical</code></td>\n <td>\n <p>Kubelet server certificate is about to expire.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>KubeletClientCertificateRenewalErrors</code></p>\n </td>\n <td><code>warning</code></td>\n <td>\n <p>Kubelet has failed to renew its client certificate.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>KubeletServerCertificateRenewalErrors</code></p>\n </td>\n <td><code>warning</code></td>\n <td>\n <p>Kubelet has failed to renew its server certificate.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>KubeletDown</code></p>\n </td>\n <td><code>critical</code></td>\n <td>\n <p>Target disappeared from Prometheus target discovery.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>KubeVersionMismatch</code></p>\n </td>\n <td><code>warning</code></td>\n <td>\n <p>Different semantic versions of Kubernetes components\n running.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>KubeClientErrors</code></p>\n </td>\n <td><code>warning</code></td>\n <td>\n <p>Kubernetes API server client is experiencing errors.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>KubeClientCertificateExpiration</code></p>\n </td>\n <td><code>warning</code></td>\n <td>\n <p>Client certificate is about to expire.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>KubeClientCertificateExpiration</code></p>\n </td>\n <td><code>critical</code></td>\n <td>\n <p>Client certificate is about to expire.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>KubeAggregatedAPIErrors</code></p>\n </td>\n <td><code>warning</code></td>\n <td>\n <p>Kubernetes aggregated API has reported errors.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>KubeAggregatedAPIDown</code></p>\n </td>\n <td><code>warning</code></td>\n <td>\n <p>Kubernetes aggregated API is down.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>KubeAPIDown</code></p>\n </td>\n <td><code>critical</code></td>\n <td>\n <p>Target disappeared from Prometheus target discovery.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>KubeAPITerminatedRequests</code></p>\n </td>\n <td><code>warning</code></td>\n <td>\n <p>The kubernetes apiserver has terminated <span>{</span><span>{</span> $value |\n humanizePercentage }} of its incoming requests.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>KubePersistentVolumeFillingUp</code></p>\n </td>\n <td><code>critical</code></td>\n <td>\n <p>Persistent Volume is filling up.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>KubePersistentVolumeFillingUp</code></p>\n </td>\n <td><code>warning</code></td>\n <td>\n <p>Persistent Volume is filling up.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>KubePersistentVolumeInodesFillingUp</code></p>\n </td>\n <td><code>critical</code></td>\n <td>\n <p>Persistent Volume Inodes is filling up.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>KubePersistentVolumeInodesFillingUp</code></p>\n </td>\n <td><code>warning</code></td>\n <td>\n <p>Persistent Volume Inodes are filling up.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>KubePersistentVolumeErrors</code></p>\n </td>\n <td><code>critical</code></td>\n <td>\n <p>Persistent Volume is having issues with provisioning.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>KubeCPUOvercommit</code></p>\n </td>\n <td><code>warning</code></td>\n <td>\n <p>Cluster has overcommitted CPU resource requests.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>KubeMemoryOvercommit</code></p>\n </td>\n <td><code>warning</code></td>\n <td>\n <p>Cluster has overcommitted memory resource requests.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>KubeCPUQuotaOvercommit</code></p>\n </td>\n <td><code>warning</code></td>\n <td>\n <p>Cluster has overcommitted CPU resource requests.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>KubeMemoryQuotaOvercommit</code></p>\n </td>\n <td><code>warning</code></td>\n <td>\n <p>Cluster has overcommitted memory resource requests.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>KubeQuotaAlmostFull</code></p>\n </td>\n <td><code>info</code></td>\n <td>\n <p>Namespace quota is going to be full.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>KubeQuotaFullyUsed</code></p>\n </td>\n <td><code>info</code></td>\n <td>\n <p>Namespace quota is fully used.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>KubeQuotaExceeded</code></p>\n </td>\n <td><code>warning</code></td>\n <td>\n <p>Namespace quota has exceeded the limits.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>CPUThrottlingHigh</code></p>\n </td>\n <td><code>info</code></td>\n <td>\n <p>Processes experience elevated CPU throttling.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>KubePodCrashLooping</code></p>\n </td>\n <td><code>warning</code></td>\n <td>\n <p>Pod is crash looping.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>KubePodNotReady</code></p>\n </td>\n <td><code>warning</code></td>\n <td>\n <p>Pod has been in a non-ready state for more than 15 minutes.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>KubeDeploymentGenerationMismatch</code></p>\n </td>\n <td><code>warning</code></td>\n <td>\n <p>Deployment generation mismatch due to possible roll-back</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>KubeDeploymentReplicasMismatch</code></p>\n </td>\n <td><code>warning</code></td>\n <td>\n <p>Deployment has not matched the expected number of replicas.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>KubeStatefulSetReplicasMismatch</code></p>\n </td>\n <td><code>warning</code></td>\n <td>\n <p>StatefulSet has not matched the expected number of\n replicas.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>KubeStatefulSetGenerationMismatch</code></p>\n </td>\n <td><code>warning</code></td>\n <td>\n <p>StatefulSet generation mismatch due to possible roll-back</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>KubeStatefulSetUpdateNotRolledOut</code></p>\n </td>\n <td><code>warning</code></td>\n <td>\n <p>StatefulSet update has not been rolled out.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>KubeDaemonSetRolloutStuck</code></p>\n </td>\n <td><code>warning</code></td>\n <td>\n <p>DaemonSet rollout is stuck.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>KubeContainerWaiting</code></p>\n </td>\n <td><code>warning</code></td>\n <td>\n <p>Pod container waiting longer than 1 hour</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>KubeDaemonSetNotScheduled</code></p>\n </td>\n <td><code>warning</code></td>\n <td>\n <p>DaemonSet pods are not scheduled.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>KubeDaemonSetMisScheduled</code></p>\n </td>\n <td><code>warning</code></td>\n <td>\n <p>DaemonSet pods are misscheduled.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>KubeJobNotCompleted</code></p>\n </td>\n <td><code>warning</code></td>\n <td>\n <p>Job did not complete in time</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>KubeJobFailed</code></p>\n </td>\n <td><code>warning</code></td>\n <td>\n <p>Job failed to complete.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>KubeHpaReplicasMismatch</code></p>\n </td>\n <td><code>warning</code></td>\n <td>\n <p>HPA has not matched desired number of replicas.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>KubeHpaMaxedOut</code></p>\n </td>\n <td><code>warning</code></td>\n <td>\n <p>HPA is running at max replicas</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>KubeStateMetricsListErrors</code></p>\n </td>\n <td><code>critical</code></td>\n <td>\n <p>kube-state-metrics is experiencing errors in list\n operations.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>KubeStateMetricsWatchErrors</code></p>\n </td>\n <td><code>critical</code></td>\n <td>\n <p>kube-state-metrics is experiencing errors in watch\n operations.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>KubeStateMetricsShardingMismatch</code></p>\n </td>\n <td><code>critical</code></td>\n <td>\n <p>kube-state-metrics sharding is misconfigured.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>KubeStateMetricsShardsMissing</code></p>\n </td>\n <td><code>critical</code></td>\n <td>\n <p>kube-state-metrics shards are missing.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>KubeAPIErrorBudgetBurn</code></p>\n </td>\n <td><code>critical</code></td>\n <td>\n <p>The API server is burning too much error budget.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>KubeAPIErrorBudgetBurn</code></p>\n </td>\n <td><code>critical</code></td>\n <td>\n <p>The API server is burning too much error budget.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>KubeAPIErrorBudgetBurn</code></p>\n </td>\n <td><code>warning</code></td>\n <td>\n <p>The API server is burning too much error budget.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>KubeAPIErrorBudgetBurn</code></p>\n </td>\n <td><code>warning</code></td>\n <td>\n <p>The API server is burning too much error budget.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>TargetDown</code></p>\n </td>\n <td><code>warning</code></td>\n <td>\n <p>One or more targets are down.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>etcdInsufficientMembers</code></p>\n </td>\n <td><code>critical</code></td>\n <td>\n <p>Etcd cluster insufficient members.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>etcdHighNumberOfLeaderChanges</code></p>\n </td>\n <td><code>warning</code></td>\n <td>\n <p>Etcd cluster high number of leader changes.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>etcdNoLeader</code></p>\n </td>\n <td><code>critical</code></td>\n <td>\n <p>Etcd cluster has no leader.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>etcdHighNumberOfFailedGRPCRequests</code></p>\n </td>\n <td><code>warning</code></td>\n <td>\n <p>Etcd cluster high number of failed gRPC requests.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>etcdGRPCRequestsSlow</code></p>\n </td>\n <td><code>critical</code></td>\n <td>\n <p>Etcd cluster gRPC requests are slow.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>etcdMemberCommunicationSlow</code></p>\n </td>\n <td><code>warning</code></td>\n <td>\n <p>Etcd cluster member communication is slow.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>etcdHighNumberOfFailedProposals</code></p>\n </td>\n <td><code>warning</code></td>\n <td>\n <p>Etcd cluster high number of failed proposals.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>etcdHighFsyncDurations</code></p>\n </td>\n <td><code>warning</code></td>\n <td>\n <p>Etcd cluster high fsync durations.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>etcdHighCommitDurations</code></p>\n </td>\n <td><code>warning</code></td>\n <td>\n <p>Etcd cluster has higher than expected commit durations.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>etcdHighNumberOfFailedHTTPRequests</code></p>\n </td>\n <td><code>warning</code></td>\n <td>\n <p>Etcd cluster has failed HTTP requests.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>etcdHighNumberOfFailedHTTPRequests</code></p>\n </td>\n <td><code>critical</code></td>\n <td>\n <p>Etcd cluster has a high number of failed HTTP requests.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>etcdHTTPRequestsSlow</code></p>\n </td>\n <td><code>warning</code></td>\n <td>\n <p>Etcd cluster HTTP requests are slow.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>HostClockNotSynchronizing</code></p>\n </td>\n <td><code>warning</code></td>\n <td>\n <p>Host clock not synchronizing.</p>\n </td>\n </tr>\n <tr>\n <td>\n <p><code>HostOomKillDetected</code></p>\n </td>\n <td><code>warning</code></td>\n <td>\n <p>Host OOM kill detected.</p>\n </td>\n </tr>\n </table></div><p>There are a few things that can cause the setup of the project to fail. Be sure to\n check the following.</p><div>\n <ul><li>\n <p>You must complete all <a target=\"_blank\" href=\"https://docs.aws.amazon.com/grafana/latest/userguide/solution-eks.html#solution-eks-prerequisites\">Prerequisites</a> \n before installing the solution.</p>\n </li><li>\n <p>The cluster must have at least one node in it before attempting to create the\n solution or access the metrics.</p>\n </li><li>\n <p>Your Amazon EKS cluster must have the <code>AWS CNI</code>, <code>CoreDNS</code> \n and <code>kube-proxy</code> add-ons installed. If they are not installed, the \n solution will not work correctly. They are installed by default, when creating\n the cluster through the console. You may need to install them if the cluster was\n created through an AWS SDK.</p>\n </li><li>\n <p>Amazon EKS pods installation timed out. This can happen if there is not enough \n node capacity available. There are multiple causes of these issues, \n including:</p>\n <div>\n <ul><li>\n <p>The Amazon EKS cluster was initialized with Fargate instead of Amazon EC2. This \n project requires Amazon EC2.</p>\n </li><li>\n <p>The nodes are <a target=\"_blank\" href=\"https://docs.aws.amazon.com/eks/latest/userguide/node-taints-managed-node-groups.html\">tainted</a> \n and therefore unavailable.</p>\n <p>You can use <code>kubectl describe node \n <code>NODENAME</code> | grep Taints</code> to check the \n taints. Then <code>kubectl taint node <code>NODENAME</code> \n <code>TAINT_NAME</code>-</code> to remove the taints. \n Make sure to include the <code>-</code> after the taint name.</p>\n </li><li>\n <p>The nodes have reached the capacity limit. In this case you can \n create a new node or increase the capacity.</p>\n </li></ul></div>\n </li><li>\n <p>You do not see any dashboards in Grafana: using the incorrect Grafana \n workspace ID.</p>\n <p>Run the following command to get information about Grafana:</p>\n <pre><code>kubectl describe grafanas external-grafana -n grafana-operator</code></pre>\n <p>You can check the results for the correct workspace URL. If it is not the one\n you are expecting, re-deploy with the correct workspace ID.</p>\n <pre><code>Spec:\n External:\n API Key:\n Key: GF_SECURITY_ADMIN_APIKEY\n Name: grafana-admin-credentials\n URL: https://<code>g-123example</code>.grafana-workspace.<code>aws-region</code>.amazonaws.com\nStatus:\n Admin URL: https://<code>g-123example</code>.grafana-workspace.<code>aws-region</code>.amazonaws.com\n Dashboards:\n ...</code></pre>\n </li><li>\n <p>You do not see any dashboards in Grafana: You are using an expired API key.</p>\n <p>To look for this case, you will need to get the grafana operator and check \n the logs for errors. Get the name of the Grafana operator with this command:</p>\n <pre><code>kubectl get pods -n grafana-operator</code></pre>\n <p>This will return the operator name, for example:</p>\n <pre><code>NAME READY STATUS RESTARTS AGE\n<code>grafana-operator-1234abcd5678ef90</code> 1/1 Running 0 1h2m</code></pre>\n <p>Use the operator name in the following command:</p>\n <pre><code>kubectl logs <code>grafana-operator-1234abcd5678ef90</code> -n grafana-operator</code></pre>\n <p>Error messages such as the following indicate an expired API key:</p>\n <pre><code>ERROR error reconciling datasource <span>{</span>\"controller\": \"grafanadatasource\", \"controllerGroup\": \"grafana.integreatly.org\", \"controllerKind\": \"GrafanaDatasource\", \"GrafanaDatasource\": <span>{</span>\"name\":\"grafanadatasource-sample-amp\",\"namespace\":\"grafana-operator\"}, \"namespace\": \"grafana-operator\", \"name\": \"grafanadatasource-sample-amp\", \"reconcileID\": \"72cfd60c-a255-44a1-bfbd-88b0cbc4f90c\", \"datasource\": \"grafanadatasource-sample-amp\", \"grafana\": \"external-grafana\", \"error\": \"status: 401, body: <span>{</span>\\\"message\\\":\\\"Expired API key\\\"}\\n\"}\ngithub.com/grafana-operator/grafana-operator/controllers.(*GrafanaDatasourceReconciler).Reconcile</code></pre>\n <p>In this case, create a new API key and deploy the solution again. If the\n problem persists, you can force synchronization by using the following command\n before redeploying:</p>\n <pre><code>kubectl delete externalsecret/external-secrets-sm -n grafana-operator</code></pre>\n </li><li>\n <p><em>CDK installs</em> – Missing SSM parameter. If you see \n an error like the following, run <code>cdk bootstrap</code> and try \n again.</p>\n <pre><code>Deployment failed: Error: aws-observability-solution-eks-infra-<code>$EKS_CLUSTER_NAME</code>: SSM \nparameter /cdk-bootstrap/<code>xxxxxxx</code>/version not found. Has the environment been \nbootstrapped? Please run 'cdk bootstrap' (see https://docs.aws.amazon.com/cdk/latest/\nguide/bootstrapping.html)</code></pre>\n </li><li>\n <p>Deployment can fail if the OIDC provider already exists. You will see an \n error like the following (in this case, for CDK installs):</p>\n <pre><code>| CREATE_FAILED | Custom::AWSCDKOpenIdConnectProvider | OIDCProvider/Resource/Default\nReceived response status [FAILED] from custom resource. Message returned: \nEntityAlreadyExistsException: Provider with url https://oidc.eks.<code>REGION</code>.amazonaws.com/id/<code>PROVIDER ID</code> already exists.</code></pre>\n <p>In this case, go to the IAM portal and delete the OIDC provider and try \n again.</p>\n </li><li>\n <p><em>Terraform installs</em> – You see an error message that \n includes <code>cluster-secretstore-sm failed to create kubernetes rest client \n for update of resource</code> and <code>failed to create kubernetes rest \n client for update of resource</code>.</p>\n <p>This error typically indicates that the External Secrets Operator is not \n installed or enabled in your Kubernetes cluster. This is installed as part of \n the solution deployment, but sometimes is not ready when the solution needs \n it.</p>\n <p>You can verify that it's installed with the following command:</p>\n <pre><code>kubectl get deployments -n external-secrets</code></pre>\n <p>If it's installed, it can take some time for the operator to be fully ready\n to be used. You can check the status of the needed Custom Resource Definitions \n (CRDs) by running the following command:</p>\n <pre><code>kubectl get crds|grep external-secrets</code></pre>\n <p>This command should list the CRDs related to the external secrets operator,\n including <code>clustersecretstores.external-secrets.io</code> and \n <code>externalsecrets.external-secrets.io</code>. If they are not listed, wait \n a few minutes and check again.</p>\n <p>Once the CRDs are registered, you can run <code>terraform apply</code> again\n to deploy the solution.</p>\n </li></ul></div>",
"author": "",
"favicon": "https://docs.aws.amazon.com/assets/images/favicon.ico",
"source": "docs.aws.amazon.com",
"published": "",
"ttr": 702,
"type": ""
}