A Kubernetes readiness probe can improve your service’s quality and reduce operational issues, making your service more resilient and robust.
However, the probe can also seriously degrade your service’s operation if you don’t implement it carefully and properly.
Other factors can also cause readiness probe failed errors, and you’ll need to pinpoint the issue or make a few adjustments to address them.
While many things can cause readiness probe failures, and multiple ways to fix them, we’ll cover three Kubernetes troubleshooting tactics to address the errors below.
Readiness Probe: A Quick Overview
Kubernetes uses readiness probes to determine when it’s safe to send traffic to a pod and when to transfer the pod to the Ready state.
A readiness probe assesses if a specific pod will accept traffic if it is used as a backend endpoint for services.
The probe runs for the rest of the pod’s life, which means it continues to run even after the pod reaches the Ready state.
Developers use readiness probes to tell Kubernetes that a running container shouldn’t receive traffic.
It’s useful when waiting for apps to perform time-consuming (initial) tasks, such as warming caches, loading files, and establishing network connections.
The readiness probe is configured within the spec.containers.readinessprobe pod configuration’s attribute.
If you get a readiness probe failed error or if the probe returns a failed state, Kubernetes takes out the IP address for the container from all the services’ endpoints.
Common Causes of Readiness Probe Failure
Before learning to fix readiness probe failed errors, you need to know some common causes of the error first.
Readiness probes are commonly used to verify tasks during container lifecycles, which means interruptions or delays to the probe’s response can interrupt the service.
The following are some of the usual conditions that can cause apps to fail the readiness probe incorrectly.
Readiness probe responses can be conditional based on components outside the application’s direct control.
For instance, you can set up a readiness probe with HTTPGet, where the app first checks a cache service or database’s availability before responding to the probe.
If a database sends a late response or is down, the whole application becomes unavailable, which might (or might not) make sense depending on your app configuration.
The behavior makes sense if the app can’t function (at all) without the third-party component.
However, if the app can keep functioning by falling back to a local cache, for instance, the external cache or database shouldn’t be connected to probe responses.
Generally, the pod shouldn’t fail the readiness probe if it’s technically ready (even if it can’t function properly).
One workaround is to implement a degraded mode. If there isn’t any access to the database, respond to read requests that can be addressed by return 503 (service unavailable) and local cache on write requests.
Make sure that downstream services are resilient to an upstream service failure.
Readiness probes can respond late in certain circumstances.
For instance, if the app has to perform heavy computation or read large volumes of data with low latency.
To minimize readiness probe failed errors, consider this behavior during setup. Also, always test your app thoroughly before running it in production to help detect potential errors early.
How to fix readiness probe failures
The reality is that troubleshooting Kubernetes nodes successfully depends on how quickly you can contextualize the problem with what’s going on within the rest of the cluster.You’ll often perform your investigation during fires in the production. The main challenge is correlating service-level occurrences with other events within the underlying infrastructure.
When readiness probe failures cause nodes to go offline, it’s critical to understand what’s happening on all nodes involved.You’ll also need to obtain context about other potentially significant events within the environment.That being said, below are several basic troubleshooting tips to help you address readiness probe failed errors.
1. Check the readiness probe
If your pod runs, but with a 0/1 ready state (or 0/2 ready state if your pod has multiple containers), you need to verify the readiness. Assess the health check (readiness probe) to do this.
In this scenario, a common factor that can cause a readiness probe failure error can be application issues.
First, run the command below to check the logs.
kubectl logs <pod_identifier> -c <container_name> -n <namespace>
Next, run the command to verify the events.
kubectl describe <pod_identifier> -n <namespace>
If the readiness probe health check fails, do the following.
Check the health check (or readiness probe), including the kubectl READY column, to get the pods output. You’ll know if the readiness probe is executing correctly.
Run this command to check the logs: kubectl logs <pod_identifier> -c <container_name> -n <namespace>.
Then, run this other command to verify the events. kubectl describe <pod_identifier> -n <namespace>.
2. Increase the Readiness Probe Time Out
The readiness probe will keep getting called throughout the container’s lifetime (every periodSeconds).
It lets the container make itself temporarily unavailable when one dependency isn’t available or while performing maintenance, running large batch jobs, or other similar tasks.
If you don’t consider that the readiness probe will keep getting called after the container starts, you might design readiness probes incorrectly. It can lead to serious issues during runtime.
You might also face serious problems even if you understand this behavior if the readiness probe doesn’t consider unexpected system dynamics.
If the container examines a shared dependency within the readiness probe, configure the probe timeout longer than the specific dependency’s maximum response time.
3. Raise the readiness probe failure threshold
The default failureThreshold count or the number of times the readiness probe requires to fail before the pod is considered not ready is three.
To help prevent a readiness probe failure, consider increasing the failure threshold count depending on the readiness probe’s frequency (based on the periodSeconds parameter).
For instance, if the default is set to 100 (100 times), you can increase it to, let’s say, 300.
The goal is to prevent failing the readiness probe prematurely before response latencies returned to normal and temporary system dynamics elapsed.
Make Readiness Probes Work for You
While you can have many readiness probe failures with proper configuration, you can’t always avoid potential errors.
Identify and address underlying issues, learn the essential troubleshooting tips, and track and report the frequency of container restarts.