node-problem-detector

node-problem-detector aims to make various node problems visible to the upstream layers in cluster management stack. It is a daemon which runs on each node, detects node problems and reports them to apiserver. node-problem-detector can either run as a DaemonSet or run standalone. It also runs as a Kubernetes Addon enabled by default in the GKE cluster.

Background

There are tons of node problems could possibly affect the pods running on the node such as:

  • Infrastructure daemon issues: ntp service down;
  • Hardware issues: Bad cpu, memory or disk, ntp service down;
  • Kernel issues: Kernel deadlock, corrupted file system;
  • Container runtime issues: Unresponsive runtime daemon;
  • ...

Currently these problems are invisible to the upstream layers in cluster management stack, so Kubernetes will continue scheduling pods to the bad nodes.

To solve this problem, we introduced this new daemon node-problem-detector to collect node problems from various daemons and make them visible to the upstream layers. Once upstream layers have the visibility to those problems, we can discuss the remedy system.

Problem API

node-problem-detector uses

  • Event: Temporary problem that has limited impact on pod but is informative should be reported as Event.

Problem Daemon

A problem daemon is a sub-daemon of node-problem-detector. It monitors a specific kind of node problems and reports them to node-problem-detector.

A problem daemon could be:

  • A tiny daemon designed for dedicated usecase of Kubernetes.
  • An existing node health monitoring daemon integrated with node-problem-detector.

Currently, a problem daemon is running as a goroutine in the node-problem-detector binary. In the future, we'll separate node-problem-detector and problem daemons into different containers, and compose them with pod specification.

List of supported problem daemons:

Problem DaemonNodeConditionDescription
KernelMonitorKernelDeadlockA system log monitor monitors kernel log and reports problem according to predefined rules.
AbrtAdaptorNoneMonitor ABRT log messages and report them further. ABRT (Automatic Bug Report Tool) is health monitoring daemon able to catch kernel problems as well as application crashes of various kinds occurred on the host. For more information visit the link.
CustomPluginMonitorOn-demand(According to users configuration)A custom plugin monitor for node-problem-detector to invoke and check various node problems with user defined check scripts. See proposal here.

Tell us about a new Kubernetes application

Newsletter

Never miss a thing! Sign up for our newsletter to stay updated.

About

Discover and share new Kubernetes applications

Navigation