PCE Health Metrics Reference

The health metrics consist of a set of key value pairs. The following table describes the possible keys that can appear.

Category

Key

Description

Severity Levels

Disk Space

disk

The PCE node reports disk space for the PCE application directories (configured in the runtime_env.yml file):

  • ephemeral_data_root
  • runtime_data_root
  • log_dir
  • persistent_data_root directories

When all these directories are on a single mount point, the node reports: disk=n%

When multiple mount points exist, the node reports the first discovered path by name, such as:

ephemeral_data_root=n%

log_dir=n%

When the PCE encounters an error determining this information, the node reports: disk=?

These thresholds trigger the following severity levels:

  • NOTICE: disk_space >= 90% or disk_inodes >= 90%
  • WARNING: disk_space >= 95% or disk_inodes >= 95%

The disk space value is only reported when one of the conditions above is met; otherwise, it reports only disk space. When a node has multiple disk mounts, the message might look like:

ephemeral_data_root_inodes=n, etc.

Physical Memory

memory

The PCE node reports basic physical memory usage, indicated as:

memory=n%

These thresholds trigger the following severity levels:

  • NOTICE: memory >= 80%
  • WARNING: memory >= 95%

CPU Load

cpu

The CPU load is calculated as a percentage between two time slices and represents CPUs of all nodes in the cluster. For example, cpu=100% means all cores are maximized.

These thresholds trigger the following severity levels:

  • NOTICE: cpu >= 95% for more than 1 minute
  • WARNING: cpu >= 95% for more than 5 minutes

Cluster Leader

leader

The IP address of the current leader, or unavailable when no leader exists or it is unreachable.

 

Cluster Status

cluster

The overall health of the cluster, reported by the leader only:

  • cluster=healthy: Everything is operating properly and all PCE services are running.
  • cluster=degraded: The cluster is running but has unhealthy nodes.
  • cluster=down: The cluster is missing a required service < 5 minutes.
  • cluster=failed: The cluster is missing a required service for >= 5 minutes.

These status values trigger the following severity levels:

  • NOTICE: cluster=degraded (<2 minutes)
  • WARNING: cluster=degraded (>=2 minutes)
  • WARNING: cluster=down (<2 minutes)
  • ERROR: cluster=down (>= 2 minutes)
  • CRITICAL (FATAL): cluster=failed

Missing Nodes

missing

The number of nodes that are missing from the cluster. If no nodes are missing, this metric is not reported.

 

Replication Lag

database_

replication_

lag

The number of seconds the database replica is lagging behind the database master. Output by database replica nodes only.

These thresholds trigger the following severity level:

  • WARNING: >=30 seconds
Disk Latency policy_disk_latency_milliseconds,
traffic_disk_latency_milliseconds

(19.3.2 and later) Average time (in milliseconds) for I/O requests issued to the device to be served. This includes the time spent by the requests in queue and the time spent servicing them. The metric is calculated exactly the same way iostat calculates await.

Values: delay (milliseconds), disk

Usefulness: Indicates Disk I/O, which is especially useful when the DB services are under heavy load.

  • Normal: <= 300
  • Warning: >300 <800
  • Critical: >= 800
Policy Database: Size policy_database_size_gb (19.3.2 and later) Informational. Size of the Policy Database data directory. Provides an indication of disk space requirements of the Policy DB. Depending on size, reported in units of byte, kilobyte, megabyte, gigabyte, terabyte  
Policy Database: Disk Utilization policy_database_utilization_percentage (19.3.2 and later) Usage ratio of the disk partition holding the Policy DB. Consequences of the Policy DB running out of disk space can be critical.
  • Normal: < 90
  • Warning: [90 - 95]
  • Critical: >= 95

Policy Database: Transaction ID Max Age

policy_database_transaction_id_max_age (19.3.2 and later) Maximum transaction ID (TxID) age of the Policy DB. This does not apply to the Traffic DB. Indicates the risk of the DB running out of TxIDs, which could cause a DB lockdown requiring expensive recovery procedures. The PCE will attempt to automatically detect and recover before this occurs (requires reboot).
  • Normal: < 1 billion
  • Warning: [1 billion - 2 billion]
  • Critical: >= 2 billion
Policy Database: Vacuum Backlog policy_database_vacuum_backlog_percentage (19.3.2 and later) Percentage of vacuum-ready rows (a.k.a dead rows) over the total number of rows of the Policy database computed over a period of up to 12 hours. This does not apply to the Traffic DB. Indicates how well the auto-vacuum of DB is performing. If the percentage is persistently above Postgres default settings of about 20% of the total number of rows, it is an indication that the auto-vacuum is not working effectively.
  • Normal: < 40
  • Warning: 40 - 80 and current number of vacuum-ready rows is above Postgres default minimum to trigger vacuum (20% +50)
  • Critical: >= 80 and current number of vacuum-ready rows is above Postgres default minimum to trigger vacuum (20% +50)
VEN Heartbeat Performance: Latency ven_heartbeat_average_latency_seconds,
ven_heartbeat_high_latency_seconds
(19.3.2 and later) (milliseconds)

ven_heartbeat_average_latency_seconds is the average over the measurement time period.

ven_heartbeat_high_latency_seconds is the average 95% over the measurement time period.

Backend processing time of VEN heartbeat requests. Does not include the time spent in the load balancer queues, as the queue time may be influenced by a number of other external factors.

The VEN heartbeat uses the same PCE services and components as the policy computation and is therefore a good overall indicator for the health of the policy subsystem, including whether system resources are being overwhelmed. Historically, it has reliably indicated I/O and/or policy cache bottleneck(s).

  • Warning: average > 500ms
  • Critical: average > 5 sec
VEN Heartbeat Performance: Success

ven_heartbeat_success_count_per_hour

(19.3.2 and later) Active VENs send a heartbeat API request to the PCE approximately every 5 minutes. This metric captures the number of VEN heartbeat requests seen on the PCE in approximately the past hour. The count may be transiently inaccurate due to concurrent log rotation or other gaps in the application log files. If the PCE has just started up, this number is expected to ramp up over the first hour.

The number of successful VEN heartbeat requests per hour summed across all PCE core nodes should be approximately the number of VENs times 12 (heartbeats happen every 5 minutes per VEN). A low number of successful VEN heartbeats likely indicates issues with VEN connectivity or PCE performance. Depending on the VEN disconnect/offline settings, a low VEN heartbeat success rate may cause traffic to be dropped to/from enforced workloads.

  • Warning: for any non-2xx code, greater than 1% of total requests for the time window
  • Critical: for any non-2xx code, greater than 20% of total requests for the time window
VEN Heartbeat Performance: Failure ven_heartbeat_failure_percent,
ven_heartbeat_failure_count_per_hour
(19.3.2 and later)

Warning: 5%

Critical/Error: 20%

Policy Performance: Latency

ven_policy_average_latency_seconds,
ven_policy_high_latency_seconds

(19.3.2 and later) (milliseconds) Average response time for policy. Latency indicates policy complexity and system load/bottlenecks. This metric captures the backend processing time of VEN policy requests. It does not include the time spent in the load balancer queues, as queue time may be influenced by a number of other external factors.

The cost to compute the VEN policy instructions depends on a large number of factors, including but not limited to the rate of change in the environment, the number of rules, the number of actors (workloads, labels, etc.) used in the rules, and the density of desired connectivity between workloads. Abnormally high VEN policy request latency may indicate issues with inadequate system resources, policy changes that result in higher than intended policy complexity, or an abnormally high rate of change to the workload context.

  • Warning: average > 10 sec
  • Critical: average > 30 sec
Policy Performance: Request Count ven_policy_request_count_per_hour

(19.3.2 and later) (requests/hour) When a new policy is provisioned or the workload context (IP address, label membership, etc.) is changed on the PCE, policy instructions are sent to affected VENs. This metric captures the number of VEN policy requests seen on the PCE in approximately the past hour. The count may be transiently inaccurate due to concurrent log rotation or other gaps in the application log files. When a PCE first starts or is restarted, this number may increase sharply over a short time period as every VEN checks to ensure policy sync.

The VEN policy request rate provides an indicator of the rate of policy change across the organization, and therefore, an estimate of the load on the PCE. VEN policy requests are sometimes more expensive to process than other API requests, and frequent policy changes may result in decreased overall performance and longer policy convergence times. Frequent policy changes may also be a symptom of underlying network or infrastructure issues, such as (but not limited to) frequent IP address changes or improperly cloned VENs.

  • Warning: > 1M req/hour
Collector: Flow Summaries collector_summaries_per_second (19.3.2 and later) Total flow summaries processing rate for a single core PCE node, over the last hour. The sum of these should roughly match the flow summary ingest rate, or the PCE will show an increasing backlog size.
  • Warning: > 12,000
  • Critical: > 15,000
Collector: Success Rate collector_post_success_count_per_hour (19.3.2 and later) Informational. Total flow summary posts accepted by a core machine over the last hour. Posts can be of different sizes, so take longer to process, but you should see roughly the same rates for each core.

If counts differ across core machines, ensure intra-PCE latency is within the 10ms limit.

 
Collector: Failure Rate collector_post_failure_count_per_hour (19.3.2 and later) Informational. Total flow summary failure rate over the last hour. Under normal operational circumstances, this value should be approximately the same for all core nodes.  
Collector: Failure Percentage collector_post_failure_percentage (19.3.2 and later) Failure/total. Failure rate/success ratio over the last hour.
  • Warning: > 10%
  • Critical: > 20%
Traffic Summary: Ingest traffic_summaries_per_second, total_traffic_summaries_per_second (19.3.2 and later) The mean rate at which flow summaries are added to the postgresql database over the last hour.
  • Warning: > 12,000
  • Critical > 15,000
Traffic Summary: Database Size traffic_database_size_gb, traffic_database_size_days (19.3.2 and later) (gigabytes; days) Informational.  
Traffic Summary: Database Size: % of Allocated traffic_database_utilization_percentage (19.3.2 and later) Informational. The system is behaving normally even if it is near or at configured disk limits. The oldest flows will be dropped to enforce the limit, however, which may not be desirable.
  • Warning: > 10%
  • Critical: > 50%
Traffic Summary: Backlog Size traffic_backlog_size_gb (19.3.2 and later) Amount of flows in the backlog that are not in the traffic database, in gigabytes. If the backlog size exceeds a certain limit (default is 10 GB and can be set in runtime environment), flows get dropped.  
Traffic Summary: Backlog Size: % of Allocated traffic_backlog_utilization_percentage (19.3.2 and later) Increasing values indicate that the buffered new flow data is growing, meaning the PCE is unable to keep up with the rate of data posted. The PCE collector flow summary rate and PCE traffic summary ingest rate need to be to be roughly equal or this buffered backlog will grow.
  • Warning: > 80%
  • Critical: > 90%
Supercluster Replication Lag  

Number of seconds since a replication event generated by a PCE was processed on another PCE. The Supercluster replication engine relies on events to ensure data gets replicated. These are not the same as the PCE audit events.

An increasing replication lag usually indicates some issue with the PCE replication engine or network connectivity. The larger the replication lag, the longer it may take a PCE to catch up with other regions once the underlying issue is addressed.

  • Warning: This is an indication that the inter-PCE data replication is not working as intended. One or more PCEs may not have the data generated by one or more other PCEs. The Supercluster expects that the replication lag will not fall behind by a large margin. If it does, the user may lose some data if the PCE that is ahead fails and is not recoverable.