PCE Health Metrics Reference

The health metrics consist of a set of key value pairs. The following table describes the possible keys that can appear.

Category	Key	Description	Severity Levels
Disk Space	`disk`, `disk_space_percent_thresholds`, `disk_inode_percent_thresholds`	The PCE node reports disk space for the PCE application directories (configured in the `runtime_env.yml` file): `ephemeral_data_root` `runtime_data_root` `log_dir` `persistent_data_root` directories When all these directories are on a single mount point, the node reports: `disk=n%` When multiple mount points exist, the node reports the first discovered path by name, such as: `ephemeral_data_root=n%` `log_dir=n%` When the PCE encounters an error determining this information, the node reports: `disk=?`. `disk_space_percent_thresholds` consists of two values that determine the disk space usage percentages that result in NOTICE or WARNING notifications. `disk_inode_percent_thresholds` consists of two values that determine the disk inode usage percentages that result in NOTICE or WARNING notifications.	Default: The following thresholds trigger the following severity levels: NOTICE: `disk_space >= 90%` or `disk_inodes >= 90%` WARNING: `disk_space >= 95%` or `disk_inodes >= 95%` These default thresholds can be modified using `disk_space_percent_thresholds` or `disk_inode_percent_thresholds`. The disk space value is only reported when one of the conditions above is met; otherwise, it reports only disk space. When a node has multiple disk mounts, the message might look like: `ephemeral_data_root_inodes=n, etc.`
Physical Memory	`memory`, `memory_percent_thresholds`	Each PCE node reports basic physical memory usage, indicated as: `memory=n%` `memory_percent_thresholds` consists of two values that determine the memory usage percentages that result in NOTICE or WARNING notifications.	Default: the following values trigger the following severity levels: NOTICE: `memory >= 80%` WARNING: `memory >= 95%` These default thresholds can be modified using `memory_percent_thresholds`.
CPU Load	`cpu`, `cpu_max_percent`, `cpu_tolerance_seconds`	Each PCE node reports CPU usage load as `cpu=n%`. The CPU load is calculated as a percentage between two time slices and represents CPUs of all nodes in the cluster. For example, `cpu=100%` means all cores are maximized. A notification (NOTICE or WARNING) is issued when the CPU load exceeds a given percentage for a given amount of time. `cpu_max_percent` is the CPU usage percentage above which the notification timer begins. `cpu_tolerance_seconds` controls the notification timer. It consists of two values that determine how long the CPU is above the maximum usage percentage before a NOTICE or WARNING occurs.	Default: the following values trigger the following severity levels: NOTICE: `cpu >= 95% for more than 1 minute` WARNING: `cpu >= 95% for more than 5 minutes` These default thresholds can be modified using `cpu_max_percent` and `cpu_tolerance_seconds`.
Cluster Leader	`leader`	The IP address of the current leader, or unavailable when no leader exists or it is unreachable.
Cluster Status	`cluster`	The overall health of the cluster, reported by the leader only: `cluster=healthy`: Everything is operating properly and all PCE services are running. `cluster=degraded`: The cluster is running but has unhealthy nodes. `cluster=down`: The cluster is missing a required service < 5 minutes. `cluster=failed`: The cluster is missing a required service for >= 5 minutes.	These status values trigger the following severity levels: NOTICE: `cluster=degraded` (<2 minutes) WARNING: `cluster=degraded` (>=2 minutes) WARNING: `cluster=down` (<2 minutes) ERROR: `cluster=down` (>= 2 minutes) CRITICAL (FATAL): `cluster=failed`
Missing Nodes	`missing`	The number of nodes that are missing from the cluster. If no nodes are missing, this metric is not reported.
Replication Lag	`database_ replication_ lag`	The number of seconds the database replica is lagging behind the primary database. Output by database replica nodes only.	These thresholds trigger the following severity level: WARNING: `>=30 seconds`
Disk Latency	`policy_disk_latency_milliseconds`, `traffic_disk_latency_milliseconds`	(19.3.2 and later) Average time (in milliseconds) for I/O requests issued to the device to be served. This includes the time spent by the requests in queue and the time spent servicing them. The metric is calculated exactly the same way iostat calculates await. Values: delay (milliseconds), disk Usefulness: Indicates Disk I/O, which is especially useful when the DB services are under heavy load.	Normal: <= 300 Warning: >300 <800 Critical: >= 800
Policy Database: Size	`policy_database_size_gb`	(19.3.2 and later) Informational. Size of the Policy Database data directory. Provides an indication of disk space requirements of the Policy DB. Depending on size, reported in units of byte, kilobyte, megabyte, gigabyte, terabyte
Policy Database: Disk Utilization	`policy_database_utilization_percentage`	(19.3.2 and later) Usage ratio of the disk partition holding the Policy DB. Consequences of the Policy DB running out of disk space can be critical.	Normal: < 90 Warning: [90 - 95] Critical: >= 95
Policy Database: Transaction ID Max Age	`policy_database_transaction_id_max_age`	(19.3.2 and later) Maximum transaction ID (TxID) age of the Policy DB. This does not apply to the Traffic DB. Indicates the risk of the DB running out of TxIDs, which could cause a DB lockdown requiring expensive recovery procedures. The PCE will attempt to automatically detect and recover before this occurs (requires reboot).	Normal: < 1 billion Warning: [1 billion - 2 billion] Critical: >= 2 billion
Policy Database: Vacuum Backlog	`policy_database_vacuum_backlog_percentage`	(19.3.2 and later) Percentage of vacuum-ready rows (a.k.a dead rows) over the total number of rows of the Policy database computed over a period of up to 12 hours. This does not apply to the Traffic DB. Indicates how well the auto-vacuum of DB is performing. If the percentage is persistently above Postgres default settings of about 20% of the total number of rows, it is an indication that the auto-vacuum is not working effectively.	Normal: < 40 Warning: 40 - 80 and current number of vacuum-ready rows is above Postgres default minimum to trigger vacuum (20% +50) Critical: >= 80 and current number of vacuum-ready rows is above Postgres default minimum to trigger vacuum (20% +50)
VEN Heartbeat Performance: Latency	`ven_heartbeat_average_latency_seconds`, `ven_heartbeat_high_latency_seconds`	(19.3.2 and later) (milliseconds) `ven_heartbeat_average_latency_seconds` is the average over the measurement time period. `ven_heartbeat_high_latency_seconds` is the average 95% over the measurement time period. Backend processing time of VEN heartbeat requests. Does not include the time spent in the load balancer queues, as the queue time may be influenced by a number of other external factors. The VEN heartbeat uses the same PCE services and components as the policy computation and is therefore a good overall indicator for the health of the policy subsystem, including whether system resources are being overwhelmed. Historically, it has reliably indicated I/O and/or policy cache bottleneck(s).	Warning: average > 500ms Critical: average > 5 sec
VEN Heartbeat Performance: Success	`ven_heartbeat_success_count_per_hour`	(19.3.2 and later) Active VENs send a heartbeat API request to the PCE approximately every 5 minutes. This metric captures the number of VEN heartbeat requests seen on the PCE in approximately the past hour. The count may be transiently inaccurate due to concurrent log rotation or other gaps in the application log files. If the PCE has just started up, this number is expected to ramp up over the first hour. The number of successful VEN heartbeat requests per hour summed across all PCE core nodes should be approximately the number of VENs times 12 (heartbeats happen every 5 minutes per VEN). A low number of successful VEN heartbeats likely indicates issues with VEN connectivity or PCE performance. Depending on the VEN disconnect/offline settings, a low VEN heartbeat success rate may cause traffic to be dropped to/from enforced workloads.	Warning: for any non-2xx code, greater than 1% of total requests for the time window Critical: for any non-2xx code, greater than 20% of total requests for the time window
VEN Heartbeat Performance: Failure	`ven_heartbeat_failure_percent`, `ven_heartbeat_failure_count_per_hour`	(19.3.2 and later)	Warning: 5% Critical/Error: 20%
Policy Performance: Latency	`ven_policy_average_latency_seconds`, `ven_policy_high_latency_seconds`	(19.3.2 and later) (milliseconds) Average response time for policy. Latency indicates policy complexity and system load/bottlenecks. This metric captures the backend processing time of VEN policy requests. It does not include the time spent in the load balancer queues, as queue time may be influenced by a number of other external factors. The cost to compute the VEN policy instructions depends on a large number of factors, including but not limited to the rate of change in the environment, the number of rules, the number of actors (workloads, labels, etc.) used in the rules, and the density of desired connectivity between workloads. Abnormally high VEN policy request latency may indicate issues with inadequate system resources, policy changes that result in higher than intended policy complexity, or an abnormally high rate of change to the workload context.	Warning: average > 10 sec Critical: average > 30 sec
Policy Performance: Request Count	`ven_policy_request_count_per_hour`	(19.3.2 and later) (requests/hour) When a new policy is provisioned or the workload context (IP address, label membership, etc.) is changed on the PCE, policy instructions are sent to affected VENs. This metric captures the number of VEN policy requests seen on the PCE in approximately the past hour. The count may be transiently inaccurate due to concurrent log rotation or other gaps in the application log files. When a PCE first starts or is restarted, this number may increase sharply over a short time period as every VEN checks to ensure policy sync. The VEN policy request rate provides an indicator of the rate of policy change across the organization, and therefore, an estimate of the load on the PCE. VEN policy requests are sometimes more expensive to process than other API requests, and frequent policy changes may result in decreased overall performance and longer policy convergence times. Frequent policy changes may also be a symptom of underlying network or infrastructure issues, such as (but not limited to) frequent IP address changes or improperly cloned VENs.	Warning: > 1M req/hour
Collector: Flow Summaries	`collector_summaries_per_second`	(19.3.2 and later) Total flow summaries processing rate for a single core PCE node, over the last hour. The sum of these should roughly match the flow summary ingest rate, or the PCE will show an increasing backlog size.	Warning: > 12,000 Critical: > 15,000
Collector: Success Rate	`collector_post_success_count_per_hour`	(19.3.2 and later) Informational. Total flow summary posts accepted by a core machine over the last hour. Posts can be of different sizes, so take longer to process, but you should see roughly the same rates for each core. If counts differ across core machines, ensure intra-PCE latency is within the 10ms limit.
Collector: Failure Rate	`collector_post_failure_count_per_hour`	(19.3.2 and later) Informational. Total flow summary failure rate over the last hour. Under normal operational circumstances, this value should be approximately the same for all core nodes.
Collector: Failure Percentage	`collector_post_failure_percentage`	(19.3.2 and later) Failure/total. Failure rate / success ratio over the last hour.	Warning: > 10% Critical: > 20%
Traffic Summary: Ingest	`traffic_summaries_per_second`, `total_traffic_summaries_per_second`	(19.3.2 and later) The mean rate at which flow summaries are added to the postgresql database over the last hour.	Warning: > 12,000 Critical > 15,000
Traffic Summary: Database Size	`traffic_database_size_gb`, `traffic_database_size_days`	(19.3.2 and later) Informational. (gigabytes; days)
Traffic Summary: Database Size: % of Allocated	`traffic_database_utilization_percentage`	(19.3.2 and later) Informational. The system is behaving normally even if it is near or at configured disk limits. The oldest flows will be dropped to enforce the limit, however, which may not be desirable.	Warning: > 10% Critical: > 50%
Traffic Summary: Backlog Size	`traffic_backlog_size_gb`	(19.3.2 and later) Amount of flows in the backlog that are not in the traffic database, in gigabytes. If the backlog size exceeds a certain limit (default is 10 GB and can be set in runtime environment), flows get dropped.
Traffic Summary: Backlog Size: % of Allocated	`traffic_backlog_utilization_percentage`	(19.3.2 and later) Increasing values indicate that the buffered new flow data is growing, meaning the PCE is unable to keep up with the rate of data posted. The PCE collector flow summary rate and PCE traffic summary ingest rate need to be to be roughly equal or this buffered backlog will grow.	Warning: > 80% Critical: > 90%
Supercluster Replication Lag		(For PCE superclusters only) Number of seconds since a replication event generated by a PCE was processed on another PCE. The supercluster replication engine relies on events to ensure data gets replicated. These are not the same as the PCE audit events. An increasing replication lag usually indicates some issue with the PCE replication engine or network connectivity. The larger the replication lag, the longer it may take a PCE to catch up with other regions once the underlying issue is addressed.	Warning: This is an indication that the inter-pce data replication is not working as intended. One or more PCEs may not have the data generated by one or more other PCEs. The supercluster expects that the replication lag will not fall behind by a large margin. If it does, the user may lose some data if the PCE that is ahead fails and is not recoverable.