PCE Health Metrics Reference
The health metrics consist of a set of key value pairs. The following table describes the possible keys that can appear.
Category | Key | Description | Severity Levels |
---|---|---|---|
Disk Space |
|
The PCE node reports disk space for the PCE application directories (configured in the
When all these directories are on a single mount point, the node reports: When multiple mount points exist, the node reports the first discovered path by name, such as:
When the PCE encounters an error determining this information, the node reports:
|
Default: The following thresholds trigger the following severity levels:
These default thresholds can be modified using The disk space value is only reported when one of the conditions above is met; otherwise, it reports only disk space. When a node has multiple disk mounts, the message might look like:
|
Physical Memory |
|
Each PCE node reports basic physical memory usage, indicated as:
|
Default: the following values trigger the following severity levels:
These default thresholds can be modified using |
CPU Load |
|
Each PCE node reports CPU usage load as
|
Default: the following values trigger the following severity levels:
These default thresholds can be modified using |
Cluster Leader |
leader
|
The IP address of the current leader, or unavailable when no leader exists or it is unreachable. |
|
Cluster Status |
cluster
|
The overall health of the cluster, reported by the leader only:
|
These status values trigger the following severity levels:
|
Missing Nodes |
missing
|
The number of nodes that are missing from the cluster. If no nodes are missing, this metric is not reported. |
|
Replication Lag |
|
The number of seconds the database replica is lagging behind the primary database. Output by database replica nodes only. |
These thresholds trigger the following severity level:
|
Disk Latency | policy_disk_latency_milliseconds ,traffic_disk_latency_milliseconds |
(19.3.2 and later) Average time (in milliseconds) for I/O requests issued to the device to be served. This includes the time spent by the requests in queue and the time spent servicing them. The metric is calculated exactly the same way iostat calculates await. Values: delay (milliseconds), disk Usefulness: Indicates Disk I/O, which is especially useful when the DB services are under heavy load. |
|
Policy Database: Size | policy_database_size_gb
|
(19.3.2 and later) Informational. Size of the Policy Database data directory. Provides an indication of disk space requirements of the Policy DB. Depending on size, reported in units of byte, kilobyte, megabyte, gigabyte, terabyte | |
Policy Database: Disk Utilization | policy_database_utilization_percentage
|
(19.3.2 and later) Usage ratio of the disk partition holding the Policy DB. Consequences of the Policy DB running out of disk space can be critical. |
|
Policy Database: Transaction ID Max Age |
policy_database_transaction_id_max_age
|
(19.3.2 and later) Maximum transaction ID (TxID) age of the Policy DB. This does not apply to the Traffic DB. Indicates the risk of the DB running out of TxIDs, which could cause a DB lockdown requiring expensive recovery procedures. The PCE will attempt to automatically detect and recover before this occurs (requires reboot). |
|
Policy Database: Vacuum Backlog | policy_database_vacuum_backlog_percentage
|
(19.3.2 and later) Percentage of vacuum-ready rows (a.k.a dead rows) over the total number of rows of the Policy database computed over a period of up to 12 hours. This does not apply to the Traffic DB. Indicates how well the auto-vacuum of DB is performing. If the percentage is persistently above Postgres default settings of about 20% of the total number of rows, it is an indication that the auto-vacuum is not working effectively. |
|
VEN Heartbeat Performance: Latency | ven_heartbeat_average_latency_seconds , ven_heartbeat_high_latency_seconds |
(19.3.2 and later) (milliseconds)
Backend processing time of VEN heartbeat requests. Does not include the time spent in the load balancer queues, as the queue time may be influenced by a number of other external factors. The VEN heartbeat uses the same PCE services and components as the policy computation and is therefore a good overall indicator for the health of the policy subsystem, including whether system resources are being overwhelmed. Historically, it has reliably indicated I/O and/or policy cache bottleneck(s). |
|
VEN Heartbeat Performance: Success |
|
(19.3.2 and later) Active VENs send a heartbeat API request to the PCE approximately every 5 minutes. This metric captures the number of VEN heartbeat requests seen on the PCE in approximately the past hour. The count may be transiently inaccurate due to concurrent log rotation or other gaps in the application log files. If the PCE has just started up, this number is expected to ramp up over the first hour. The number of successful VEN heartbeat requests per hour summed across all PCE core nodes should be approximately the number of VENs times 12 (heartbeats happen every 5 minutes per VEN). A low number of successful VEN heartbeats likely indicates issues with VEN connectivity or PCE performance. Depending on the VEN disconnect/offline settings, a low VEN heartbeat success rate may cause traffic to be dropped to/from enforced workloads. |
|
VEN Heartbeat Performance: Failure | ven_heartbeat_failure_percent , ven_heartbeat_failure_count_per_hour |
(19.3.2 and later) |
Warning: 5% Critical/Error: 20% |
Policy Performance: Latency |
|
(19.3.2 and later) (milliseconds) Average response time for policy. Latency indicates policy complexity and system load/bottlenecks. This metric captures the backend processing time of VEN policy requests. It does not include the time spent in the load balancer queues, as queue time may be influenced by a number of other external factors. The cost to compute the VEN policy instructions depends on a large number of factors, including but not limited to the rate of change in the environment, the number of rules, the number of actors (workloads, labels, etc.) used in the rules, and the density of desired connectivity between workloads. Abnormally high VEN policy request latency may indicate issues with inadequate system resources, policy changes that result in higher than intended policy complexity, or an abnormally high rate of change to the workload context. |
|
Policy Performance: Request Count | ven_policy_request_count_per_hour
|
(19.3.2 and later) (requests/hour) When a new policy is provisioned or the workload context (IP address, label membership, etc.) is changed on the PCE, policy instructions are sent to affected VENs. This metric captures the number of VEN policy requests seen on the PCE in approximately the past hour. The count may be transiently inaccurate due to concurrent log rotation or other gaps in the application log files. When a PCE first starts or is restarted, this number may increase sharply over a short time period as every VEN checks to ensure policy sync. The VEN policy request rate provides an indicator of the rate of policy change across the organization, and therefore, an estimate of the load on the PCE. VEN policy requests are sometimes more expensive to process than other API requests, and frequent policy changes may result in decreased overall performance and longer policy convergence times. Frequent policy changes may also be a symptom of underlying network or infrastructure issues, such as (but not limited to) frequent IP address changes or improperly cloned VENs. |
|
Collector: Flow Summaries | collector_summaries_per_second
|
(19.3.2 and later) Total flow summaries processing rate for a single core PCE node, over the last hour. The sum of these should roughly match the flow summary ingest rate, or the PCE will show an increasing backlog size. |
|
Collector: Success Rate | collector_post_success_count_per_hour
|
(19.3.2 and later) Informational. Total flow summary posts accepted by a core machine over the last hour. Posts can be of different sizes, so take longer to process, but you should see roughly the same rates for each core. If counts differ across core machines, ensure intra-PCE latency is within the 10ms limit. |
|
Collector: Failure Rate | collector_post_failure_count_per_hour
|
(19.3.2 and later) Informational. Total flow summary failure rate over the last hour. Under normal operational circumstances, this value should be approximately the same for all core nodes. |
|
Collector: Failure Percentage | collector_post_failure_percentage
|
(19.3.2 and later) Failure/total. Failure rate / success ratio over the last hour. |
|
Traffic Summary: Ingest | traffic_summaries_per_second , total_traffic_summaries_per_second |
(19.3.2 and later) The mean rate at which flow summaries are added to the postgresql database over the last hour. |
|
Traffic Summary: Database Size | traffic_database_size_gb , traffic_database_size_days |
(19.3.2 and later) Informational. (gigabytes; days) | |
Traffic Summary: Database Size: % of Allocated | traffic_database_utilization_percentage
|
(19.3.2 and later) Informational. The system is behaving normally even if it is near or at configured disk limits. The oldest flows will be dropped to enforce the limit, however, which may not be desirable. |
|
Traffic Summary: Backlog Size | traffic_backlog_size_gb
|
(19.3.2 and later) Amount of flows in the backlog that are not in the traffic database, in gigabytes. If the backlog size exceeds a certain limit (default is 10 GB and can be set in runtime environment), flows get dropped. | |
Traffic Summary: Backlog Size: % of Allocated | traffic_backlog_utilization_percentage
|
(19.3.2 and later) Increasing values indicate that the buffered new flow data is growing, meaning the PCE is unable to keep up with the rate of data posted. The PCE collector flow summary rate and PCE traffic summary ingest rate need to be to be roughly equal or this buffered backlog will grow. |
|
Supercluster Replication Lag |
(For PCE superclusters only) Number of seconds since a replication event generated by a PCE was processed on another PCE. The supercluster replication engine relies on events to ensure data gets replicated. These are not the same as the PCE audit events. An increasing replication lag usually indicates some issue with the PCE replication engine or network connectivity. The larger the replication lag, the longer it may take a PCE to catch up with other regions once the underlying issue is addressed. |
|