Monitor PCE Health

This section describes how to monitor the health of the PCE.

PCE Health Monitoring Techniques

You can monitor the PCE software health using the following methods:

PCE web console: The Health page in the PCE web console provides health information about your on-premises PCE, whether you deployed a 2x2 cluster, 4x2 cluster, or SNC.
REST API: The PCE Health API can be used to obtain health information.
Syslog: When you configure syslog with the PCE software, the PCE reports system_health messages to syslog for all nodes in the PCE cluster.
PCE command-line interface: Run commands to obtain health status for the entire PCE cluster and each node in the cluster.

Minimum Required Monitoring

The PCE provides several different methods you can use to monitor PCE health, as described in PCE Health Monitoring Techniques.

No matter which technique you use, there is one main signal that it is important to watch for: the overall system status. You must monitor it as follows:

If you are using the PCE web console, keep an eye on the PCE Health status near the top of the page. It indicates whether the PCE is in a Normal, Warning, or Critical state of health. For details, see Health Monitoring Using PCE Web Console.
If you are using the API, similarly, monitor the status field. For details, see Health Monitoring Using Health REST API.
If you are using the PCE syslog to monitor PCE health, watch for any messages that contain the text sev=WARN or sev=ERR. In such messages, check the other fields for details. For more details, see Health Monitoring Using Syslog.

The rest of this section provides details about the meaning of the various PCE health metrics and what to do if a warning or error state is seen.

PCE Health Status Codes

The following table lists the status shown in the PCE web console (or PCE Health API), the severity code shown in syslog, the corresponding color code in the PCE web console, and the most commonly encountered causes for each level of health.

Status/Severity	Color	Typical Meaning
Normal (healthy) or `sev=INFO`	Green	All required nodes and services are running. CPU usage, memory usage, and disk usage of all nodes is less than 95%, and all other metrics are below their thresholds. Database replication lag is less than or equal to 30 seconds. (In a PCE Supercluster only) Supercluster replication lag is less than or equal to 120 seconds.
Warning or `sev=WARN`	Yellow	One or more nodes are unreachable. One or more optional services are missing, or one or more required services have been degraded. The CPU usage, memory usage, or disk usage of any node is greater than or equal to 95%, or another health metric has exceeded its warning threshold. For more information, see PCE Health Metrics Reference. Database replication lag is greater than 30 seconds. (In a PCE Supercluster only) Supercluster replication lag is greater than 120 seconds.
Critical or `sev=ERR`	Red	One or more required services are missing. A health metric has exceeded its critical/error threshold. For more information, see PCE Health Metrics Reference.

If a warning threshold has been exceeded, a warning icon appears in three places in the PCE web console: the upper right of the PCE Health dashboard, the General summary area of the dashboard, and next to the appropriate tab.

Health Monitoring Using PCE Web Console

Click the Health icon at the top of the PCE web console to see the general health of the PCE.

Tabs categorize the health information by Node, Application, Database Replication, and Supercluster.

The Node tab shows node information, including the health metric Disk Latency. It also displays a hardware requirements message for each node, to tell whether the hardware provisioned meets the requirements as documented in the Capacity Planning topic. If a node is found to have sufficient resources to meet specifications, the message "Node Specs Meet requirements" appears with a green checkmark. If the node does not have sufficient resources to meet the required specifications, the alert "Node Specs Do not meet requirements" appears with a yellow triangle. The requirements vary depending on the type of PCE cluster (single-node, 2x2 multi-node, 4x2 multi-node, etc.). This is determined based on the cluster_type runtime parameter, which is set for every node. The hardware requirements check needs to know the cluster type so it can use the right set of hardware requirements.

The Application tab shows a variety of information, including database health metrics. (For details, see PCE Health Metrics Reference.) The tab is divided into sections:

Collector Summary (flow rate, success vs. failure rates)
Traffic Summary (ingestion, backlog, database utilization)
Policy Database Summary (database size, transaction ID age, vacuum backlog)
VEN Heartbeat (success vs. failure, latency)
VEN Policy (request rate, latency)

The Database Replication tab shows the database replication lag.

The Supercluster tab shows the Supercluster replication lag (applicable only in a PCE Supercluster).

PCE Health Status Indicator

The PCE web console provides an indicator that reflects overall status. Near the top of the PCE Health page in the PCE web console, a warning indicator labeled PCE Health shows normal, warning, or critical. You can find more details on the tab that corresponds to the issue.

Health Monitoring Using Health REST API

With the PCE Health API, you can display PCE health information using the following syntax:

GET [api_version]/health

For details, see PCE Health in the REST API Developer Guide.

Health Monitoring Using Syslog

Each PCE node reports its status to the local syslog daemon once every minute. The PCE uses the program name illumio_pce/system_health for these messages.

Example Syslog Messages

Example syslog message from a non-leader PCE node:

2015-12-17T00:40:31+00:00 level=info host=ip-10-0-0-26 ip=127.0.0.1 program=illumio_pce/system_health| sec=312831.757 sev=INFO pid=9231 tid=12334020 rid=0 leader=10.0.24.26 database_replication_lag=3.869344 cpu=2% disk=11% memory=19%

Example syslog message from a leader PCE node for a healthy PCE cluster:

2015-12-23T22:52:59+00:00 level=info host=ip-10-0-24-26 ip=127.0.0.1 program=illumio_pce/system_health| sec=911179.836 sev=INFO pid=5633 tid=10752960 rid=0 cluster=healthy cpu=2% disk=10% memory=37%

Example syslog message from a leader PCE node for a degraded PCE cluster with one node missing:

2015-12-23T22:56:00+00:00 level=notice host=ip-10-0-24-26 ip=127.0.0.1 program=illumio_pce/system_health| sec=911360.719 sev=WARN pid=5633 tid=10752960 rid=0 cluster=degraded missing=1 cpu=34% disk=10% memory=23%

Health Monitoring Using PCE Command Line

This section gives several techniques you can use at the command line to monitor PCE health.

Monitor a PCE Cluster

The following command displays the status of the PCE cluster, including where each individual service is running:

$ sudo -u ilo-pce illumio-pce-ctl cluster-status

Return codes:

0 - NOT RUNNING
1 - RUNNING
2 - PARTIAL (not all required services running)

For example:

$ ./illumio-pce-ctl cluster-status
 
SERVICES (runlevel: 5)          NODES (Reachable: 4 of 4)
======================          ===========================
agent_background_worker_service 10.0.26.49      10.0.6.171
agent_service                   10.0.26.49      10.0.6.171
agent_traffic_redis_cache       10.0.11.96      10.0.25.197
agent_traffic_redis_server      10.0.25.197
agent_traffic_service           10.0.26.49      10.0.26.49      10.0.6.171      10.0.6.171
auditable_events_service        10.0.26.49      10.0.6.171
collector_service               10.0.26.49      10.0.26.49      10.0.6.171      10.0.6.171
database_monitor                10.0.11.96      10.0.25.197
database_service                10.0.25.197
database_slave_service          10.0.11.96
ev_service                      10.0.26.49      10.0.6.171
executor_service                10.0.26.49      10.0.6.171
fileserver_service              10.0.25.197
fluentd_source_service          10.0.26.49      10.0.6.171
login_service                   10.0.26.49      10.0.6.171
memcached                       10.0.26.49      10.0.6.171
node_monitor                    10.0.11.96      10.0.25.197     10.0.26.49      10.0.6.171
pg_listener_service             10.0.11.96
search_index_service            10.0.26.49      10.0.6.171
server_load_balancer            10.0.26.49      10.0.6.171
service_discovery_agent         10.0.25.197
service_discovery_server        10.0.11.96      10.0.26.49      10.0.6.171
set_server_redis_server         10.0.11.96
traffic_worker_service          10.0.26.49      10.0.6.171
web_server                      10.0.26.49      10.0.6.171

This command displays the members of the PCE cluster:

$ sudo -u ilo-pce illumio-pce-ctl cluster-members

For example:

[illumio@core0 illumio-pce]$ ./illumio-pce-ctl cluster-members
Reading /var/illumio-pce/data/runtime_env.yml.
Node                 Address         Status  Type    
core0.mycompany.com  10.6.1.19:8301  alive   server
data0.mycompany.com  10.6.1.20:8301  alive   server
core1.mycompany.com  10.6.1.32:8301  alive   server
data1.mycompany.com  10.6.1.31:8301  alive   client

Monitor Database Replication

On either data node, run the following command to display the status of replication between the primary database and replica:

$ sudo -u ilo-pce illumio-pce-db-management show-replication-info

The PCE updates this information every two minutes.

IMPORTANT:

To prevent data loss during a database failover operation, monitor the PCE databases for excessive database replication lag.

For example:

$ ./illumio-pce-db-management show-replication-info
Reading /var/illumio/data/runtime_env.yml.
INSTALL_ROOT=/var/illumio/software
RENV=development
 
Current Time: 2016-02-16 22:42:03 UTC
 
Master: (10.6.1.73)
Last Sampling Time : 2016-02-16 22:41:14 UTC
Transaction Log location : 0/41881E8
 
Slave(s):
IP Address: 10.6.1.72
Last Sampling Time : 2016-02-16 22:41:16 UTC
Streaming : true
Receive Log Location : 0/41881E8
Replay Log Location : 0/4099048
Receive Lag (bytes) : 0
Replay Lag (bytes) : 979360
Transaction Lag (secs) : 4.633377
Last Transaction Replayed Time: 2016-02-16 22:37:12.920179 UTC

PCE Health Troubleshooting

This section tells what action to take if you see a non-normal status when monitoring PCE health. The recommended response depends on which metric has departed from the Normal state. If you are not able to diagnose and fix it yourself, contact Illumio Support.

The health metrics may occur in the PCE web console, API response status field, or in the syslog severity field. When multiple conditions result in differing levels of severity, the more critical level is reported. If you receive a non-normal level for any of the following, here are the suggested actions to take. For additional details, see PCE Health Metrics Reference.

Name	Troubleshoot
Disk Latency	Warning/Critical: Disk latency on data nodes is an indication that DB/Traffic service needs to be investigated further for possible performance issues. Typically higher disk latency numbers indicate Disk I/O bottlenecks.
CPU	When the PCE is under heavy load, CPU usage increases, and the Warning status is reported. Typically, the load should decrease without intervention in less than 20 minutes. If the Warning condition persists for 30 minutes or more, decrease the load on the CPU or increase capacity.
Memory	When the PCE is under heavy load, memory usage increases, and the Warning status is reported. Typically, the load should decrease without intervention in less than 20 minutes. If the Warning condition persists for 30 minutes or more, increase the available memory.
Disk Space	The PCE manages disk space using log rotation, and this is usually sufficient to address any Warning condition. If the Warning level persists for more than one day, and the amount of disk space consumed keeps increasing, notify Illumio Support.
Policy Database Summary	disk_usage (database disk utilization): Warning: Plan to increase the capacity of the disk partition holding the Policy DB or make more room by deleting unnecessary data as soon as possible. Critical: Immediately increase the disk partition holding the Policy DB or make more room by deleting unnecessary data. txid_max_age (transaction ID maximum age): Warning: Contact Illumio Support and plan a manual full vacuum as soon as possible. Critical: Immediately contact Illumio Support. vacuum_backlog (vacuum backlog): Warning, Critical: If the situation persists, contact Illumio Support so that the reason for the underperformance of the auto-vacuum can be investigated.
VEN heartbeat performance	avg_latency, hi_latency (latency): If the VEN heartbeat latency is high, examine the application logs on core nodes and system resource utilization across the entire PCE cluster. IOPS-related issues may often be diagnosed by examining database logs and observing long wait times for committing database transactions to disk. rate, result (response stats): Warning/Critical: Examine the application logs on core nodes for more information about the precise cause of the failure.
Policy performance	avg_latency, hi_latency (latency): If latency is abnormally high, investigate the cause. For example, examine the logs to try to find out why the policy is changing. rate (request count): If abnormally large, investigate the cause (see latency). The default threshold is conservative by design. Each organization has its own expected rate of change of VEN policy, so there is no universal correct warning threshold. You can modify the threshold to better match expectations (see Configurable Thresholds for Health Metrics). If the number of VEN policy requests is too high, examine application logs to find the reasons for the policy changes, and determine whether the policy changes are expected.
Collector summary	Flow summaries rate, node: A 4x2 PCE cluster is configured to handle approximately 10,000 flow summaries per second by default. If fewer posts are reported and you see a large number of failed posts, the collector count can be increased with help from Illumio Support. Success rate, node: This metric is informational. However, if counts differ across core machines, ensure intra-PCE latency is within the 10ms limit. Failure percentage ratio, node: On startup, or when connections are reestablished, VEN post rates can overwhelm the PCE, causing it to reject posts. This is normal unless persistent. If this ratio is large, or if the value is consistent and large (0.1), it means VENs may not be able to upload flow data, and they will start dropping after 24 hrs. The solution is usually to add more collectors.
Traffic summary	Ingest rate, node: A 4x2 PCE cluster is configured to handle approximately 10,000 flows per second by default. If this rate is exceeded, and a backlog begins to grow, the PCE will eventually prune the backlog and lose data. Adding additional `flow_analytics` daemons will distribute the work, but eventually PostgreSQL itself could become the bottleneck, requiring the use of DX. Backlog size, node: If the size of the backlog increases continuously, this indicates performance issues with the flow analytics service which processes the flows in the backlog. Contact Illumio support if the backlog exceeds the safe threshold. Backlog size percentage: Increasing values indicate that the buffered new flow data is growing, meaning the PCE is unable to keep up with the rate of data posted. The PCE collector flow summary rate and PCE traffic summary ingest rate need to be to be roughly equal, or this buffered backlog will grow.
Database Replication Lag	Warning: Check whether the PCE is running properly, and verify that there is no network issue between the nodes. If the replication lag keeps increasing, contact Illumio Support.
Supercluster Replication Lag	Warning: Check whether all PCEs are running properly, and verify that there is no network issue between the lagging PCEs. If the replication lag keeps increasing, contact Illumio Support.

Configurable Thresholds for Health Metrics

You can configure the thresholds that define the normal, warning, and critical status for each health metric. Each health metric has predefined thresholds for normal (green), warning (yellow), and critical (red). You can use the command illumio-pce-env metrics --write to adjust these thresholds. This command can be used to modify any Boolean, number, float, or string, or array of these types (no nested arrays). For example:

illumio-pce-env metrics --write CollectorHealth:failure_warning_percent=15.0

After setting the desired threshold values, copy /var/lib/illumio-pce/data/illumio/metrics.conf to every node in the cluster to ensure consistent application of the thresholds.

Examples of when you might want to use this feature:

At a larger installation, the default memory threshold is set to 80%, but memory usage routinely spikes to 95%. Every time the memory utilization exceeds the threshold, the PCE Health page displays a warning. By configuring a higher threshold, you can reduce the frequency of warnings.
Database replication lag can exceed a threshold for a brief time, raising a warning, but the system will catch up with replication after some time. To reduce these warnings, you can configure a longer time period for database replication lag to be tolerated. Note: This is not the same as configuring the threshold of the replication lag itself, but the permissible period of time for the lag to be non-zero.
The default thresholds might be acceptable when the PCE is first installed, but as more VENs are paired to the PCE over time, the default thresholds might need adjustment.

To set health metrics thresholds:

Run the following command to get a list of the available metrics, their current settings, and the thresholds you can modify:

illumio-pce-env metrics --list

Example output:

Engine                                     Param    Value    Default
CollectorHealth             failure_warning_percent             10.0
			     failure_critical_percent            20.0
			     summary_warning_rate               12000
			     summary_critical_rate              15000
DiskLatencyMetric
FlowAnalyticsHealth         backlog_warning_percent             10.0
			     backlog_critical_percent            50.0
			     summary_warning_rate               12000
			     summary_critical_rate              15000
PolicyDBDiskHealthMetric
PolicyDBTxidHealthMetric
PolicyDBVacuumHealthMetric
PolicyHealth
TrafficDBMigrateProgress

If nothing appears in the Param column for a given metric, you can't modify the thresholds for that metric. This example output shows that the Collector Health metric has four thresholds you can modify.

Run the following command:
```
illumio-pce-env metrics --write MetricName:threshold_name=value
```
For MetricName, threshold_name, and value, substitute the desired values. For example:
```
illumio-pce-env metrics --write CollectorHealth:failure_warning_percent=15.0
```
NOTE: Do not insert any space characters around the equals sign (=).
Copy /var/lib/illumio-pce/data/illumio/metrics.conf to every node in the cluster. The path to metrics.conf might be different if you have customized persistent_data_root in runtime_env.yml.
Restart the PCE.
When a metrics configuration is detected, the PCE loads and applies it. In ilo_node_monitor.log, you should see a message like "Loaded metric configuration for MetricName."

The metrics command provides other options as well. This section discussed only the most useful ones. For complete information, run the command with the -h option to see the help text:

illumio-pce-env metrics -h