Monitor PCE Health
This section tells how to monitor the health of the PCE.
PCE Health Monitoring Techniques
You can monitor the PCE software health using the following methods:
- PCE web console: The Health page in the PCE web console provides health information about your on-premises PCE, whether you deployed a 2x2 cluster, 4x2 cluster, or SNC.
- REST API: The PCE Health API can be used to obtain health information.
- Syslog: When you configure syslog with the PCE software, the PCE reports
system_health
messages to syslog for all nodes in the PCE cluster. - PCE command-line interface: Run commands to obtain health status for the entire PCE cluster and each node in the cluster.
- Database health check tool: (Illumio Core 19.3.0 and 19.3.1 only) Run the
dbcheck
command-line tool to obtain database health status on each data node in the cluster.
Minimum Required Monitoring
The PCE provides several different methods you can use to monitor PCE health, as described in PCE Health Monitoring Techniques.
No matter which technique you use, there is one main signal that it is important to watch for: the overall system status. You must monitor it as follows:
- If you are using the PCE web console, keep an eye on the PCE Health status near the top of the page. It indicates whether the PCE is in a Normal, Warning, or Critical state of health. For details, see Health Monitoring Using PCE Web Console.
- If you are using the API, similarly, monitor the status field. For details, see Health Monitoring Using Health REST API.
- If you are using the PCE syslog to monitor PCE health, watch for any messages that contain the text sev=WARN or sev=ERR. In such messages, check the other fields for details. For more details, see Health Monitoring Using Syslog.
The rest of this section provides details about the meaning of the various PCE health metrics and what to do if a warning or error state is seen.
PCE Health Status Codes
The following table lists the status shown in the PCE web console (or PCE Health API), the severity code shown in syslog, the corresponding color code in the PCE web console, and the most commonly encountered causes for each level of health.
Status/Severity |
Color |
Typical Meaning |
---|---|---|
Normal (healthy) or sev=INFO |
Green |
|
Warning or sev=WARN |
Yellow |
|
Critical or sev=ERR |
Red |
|
In Illumio Core 19.3.2 and later, if a warning threshold has been exceeded, a warning icon appears in three places in the web console UI: the upper right of the PCE Health dashboard, the General summary area of the dashboard, and next to the appropriate tab.
Health Monitoring Using PCE Web Console
Click the Health icon at the top of the PCE web console to see the general health of the PCE.
In Illumio Core 19.3.2 and later:
Tabs categorize the health information by Node, Application, Database Replication, and Supercluster.
The Node tab shows node information, including the health metric Disk Latency.
The Application tab shows a variety of information, including database health metrics. (For details, see PCE Health Metrics Reference.) The tab is divided into sections:
- Collector Summary (flow rate, success vs. failure rates)
- Traffic Summary (ingestion, backlog, database utilization)
- Policy Database Summary (database size, transaction ID age, vacuum backlog)
- VEN Heartbeat (success vs. failure, latency)
- VEN Policy (request rate, latency)
The Database Replication tab shows the database replication lag.
The Supercluster tab shows the supercluster replication lag.
In Illumio Core 19.3.0 and 19.3.1:
- PCE Health: Displays general PCE cluster health information, such as the PCE name, its runlevel, and overall PCE health (normal, warning, or error).
- PCE Node Health: For each node in the cluster, displays the node's hostname, IP address, runlevel, and whether the PCE software is running properly. Displays the node type (core or data) and which data nodes are the database replica and master. Displays the replication delay for the database replica.
- Database Replication Lag: Displays which of your data node is the master and which data node is the replica. Displays the amount of latency between the database replica and database master nodes for both the policy and traffic database. The PCE continually replicates both its policy and traffic databases.
- PCE Service Alert: Displays the number of degraded or failed services in the cluster with a detailed view to see where the service failures have occurred; namely, services that have been degraded or are no longer running on specific nodes.
PCE Health Status Indicator
The PCE web console provides an indicator that reflects overall status. Near the top of the PCE Health page in the PCE web console, a warning indicator labeled PCE Health shows normal, warning, or critical. You can find more details on the tab that corresponds to the issue.
Health Monitoring Using Health REST API
With the Health Check API, you can display PCE health information using the following syntax:
GET [api_version]/health
For details, see PCE Health in the REST API Developer Guide.
Health Monitoring Using Syslog
Each PCE node reports its status to the local syslog daemon once every minute. The PCE uses the program name illumio_pce/system_health
for these messages.
Example Syslog Messages
Example syslog message from a non-leader PCE node:
2015-12-17T00:40:31+00:00 level=info host=ip-10-0-0-26 ip=127.0.0.1 program=illumio_pce/system_health| sec=312831.757 sev=INFO pid=9231 tid=12334020 rid=0 leader=10.0.24.26 database_replication_lag=3.869344 cpu=2% disk=11% memory=19%
Example syslog message from a leader PCE node for a healthy PCE cluster:
2015-12-23T22:52:59+00:00 level=info host=ip-10-0-24-26 ip=127.0.0.1 program=illumio_pce/system_health| sec=911179.836 sev=INFO pid=5633 tid=10752960 rid=0 cluster=healthy cpu=2% disk=10% memory=37%
Example syslog message from a leader PCE node for a degraded PCE cluster with one node missing:
2015-12-23T22:56:00+00:00 level=notice host=ip-10-0-24-26 ip=127.0.0.1 program=illumio_pce/system_health| sec=911360.719 sev=WARN pid=5633 tid=10752960 rid=0 cluster=degraded missing=1 cpu=34% disk=10% memory=23%
Health Monitoring Using PCE Command Line
This section gives several techniques you can use at the command line to monitor PCE health.
Monitor a PCE Cluster
The following command displays the status of the PCE cluster, including where each individual service is running:
$ sudo -u ilo-pce illumio-pce-ctl cluster-status
Return codes:
- 0 - NOT RUNNING
- 1 - RUNNING
- 2 - PARTIAL (not all required services running)
For example:
$ ./illumio-pce-ctl cluster-status
SERVICES (runlevel: 5) NODES (Reachable: 4 of 4)
====================== ===========================
agent_background_worker_service 10.0.26.49 10.0.6.171
agent_service 10.0.26.49 10.0.6.171
agent_traffic_redis_cache 10.0.11.96 10.0.25.197
agent_traffic_redis_server 10.0.25.197
agent_traffic_service 10.0.26.49 10.0.26.49 10.0.6.171 10.0.6.171
auditable_events_service 10.0.26.49 10.0.6.171
collector_service 10.0.26.49 10.0.26.49 10.0.6.171 10.0.6.171
database_monitor 10.0.11.96 10.0.25.197
database_service 10.0.25.197
database_slave_service 10.0.11.96
ev_service 10.0.26.49 10.0.6.171
executor_service 10.0.26.49 10.0.6.171
fileserver_service 10.0.25.197
fluentd_source_service 10.0.26.49 10.0.6.171
login_service 10.0.26.49 10.0.6.171
memcached 10.0.26.49 10.0.6.171
node_monitor 10.0.11.96 10.0.25.197 10.0.26.49 10.0.6.171
pg_listener_service 10.0.11.96
search_index_service 10.0.26.49 10.0.6.171
server_load_balancer 10.0.26.49 10.0.6.171
service_discovery_agent 10.0.25.197
service_discovery_server 10.0.11.96 10.0.26.49 10.0.6.171
set_server_redis_server 10.0.11.96
traffic_worker_service 10.0.26.49 10.0.6.171
web_server 10.0.26.49 10.0.6.171
This command displays the members of the PCE cluster:
$ sudo -u ilo-pce illumio-pce-ctl cluster-members
For example:
[illumio@core0 illumio-pce]$ ./illumio-pce-ctl cluster-members
Reading /var/illumio-pce/data/runtime_env.yml.
Node Address Status Type
core0.mycompany.com 10.6.1.19:8301 alive server
data0.mycompany.com 10.6.1.20:8301 alive server
core1.mycompany.com 10.6.1.32:8301 alive server
data1.mycompany.com 10.6.1.31:8301 alive client
Monitor Database Replication
On either data node, run the following command to display the status of replication between the master database and replica:
$ sudo -u ilo-pce illumio-pce-db-management show-replication-info
The PCE updates this information every two minutes.
To prevent data loss during a database failover operation, monitor the PCE databases for excessive database replication lag.
For example:
$ ./illumio-pce-db-management show-replication-info
Reading /var/illumio/data/runtime_env.yml.
INSTALL_ROOT=/var/illumio/software
RENV=development
Current Time: 2016-02-16 22:42:03 UTC
Master: (10.6.1.73)
Last Sampling Time : 2016-02-16 22:41:14 UTC
Transaction Log location : 0/41881E8
Slave(s):
IP Address: 10.6.1.72
Last Sampling Time : 2016-02-16 22:41:16 UTC
Streaming : true
Receive Log Location : 0/41881E8
Replay Log Location : 0/4099048
Receive Lag (bytes) : 0
Replay Lag (bytes) : 979360
Transaction Lag (secs) : 4.633377
Last Transaction Replayed Time: 2016-02-16 22:37:12.920179 UTC
Database Health Check Tool
(Illumio Core 19.3.0 and 19.3.1 only. In 19.3.2, the functionality of this tool has been integrated fully into the PCE, so the dbcheck tool is no longer needed. In 19.3.1, the dbcheck tool is shipped with the PCE. For earlier versions, available as a download from the Illumio Knowledge Base.)
The command-line tool dbcheck
checks the health of PCE databases and indicates whether database maintenance is needed. The tool returns either "The PCE databases appear to be healthy" or "The PCE might need maintenance. Please contact Illumio support."
The tool checks for the following signs of database health:
- All required databases are correctly operating.
- Databases are not at risk of transaction ID starvation.
- Databases are not falling behind in background tasks, ex. vacuum.
- The disk usage of databases is within safe usage thresholds.
Run dbcheck
on the data nodes of a PCE. The tool should be executed on both PCE data nodes, since the primary and traffic databases could be running on either node. The command syntax is:
$ sudo -u ilo-pce [ILLUMIO_RUNTIME_ENV=runtime_env_location] dbcheck [--verbose] [--output outFile]
ILLUMIO_RUNTIME_ENV
: If the Illumio Runtime Environment file is in a custom location, specify the custom location.--verbose
: returns additional information, including PCE version, runlevel, node type, PCE FQDN and disk statistics.--output
: sends the command ouput to the specified file.
You can run the tool using the command line. You can also set the tool to run periodically using the cron
function and send the command output to an administrator or a system monitoring tool.
The tool returns the standard Linux exit codes: 0 is success and 1 is failure.PCE
PCE Health Troubleshooting
This section tells what action to take if you see a non-normal status when monitoring PCE health. The recommended response depends on which metric has departed from the Normal state. If you are not able to diagnose and fix it yourself, contact Illumio Support.
The health metrics may occur in the PCE web console, API response status field, or in the syslog severity field. When multiple conditions result in differing levels of severity, the more critical level is reported. If you receive a non-normal level for any of the following, here are the suggested actions to take. For additional details, see PCE Health Metrics Reference.
Name |
Troubleshoot |
---|---|
Disk Latency |
Warning/Critical: Disk latency on data nodes is an indication that DB/Traffic service needs to be investigated further for possible performance issues. Typically higher disk latency numbers indicate Disk I/O bottlenecks. |
CPU | When the PCE is under heavy load, CPU usage increases, and the Warning status is reported. Typically, the load should decrease without intervention in less than 20 minutes. If the Warning condition persists for 30 minutes or more, decrease the load on the CPU or increase capacity. |
Memory | When the PCE is under heavy load, memory usage increases, and the Warning status is reported. Typically, the load should decrease without intervention in less than 20 minutes. If the Warning condition persists for 30 minutes or more, increase the available memory. |
Disk Space | The PCE manages disk space using log rotation, and this is usually sufficient to address any Warning condition. If the Warning level persists for more than one day, and the amount of disk space consumed keeps increasing, notify Illumio Support. |
Policy Database Summary |
|
VEN heartbeat performance |
|
Policy performance |
|
Collector summary |
|
Traffic summary |
|
Database Replication Lag | Warning: Check whether the PCE is running properly and verify that there is no network issue between the nodes. If the replication lag keeps increasing, contact Illumio Support. |
Supercluster Replication Lag | Warning: Check whether all PCEs are running properly and verify that there is no network issue between the lagging PCEs. If the replication lag keeps increasing, contact Illumio Support. |
Configurable Thresholds for Health Metrics
(For Illumio Core 19.3.2 and later)
You can configure the thresholds that define the normal, warning, and critical status for each health metric. Each health metric has predefined thresholds for normal (green), warning (yellow), and critical (red). You can use the command illumio-pce-env metrics --write
to adjust these thresholds. This command can be used to modify any Boolean, number, float, or string, or array of these types (no nested arrays). For example:
illumio-pce-env metrics --write CollectorHealth:failure_warning_percent=15.0
After setting the desired threshold values, copy /var/lib/illumio-pce/data/illumio/metrics.conf
to every node in the cluster to ensure consistent application of the thresholds.
Examples of when you might want to use this feature:
- At a larger installation, the default memory threshold is set to 80%, but memory usage routinely spikes to 95%. Every time the memory utilization exceeds the threshold, the PCE Health page displays a warning. By configuring a higher threshold, you can reduce the frequency of warnings.
- Database replication lag can exceed a threshold for a brief time, raising a warning, but the system will catch up with replication after some time. To reduce these warnings, you can configure a longer time period for database replication lag to be tolerated. Note: This is not the same as configuring the threshold of the replication lag itself, but the permissible period of time for the lag to be non-zero.
- The default thresholds might be acceptable at first, but as more VENs are paired to the PCE over time, the default thresholds might need adjustment.
To set health metrics thresholds:
-
Run the following command to get a list of the available metrics, their current settings, and the thresholds you can modify:
illumio-pce-env metrics --list
Example output:
Engine Param Value Default CollectorHealth failure_warning_percent 10.0 failure_critical_percent 20.0 summary_warning_rate 12000 summary_critical_rate 15000 DiskLatencyMetric FlowAnalyticsHealth backlog_warning_percent 10.0 backlog_critical_percent 50.0 summary_warning_rate 12000 summary_critical_rate 15000 PolicyDBDiskHealthMetric PolicyDBTxidHealthMetric PolicyDBVacuumHealthMetric PolicyHealth TrafficDBMigrateProgress
If nothing appears in the Param column for a given metric, you can't modify the thresholds for that metric. This output shows that the Collector Health metric has four thresholds you can modify.
-
Run the following command:
illumio-pce-env metrics --write MetricName:threshold_name=value
For MetricName, threshold_name, and value, substitute the desired values. For example:
illumio-pce-env metrics --write CollectorHealth:failure_warning_percent=15.0
NOTE: Do not insert any space characters around the equals sign (=).
-
Copy
/var/lib/illumio-pce/data/illumio/metrics.conf
to every node in the cluster. (The path tometrics.conf
might be different if you have customizedpersistent_data_root
inruntime_env.yml
.) -
Restart the PCE.
-
When a metrics configuration is detected, the PCE loads and applies it. In
ilo_node_monitor.log
, you should see a message like "Loaded metric configuration for MetricName."
The metrics command provides other options as well; this section has covered only the most useful ones. For complete information, run the command with the -h
option to see the help text:
illumio-pce-env metrics -h