PCE Health

The Public Experimental Health Check API displays health information about a 4X2 Supercluster or a PCE virtual appliance.

NOTE:

This API is only available for Illumio Core PCE installed on-premises and is not available for Illumio Cloud customers.

About the PCE Health API

With this API, you can see the following health information:

How long the PCE has been running, its runlevel, and overall health (normal, warning, or error).
Each node hostname, IP address, uptime, runlevel, and whether the PCE software is running properly.
Each node type (core or data), and which data node is the database slave and which is the master. The replication delay for the database slave is also displayed.
Information about PCE service alerts, such as the number of degraded or failed services in the cluster, so you can see where service failures have occurred.
The new health API schema (health_definitions_schema.json) is designed to be consumed by the UI as well as the API end-user. Metrics are listed in two sections: for an individual node, and the general metrics section.

Health Metrics

Application-level metrics have been added to PCE health API to allow for pro-active monitoring and gathering of insights into the system performance. These metrics cover all core PCE subsystems: core application (VEN heartbeats, policy), PCE platform (database health and disk latency) and data (traffic pipeline and storage).

You can monitor the PCE health and performance by looking at nodes, clusters, database replication, and other services. Monitoring can be performed in different ways, such as using the Health page in the PCE's web console UI, messages in the PCE syslog, and the Illumio REST API.

While the periodic syslog messages can be used for historic monitoring (time series), the API uses pre-defined and customizable thresholds to toggle information that was defined as warning or critical.

The following metric properties are used by the Health API:

: metric (name, value, and units)
- An example for metric : { metric: "Disk Usage", last_updated: "2020-03-12T08:46:25-07:00", entries: [...] }
- An example for values: [{ status: "normal", name: "usage", type: "percent", value: 12 }, { name: "disk", value: "persistent" }]
last_updated timestamp (not available in the UI)

If you want to enable or disable individual metrics, use the CLI commands described in Configurable Thresholds for Health Metrics. You can also use the configurable thresholds technique to turn off all metrics.

Health metric schema

The existing UI schema is extended to allow generic metrics in two sections: the node section and general section.

The overall metric schema may look like this:

            [
"metric": {
	"description": "One or more entries encompassing the metric.",
	"type": "object",
	"required": [
		"metric"
	],
	"properties": {
		"metric": {
			"type": "string"
		},
		"entries": {
			"type": "array",
			"items": {
				"anyOf": [
				{
					"$ref": "#/definitions/cluster"
				},
				{
					"$ref": "#/definitions/entry"
				}
			]
		}
	},
	"last_updated": {
		"type": "string",
		"format": "date-time"
	},
	"display": {
		"description": "An optional hint for the UI to display the data in a specific form.",
		"type": "string",
		"enum": [
			"table",
			"join"
		]
	}
}

NOTE: The optional "display" field is not used for the node metrics.

Configure Thresholds for Health Metrics

Configure the thresholds that define the normal, warning, and critical status for each health metric.
Use the new command illumio-pce-env metrics --write to adjust these thresholds.

See the PCE Administration Guide for more information.

PCE Health API Method

Functionality	HTTP	URI
Get the health information of the PCE Cluster and its nodes	`GET`	`[api_version]/health`

Check PCE Health

URI to Check PCE Health

            GET [api_version]/health

Curl Command Check PCE Health

            curl -i -X GET https://pce.my-company.com:8443/api/v2/health -H 'Accept: application/json' -u $KEY:'TOKEN'

PCE Health Response Properties

Property	Description	Type	Required
`status`	Current health status of the PCE. Possible values: `normal`: When a PCE health is a normal state it means: All required services are running. All nodes are running. CPU usage of all nodes is less than 95%. The memory usage of all nodes is less than 95%. Disk usage of all nodes is less than 95%. Database replication lag is less than or equal to 30 seconds. `warning`: When PCE health is in a warning state, it means: One or more nodes are unreachable. One or more optional services are missing, or one or more required services have been degraded. The CPU usage of any node is greater than or equal to 95%. Memory usage of any node is greater than or equal to 95%. Disk usage of any node is greater than or equal to 95%. Database replication lag is greater than 30 seconds. `critical`: A PCE is considered to be in a critical state when one or more required services are missing. If a PCE enters a critical state, it might not be possible to authenticate to the PCE or get an API response depending on which services are missing from the PCE.	String	Yes
`type`	Type of the PCE: `standalone`: Indicates that this PCE is an on-premises 2x2 or 4x2 PCE cluster. Or one of the following types: `leader`: Indicates that this PCE is the leader of a Supercluster. `member`: Indicates that this PCE is a member of a Supercluster.	String
`fqdn`	The fully qualified domain name (FQDN) of the PCE.	String
`available_seconds`	The length of time that this PCE has been available, measured in seconds.	Number
`notifications`	Heath warnings related to the PCE, which contain the following properties: `status`: Severity status of this notification. Possible values include: `normal`, `warning`, or `critical`. `token`: Description token of the notification. `message`: Description of this notification	Array String String String	Yes Yes No
`listen_only_mode_enabled_at`	Timestamp at which PCE Listen Only Mode was enabled. Format: `date-time` For information about enabling or disabling listen-only mode for a PCE, see the PCE Administration Guide.	String &Null
`upgrade_pending`
`nodes`	The nodes that comprise your PCE cluster. For each node of your PCE, this API call returns the following properties: `hostname`: The node hostname. `ip_address`: The node IP address. `type` `runlevel`: The current runlevel of the PCE software on the node. `Minimum: 0` and `Maximum:6`. For more information about runlevels and their usage, see the PCE Administration Guide. `uptime_seconds`: Seconds since this node has been restarted. `cpu`: CPU usage of this node: `status`: Either `normal`, `warning`, or `critical`. `percent`: Percentage of the node CPU being used. It can be `minimum:0` to `maximum:100` `disk` : Disk usage of this node per individual location `Location` `Value`(see `health_status_percent.schema.json`) `status`: Either `normal`, `warning`, or `critical`. `percent`: Percentage of the node CPU being used. It can be `minimum:0` to `maximum:100` `memory`: Memory usage of this node: `status`: Either `normal`, `warning`, or `critical`. `percent`: Percentage of the node CPU being used. It can be `minimum:0` to `maximum:100` `services`: The status of all PCE services running on the node. Possible status for PCE services include: `running`: The service is fully running and operational. `not running`: The service has stopped running. `partial`: The service is running but in a partial state. `optional` `unknown` `metrics`:Node metrics `metric` `entries` `last_updated_at`: format `date-time` `generated_at`:Timestamp when this information was generated in the `date-time` format	Array Str&Null Str&Null Num&Null Num&Null Num&Null Object String Number Array String Array	Yes Yes No No No No Yes Yes Yes Yes Yes Yes No Yes
`network`	PCE 2x2 or 4x2 Deployment For a PCE 2x2 or 4x2 deployment, the `network`property provides latency information between the database master and database slave data nodes in your PCE for policy and traffic data. This property also indicates which data node in your PCE is the database master database and which is the database slave. This `type` of database replication is called `intracluster` in the REST API. Sub-properties include: `replication`: The category of properties that provide database replication latency information for a PCE cluster. (For a PCE Supercluster, this information is provided for each PCE in the Supercluster.) `type`: Type of replication. `intracluster` for a PCE 2x2 or 4x2 deployment. `details`: Includes the following properties: `database_name`: Name of the Database being replicated `master_fqdn`: The FQDN of the database master node. `slave_fqdn`: FQDN of the slave database node. `value`: The amount of replication lag between the master and database slave for both policy and traffic data. `status`: Lag status: `normal`, `warning`, or `critical`. `lag_seconds`: The amount of lag measured in seconds between the master and slave databases for both policy and traffic data. Supercluster Deployment If you have deployed a PCE Supercluster, the PCE health call also returns information about the database replication between the PCE you are currently logged into and all other PCEs in the Supercluster. In a Supercluster deployment, the security policy provisioned on the leader is replicated to all other PCEs in the Supercluster. Additionally, all PCEs in the Supercluster (leader and members) replicate copies of each workload's context, such as IP addresses, to all other PCEs in the Supercluster. This other `type` of database replication for a Supercluster is called `intercluster` in the REST API, and information is provided for all PCEs in the Supercluster. Properties include: `replication`: The category of properties that provide database replication latency information for a PCE cluster. `type`: Type of replication. `intercluster` for a PCE Supercluster deployment. `details`: Includes the following properties: `fqdn`: The FQDN of the database master of the other PCEs listed in this section. `value`: The amount of replication lag between the PCE you are logged into and one of the other PCEs in the Supercluster. `status`: Either `normal`, `warning`, or `critical`. `lag_seconds`: The amount of lag measured in seconds between the PCE you are logged into and the other PCE listed in this section.	Object Object String Object String String String Object String Number	Yes Yes Yes Yes Yes Yes Yes Yes Yes
`metric`	One or more entries encompassing the metric `metric`: Additional reported metrics `entries` `last_updated_at`: format `date-time` `display`: When you use `display:table`, it is an optional hint for the UI to display the metric's data in a specific table form. If `display:join` is used immediately after the metric with `display:table`, then the next metric will be visually joined to the previous metric's table. The `display: enclosed` property is requesting that the UI wrap this property in parenthesis.	Object String Array String String	No
`generated_at`	The timestamp of when the PCE information was generated. Format: date-time	String, Null	Yes

Configuring Health Metrics

Turn off health metrics

Configure thresholds for health metrics

Configure the thresholds that define the normal, warning, and critical status for each health metric.
Use the command illumio-pce-env metrics --write to adjust these thresholds.

See the PCE Administration Guide for more information.