PCE Health

The Public Experimental Health Check API displays health information about a 4X2 Supercluster or a PCE virtual appliance.

NOTE:

This API is only available for Illumio Core PCE installed on-premises and is not available for Illumio Cloud customers.

About the PCE Health API

With this API, you can see the following health information: 

  • How long the PCE has been running, its runlevel, and overall health (normal, warning, or error).
  • Each node hostname, IP address, uptime, runlevel, and whether the PCE software is running properly.
  • Each node type (core or data), and which data node is the database slave and which is the master. The replication delay for the database slave is also displayed.
  • Information about PCE service alerts, such as the number of degraded or failed services in the cluster, so you can see where service failures have occurred.
  • The new health API schema (health_definitions_schema.json) is designed to be consumed by the UI as well as the API end-user. Metrics are listed in two sections: for an individual node, and the general metrics section.

Health Metrics

Application-level metrics have been added to PCE health API to allow for pro-active monitoring and gathering of insights into the system performance. These metrics cover all core PCE subsystems: core application (VEN heartbeats, policy), PCE platform (database health and disk latency) and data (traffic pipeline and storage).

You can monitor the PCE health and performance by looking at nodes, clusters, database replication, and other services. Monitoring can be performed in different ways, such as using the Health page in the PCE's web console UI, messages in the PCE syslog, and the Illumio REST API.

While the periodic syslog messages can be used for historic monitoring (time series), the API uses pre-defined and customizable thresholds to toggle information that was defined as warning or critical.

The following metric properties are used by the Health API:

  • : metric (name, value, and units)
    • An example for metric : { metric: "Disk Usage", last_updated: "2020-03-12T08:46:25-07:00", entries: [...] }
    • An example for values: [{ status: "normal", name: "usage", type: "percent", value: 12 }, { name: "disk", value: "persistent" }]
  • last_updated timestamp (not available in the UI)

If you want to enable or disable individual metrics, use the CLI commands described in Configurable Thresholds for Health Metrics. You can also use the configurable thresholds technique to turn off all metrics.

Health metric schema

The existing UI schema is extended to allow generic metrics in two sections: the node section and general section.

The overall metric schema may look like this:

            [
"metric": {
	"description": "One or more entries encompassing the metric.",
	"type": "object",
	"required": [
		"metric"
	],
	"properties": {
		"metric": {
			"type": "string"
		},
		"entries": {
			"type": "array",
			"items": {
				"anyOf": [
				{
					"$ref": "#/definitions/cluster"
				},
				{
					"$ref": "#/definitions/entry"
				}
			]
		}
	},
	"last_updated": {
		"type": "string",
		"format": "date-time"
	},
	"display": {
		"description": "An optional hint for the UI to display the data in a specific form.",
		"type": "string",
		"enum": [
			"table",
			"join"
		]
	}
}
        
NOTE: The optional "display" field is not used for the node metrics.

Configure Thresholds for Health Metrics

Configure the thresholds that define the normal, warning, and critical status for each health metric.
Use the new command illumio-pce-env metrics --write to adjust these thresholds.

See the PCE Administration Guide for more information.

PCE Health API Method

Functionality HTTP URI

Get the health information of the PCE Cluster and its nodes

GET

[api_version]/health

Check PCE Health

URI to Check PCE Health

            GET [api_version]/health
        

Curl Command Check PCE Health

            curl -i -X GET https://pce.my-company.com:8443/api/v2/health -H 'Accept: application/json' -u $KEY:'TOKEN' 

PCE Health Response Properties

Property Description Type Required  
status

Current health status of the PCE. Possible values: 

  • normal: When a PCE health is a normal state it means: 
    • All required services are running.
    • All nodes are running.
    • CPU usage of all nodes is less than 95%.
    • The memory usage of all nodes is less than 95%.
    • Disk usage of all nodes is less than 95%.
    • Database replication lag is less than or equal to 30 seconds.
  • warning: When PCE health is in a warning state, it means: 
    • One or more nodes are unreachable.
    • One or more optional services are missing, or one or more required services have been degraded.
    • The CPU usage of any node is greater than or equal to 95%.
    • Memory usage of any node is greater than or equal to 95%.
    • Disk usage of any node is greater than or equal to 95%.
    • Database replication lag is greater than 30 seconds.
  • critical: A PCE is considered to be in a critical state when one or more required services are missing. If a PCE enters a critical state, it might not be possible to authenticate to the PCE or get an API response depending on which services are missing from the PCE.
String Yes  
type

Type of the PCE:

  • standalone: Indicates that this PCE is an on-premises 2x2 or 4x2 PCE cluster.
    Or one of the following types: 
  • leader: Indicates that this PCE is the leader of a Supercluster.
  • member: Indicates that this PCE is a member of a Supercluster.
String    
fqdn The fully qualified domain name (FQDN) of the PCE. String    
available_seconds The length of time that this PCE has been available, measured in seconds. Number    
notifications

Heath warnings related to the PCE, which contain the following properties: 

  • status: Severity status of this notification. Possible values include: normal, warning, or critical.
  • token: Description token of the notification.
  • message: Description of this notification

Array

String

 

String

String

 

Yes

 

Yes

No

 
listen_only_mode_enabled_at

Timestamp at which PCE Listen Only Mode was enabled. Format: date-time

For information about enabling or disabling listen-only mode for a PCE, see the PCE Administration Guide.

String &Null    
upgrade_pending        
nodes

The nodes that comprise your PCE cluster. For each node of your PCE, this API call returns the following properties: 

  • hostname: The node hostname.
  • ip_address: The node IP address.
  • type
  • runlevel: The current runlevel of the PCE software on the node. Minimum: 0 and Maximum:6. For more information about runlevels and their usage, see the PCE Administration Guide.
  • uptime_seconds: Seconds since this node has been restarted.
  • cpu: CPU usage of this node: 
    • status: Either normal, warning, or critical.
    • percent: Percentage of the node CPU being used. It can be minimum:0 to maximum:100

    disk

    : Disk usage of this node per individual location 

     Location

    Value(see health_status_percent.schema.json)

    • status: Either normal, warning, or critical.
    • percent: Percentage of the node CPU being used. It can be minimum:0 to maximum:100
  • memory: Memory usage of this node: 
    • status: Either normal, warning, or critical.
    • percent: Percentage of the node CPU being used. It can be minimum:0 to maximum:100
  • services: The status of all PCE services running on the node. Possible status for
  •  PCE services include: 
    • running: The service is fully running and operational.
    • not running: The service has stopped running.
    • partial: The service is running but in a partial state.
    • optional
    • unknown
  • metrics:Node metrics

    • metric
    • entries
    • last_updated_at: format date-time
  • generated_at:Timestamp when this information was generated in the date-time format

Array

 

Str&Null

Str&Null

Num&Null

Num&Null

 

Num&Null

Object

String

Number

 

Array

String

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Array

 

 

 

 

Yes

Yes

No

No

 

No

No

Yes

Yes

 

 

Yes

Yes

Yes

 

 

 

 

 

 

Yes

 

 

 

 

 

 

No

 

 

 

Yes

 
network

PCE 2x2 or 4x2 Deployment

For a PCE 2x2 or 4x2 deployment, the networkproperty provides latency information between the database master and database slave data nodes in your PCE for policy and traffic data.

This property also indicates which data node in your PCE is the database master database and which is the database slave.

This type of database replication is called intracluster in the REST API.

Sub-properties include: 

replication: The category of properties that provide database replication latency information for a PCE cluster. (For a PCE Supercluster, this information is provided for each PCE in the Supercluster.)

  • type: Type of replication. intracluster for a PCE 2x2 or 4x2 deployment.
  • details: Includes the following properties: 
    • database_name: Name of the Database being replicated
    • master_fqdn: The FQDN of the database master node.
    • slave_fqdn: FQDN of the slave database node.

  • value: The amount of replication lag between the master and database slave for both policy and traffic data.
    • status: Lag status: normal, warning, or critical.
    • lag_seconds: The amount of lag measured in seconds between the master and slave databases for both policy and traffic data.

Supercluster Deployment

If you have deployed a PCE Supercluster, the PCE health call also returns information about the database replication between the PCE you are currently logged into and all other PCEs in the Supercluster.

In a Supercluster deployment, the security policy provisioned on the leader is replicated to all other PCEs in the Supercluster. Additionally, all PCEs in the Supercluster (leader and members) replicate copies of each workload's context, such as IP addresses, to all other PCEs in the Supercluster.

This other type of database replication for a Supercluster is called intercluster in the REST API, and information is provided for all PCEs in the Supercluster.

Properties include: 

replication: The category of properties that provide database replication latency information for a PCE cluster.

  • type: Type of replication. intercluster for a PCE Supercluster deployment.
  • details: Includes the following properties: 
    • fqdn: The FQDN of the database master of the other PCEs listed in this section.
  • value: The amount of replication lag between the PCE you are logged into and one of the other PCEs in the Supercluster.
    • status: Either normal, warning, or critical.
    • lag_seconds: The amount of lag measured in seconds between the PCE you are logged into and the other PCE listed in this section.

Object

 

 

 

 

 

 

 

Object

 

String

Object

String

String

String

Object

 

String

Number

 

 

 

 

 

 

 

 

Yes

 

Yes

Yes

Yes

Yes

Yes

Yes

 

Yes

Yes

 
metric

One or more entries encompassing the metric

  • metric: Additional reported metrics

    • entries
    • last_updated_at: format date-time
    • display: When you use display:table, it is an optional hint for the UI to display the metric's data in a specific table form.
      If display:join is used immediately after the metric with display:table, then the next metric will be visually joined to the previous metric's table.
      The display: enclosed property is requesting that the UI wrap this property in parenthesis.

Object

String

Array

String

String

No  
generated_at

The timestamp of when the PCE information was generated.

Format: date-time

String, Null Yes  

Configuring Health Metrics

Turn off health metrics

If you want to enable or disable individual metrics, use the CLI commands described in Configurable Thresholds for Health Metrics. You can also use the configurable thresholds technique to turn off all metrics.

Configure thresholds for health metrics

Configure the thresholds that define the normal, warning, and critical status for each health metric.
Use the command illumio-pce-env metrics --write to adjust these thresholds.

See the PCE Administration Guide for more information.