Manage Data and Disk Capacity

The amount of data collected and stored by the PCE can be large. Events, Explorer, and the internal syslog all generate data that is stored in PCE databases and log files. When the amount of stored data is not managed carefully, disks can become overfull. This occurrence can cause a variety of symptoms: inability to take backups, failing API calls, and general PCE functionality issues. Even when these issues do not occur, a large amount of stored data creates larger database backups, and it takes longer to back up and restore the database.

To successfully manage these issues, consider the following recommendations:

  • Identify: Know your organization's policies, backup strategies, and monitoring strategies.
  • Detect: Monitor ongoing disk usage.
  • Respond: Know how to troubleshoot and fix issues related to data storage.
  • Recover: Set up your PCE deployment to reduce disk usage.

Identify Data Management Strategies

Identify your organization's policies and strategies related to data storage and retention, backups, and monitoring. This knowledge forms the basis for any ongoing data management activities. You'll need the following information:

  • Records retention policy: How many days of events data must be available at all times? When your policy requires fewer days of events data than the PCE's default, you can decrease the PCE's events retention period, which helps avoid filling up disk space.
  • System backup policy: Are full backups always necessary, or would weekly full backups be sufficient, supplemented by smaller daily backups that do not include events data?
  • Disk usage trends: How fast is data usage growing in your Illumio Core deployment? What is the additional data usage each day?
  • Monitoring tools: What disk monitoring tools are in place? If none, is there a useful tool that could be added? Do the monitoring tools integrate with the PCE Health API?

Detect Disk Usage

Monitor disk usage to be sure you are aware of status and trends, especially any unusual activity, such as sudden spikes or other anomalies.

  • Watch the PCE Health page. For information, see Monitor PCE Health.
    • Check the Disk Usage figures.
    • When disk usage is too high, the PCE displays warnings, such as “Disk Critical.”
    • You can call the page's underlying PCE Health API with external monitoring tools.
  • Check the system health messages that are sent to syslog from each node in the cluster.
  • Use the command illumio-pce-ctl events-db disk-usage-show to get the number of events in the database, the amount of disk used by the Events database, and the average number of events per day. For more information, see View Events Using PCE Command Line in the Events Administration Guide.
  • Run your own disk monitoring tools or use standard Linux commands, such as df and du.

Respond to Disk Capacity Issues

You can prevent many disk capacity issues by deploying the PCE with sufficient resources. Be sure your disk meets the recommendations in PCE Capacity Planning in the PCE Installation and Upgrade Guide.

When you are running out of storage space, use Linux tools to find the parts of the disk that are being utilized heavily. Then, depending on your findings, try some of these techniques:

  • Are the PCE log files taking up disk space? Look for extra, older files you can move or delete from the log directory (usually /var/logs/illumio-pce).
  • Are other system logs taking too much space? Rotate and compress them, or delete them.
  • After a PCE successfully joins a Supercluster, a directory called postgresql.bak is sometimes left behind in the <postgresql directory>, especially on the database master node. You can delete the directory postgresql.bak and all its contents. This file directory is kept in case the cluster-join command fails and you need to recover, but once the cluster-join is complete, and your disk space needs become the higher priority, the directory can be removed.
  • Delete any large or unnecessary files in the /tmp directory; for example, core files.
  • Remove copies of backups stored on PCE nodes. In general, don't use the PCE as a place to store backup files.
  • Reduce the retention period for events data, making sure it is still acceptable according to your organization's record retention policy. The PCE automatically deletes excess older records from the database. The default data retention period for events is 30 days. You can decrease the retention period to as little as 1 day. However, exercise caution; balance the need to minimize disk usage against your company's data retention policies and your need to retain data for analysis. For information about how to change the data retention period, see Configure Events Settings in PCE Web Console in the Events Administration Guide.
  • The PCE provides short-term storage of events data. Consider forwarding events data to Splunk or other SIEM software for long-term storage in accordance with your organization's data retention policies.
  • Consider excluding events from most database dumps. Use the option --no-include-events for the illumio-pce-db-management dump command. When your organization's policies permit it, perform a full database dump (which includes events data) once during each events data retention period.

Recover Disk Usage

  • Extend the disk: When the current disk or partition is smaller than the recommended size, increase the partition size. The file runtime_env.yml can be configured with different local partition settings.
  • Add a partition or slice for logs or backups: Copy the old files in /var/logs/illumio-pce to a new disk. Mount the new disk to the same location on the PCE with the same permissions as the original disk.
  • Create a new disk or partition: Mount a new disk or partition to a suitable location for saving backup files.
  • Move the Explorer database to its own disk: Mount a new dedicated disk and move files from the existing traffic datastore to this dedicated disk. For information, see How to Move an Existing Explorer Database to a Separate Disk in the Illumio Knowledge Base (login required).