Try StarTree Cloud: 30-day free trial
Use Cluster Health Dashboard

StarTree Cloud cluster health dashboard

The StarTree Cloud cluster health dashboard provides an overview of Pinot checks, which lets you observe pass/fail statuses and filter checks based on instances or tables. This dashboard offers a holistic view of the overall health of the cluster.

The ClusterHealthCheckTask task runs every 20 minutes by default. Dashboard checks are cached and kept in memory, and then overwritten with every run.

To use checks ad-hoc, use these controller API calls:

  • GET - /periodictask/run?taskName=ClusterHealthCheckTask (to run the checks now)
  • GET - /clusterHealth (to fetch cluster health)
  • GET - /clusterHealth/list (to list all available cluster health checks)

To view the cluster health dashboard

Log into StarTree Cloud and do the following:

  1. Click the organization, then select the workspace you want to view monitoring metrics for.
  2. Click the Services tab.
  3. Click the link next to My Apps.
  4. Click the Pinot Control Panel tile.

A dashboard containing a list of checks appears, and indicates whether the check passes or fails, and additional details about the check.

List of health checks

Current health checks are listed here.

CheckDescription
IDEAL_STATE_EV_MISMATCH_CHECKChecks if a table has any segments whose ExternalView state does not match with IdealState
SEGMENT_COLUMN_MISMATCH_CHECKChecks if a table has any segments whose columns do not match with the current table schema
SEGMENT_COUNT_CHECKCheck if a table has too many segments
SEGMENT_SIZE_CHECKCheck if a table has too many small-sized segments
REPLICATION_CHECKCheck if a table has a replication of 3 or more
TABLE_COLUMN_COUNT_CHECKCheck if a table has too many columns in the schema
TIME_COLUMN_GRANULARITY_CHECKCheck if a table has any time columns with granularity set to MILLISECONDS / MICROSECONDS
UPSERT_TABLE_SEGMENT_ASSIGNMENT_CHECKFor an UPSERT table, checks if all segments of a partition are assigned to a single server
INSTANCE_HEALTH_API_CHECKChecks if the instance /health API is live
SEGMENT_SKEW_HEALTH_CHECKVerifies whether any server, pertaining to a specific table having more than 50 segments, has a number of segments that exceeds 50% of the mean segment count across all servers for that table
CONSUMING_PARTITION_SKEW_HEALTH_CHECKVerifies whether any server, pertaining to a specific table having more than 10 consuming segments, has a number of consuming segments that exceed 50% of the mean consuming segment count across all servers for that table.
TABLE_SKEW_CHECKChecks whether the number of tables hosted by a server exceeds 50% of the average number of tables hosted across all servers.
CLUSTER_LEVEL_SEGMENT_SKEW_CHECKVerifies whether number of segments on any server exceed 50% of the mean segment count across all servers
CLUSTER_LEVEL_CONSUMING_PARTITION_SKEW_CHECKVerifies whether number of consuming segments on any server exceed 50% of the mean consuming segment count across all servers
HELIX_HOST_NAME_INSTANCE_NAME_MISMATCH_CHECKChecks if the Instance ID/Name matches with the expected value derived from Instance Config