Managed offline flow

Managed Offline Flow

Pinot is most commonly used to provide real-time analytics based on streaming data, which can be achieved using a real-time table. However, after running these systems for a while, we'll want to update the data ingested into this table. Perhaps the name of a value in a column has been updated, or we want to remove some duplicate records.

Segments in real-time tables can't be replaced, but we can replace those in offline tables. Managed offline flow is the way that Pinot handles the process of moving the data from real-time to offline tables.

In this recipe we'll learn how to use Pinot offline managed flow.

This is the code for the following recipe: https://github.com/startreedata/pinot-recipes/tree/main/recipes/managed-offline-flow (opens in a new tab)

Prerequisites

To follow the code examples in this guide, you must install Docker (opens in a new tab) locally and download recipes.

Clone this repository and navigate to this recipe:

git clone git@github.com:startreedata/pinot-recipes.git
cd pinot-recipes/recipes/ingest-json-files

Makefile

make recipe

Running this recipe will build the foundation and start producing data into Kafka.

Run the next Make task:

Managed Offline Flow

make manage_offline_flow

The Make command above will perform these tasks:

  • Sets the necessary properties in the Pinot Controller to enable the managed offline flow task: RealtimeToOfflineSegmentsTask.timeoutMs and .numConcurrentTasksPerInstance.
  • Schedules the task to run.
  • Prints logs related to the task.
  • Updates the hybrid table's time boundary so that you can see records that have been move to offline.

View realtime and offline segments

Navigate to http://localhost:9000/#/query (opens in a new tab) and run the following query:

select $segmentName, count(*) cnt
from events
group by $segmentName
order by cnt desc

Run the statement above to see records migrate from REALTIME to OFFLINE by running make realtime to generate more data and make manage_offline_flow to migrate older data to OFFLINE. See the README on GitHub for this recipe (opens in a new tab) for sample output.

Clean up

make clean

Troubleshooting

To clean up old Docker installations that may be interfering with your testing of this recipe, run the following command:

docker system prune