Managed Offline Flow
Pinot is most commonly used to provide real-time analytics based on streaming data, which can be achieved using a real-time table. However, after running these systems for a while, we'll want to update the data ingested into this table. Perhaps the name of a value in a column has been updated, or we want to remove some duplicate records.
Segments in real-time tables can't be replaced, but we can replace those in offline tables. Managed offline flow is the way that Pinot handles the process of moving the data from real-time to offline tables.
In this recipe we'll learn how to use Pinot offline managed flow.
Pinot Version | 0.9.3 |
Code | startreedata/pinot-recipes/managed-offline-flow |
This is the code for the following recipe: https://github.com/startreedata/pinot-recipes/tree/main/recipes/managed-offline-flow (opens in a new tab)
Prerequisites
To follow the code examples in this guide, you must install Docker (opens in a new tab) locally and download recipes.
Clone this repository and navigate to this recipe:
git clone git@github.com:startreedata/pinot-recipes.git
cd pinot-recipes/recipes/ingest-json-files
Makefile
make recipe
Running this recipe will build the foundation and start producing data into Kafka.
Run the next Make task:
Managed Offline Flow
make manage_offline_flow
The Make command above will perform these tasks:
- Sets the necessary properties in the Pinot Controller to enable the managed offline flow task:
RealtimeToOfflineSegmentsTask
.timeoutMs
and.numConcurrentTasksPerInstance
. - Schedules the task to run.
- Prints logs related to the task.
- Updates the hybrid table's time boundary so that you can see records that have been move to offline.
View realtime and offline segments
Navigate to http://localhost:9000/#/query (opens in a new tab) and run the following query:
select $segmentName, count(*) cnt
from events
group by $segmentName
order by cnt desc
Run the statement above to see records migrate from REALTIME to OFFLINE by running make realtime
to generate more data and make manage_offline_flow
to migrate older data to OFFLINE. See the README on GitHub for this recipe (opens in a new tab) for sample output.
Clean up
make clean
Troubleshooting
To clean up old Docker installations that may be interfering with your testing of this recipe, run the following command:
docker system prune