In real-time OLAP databases, we typically use append only data structures for fast data ingestion. But how do we deal with updating records to say, capture the latest status of an order, or the most recent location of a delivery vehicle? Pinot handles this for real-time data ingestion with its upsert functionality.
So how does it work?
Events are still ingested into the store regardless of whether they are new records or updates of existing ones, as shown in the diagram below:
All records are still ingested
But Pinot will also populate an in-memory dictionary to keep track of the latest
docId for each primary key, as shown in the diagram below:
Pinot's in-memory upserts dictionary
That dictionary is then used to populate a segment's
validDocIds, which is used when querying the data.
Something to keep in mind is that upserts work on an individual partition basis only, so you need to make sure that the partitions in your streaming data platform are keyed by the primary key.
Partition data by primary key
If you don't do this, the upsert functionality won't work and you'll see duplicate data.