Skip to main content

Upserts

In real-time OLAP databases, we typically use append only data structures for fast data ingestion. But how do we deal with updating records to say, capture the latest status of an order, or the most recent location of a delivery vehicle? Pinot handles this for real-time data ingestion with its upsert functionality.

tip

If you want to learn how to configure upserts, see the full upserts or partial upserts developer guides.

So how does it work?

Events are still ingested into the store regardless of whether they are new records or updates of existing ones, as shown in the diagram below:

All records are still ingestedAll records are still ingested

But Pinot will also populate an in-memory dictionary to keep track of the latest docId for each primary key, as shown in the diagram below:

Pinot's in-memory upserts dictionaryPinot's in-memory upserts dictionary

That dictionary is then used to populate a segment's validDocIds, which is used when querying the data.

Something to keep in mind is that upserts work on an individual partition basis only, so you need to make sure that the partitions in your streaming data platform are keyed by the primary key.

Partition data by primary keyPartition data by primary key

If you don't do this, the upsert functionality won't work and you'll see duplicate data.