Skip to main content

Off-Heap Upsert

Motivation

The current implementation of upsert relies on an in-memory map on each server to store metadata. This metadata is stored for every unique key in the table. This quickly leads to an increase in heap memory usage as the number of primary keys increase in the table.

Another limitation of keeping this state in memory is that we have to recreate it every time the servers need a restart.

Solution

We designed Off-Heap upsert to solve this particular problem. With the help of this feature, Upsert can now be scaled easily on each server while using a fraction of the memory.

The upsert state is now managed on disk. This however leads to a negligible slowdown in ingestion speed. The reason being we use a cache on top of the disk storage so all the recent reads should go to memory. We also only append the latest writes to the memory store which are flushed to disk later thus improving the write speeds as well.

The query path is not changed at all so users should still get the same p99 latencies. In fact, the latencies might be better due to reduced pressure on heap memory.

The off-heap upsert implementation follows the extensible plugin ecosystem principle of Apache Pinot. We have designed it to be able to support any data store as the state backend. However, currently, it uses RocksDB as the default backend.

Usage

Off-heap upsert supports two deployment modes -

  • Server Level Store - Here each server has a single backend and all tables update their state to this backend. This is pretty efficient as it leads to less overall CPU and memory utilization of the backend.

  • Table Level Store - Here each table has a separate backend. It provides better performance for that table.

By default, users should rely on the server-level store and only use the table-level store in case of extremely rare performance bottlenecks.

Use Server Level Store

The server-level store requires a change in server configs at the moment. We are working to make this configurable by the user itself.

  • Raise a request via the Support portal to enable the off-heap store in StarTree. The ticket should contain approximate count of distinct primary keys that will be present in all the tables in which offheap-upsert will be enabled. Just an approximate global count is required and not a granular table-level count.

  • Once the off-heap store is enabled in your deployment, you can start using it in your tables.

Enable in a new table

For a new table, users can simply add the following configuration in the table config.

    "upsertConfig" : {
"mode": "FULL",
"hashFunction": "NONE",
"enableSnapshot": true,
"metadataManagerClass": "ai.startree.pinot.upsert.rocksdb.RocksDBTableUpsertMetadataManager"
}

Partial upserts are also supported. The configuration above is just an example. You can just add enableSnapshot and metadataManagerClass to your existing upsert configuration.

That's it. Now the table should start using the off-heap backend for upsert.

Enable in an existing table

The steps to enable in an existing table are the same as above. However, it requires a server restart to transfer the older metadata to this new backend.

In the same Support ticket, You can also mention the existing table names in which the feature needs to be enabled.

Use Table Level Store

As mentioned before, Table level store should only be used in case you are experiencing any major ingestion lag because of off-heap upsert. The table-level store will lead to higher CPU usage and memory usage but will provide better performance.

Enable in a new table

For a new table, users can simply add the following configuration in the table config.

    "upsertConfig" : {
"mode": "FULL",
"hashFunction": "NONE",
"enableSnapshot": true,
"metadataManagerClass": "ai.startree.pinot.upsert.rocksdb.RocksDBTableUpsertMetadataManager",
"metadataManagerConfigs": {
"useIsolatedStore": "true"
}
}

Users can also tune the backend according to the performance requirement. e.g. To change the RocksDB cache size to 2GB (default is 1GB)

  "metadataManagerConfigs": {
"rocksdb.blockcache.size_bytes": "2147483648",
}

You can overwrite any Rocks DBOptions as well as CFOptions mentioned in the official RocksDB wiki.

  • To override DBOptions, use the prefix rocksdb.db. followed by the option name
  • To override ColumnFamilyOptions, use the prefix rocksdb.columnfamily. followed by the option name.
  • To override BlockBasedTable options, use the prefix rocksdb.blocktableformat. followed by the option name.

Partial upserts are also supported. The configuration above is just an example. You can just add enableSnapshot, metadataManagerClass, and metadataManagerConfigs to your existing upsert configuration.

That's it. Now the table should start using the off-heap backend for upsert.

Enable in an existing table

The steps to enable in an existing table are the same as above. However, it requires a server restart to transfer the older metadata to this new backend.

In the same Support ticket, You can also mention the existing table names in which the feature needs to be enabled.

Conclusion

With off-heap upsert, users should now be able to use Apache Pinot's powerful real-time upsert feature on even larger datasets at a fraction of the cost. We are working to make configuring Off-Heap upsert even easier for users. This involves eliminating the need to require server restarts as well as the need to make changes to instance configs.