Skip to main content

Infer Pinot schema from Avro schema

In this recipe, we'll learn how to infer a Pinot schema from an Avro schema.

Prerequisites

To follow the code examples in this guide, you must install Docker locally and download recipes.

  1. If you haven't already, download recipes.
  2. In the terminal, go to the recipe by running the following command:
    cd pinot-recipes/recipes/infer-schema-avro-data

JSON Data

We're going to infer a Pinot schema from the following Avro schema:

/avro/mastodon-topic-value.avsc
{
"type":"record",
"namespace":"com.simonaubury",
"name":"MastodonDetails",
"fields":[
{"name":"m_id","type":"long","default":0},
{"name":"created_at","type": "int","logicalType":"date"},
{"name":"created_at_str","type":"string","default":"unknown"},
{"name":"app","type":"string","default":"unknown"},
{"name":"url","type":"string","default":"null"},
{"name":"base_url","type":"string","default":"null"},
{"name":"language","type":"string","default":"null"},
{"name":"favourites","type":"int","default":0},
{"name":"username","type":"string","default":""},
{"name":"bot","type":"boolean"},
{"name":"tags","type":"int","default":0},
{"name":"characters","type":"int","default":0},
{"name":"words","type":"int","default":0},
{"name":"mastodon_text","type":"string","default":"null"}
]
}

Infer schema

Now we're going to infer a schema for this input file. We can do this using the AvroSchemaToPinotSchema command.

You can generate a schema file that creates a dimension column field per Aro field, by running the following command:

docker run \
-v ${PWD}/config:/config \
-v ${PWD}/avro:/avro \
apachepinot/pinot:0.12.0-arm64 AvroSchemaToPinotSchema \
-avroSchemaFile /avro/mastodon-topic-value.avsc \
-pinotSchemaName="mastodon" \
-outputDir="/config" \
-dimensions="" \
-timeColumnName "created_at" \
-metrics "favourites,words,characters,tags"

A JSON file will be written to ./config/mastodon.json, the contents of which are shown below:

./config/mastodon.json
{
"schemaName": "mastodon",
"dimensionFieldSpecs": [
{"name": "m_id", "dataType": "LONG"},
{"name": "created_at_str", "dataType": "STRING"},
{"name": "app", "dataType": "STRING"},
{"name": "url", "dataType": "STRING"},
{"name": "base_url", "dataType": "STRING"},
{"name": "language", "dataType": "STRING"},
{"name": "username", "dataType": "STRING"},
{"name": "bot", "dataType": "BOOLEAN"},
{"name": "mastodon_text", "dataType": "STRING"}
],
"metricFieldSpecs": [
{"name": "favourites", "dataType": "INT"},
{"name": "tags", "dataType": "INT"},
{"name": "characters", "dataType": "INT"},
{"name": "words", "dataType": "INT"}
],
"dateTimeFieldSpecs": [
{
"name": "created_at",
"dataType": "INT",
"format": "1:DAYS:EPOCH",
"granularity": "1:DAYS"
}
]
}