Infer Pinot schema from JSON data
In this recipe we'll learn how to infer a Pinot schema from a JSON input file.
Pinot Version | 0.9.0 |
Code | startreedata/pinot-recipes/infer-schema-json-data |
Pre-requisites
You will need to install Docker locally to follow the code examples in this guide.
Download Recipe
First, clone the GitHub repository to your local machine and navigate to this recipe:
git clone git@github.com:startreedata/pinot-recipes.git
cd pinot-recipes/recipes/infer-schema-json-data
If you don't have a Git client, you can also download a zip file that contains the code and then navigate to the recipe.
JSON Data
We're going to infer a Pinot schema from the following input file:
{
"id":"7044874109",
"type":"PushEvent",
"actor":{
"id":18542751,
"login":"LimeVista",
"display_login":"LimeVista",
"gravatar_id":"",
"url":"https://api.github.com/users/LimeVista",
"avatar_url":"https://avatars.githubusercontent.com/u/18542751?"
},
"repo":{
"id":115911530,
"name":"LimeVista/Tapes",
"url":"https://api.github.com/repos/LimeVista/Tapes"
},
"payload":{
"push_id":2226018068,
"size":1,
"distinct_size":1,
"ref":"refs/heads/master",
"head":"c5fc8b32a9ead1eba315d97410cb4ac1e6ca1774",
"before":"892d872c5d3f24cc6837900c9f4618dc2fe92930",
"commits":[
{
"sha":"c5fc8b32a9ead1eba315d97410cb4ac1e6ca1774",
"author":{
"name":"Lime",
"email":"4cc153d999e24274955157fc813e6f92f821525d@outlook.com"
},
"message":"Merge branch 'master' of https://github.com/LimeVista/Tapes\\n\\n# Conflicts:\\n#\\t.gitignore",
"distinct":true,
"url":"https://api.github.com/repos/LimeVista/Tapes/commits/c5fc8b32a9ead1eba315d97410cb4ac1e6ca1774"
}
]
},
"public":true,
"created_at":"2018-01-01T11:00:00Z"
}
Infer schema
Now we're going to infer a schema for this input file.
We can do this using the JsonToPinotSchema
command.
Schema with only dimension fields
You can generate a schema file that creates a dimension column field per JSON field, by running the following command:
docker run \
-v ${PWD}/data/github.json:/data/github.json \
-v ${PWD}/config:/config \
apachepinot/pinot:0.9.3 JsonToPinotSchema \
-jsonFile /data/github.json \
-pinotSchemaName="github" \
-outputDir="/config" \
-dimensions=""
A JSON file will be written to ./config/github.json
, the contents of which are shown below:
{
"schemaName" : "github",
"dimensionFieldSpecs" : [ {
"name" : "id",
"dataType" : "STRING"
}, {
"name" : "type",
"dataType" : "STRING"
}, {
"name" : "actor.id",
"dataType" : "INT"
}, {
"name" : "actor.login",
"dataType" : "STRING"
}, {
"name" : "actor.display_login",
"dataType" : "STRING"
}, {
"name" : "actor.gravatar_id",
"dataType" : "STRING"
}, {
"name" : "actor.url",
"dataType" : "STRING"
}, {
"name" : "actor.avatar_url",
"dataType" : "STRING"
}, {
"name" : "repo.id",
"dataType" : "INT"
}, {
"name" : "repo.name",
"dataType" : "STRING"
}, {
"name" : "repo.url",
"dataType" : "STRING"
}, {
"name" : "payload.push_id",
"dataType" : "LONG"
}, {
"name" : "payload.size",
"dataType" : "INT"
}, {
"name" : "payload.distinct_size",
"dataType" : "INT"
}, {
"name" : "payload.ref",
"dataType" : "STRING"
}, {
"name" : "payload.head",
"dataType" : "STRING"
}, {
"name" : "payload.before",
"dataType" : "STRING"
}, {
"name" : "payload.commits",
"dataType" : "STRING"
}, {
"name" : "public",
"dataType" : "BOOLEAN"
}, {
"name" : "created_at",
"dataType" : "STRING"
} ]
}
Schema with time column
We should probably specify the created_at
field as a date time field, which we can do using the following command:
docker run \
-v ${PWD}/data/github.json:/data/github.json \
-v ${PWD}/config:/config \
apachepinot/pinot:0.9.3 JsonToPinotSchema \
-jsonFile /data/github.json \
-pinotSchemaName="github_with_ts" \
-outputDir="/config" \
-timeColumnName=created_at
A JSON file will be written to ./config/github.json
, the contents of which are shown below:
{
"schemaName" : "github_with_ts",
"dimensionFieldSpecs" : [ {
"name" : "id",
"dataType" : "STRING"
}, {
"name" : "type",
"dataType" : "STRING"
}, {
"name" : "actor.id",
"dataType" : "INT"
}, {
"name" : "actor.login",
"dataType" : "STRING"
}, {
"name" : "actor.display_login",
"dataType" : "STRING"
}, {
"name" : "actor.gravatar_id",
"dataType" : "STRING"
}, {
"name" : "actor.url",
"dataType" : "STRING"
}, {
"name" : "actor.avatar_url",
"dataType" : "STRING"
}, {
"name" : "repo.id",
"dataType" : "INT"
}, {
"name" : "repo.name",
"dataType" : "STRING"
}, {
"name" : "repo.url",
"dataType" : "STRING"
}, {
"name" : "payload.push_id",
"dataType" : "LONG"
}, {
"name" : "payload.size",
"dataType" : "INT"
}, {
"name" : "payload.distinct_size",
"dataType" : "INT"
}, {
"name" : "payload.ref",
"dataType" : "STRING"
}, {
"name" : "payload.head",
"dataType" : "STRING"
}, {
"name" : "payload.before",
"dataType" : "STRING"
}, {
"name" : "payload.commits",
"dataType" : "STRING"
}, {
"name" : "public",
"dataType" : "BOOLEAN"
} ],
"dateTimeFieldSpecs" : [ {
"name" : "created_at",
"dataType" : "STRING",
"format" : "1:DAYS:EPOCH",
"granularity" : "1:DAYS"
} ]
}
Schema with unnested fields
At the moment the payload.commits
JSON array is stored as a string.
We can unnest each of the values in the documents in the array, by running the following command:
docker run \
-v ${PWD}/data/github.json:/data/github.json \
-v ${PWD}/config:/config \
apachepinot/pinot:0.9.3 JsonToPinotSchema \
-jsonFile /data/github.json \
-pinotSchemaName="github_unnest" \
-outputDir="/config" \
-timeColumnName=created_at \
-fieldsToUnnest=payload.commits
A JSON file will be written to ./config/github_unnest.json
, the contents of which are shown below:
{
"schemaName" : "github_unnest",
"dimensionFieldSpecs" : [ {
"name" : "id",
"dataType" : "STRING"
}, {
"name" : "type",
"dataType" : "STRING"
}, {
"name" : "actor.id",
"dataType" : "INT"
}, {
"name" : "actor.login",
"dataType" : "STRING"
}, {
"name" : "actor.display_login",
"dataType" : "STRING"
}, {
"name" : "actor.gravatar_id",
"dataType" : "STRING"
}, {
"name" : "actor.url",
"dataType" : "STRING"
}, {
"name" : "actor.avatar_url",
"dataType" : "STRING"
}, {
"name" : "repo.id",
"dataType" : "INT"
}, {
"name" : "repo.name",
"dataType" : "STRING"
}, {
"name" : "repo.url",
"dataType" : "STRING"
}, {
"name" : "payload.push_id",
"dataType" : "LONG"
}, {
"name" : "payload.size",
"dataType" : "INT"
}, {
"name" : "payload.distinct_size",
"dataType" : "INT"
}, {
"name" : "payload.ref",
"dataType" : "STRING"
}, {
"name" : "payload.head",
"dataType" : "STRING"
}, {
"name" : "payload.before",
"dataType" : "STRING"
}, {
"name" : "payload.commits.sha",
"dataType" : "STRING"
}, {
"name" : "payload.commits.author.name",
"dataType" : "STRING"
}, {
"name" : "payload.commits.author.email",
"dataType" : "STRING"
}, {
"name" : "payload.commits.message",
"dataType" : "STRING"
}, {
"name" : "payload.commits.distinct",
"dataType" : "BOOLEAN"
}, {
"name" : "payload.commits.url",
"dataType" : "STRING"
}, {
"name" : "public",
"dataType" : "BOOLEAN"
} ],
"dateTimeFieldSpecs" : [ {
"name" : "created_at",
"dataType" : "STRING",
"format" : "1:DAYS:EPOCH",
"granularity" : "1:DAYS"
} ]
}