Infer Pinot schema from JSON data
In this recipe we'll learn how to infer a Pinot schema from a JSON input file.
Pinot Version | 0.9.0 |
Code | startreedata/pinot-recipes/infer-schema-json-data |
Prerequisites
To follow the code examples in this guide, you must install Docker (opens in a new tab) locally and download recipes.
Navigate to recipe
- If you haven't already, download recipes.
- In terminal, go to the recipe by running the following command:
cd pinot-recipes/recipes/infer-schema-json-data
JSON Data
We're going to infer a Pinot schema from the following input file:
{
"id":"7044874109",
"type":"PushEvent",
"actor":{
"id":18542751,
"login":"LimeVista",
"display_login":"LimeVista",
"gravatar_id":"",
"url":"https://api.github.com/users/LimeVista",
"avatar_url":"https://avatars.githubusercontent.com/u/18542751?"
},
"repo":{
"id":115911530,
"name":"LimeVista/Tapes",
"url":"https://api.github.com/repos/LimeVista/Tapes"
},
"payload":{
"push_id":2226018068,
"size":1,
"distinct_size":1,
"ref":"refs/heads/master",
"head":"c5fc8b32a9ead1eba315d97410cb4ac1e6ca1774",
"before":"892d872c5d3f24cc6837900c9f4618dc2fe92930",
"commits":[
{
"sha":"c5fc8b32a9ead1eba315d97410cb4ac1e6ca1774",
"author":{
"name":"Lime",
"email":"4cc153d999e24274955157fc813e6f92f821525d@outlook.com"
},
"message":"Merge branch 'master' of https://github.com/LimeVista/Tapes\\n\\n# Conflicts:\\n#\\t.gitignore",
"distinct":true,
"url":"https://api.github.com/repos/LimeVista/Tapes/commits/c5fc8b32a9ead1eba315d97410cb4ac1e6ca1774"
}
]
},
"public":true,
"created_at":"2018-01-01T11:00:00Z"
}
/data/github.json
Infer schema
Now we're going to infer a schema for this input file.
We can do this using the JsonToPinotSchema
command.
Schema with only dimension fields
You can generate a schema file that creates a dimension column field per JSON field, by running the following command:
docker run \
-v ${PWD}/data/github.json:/data/github.json \
-v ${PWD}/config:/config \
apachepinot/pinot:0.9.3 JsonToPinotSchema \
-jsonFile /data/github.json \
-pinotSchemaName="github" \
-outputDir="/config" \
-dimensions=""
A JSON file will be written to ./config/github.json
, the contents of which are shown below:
{
"schemaName" : "github",
"dimensionFieldSpecs" : [ {
"name" : "id",
"dataType" : "STRING"
}, {
"name" : "type",
"dataType" : "STRING"
}, {
"name" : "actor.id",
"dataType" : "INT"
}, {
"name" : "actor.login",
"dataType" : "STRING"
}, {
"name" : "actor.display_login",
"dataType" : "STRING"
}, {
"name" : "actor.gravatar_id",
"dataType" : "STRING"
}, {
"name" : "actor.url",
"dataType" : "STRING"
}, {
"name" : "actor.avatar_url",
"dataType" : "STRING"
}, {
"name" : "repo.id",
"dataType" : "INT"
}, {
"name" : "repo.name",
"dataType" : "STRING"
}, {
"name" : "repo.url",
"dataType" : "STRING"
}, {
"name" : "payload.push_id",
"dataType" : "LONG"
}, {
"name" : "payload.size",
"dataType" : "INT"
}, {
"name" : "payload.distinct_size",
"dataType" : "INT"
}, {
"name" : "payload.ref",
"dataType" : "STRING"
}, {
"name" : "payload.head",
"dataType" : "STRING"
}, {
"name" : "payload.before",
"dataType" : "STRING"
}, {
"name" : "payload.commits",
"dataType" : "STRING"
}, {
"name" : "public",
"dataType" : "BOOLEAN"
}, {
"name" : "created_at",
"dataType" : "STRING"
} ]
}
./config/github.json
Schema with time column
We should probably specify the created_at
field as a date time field, which we can do using the following command:
docker run \
-v ${PWD}/data/github.json:/data/github.json \
-v ${PWD}/config:/config \
apachepinot/pinot:0.9.3 JsonToPinotSchema \
-jsonFile /data/github.json \
-pinotSchemaName="github_with_ts" \
-outputDir="/config" \
-timeColumnName=created_at
A JSON file will be written to ./config/github.json
, the contents of which are shown below:
{
"schemaName" : "github_with_ts",
"dimensionFieldSpecs" : [ {
"name" : "id",
"dataType" : "STRING"
}, {
"name" : "type",
"dataType" : "STRING"
}, {
"name" : "actor.id",
"dataType" : "INT"
}, {
"name" : "actor.login",
"dataType" : "STRING"
}, {
"name" : "actor.display_login",
"dataType" : "STRING"
}, {
"name" : "actor.gravatar_id",
"dataType" : "STRING"
}, {
"name" : "actor.url",
"dataType" : "STRING"
}, {
"name" : "actor.avatar_url",
"dataType" : "STRING"
}, {
"name" : "repo.id",
"dataType" : "INT"
}, {
"name" : "repo.name",
"dataType" : "STRING"
}, {
"name" : "repo.url",
"dataType" : "STRING"
}, {
"name" : "payload.push_id",
"dataType" : "LONG"
}, {
"name" : "payload.size",
"dataType" : "INT"
}, {
"name" : "payload.distinct_size",
"dataType" : "INT"
}, {
"name" : "payload.ref",
"dataType" : "STRING"
}, {
"name" : "payload.head",
"dataType" : "STRING"
}, {
"name" : "payload.before",
"dataType" : "STRING"
}, {
"name" : "payload.commits",
"dataType" : "STRING"
}, {
"name" : "public",
"dataType" : "BOOLEAN"
} ],
"dateTimeFieldSpecs" : [ {
"name" : "created_at",
"dataType" : "STRING",
"format" : "1:DAYS:EPOCH",
"granularity" : "1:DAYS"
} ]
}
./config/github_with_ts.json
Schema with unnested fields
At the moment the payload.commits
JSON array is stored as a string.
We can unnest each of the values in the documents in the array, by running the following command:
docker run \
-v ${PWD}/data/github.json:/data/github.json \
-v ${PWD}/config:/config \
apachepinot/pinot:0.9.3 JsonToPinotSchema \
-jsonFile /data/github.json \
-pinotSchemaName="github_unnest" \
-outputDir="/config" \
-timeColumnName=created_at \
-fieldsToUnnest=payload.commits
A JSON file will be written to ./config/github_unnest.json
, the contents of which are shown below:
{
"schemaName" : "github_unnest",
"dimensionFieldSpecs" : [ {
"name" : "id",
"dataType" : "STRING"
}, {
"name" : "type",
"dataType" : "STRING"
}, {
"name" : "actor.id",
"dataType" : "INT"
}, {
"name" : "actor.login",
"dataType" : "STRING"
}, {
"name" : "actor.display_login",
"dataType" : "STRING"
}, {
"name" : "actor.gravatar_id",
"dataType" : "STRING"
}, {
"name" : "actor.url",
"dataType" : "STRING"
}, {
"name" : "actor.avatar_url",
"dataType" : "STRING"
}, {
"name" : "repo.id",
"dataType" : "INT"
}, {
"name" : "repo.name",
"dataType" : "STRING"
}, {
"name" : "repo.url",
"dataType" : "STRING"
}, {
"name" : "payload.push_id",
"dataType" : "LONG"
}, {
"name" : "payload.size",
"dataType" : "INT"
}, {
"name" : "payload.distinct_size",
"dataType" : "INT"
}, {
"name" : "payload.ref",
"dataType" : "STRING"
}, {
"name" : "payload.head",
"dataType" : "STRING"
}, {
"name" : "payload.before",
"dataType" : "STRING"
}, {
"name" : "payload.commits.sha",
"dataType" : "STRING"
}, {
"name" : "payload.commits.author.name",
"dataType" : "STRING"
}, {
"name" : "payload.commits.author.email",
"dataType" : "STRING"
}, {
"name" : "payload.commits.message",
"dataType" : "STRING"
}, {
"name" : "payload.commits.distinct",
"dataType" : "BOOLEAN"
}, {
"name" : "payload.commits.url",
"dataType" : "STRING"
}, {
"name" : "public",
"dataType" : "BOOLEAN"
} ],
"dateTimeFieldSpecs" : [ {
"name" : "created_at",
"dataType" : "STRING",
"format" : "1:DAYS:EPOCH",
"granularity" : "1:DAYS"
} ]
}
./config/github_unnest.json