Infer Pinot schema from JSON data

Infer Pinot schema from JSON data

In this recipe we'll learn how to infer a Pinot schema from a JSON input file.

Prerequisites

To follow the code examples in this guide, you must install Docker (opens in a new tab) locally and download recipes.

Navigate to recipe

  1. If you haven't already, download recipes.
  2. In terminal, go to the recipe by running the following command:
cd pinot-recipes/recipes/infer-schema-json-data

JSON Data

We're going to infer a Pinot schema from the following input file:

{
  "id":"7044874109",
  "type":"PushEvent",
  "actor":{
    "id":18542751,
    "login":"LimeVista",
    "display_login":"LimeVista",
    "gravatar_id":"",
    "url":"https://api.github.com/users/LimeVista",
    "avatar_url":"https://avatars.githubusercontent.com/u/18542751?"
  },
  "repo":{
    "id":115911530,
    "name":"LimeVista/Tapes",
    "url":"https://api.github.com/repos/LimeVista/Tapes"
  },
  "payload":{
    "push_id":2226018068,
    "size":1,
    "distinct_size":1,
    "ref":"refs/heads/master",
    "head":"c5fc8b32a9ead1eba315d97410cb4ac1e6ca1774",
    "before":"892d872c5d3f24cc6837900c9f4618dc2fe92930",
    "commits":[
      {
        "sha":"c5fc8b32a9ead1eba315d97410cb4ac1e6ca1774",
        "author":{
          "name":"Lime",
          "email":"4cc153d999e24274955157fc813e6f92f821525d@outlook.com"
        },
        "message":"Merge branch 'master' of https://github.com/LimeVista/Tapes\\n\\n# Conflicts:\\n#\\t.gitignore",
        "distinct":true,
        "url":"https://api.github.com/repos/LimeVista/Tapes/commits/c5fc8b32a9ead1eba315d97410cb4ac1e6ca1774"
      }
    ]
  },
  "public":true,
  "created_at":"2018-01-01T11:00:00Z"
}
 

/data/github.json

Infer schema

Now we're going to infer a schema for this input file. We can do this using the JsonToPinotSchema command.

Schema with only dimension fields

You can generate a schema file that creates a dimension column field per JSON field, by running the following command:

docker run \
  -v ${PWD}/data/github.json:/data/github.json \
  -v ${PWD}/config:/config \
  apachepinot/pinot:0.9.3 JsonToPinotSchema \
  -jsonFile /data/github.json \
  -pinotSchemaName="github" \
  -outputDir="/config" \
  -dimensions=""

A JSON file will be written to ./config/github.json, the contents of which are shown below:

{
  "schemaName" : "github",
  "dimensionFieldSpecs" : [ {
    "name" : "id",
    "dataType" : "STRING"
  }, {
    "name" : "type",
    "dataType" : "STRING"
  }, {
    "name" : "actor.id",
    "dataType" : "INT"
  }, {
    "name" : "actor.login",
    "dataType" : "STRING"
  }, {
    "name" : "actor.display_login",
    "dataType" : "STRING"
  }, {
    "name" : "actor.gravatar_id",
    "dataType" : "STRING"
  }, {
    "name" : "actor.url",
    "dataType" : "STRING"
  }, {
    "name" : "actor.avatar_url",
    "dataType" : "STRING"
  }, {
    "name" : "repo.id",
    "dataType" : "INT"
  }, {
    "name" : "repo.name",
    "dataType" : "STRING"
  }, {
    "name" : "repo.url",
    "dataType" : "STRING"
  }, {
    "name" : "payload.push_id",
    "dataType" : "LONG"
  }, {
    "name" : "payload.size",
    "dataType" : "INT"
  }, {
    "name" : "payload.distinct_size",
    "dataType" : "INT"
  }, {
    "name" : "payload.ref",
    "dataType" : "STRING"
  }, {
    "name" : "payload.head",
    "dataType" : "STRING"
  }, {
    "name" : "payload.before",
    "dataType" : "STRING"
  }, {
    "name" : "payload.commits",
    "dataType" : "STRING"
  }, {
    "name" : "public",
    "dataType" : "BOOLEAN"
  }, {
    "name" : "created_at",
    "dataType" : "STRING"
  } ]
}

./config/github.json

Schema with time column

We should probably specify the created_at field as a date time field, which we can do using the following command:

docker run \
  -v ${PWD}/data/github.json:/data/github.json \
  -v ${PWD}/config:/config \
  apachepinot/pinot:0.9.3 JsonToPinotSchema \
  -jsonFile /data/github.json \
  -pinotSchemaName="github_with_ts" \
  -outputDir="/config" \
  -timeColumnName=created_at

A JSON file will be written to ./config/github.json, the contents of which are shown below:

{
  "schemaName" : "github_with_ts",
  "dimensionFieldSpecs" : [ {
    "name" : "id",
    "dataType" : "STRING"
  }, {
    "name" : "type",
    "dataType" : "STRING"
  }, {
    "name" : "actor.id",
    "dataType" : "INT"
  }, {
    "name" : "actor.login",
    "dataType" : "STRING"
  }, {
    "name" : "actor.display_login",
    "dataType" : "STRING"
  }, {
    "name" : "actor.gravatar_id",
    "dataType" : "STRING"
  }, {
    "name" : "actor.url",
    "dataType" : "STRING"
  }, {
    "name" : "actor.avatar_url",
    "dataType" : "STRING"
  }, {
    "name" : "repo.id",
    "dataType" : "INT"
  }, {
    "name" : "repo.name",
    "dataType" : "STRING"
  }, {
    "name" : "repo.url",
    "dataType" : "STRING"
  }, {
    "name" : "payload.push_id",
    "dataType" : "LONG"
  }, {
    "name" : "payload.size",
    "dataType" : "INT"
  }, {
    "name" : "payload.distinct_size",
    "dataType" : "INT"
  }, {
    "name" : "payload.ref",
    "dataType" : "STRING"
  }, {
    "name" : "payload.head",
    "dataType" : "STRING"
  }, {
    "name" : "payload.before",
    "dataType" : "STRING"
  }, {
    "name" : "payload.commits",
    "dataType" : "STRING"
  }, {
    "name" : "public",
    "dataType" : "BOOLEAN"
  } ],
  "dateTimeFieldSpecs" : [ {
    "name" : "created_at",
    "dataType" : "STRING",
    "format" : "1:DAYS:EPOCH",
    "granularity" : "1:DAYS"
  } ]
}

./config/github_with_ts.json

Schema with unnested fields

At the moment the payload.commits JSON array is stored as a string. We can unnest each of the values in the documents in the array, by running the following command:

docker run \
  -v ${PWD}/data/github.json:/data/github.json \
  -v ${PWD}/config:/config \
  apachepinot/pinot:0.9.3 JsonToPinotSchema \
  -jsonFile /data/github.json \
  -pinotSchemaName="github_unnest" \
  -outputDir="/config" \
  -timeColumnName=created_at \
  -fieldsToUnnest=payload.commits

A JSON file will be written to ./config/github_unnest.json, the contents of which are shown below:

{
  "schemaName" : "github_unnest",
  "dimensionFieldSpecs" : [ {
    "name" : "id",
    "dataType" : "STRING"
  }, {
    "name" : "type",
    "dataType" : "STRING"
  }, {
    "name" : "actor.id",
    "dataType" : "INT"
  }, {
    "name" : "actor.login",
    "dataType" : "STRING"
  }, {
    "name" : "actor.display_login",
    "dataType" : "STRING"
  }, {
    "name" : "actor.gravatar_id",
    "dataType" : "STRING"
  }, {
    "name" : "actor.url",
    "dataType" : "STRING"
  }, {
    "name" : "actor.avatar_url",
    "dataType" : "STRING"
  }, {
    "name" : "repo.id",
    "dataType" : "INT"
  }, {
    "name" : "repo.name",
    "dataType" : "STRING"
  }, {
    "name" : "repo.url",
    "dataType" : "STRING"
  }, {
    "name" : "payload.push_id",
    "dataType" : "LONG"
  }, {
    "name" : "payload.size",
    "dataType" : "INT"
  }, {
    "name" : "payload.distinct_size",
    "dataType" : "INT"
  }, {
    "name" : "payload.ref",
    "dataType" : "STRING"
  }, {
    "name" : "payload.head",
    "dataType" : "STRING"
  }, {
    "name" : "payload.before",
    "dataType" : "STRING"
  }, {
    "name" : "payload.commits.sha",
    "dataType" : "STRING"
  }, {
    "name" : "payload.commits.author.name",
    "dataType" : "STRING"
  }, {
    "name" : "payload.commits.author.email",
    "dataType" : "STRING"
  }, {
    "name" : "payload.commits.message",
    "dataType" : "STRING"
  }, {
    "name" : "payload.commits.distinct",
    "dataType" : "BOOLEAN"
  }, {
    "name" : "payload.commits.url",
    "dataType" : "STRING"
  }, {
    "name" : "public",
    "dataType" : "BOOLEAN"
  } ],
  "dateTimeFieldSpecs" : [ {
    "name" : "created_at",
    "dataType" : "STRING",
    "format" : "1:DAYS:EPOCH",
    "granularity" : "1:DAYS"
  } ]
}

./config/github_unnest.json