Skip to main content

JSON Transformation Functions

In this recipe we'll learn how to use JSON transformation functions to extract values from nested JSON documents during the data ingestion process.

Pre-requisites

You will need to install Docker locally to follow the code examples in this guide.

Download Recipe

First, clone the GitHub repository to your local machine and navigate to this recipe:

git clone git@github.com:startreedata/pinot-recipes.git
cd pinot-recipes/recipes/json-transformation-functions

If you don't have a Git client, you can also download a zip file that contains the code and then navigate to the recipe.

Launch Pinot Cluster

You can spin up a Pinot Cluster by running the following command:

docker-compose up

This command will run a single instance of the Pinot Controller, Pinot Server, Pinot Broker, and Zookeeper. You can find the docker-compose.yml file on GitHub.

Dataset

We're going to import the following JSON file:

data/import.json
{"name":"Pete", "meta": {"age":24}, "subjectsAndGrades":[{"name":"Maths", "grade":"A"}, {"name":"English", "grade":"B"}]}
{"name":"John", "meta": {"age":28}, "subjectsAndGrades":[{"name":"Maths", "grade":"A"}, {"name":"Computer Science", "grade":"C"}]}

Pinot Schema and Table

Now let's create a Pinot Schema and Table.

First, the schema:

config/schema.json
{
"schemaName":"people",
"dimensionFieldSpecs":[
{
"name":"name",
"dataType":"STRING"
},
{
"name":"age",
"dataType":"INT"
},
{
"dataType":"STRING",
"name":"subjects",
"singleValueField":false
},
{
"dataType":"STRING",
"name":"grades",
"singleValueField":false
}
]
}

The subjects and grades columns will both contains arrays of values, which we can configure by setting "singleValueField":false.

We'll also have the following table config:

config/table.json
{
"tableName":"people",
"tableType":"OFFLINE",
"segmentsConfig":{
"replication":1,
"schemaName":"people"
},
"ingestionConfig":{
"transformConfigs":[
{
"columnName":"subjects",
"transformFunction":"jsonPathArray(subjectsAndGrades, '$.[*].name')"
},
{
"columnName":"grades",
"transformFunction":"jsonPathArray(subjectsAndGrades, '$.[*].grade')"
},
{
"columnName":"age",
"transformFunction":"JSONPATHLONG(meta, '$.age')"
}
],
"batchIngestionConfig":{
"segmentIngestionType":"APPEND",
"segmentIngestionFrequency":"DAILY"
}
},
"tenants":{
"broker":"DefaultTenant",
"server":"DefaultTenant"
},
"tableIndexConfig":{
"loadMode":"MMAP"
},
"metadata":{}
}

In this config we define transform configs (ingestionConfig.transformConfigs) to extract the subject names and grades from the subjectAndGrades property, using the jsonPathArray function. We also define one to extract the age from the meta property using the JSONPATHLONG function.

You can create the table and schema by running the following command:`

docker exec -it pinot-controller-json bin/pinot-admin.sh AddTable   \
-tableConfigFile /config/table.json \
-schemaFile /config/schema.json \
-exec

You should see a message similar to the following if everything is working correctly:

2022/02/25 13:21:11.963 INFO [AddTableCommand] [main] Executing command: AddTable -tableConfigFile /config/table.json -schemaFile /config/schema.json -controllerProtocol http -controllerHost 172.30.0.3 -controllerPort 9000 -user null -password [hidden] -exec
2022/02/25 13:21:13.337 INFO [AddTableCommand] [main] {"status":"Table people_OFFLINE succesfully added"}

Ingestion Job

Now we’re going to import the JSON file into Pinot. We'll do this with the following ingestion spec:

config/table.json
executionFrameworkSpec:
name: 'standalone'
segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner'
segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner'
jobType: SegmentCreationAndTarPush
inputDirURI: '/data'
includeFileNamePattern: 'glob:**/import.json'
outputDirURI: '/opt/pinot/data/movies/'
overwriteOutput: true
pinotFSSpecs:
- scheme: file
className: org.apache.pinot.spi.filesystem.LocalPinotFS
recordReaderSpec:
dataFormat: 'json'
className: 'org.apache.pinot.plugin.inputformat.json.JSONRecordReader'
tableSpec:
tableName: 'people'
pinotClusterSpecs:
- controllerURI: 'http://localhost:9000'

The import job will map fields in each JSON document to a corresponding column in the people schema. If one of the fields doesn't exist in the schema it will be skipped.

In this case the name field will be automatically mapped to the name column. The subjectAndGrades field is processed by transformation functions and the values are imported into the subjects and grades columns. The meta field is processed by a transformation function to extract the age property, which is stored in the age column.

You can run the following command to run the import:

docker exec -it pinot-controller-json bin/pinot-admin.sh LaunchDataIngestionJob \
-jobSpecFile /config/job-spec.yml

Querying

Once that's completed, navigate to localhost:9000/#/query and click on the people table or copy/paste the following query:

select * 
from people
limit 10

You will see the following output:

agegradesnamesubjects
24A,BPeteMaths,English
28A,CJohnMaths,Computer Science

Query Results