Skip to main content

Ingesting JSON files

In this recipe we'll learn how to ingest JSON documents from a newline-delimited JSON (jsonlines) file.

Mark Needham shows how to ingest JSON files

Pre-requisites

You will need to install Docker locally to follow the code examples in this guide.

Download Recipe

First, clone the GitHub repository to your local machine and navigate to this recipe:

git clone git@github.com:startreedata/pinot-recipes.git
cd pinot-recipes/recipes/ingest-json-files

If you don't have a Git client, you can also download a zip file that contains the code and then navigate to the recipe.

Launch Pinot Cluster

You can spin up a Pinot Cluster by running the following command:

docker-compose up

This command will run a single instance of the Pinot Controller, Pinot Server, Pinot Broker, and Zookeeper. You can find the docker-compose.yml file on GitHub.

Dataset

We're going to import the following JSON file:

data/import.json
{"title": "Valentine's Day", "genre": "Comedy", "year": 2010, "id": 361248901147483647}
{"title": "The Ugly Truth", "genre": "Comedy", "year": 2009, "id": 332567813147483648}
{"title": "P.S. I Love You", "genre": "Romance", "year": 2007, "id": 346905752147483649}
{"title": "Dear John", "genre": "Drama", "year": 2010, "id": 300441473147483650}
{"title": "The Curious Case of Benjamin Button", "genre": "Fantasy", "year": 2008, "id": 394030854147483651}

Pinot Schema and Table

Now let's create a Pinot Schema and Table.

First, the schema:

config/schema.json
{
"schemaName": "movies",
"dimensionFieldSpecs": [
{
"name": "id",
"dataType": "LONG"
},
{
"name": "title",
"dataType": "STRING"
},
{
"name": "genre",
"dataType": "STRING"
},
{
"name": "year",
"dataType": "INT"
}
]
}

We'll also have the following table config:

config/table.json
{
"tableName": "movies",
"tableType": "OFFLINE",
"segmentsConfig": {
"replication": 1
},
"tenants": {
"broker":"DefaultTenant",
"server":"DefaultTenant"
},
"tableIndexConfig": {
"loadMode": "MMAP"
},
"ingestionConfig": {
"batchIngestionConfig": {
"segmentIngestionType": "APPEND",
"segmentIngestionFrequency": "DAILY"
}
},
"metadata": {}
}

You can create the table and schema by running the following command:`

docker exec -it pinot-controller-json bin/pinot-admin.sh AddTable   \
-tableConfigFile /config/table.json \
-schemaFile /config/schema.json \
-exec

You should see a message similar to the following if everything is working correctly:

2022/02/25 09:39:00.839 INFO [AddTableCommand] [main] Executing command: AddTable -tableConfigFile /config/table.json -schemaFile /config/schema.json -controllerProtocol http -controllerHost 172.29.0.3 -controllerPort 9000 -user null -password [hidden] -exec
2022/02/25 09:39:01.308 INFO [AddTableCommand] [main] {"status":"Table movies_OFFLINE succesfully added"}

Ingestion Job

Now we’re going to import the JSON file into Pinot. We'll do this with the following ingestion spec:

executionFrameworkSpec:
name: 'standalone'
segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner'
segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner'
jobType: SegmentCreationAndTarPush
inputDirURI: '/data'
includeFileNamePattern: 'glob:**/import.json'
outputDirURI: '/opt/pinot/data/movies/'
overwriteOutput: true
pinotFSSpecs:
- scheme: file
className: org.apache.pinot.spi.filesystem.LocalPinotFS
recordReaderSpec:
dataFormat: 'json'
className: 'org.apache.pinot.plugin.inputformat.json.JSONRecordReader'
tableSpec:
tableName: 'movies'
pinotClusterSpecs:
- controllerURI: 'http://localhost:9000'

The import job will map fields in each JSON document to a corresponding column in the movies schema. If one of the fields doesn't exist in the schema it will be skipped.

tip

You can also apply transformation functions to JSON documents during the ingestion process. For more details, see the JSON Transformation Functions guide.

You can run the following command to run the import:

docker exec -it pinot-controller-json bin/pinot-admin.sh LaunchDataIngestionJob \
-jobSpecFile /config/job-spec.yml

Querying

Once that's completed, navigate to localhost:9000/#/query and click on the movies table or copy/paste the following query:

select * 
from movies
limit 10

You will see the following output:

genreidtitleyear
Drama300441473147483650Dear John2010
Comedy332567813147483648The Ugly Truth2009
Romance346905752147483649P.S. I Love You2007
Comedy361248901147483647Valentine's Day2010
Fantasy394030854147483651The Curious Case of Benjamin Button2008

Query Results