Ingesting JSON files
In this recipe, we'll learn how to ingest JSON documents from a newline-delimited JSON (jsonlines) file.
Mark Needham (opens in a new tab) shows how to ingest JSON files
To automatically ingest JSON data from Amazon S3 to Pinot, check out this video (opens in a new tab).
Pinot Version | 1.0.0 |
Code | startreedata/pinot-recipes/ingest-json-files |
Prerequisites
To follow the code examples in this guide, you must install Docker (opens in a new tab) locally and download recipes.
Navigate to recipe
- If you haven't already, download recipes.
- In terminal, go to the recipe by running the following command:
cd pinot-recipes/recipes/ingest-json-files
Launch Pinot Cluster
You can spin up a Pinot Cluster by running the following command:
docker-compose up
This command will run a single instance of the Pinot Controller, Pinot Server, Pinot Broker, and Zookeeper. You can find the docker-compose.yml (opens in a new tab) file on GitHub.
Dataset
We're going to import the following JSON file:
data/import.json
{"title": "Valentine's Day", "genre": "Comedy", "year": 2010, "id": 361248901147483647}
{"title": "The Ugly Truth", "genre": "Comedy", "year": 2009, "id": 332567813147483648}
{"title": "P.S. I Love You", "genre": "Romance", "year": 2007, "id": 346905752147483649}
{"title": "Dear John", "genre": "Drama", "year": 2010, "id": 300441473147483650}
{"title": "The Curious Case of Benjamin Button", "genre": "Fantasy", "year": 2008, "id": 394030854147483651}
Pinot Schema and Table
Now let's create a Pinot Schema and Table.
First, the schema:
config/schema.json
{
"schemaName": "movies",
"dimensionFieldSpecs": [
{
"name": "id",
"dataType": "LONG"
},
{
"name": "title",
"dataType": "STRING"
},
{
"name": "genre",
"dataType": "STRING"
},
{
"name": "year",
"dataType": "INT"
}
]
}
We'll also have the following table config:
config/table.json
{
"tableName": "movies",
"tableType": "OFFLINE",
"segmentsConfig": {
"replication": 1
},
"tenants": {
"broker":"DefaultTenant",
"server":"DefaultTenant"
},
"tableIndexConfig": {
"loadMode": "MMAP"
},
"ingestionConfig": {
"batchIngestionConfig": {
"segmentIngestionType": "APPEND",
"segmentIngestionFrequency": "DAILY"
}
},
"metadata": {}
}
You can create the table and schema by running the following command:`
docker run \
--network json \
-v $PWD/config:/config \
apachepinot/pinot:1.0.0 AddTable \
-tableConfigFile /config/table.json \
-schemaFile /config/schema.json \
-controllerHost "pinot-controller-json" \
-exec
You should see a message similar to the following if everything is working correctly:
2022/02/25 09:39:00.839 INFO [AddTableCommand] [main] Executing command: AddTable -tableConfigFile /config/table.json -schemaFile /config/schema.json -controllerProtocol http -controllerHost 172.29.0.3 -controllerPort 9000 -user null -password [hidden] -exec
2022/02/25 09:39:01.308 INFO [AddTableCommand] [main] {"status":"Table movies_OFFLINE succesfully added"}
Ingestion Job
Now we’re going to import the JSON file into Pinot. We'll do this with the following ingestion spec:
executionFrameworkSpec:
name: 'standalone'
segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner'
segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner'
jobType: SegmentCreationAndTarPush
inputDirURI: '/data'
includeFileNamePattern: 'glob:**/import.jsonl'
outputDirURI: '/opt/pinot/data/movies/'
overwriteOutput: true
pinotFSSpecs:
- scheme: file
className: org.apache.pinot.spi.filesystem.LocalPinotFS
recordReaderSpec:
dataFormat: 'json'
className: 'org.apache.pinot.plugin.inputformat.json.JSONRecordReader'
tableSpec:
tableName: 'movies'
pinotClusterSpecs:
- controllerURI: 'http://pinot-controller-json:9000'
pushJobSpec:
pushParallelism: 2
pushAttempts: 2
The import job will map fields in each JSON document to a corresponding column in the movies
schema.
If one of the fields doesn't exist in the schema it will be skipped.
You can also apply transformation functions to JSON documents during the ingestion process. For more details, see the JSON Transformation Functions guide.
You can run the following command to run the import:
docker run \
--network json \
-v $PWD/config:/config \
-v $PWD/data:/data \
apachepinot/pinot:1.0.0 LaunchDataIngestionJob \
-jobSpecFile /config/job-spec.yml
Querying
Once that's completed, navigate to localhost:9000/#/query (opens in a new tab) and click on the movies
table or copy/paste the following query:
select *
from movies
limit 10
You will see the following output:
genre | id | title | year |
---|---|---|---|
Drama | 300441473147483650 | Dear John | 2010 |
Comedy | 332567813147483648 | The Ugly Truth | 2009 |
Romance | 346905752147483649 | P.S. I Love You | 2007 |
Comedy | 361248901147483647 | Valentine's Day | 2010 |
Fantasy | 394030854147483651 | The Curious Case of Benjamin Button | 2008 |
Query Results