How to handle null values
In this recipe we'll learn how to handle null or missing values in Apache Pinot tables.
Pinot Version | 0.10.0 |
Code | startreedata/pinot-recipes/null-values |
Pre-requisites
You will need to install Docker locally to follow the code examples in this guide.
Download Recipe
First, clone the GitHub repository to your local machine and navigate to this recipe:
git clone git@github.com:startreedata/pinot-recipes.git
cd pinot-recipes/recipes/null-values
If you don't have a Git client, you can also download a zip file that contains the code and then navigate to the recipe.
Launch Pinot Cluster
You can spin up a Pinot Cluster by running the following command:
docker-compose up
This command will run a single instance of the Pinot Controller, Pinot Server, Pinot Broker, and Zookeeper. You can find the docker-compose.yml file on GitHub.
Dataset
We're going to import the following JSON file:
{"title": "Valentine's Day", "genre": "Comedy", "year": 2010, "id": 361248901147483647}
{"title": "The Ugly Truth", "year": 2009, "id": 332567813147483648}
{"title": "P.S. I Love You", "genre": "Romance", "year": 2007, "id": 346905752147483649}
{"title": "Dear John", "year": 2010, "id": 300441473147483650}
{"title": "The Curious Case of Benjamin Button", "genre": "Fantasy", "year": 2008, "id": 394030854147483651}
The Ugly Truth
and Dear John
are both missing a value for genre
, which we'll explore in this guide.
Pinot Schema and Table
Now let's create a Pinot Schema and Table.
First, the schema:
{
"schemaName": "movies",
"dimensionFieldSpecs": [
{
"name": "id",
"dataType": "LONG"
},
{
"name": "title",
"dataType": "STRING"
},
{
"name": "genre",
"dataType": "STRING"
},
{
"name": "year",
"dataType": "INT"
}
]
}
You can create the schema by running the following command:
docker exec -it pinot-controller-json bin/pinot-admin.sh AddSchema \
-schemaFile /config/schema.json \
-exec
We're also going to create two tables: one that allows null values and one that doesn't.
This table doesn't handle null values:
{
"tableName": "movies_no_nulls",
"tableType": "OFFLINE",
"segmentsConfig": {
"replication": 1,
"schemaName": "movies"
},
"tenants": {
"broker": "DefaultTenant",
"server": "DefaultTenant"
},
"tableIndexConfig": {
"loadMode": "MMAP"
},
"ingestionConfig": {
"batchIngestionConfig": {
"segmentIngestionType": "APPEND",
"segmentIngestionFrequency": "DAILY"
}
},
"metadata": {}
}
You can create the table by running the following command:`
docker exec -it pinot-controller-json bin/pinot-admin.sh AddTable \
-tableConfigFile /config/table_no_nulls.json \
-exec
And this table allows null values:
{
"tableName": "movies_nulls",
"tableType": "OFFLINE",
"tableIndexConfig": {
"loadMode": "MMAP",
"nullHandlingEnabled": "true"
},
"segmentsConfig": {
"replication": 1,
"schemaName": "movies"
},
"tenants": {
"broker": "DefaultTenant",
"server": "DefaultTenant"
},
"ingestionConfig": {
"batchIngestionConfig": {
"segmentIngestionType": "APPEND",
"segmentIngestionFrequency": "DAILY"
}
},
"metadata": {}
}
The highlighted config is how we indicate that we want this table to have null values.
You can create the table by running the following command:`
docker exec -it pinot-controller-json bin/pinot-admin.sh AddTable \
-tableConfigFile /config/table_nulls.json \
-exec
Ingestion Job
Now we’re going to import the JSON file into these tables. We'll do this with the following ingestion spec:
executionFrameworkSpec:
name: 'standalone'
segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner'
segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner'
jobType: SegmentCreationAndTarPush
inputDirURI: '/data'
includeFileNamePattern: 'glob:**/import.json'
outputDirURI: '/opt/pinot/data/${tableName}'
overwriteOutput: true
pinotFSSpecs:
- scheme: file
className: org.apache.pinot.spi.filesystem.LocalPinotFS
recordReaderSpec:
dataFormat: 'json'
className: 'org.apache.pinot.plugin.inputformat.json.JSONRecordReader'
tableSpec:
tableName: ${tableName}
pinotClusterSpecs:
- controllerURI: 'http://localhost:9000'
The import job will map fields in each JSON document to a corresponding column in the movies
schema.
You can run the following command to run the import on the movies_no_null
table:
docker exec -it pinot-controller-json bin/pinot-admin.sh LaunchDataIngestionJob \
-jobSpecFile /config/job-spec.yml \
-values tableName='movies_no_nulls'
And the following to run the import on the movies_nuls
table:
docker exec -it pinot-controller-json bin/pinot-admin.sh LaunchDataIngestionJob \
-jobSpecFile /config/job-spec.yml \
-values tableName='movies_nulls'
Querying
Once that's completed, navigate to localhost:9000/#/query and run the following query to return the rows that have a genre in the movies_no_nulls
table:
select *
from movies_no_nulls
WHERE genre IS NOT NULL
You will see the following output:
genre | id | title | year |
---|---|---|---|
Comedy | 361248901147483647 | Valentine's Day | 2010 |
null | 332567813147483648 | The Ugly Truth | 2009 |
Romance | 346905752147483649 | P.S. I Love You | 2007 |
null | 300441473147483650 | Dear John | 2010 |
Fantasy | 394030854147483651 | The Curious Case of Benjamin Button | 2008 |
Query Results
We can see from the results that the null genres haven't been filtered out for the movies_no_nulls
table.
This is because Pinot uses a default value of "null"
for string columns.
If we want to filter these null values, we'd have to run the following query:
select *
from movies_no_nulls
WHERE genre <> 'null'
The genre
column in the movies_nulls
table, on the other hand, supports null values, which we can see by running the following query:
select *
from movies_nulls
WHERE genre IS NOT NULL
You will see the following output:
genre | id | title | year |
---|---|---|---|
Comedy | 361248901147483647 | Valentine's Day | 2010 |
Romance | 346905752147483649 | P.S. I Love You | 2007 |
Fantasy | 394030854147483651 | The Curious Case of Benjamin Button | 2008 |
Query Results