Skip to main content

Import data files from different directories

In this recipe we'll learn how to import CSV files from different directories.

Pre-requisites

You will need to install Docker locally to follow the code examples in this guide.

Download Recipe

First, clone the GitHub repository to your local machine and navigate to this recipe:

git clone git@github.com:startreedata/pinot-recipes.git
cd pinot-recipes/recipes/csv-files-spaces-column-names

If you don't have a Git client, you can also download a zip file that contains the code and then navigate to the recipe.

Launch Pinot Cluster

You can spin up a Pinot Cluster by running the following command:

docker-compose up

This command will run a single instance of the Pinot Controller, Pinot Server, Pinot Broker, and Zookeeper. You can find the docker-compose.yml file on GitHub.

Dataset

We're going to import the following CSV files:

input/2000_2009/2000_2009.csv
title,genre,year,id
"The Da Vinci Code","Thriller",2006,331567813147483648
"Ice Age: The Meltdown","Children",2006,337567813147483648
"P.S. I Love You","Romance",2007,346905752147483649
"The Curious Case of Benjamin Button","Fantasy",2008,394030854147483651
"Harry Potter and the Half-Blood Prince","Fantasy",2009,323030454547483651
"The Ugly Truth","Comedy",2009,332567813147483648
input/2010_2019/2010_2019.csv
title,genre,year,id
"Dear John","Drama",2010,300441473147483650
"Valentine's Day","Comedy",2010,361248901147483647
"Minions","Comedy",2015,333567813147483648
input/2020_present/2020_present.csv
title,genre,year,id
"Uncharted","Action",2022,493030854147483651

Pinot Schema and Table

Now let's create a Pinot Schema and Table.

A common pattern when creating a schema is to create columns that map directly to the names of the fields in our data source. We can't do that in this case since column names can't contain spaces, so instead we'll have the following:

config/schema.json
{
"schemaName": "movies",
"dimensionFieldSpecs": [{
"name": "id",
"dataType": "LONG"
},
{
"name": "title",
"dataType": "STRING"
},
{
"name": "genre",
"dataType": "STRING"
},
{
"name": "year",
"dataType": "INT"
}
]
}

We'll also have the following table config:

config/table.json*
{
"tableName": "movies",
"tableType": "OFFLINE",
"segmentsConfig": {
"replication": 1
},
"tenants": {
"broker": "DefaultTenant",
"server": "DefaultTenant"
},
"tableIndexConfig": {
"loadMode": "MMAP"
},
"ingestionConfig": {
"batchIngestionConfig": {
"segmentIngestionType": "APPEND",
"segmentIngestionFrequency": "DAILY"
},
"transformConfigs": []
},
"metadata": {}
}

You can create the table and schema by running the following command:`

docker exec -it pinot-controller-csv bin/pinot-admin.sh AddTable   \
-tableConfigFile /config/table.json \
-schemaFile /config/schema.json \
-exec

You should see a message similar to the following if everything is working correctly:

2021/11/25 12:02:04.606 INFO [AddTableCommand] [main] Executing command: AddTable -tableConfigFile /config/table.json -schemaFile /config/schema.json -controllerProtocol http -controllerHost 192.168.144.3 -controllerPort 9000 -user null -password [hidden] -exec
2021/11/25 12:02:05.084 INFO [AddTableCommand] [main] {"status":"Table movies_OFFLINE succesfully added"}

Ingestion Job

Now we’re going to import the CSV file into Pinot. We'll do this with the following ingestion spec:

config/job-spec.yml
executionFrameworkSpec:
name: 'standalone'
segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner'
segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner'
segmentNameGeneratorSpec:
configs:
use.global.directory.sequence.id: true
jobType: SegmentCreationAndTarPush
inputDirURI: '/input'
includeFileNamePattern: 'glob:**/*.csv'
outputDirURI: '/data'
pinotFSSpecs:
- scheme: file
className: org.apache.pinot.spi.filesystem.LocalPinotFS
recordReaderSpec:
dataFormat: 'csv'
className: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReader'
configClassName: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReaderConfig'
tableSpec:
tableName: 'movies'
pinotClusterSpecs:
- controllerURI: 'http://localhost:9000'

The part of this file that we're most interested in is the following section:

segmentNameGeneratorSpec:
configs:
use.global.directory.sequence.id: true

This config makes sure that the segment id (that's used in the segment name) is incremented globally across all directories, rather than being incremented on a directory by directory basis.

If we don't set this config, every segment that's created from a CSV file will be replaced by the segment created from the next CSV file. We would end up with only one segment that contained the rows from the last CSV that got processed.

info

We only need to use this config when importing data for tables that don't have a timestamp field. If a timestamp field is specified, timestamps are used in the segment name and the likelihood of segments being overridden is reduced.

You can run the following command to run the import:

docker exec -it manual-pinot-controller-csv bin/pinot-admin.sh LaunchDataIngestionJob \
-jobSpecFile /config/job-spec.yml

Querying

Once that's completed, navigate to localhost:9000/#/query and click on the crimes table or copy/paste the following query:

select * 
from crimes
limit 10

You will see the following output:

genreidtitleyear
Action493030854147483651Uncharted2022
Drama300441473147483650Dear John2010
Comedy361248901147483647Valentine's Day2010
Comedy333567813147483648Minions2015
Thriller331567813147483648The Da Vinci Code2006
Children337567813147483648Ice Age: The Meltdown2006
Romance346905752147483649P.S. I Love You2007
Fantasy394030854147483651The Curious Case of Benjamin Button2008
Fantasy323030454547483651Harry Potter and the Half-Blood Prince2009
Comedy332567813147483648The Ugly Truth2009

Query Results