Try StarTree Cloud: 30-day free trial
Importing CSV files with columns containing spaces

Importing CSV files with columns containing spaces

In this recipe we'll learn how to use transformation functions to ingest a CSV file whose column names contain spaces.

Prerequisites

To follow the code examples in this guide, you must install Docker (opens in a new tab) locally and download recipes.

Navigate to recipe

  1. If you haven't already, download recipes.
  2. In terminal, go to the recipe by running the following command:
cd pinot-recipes/recipes/csv-files-spaces-column-names

Launch Pinot Cluster

You can spin up a Pinot Cluster by running the following command:

docker-compose up

This command will run a single instance of the Pinot Controller, Pinot Server, Pinot Broker, and Zookeeper. You can find the docker-compose.yml (opens in a new tab) file on GitHub.

💡

The docker-compose command will fail in the MAC M1/M2 machines. To fix this, add platform: linux/amd64 under each services

docker-compose.yml
version: '3.7'
services:
  zookeeper:
    platform: linux/amd64
    ...
...

Dataset

We're going to import the following CSV file, in which the Case Number column contains a space:

IDCase Number
10224738HY411648
10224739HY411615
11646166JC213529
10224740HY411595

data/import.csv

Pinot Schema and Table

Now let's create a Pinot Schema and Table.

A common pattern when creating a schema is to create columns that map directly to the names of the fields in our data source. We can't do that in this case since column names can't contain spaces, so instead we'll have the following:

{
    "schemaName": "crimes",
    "dimensionFieldSpecs": [
      {
        "name": "ID",
        "dataType": "INT"
      },
      {
        "name": "CaseNumber",
        "dataType": "STRING"
      }
    ]
}

config/schema.json

We'll also have the following table config:

{
    "tableName": "crimes",
    "tableType": "OFFLINE",
    "segmentsConfig": {
      "replication": 1
    },
    "tenants": {
      "broker":"DefaultTenant",
      "server":"DefaultTenant"
    },
    "tableIndexConfig": {
      "loadMode": "MMAP"
    },
    "ingestionConfig": {
      "batchIngestionConfig": {
        "segmentIngestionType": "APPEND",
        "segmentIngestionFrequency": "DAILY"
      },
      "transformConfigs": [
        {"columnName": "CaseNumber", "transformFunction": "\"Case Number\"" }
      ]
    },
    "metadata": {}
}

config/table.json

💡

The entry under ingestionConfig.transformConfigs makes sure that data in the Case Number field in the data source is ingested into the CaseNumber column of the table. To learn more about writing these functions, see the ingestion transformation (opens in a new tab) documentation.

You can create the table and schema by running the following command:`

docker run \
   --network csv \
   -v $PWD/config:/config \
   apachepinot/pinot:1.0.0 AddTable \
     -schemaFile /config/schema.json \
     -tableConfigFile /config/table.json \
     -controllerHost "pinot-controller-csv" \
    -exec

You should see a message similar to the following if everything is working correctly:

2021/11/25 12:02:04.606 INFO [AddTableCommand] [main] Executing command: AddTable -tableConfigFile /config/table.json -schemaFile /config/schema.json -controllerProtocol http -controllerHost 192.168.144.3 -controllerPort 9000 -user null -password [hidden] -exec
2021/11/25 12:02:05.084 INFO [AddTableCommand] [main] {"status":"Table crimes_OFFLINE succesfully added"}

Ingestion Job

Now we’re going to import the CSV file into Pinot. We'll do this with the following ingestion spec:

executionFrameworkSpec:
  name: 'standalone'
  segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner'
  segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner'
  segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentUriPushJobRunner'
jobType: SegmentCreationAndTarPush
inputDirURI: '/data'
includeFileNamePattern: 'glob:**/import.csv'
outputDirURI: '/opt/pinot/data/crimes/'
overwriteOutput: true
pinotFSSpecs:
  - scheme: file
    className: org.apache.pinot.spi.filesystem.LocalPinotFS
recordReaderSpec:
  dataFormat: 'csv'
  className: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReader'
  configClassName: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReaderConfig'
tableSpec:
  tableName: 'crimes'
pinotClusterSpecs:
  - controllerURI: 'http://pinot-controller-csv:9000'
pushJobSpec:
  pushAttempts: 2
  pushRetryIntervalMillis: 1000

config/job-spec.yml

You can run the following command to run the import:

docker run \
   --network csv \
   -v $PWD/config:/config \
   -v $PWD/data:/data \
   apachepinot/pinot:1.0.0 LaunchDataIngestionJob \
  -jobSpecFile /config/job-spec.yml

Querying

Once that's completed, navigate to localhost:9000/#/query (opens in a new tab) and click on the crimes table or copy/paste the following query:

select * 
from crimes 
limit 10

You will see the following output:

CaseNumberID
HY41164810224738
HY41161510224739
JC21352911646166
HY41159510224740

Query Results