UK House Prices

UK House Prices

This dataset contains the price paid for houses sold in the UK from 2015 until 2022. You can learn more about the dataset at Price Paid Data (opens in a new tab)

Prerequisites

To follow the code examples in this guide, you must install Docker (opens in a new tab) locally and download recipes.

Navigate to recipe

  1. If you haven't already, download recipes.
  2. In terminal, go to the recipe by running the following command:
cd pinot-recipes/recipes/house-prices-dataset

Download data

The dataset contains .csv files with house sales from specific years. You will need to download at least one of these ffiles from https://www.gov.uk/government/statistical-data-sets/price-paid-data-downloads#yearly-file (opens in a new tab) and put it in the raw directory.

You'll then need to call clean_csv.py with the name of the CSV file in the raw directory. For example, if you download the 2021 dataset, you'd run this command, passing in pp-2021.csv:

python clean_csv.py pp-2021.csv

The CSV file will have a header line appended to the front and will be written to the data directory.

Launch Pinot Cluster

You can spin up a Pinot Cluster by running the following command:

 
docker compose up

This command will run a single instance of the Pinot Controller, Pinot Server, Pinot Broker, and Zookeeper. You can find the docker-compose.yml (opens in a new tab) file on GitHub.

Pinot Schema and Table

Now let's create a Pinot Schema and Table.

First, the schema:

{
  "schemaName": "house_prices",
  "dimensionFieldSpecs": [
    {"name": "UniqueTransactionIdentifier", "dataType": "STRING"},
    {"name": "Postcode", "dataType": "STRING"},
    {"name": "PropertyType", "dataType": "STRING"},
    {"name": "OldNew", "dataType": "STRING"},
    {"name": "Duration", "dataType": "STRING"},
    {"name": "PAON", "dataType": "STRING"},
    {"name": "SAON", "dataType": "STRING"},
    {"name": "Street", "dataType": "STRING"},
    {"name": "Locality", "dataType": "STRING"},
    {"name": "TownCity", "dataType": "STRING"},
    {"name": "District", "dataType": "STRING"},
    {"name": "County", "dataType": "STRING"},
    {"name": "PPDCategoryType", "dataType": "STRING"},
    {"name": "RecordStatus", "dataType": "STRING"}
  ],
  "dateTimeFieldSpecs": [
    {
      "name": "TransferDate",
      "dataType": "TIMESTAMP",
      "format": "1:MILLISECONDS:EPOCH",
      "granularity": "1:MILLISECONDS"
    }
  ],
  "metricFieldSpecs": [{"name": "Price", "dataType": "INT"}]
}

config/schema.json

We'll also have the following table config:

{
  "tableName": "house_prices",
  "tableType": "OFFLINE",
  "segmentsConfig": {
    "replication": 1,
    "schemaName": "house_prices",
    "timeColumnName": "TransferDate"
  },
  "tenants": {},
  "tableIndexConfig": {},
  "ingestionConfig": {
    "transformConfigs": [
      {
        "columnName": "TransferDate",
        "transformFunction": "FromDateTime(DateOfTransfer, 'YYYY-MM-dd HH:mm')"
      }
    ]
  },
  "metadata": {}
}

config/table.json

You can create the table and schema by running the following command:`

docker run \
   --network houseprices \
   -v $PWD/config:/config \
   apachepinot/pinot:1.0.0 AddTable \
     -schemaFile /config/schema.json \
     -tableConfigFile /config/table.json \
     -controllerHost "pinot-controller-houseprices" \
    -exec

Ingesting CSV files

We're going to use Pinot's SQL insert syntax to import the CSV files. Navigate to the Pinot UI (opens in a new tab) and click on Query Console You can then run the following query to run the ingestion job:

SET taskName = 'houseprices-dataload';
SET input.fs.className = 'org.apache.pinot.spi.filesystem.LocalPinotFS';
SET includeFileNamePattern='glob:**/*.csv';
INSERT INTO house_prices 
FROM FILE 'file:///data/';

Wait a few minutes for the job to complete. You can check on the status by navigating to the Minion Task Manager (opens in a new tab) page.

Querying the dataset

Once the data is loaded, we can run some queries.

We can find the highest average prices paid by county:

select County, avg(Price) AS averagePrice, max(Price) AS maxPrice, count(*) AS numberOfSales
from house_prices
group by County
order by averagePrice DESC
limit 10
CountyaveragePricenumberOfSales
GREATER LONDON820453241466
WINDSOR AND MAIDENHEAD7137895637
SURREY67417347885
BUCKINGHAMSHIRE59256921111
HERTFORDSHIRE56259642077
OXFORDSHIRE55395526880
WOKINGHAM5303177072
SLOUGH5224383314
BRIGHTON AND HOVE51856810847
BRACKNELL FOREST5016404986

Or find the most expensive property sold by town/city in 2021:

select TownCity, County, avg(Price) AS averagePrice, mode(Price), 
       max(Price) AS maxPrice, 
       count(*) AS numberOfSales
from house_prices
WHERE year(TransferDate) = 2021
group by TownCity, County
order by maxPrice DESC
limit 10
TownCityCountymaxPricenumberOfSales
LONDONGREATER LONDON52300000081988
OXFORDOXFORDSHIRE4141086602795
MANCHESTERGREATER MANCHESTER29200000019965
RETFORDNOTTINGHAMSHIRE186586777864
GREENHITHEKENT168000000411
BRISTOLCITY OF BRISTOL1661053019560
SLOUGHSLOUGH1550000001946
MILTON KEYNESMILTON KEYNES1467500005565
LUTTERWORTHLEICESTERSHIRE135087821686
WEMBLEYGREATER LONDON130500000907