Getting started with Druid: A high-performance, column-oriented, distributed data store.

Getting started with Druid: A high-performance, column-oriented, distributed data store.

Getting started with Druid: A high-performance, column-oriented, distributed data store.

The goal of this post is 2 folds:

  1. Being able to work with arbitrary data generated from http://www.json-generator.com/. Ingesting it into Druid.
  2. Performing filtering and aggregation on the data.

We'll start with timeseries data and then try and look for ways to work with non-timeseries data.

Generating data

I used the following json spec to generate data from json-generator:

Ingestion

We generated a sample data from json-generator.com
Now we need to generate the index file, which is like a metadata document that druid uses to ingest the files.

The ingestion spec is a JSON document of the following structure:

The spec is broadly composed of

{
"dataSchema" : {...},
"ioConfig" : {...},
"tuningConfig" : {...}
}

Dataschema: Specify dimensions and timestamp attributes from the dataset.
IOConfig: Specify the data source path
TuningConfig: Hadoop tuning configurations

POST the ingestion spec to the server:
curl -X 'POST' -H 'Content-Type:application/json' -d @index.json localhost:8090/druid/indexer/v1/task

You should get an output like:

{"task":"index_hadoop_out_2016-12-04T17:13:17.220Z"}%

Go to http://localhost:8090/console.html to see the status of your job.

Querying

Here we run sample query

We can use curl to  POST this query to the druid server.

The output would be:

About Ganesh

Leave a Reply

Your email address will not be published. Required fields are marked *