SlideShare a Scribd company logo
Druid SQL Interface
Calcite
Problem
- Druid is used extensively on our team and at Oath
- Druid is hard to interact with due to its JSON input format
- Many at Oath are not familiar with how to optimize Druid queries
Why use Druid?
- Able to ingest and serve data in real-time with low latency
- Good for ad-hoc queries
- Good for storing aggregate data
- Scalable to ingest millions of events/sec
Using SQL to bridge the gap
● SQL is the lingua franca of data
● Most at Oath are already familiar with SQL
● SQL is easier to write and more concise than JSON
● All BI tools we use support SQL
SQL vs Druid JSON
Here is a sample SQL query for a given dataset:
SELECT
SUM("store_sales") filter (where "store_state" = 'CA'),
SUM("store_cost") filter (where "store_state" = 'OR')
FROM
"foodmart"
WHERE
"the_month" == 'October'
LIMIT
10
The same query in Druid JSON format is much less readable
SQL vs Druid JSON
{
"queryType":"groupBy",
"dataSource":"foodmart",
"granularity":"all",
"dimensions":[],
"limitSpec":{
"type":"default",
"limit":10,
"columns":[]
},
"filter":{
"type":"and",
"fields":[
{
"type":"or",
"fields":[
{
"type":"selector",
"dimension":"store_state",
"value":"CA"
},
{
"type":"selector",
"dimension":"store_state",
"value":"OR"
}
]
},
{
"type":"not",
"field":{
"type":"selector",
"dimension":"the_month",
"value":"October"
}
}
]
},
"aggregations":[
{
"type":"filtered",
"filter":{
"type":"selector",
"dimension":"store_state",
"value":"CA"
},
"aggregator":{
"type":"doubleSum",
"name":"EXPR$0",
"fieldName":"store_sales"
}
},
{
"type":"filtered",
"filter":{
"type":"selector",
"dimension":"store_state",
"value":"OR"
},
"aggregator":{
"type":"doubleSum",
"name":"EXPR$1",
"fieldName":"store_cost"
}
}
],
"intervals":["1900-01-09T00:00:00.000/2992-
01-10T00:00:00.000"]
}
Pre-existing Solutions
- Druid SQL services
- Hive Druid connection
- Apache Calcite
Druid SQL Services
- Druid has SQL support via Apache Calcite
- Pros:
- Significantly simplifies query JSON
- Already supported in Druid
- Cons:
- Support is experimental
- Doesn’t support DataSketch aggregators
curl -XPOST -H 'Content-Type: application/json' http://BROKER:8082/druid/v2/sql/ -d @query.json
{
"query" : "SELECT COUNT(*) FROM data_source WHERE foo = 'bar' AND __time > TIMESTAMP '2000-01-
01 00:00:00'",
"context" : {"sqlTimeZone" : "America/Los_Angeles"}
}
Hive Druid Connection
- Hive also has some level of Druid support via Apache Calcite
- Pros:
- Many BI tools already support Hive
- Cons:
- Lacks support for sketches
Apache Calcite
- Translator between SQL and Druid JSON
- Industry-standard SQL parser
- Represent your query in relational algebra, transform using planning rules,
and optimize according to a cost model
- Open source
Our Solution
- Use Apache Calcite directly
- Address the deficiencies of Calcite and contribute back to the open source
community
Calcite relational algebra
- Relational logic tree translated
from SQL query
- Each node has its cost based on
context
- SELECT SUM(a) as c FROM
table1 WHERE b=1 ORDER BY c
TableScan On table1
Filter (b=1)
Project (table1.a -> a)
Aggregate (sum(a))
Sort on c
Query Planning
- Apply rules on the
Relational logic tree
- Transform certain logic
subtree into Druid Query
Node
TableScan Filter Project Aggregate Sort
Druid GroupBy Query node Sort
Druid TopN Query node
Or
Optimization
- Use cost model to estimate the
performance of different
transformed logic tree
- Basic idea is to leverage more
computation in Druid
Druid GroupBy Query node Sort
Druid TopN Query node
Cost = 10 Cost = 10
Cost = 15
Renderer
- Render the druid json query to be
sent out
- If any computation cannot be
pushed to json query, run it locally
in Calcite.
Druid TopN Query node
{
"queryType":"TopN",
"dataSource":"foodmart",
"Granularity":"all",
…
Major Problems
- Did not support Post-Aggregation
- AVERAGE function
- Could run out of memory
- Did not support Filtered Aggregations
- Could cause Druid query all rows and process them in memory
- Did not support Distinct Count Aggregators using ThetaSketches
- Calcite will always try to give the user exact results
- Distinct count aggregations are not pushed to Druid
Post Aggregation Support
- New Rule to merge Post
aggregation node
- New Render that can
generate druid query with
post aggregation
TableScan Project Aggregate
Druid GroupBy Query node
Aggregate
Aggregate
Druid GroupBy Query node
New Rule
Filtered Aggregations Support
- New Rule to move Filter
operation from Calcite to
Druid
- Optimization on filters
- New rule to extract
common filter into outer
filter
- New rule to combine filter
with logical ORs to outer
filter
TableScan Filter1 Project
Aggregate
Aggregate
Filter2
Filter2
TableScan
Filter1
Project
Aggregate
AggregateFilter2
Performance
- Avoid unnecessary rows scan
in Druid
- Greatly reduce the runtime of
when filters are involved
Why ThetaSketch
- Sketches are a class of streaming, stochastic
algorithms
- Trade off accuracy for speed – orders of
magnitude faster
- Exact up to configurable thresholds and
approximate after
- Mathematically provable error bounds
- Bounded in space
- Set operations – union, intersect, difference Sketches logo from http://datasketches.github.io
ThetaSketch Support
- New rule to translate Distinct count aggregator node to Thetasketches node
- Allow users to config whether approximate cardinality is allowed
Performance
- Reduced the running time of the
query with count distinct aggregator
when cardinality estimation is
allowed
- Sketches column can be utilized
now
- With post aggregation support,
more operation can be applied
User Interface
- Superset is commonly used
with Druid
- Superset SQL Lab is popular
on SQL-like database
From superset documentation: https://superset.incubator.apache.org/
Superset Calcite Connection
- Superset is python application
- Standard python DBAPI is created
- Able to use SQL lab to run ad-hoc query on Druid
Perform Query
Parsing,
Planning
Internal
Computation
Druid Adapter
SQL Lab
Calcite JDBC
Output
User
Superset
Calcite
Druid
Questions

More Related Content

PDF
Don’t optimize my queries, optimize my data!
Julian Hyde
 
PDF
Deep Dive into Hyperparameter Tuning
Shubhmay Potdar
 
PDF
ClickHouse Deep Dive, by Aleksei Milovidov
Altinity Ltd
 
PDF
Introduction to Apache Calcite
Jordan Halterman
 
PDF
SQL Macros - Game Changing Feature for SQL Developers?
Andrej Pashchenko
 
PDF
10 Good Reasons to Use ClickHouse
rpolat
 
PDF
Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...
Edureka!
 
PDF
Introduction to DataFusion An Embeddable Query Engine Written in Rust
Andrew Lamb
 
Don’t optimize my queries, optimize my data!
Julian Hyde
 
Deep Dive into Hyperparameter Tuning
Shubhmay Potdar
 
ClickHouse Deep Dive, by Aleksei Milovidov
Altinity Ltd
 
Introduction to Apache Calcite
Jordan Halterman
 
SQL Macros - Game Changing Feature for SQL Developers?
Andrej Pashchenko
 
10 Good Reasons to Use ClickHouse
rpolat
 
Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...
Edureka!
 
Introduction to DataFusion An Embeddable Query Engine Written in Rust
Andrew Lamb
 

What's hot (20)

PDF
SQL Pattern Matching – should I start using it?
Andrej Pashchenko
 
PDF
Understanding Oracle RAC 11g Release 2 Internals
Markus Michalewicz
 
PDF
Bucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
Databricks
 
PDF
Tricks every ClickHouse designer should know, by Robert Hodges, Altinity CEO
Altinity Ltd
 
PPTX
Machine Learning vs. Deep Learning
Belatrix Software
 
PDF
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Databricks
 
PPTX
Ensemble learning Techniques
Babu Priyavrat
 
PPTX
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Dremio Corporation
 
PDF
Real-time analytics with Druid at Appsflyer
Michael Spector
 
PPTX
Ensemble learning
Haris Jamil
 
PPTX
Hadoop Summit 2012 | Optimizing MapReduce Job Performance
Cloudera, Inc.
 
PDF
How to use Impala query plan and profile to fix performance issues
Cloudera, Inc.
 
PPTX
Overfitting & Underfitting
SOUMIT KAR
 
PDF
Deep learning with Keras
QuantUniversity
 
PPTX
K-Folds Cross Validation Method
SHUBHAM GUPTA
 
PDF
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
Altinity Ltd
 
PDF
ClickHouse Materialized Views: The Magic Continues
Altinity Ltd
 
PDF
What is in a Lucene index?
lucenerevolution
 
PPT
Neural network final NWU 4.3 Graphics Course
Mohaiminur Rahman
 
PDF
Exploratory data analysis data visualization
Dr. Hamdan Al-Sabri
 
SQL Pattern Matching – should I start using it?
Andrej Pashchenko
 
Understanding Oracle RAC 11g Release 2 Internals
Markus Michalewicz
 
Bucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
Databricks
 
Tricks every ClickHouse designer should know, by Robert Hodges, Altinity CEO
Altinity Ltd
 
Machine Learning vs. Deep Learning
Belatrix Software
 
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Databricks
 
Ensemble learning Techniques
Babu Priyavrat
 
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Dremio Corporation
 
Real-time analytics with Druid at Appsflyer
Michael Spector
 
Ensemble learning
Haris Jamil
 
Hadoop Summit 2012 | Optimizing MapReduce Job Performance
Cloudera, Inc.
 
How to use Impala query plan and profile to fix performance issues
Cloudera, Inc.
 
Overfitting & Underfitting
SOUMIT KAR
 
Deep learning with Keras
QuantUniversity
 
K-Folds Cross Validation Method
SHUBHAM GUPTA
 
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
Altinity Ltd
 
ClickHouse Materialized Views: The Magic Continues
Altinity Ltd
 
What is in a Lucene index?
lucenerevolution
 
Neural network final NWU 4.3 Graphics Course
Mohaiminur Rahman
 
Exploratory data analysis data visualization
Dr. Hamdan Al-Sabri
 
Ad

Similar to Querying Druid in SQL with Superset (20)

PDF
NoSQL no more: SQL on Druid with Apache Calcite
gianmerlino
 
PDF
20th Athens Big Data Meetup - 1st Talk - Druid: the open source, performant, ...
Athens Big Data
 
PDF
Game Analytics at London Apache Druid Meetup
Jelena Zanko
 
PPTX
July 2014 HUG : Pushing the limits of Realtime Analytics using Druid
Yahoo Developer Network
 
PDF
Web analytics at scale with Druid at naver.com
Jungsu Heo
 
PDF
Druid
Dori Waldman
 
PDF
Adding new type to Druid
Navis Ryu
 
PDF
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Julian Hyde
 
PDF
Druid meetup 2018-03-13
gianmerlino
 
PPTX
Scalable olap with druid
Kashif Khan
 
PDF
Imply at Apache Druid Meetup in London 1-15-20
Jelena Zanko
 
PDF
Apache Druid 101
Data Con LA
 
PDF
Benchmarking Apache Druid
Imply
 
PDF
Benchmarking Apache Druid
Matt Sarrel
 
PDF
Druid: Under the Covers (Virtual Meetup)
Imply
 
PDF
A Day in the Life of a Druid Implementor and Druid's Roadmap
Itai Yaffe
 
PPTX
Understanding apache-druid
Suman Banerjee
 
PPTX
Vitalii Bondarenko - Масштабована бізнес-аналітика у Cloud Big Data Cluster. ...
Lviv Startup Club
 
PDF
Query generation across multiple data stores [SBTB 2016]
Hiral Patel
 
PPTX
Interactive Analytics at Scale in Apache Hive Using Druid
DataWorks Summit/Hadoop Summit
 
NoSQL no more: SQL on Druid with Apache Calcite
gianmerlino
 
20th Athens Big Data Meetup - 1st Talk - Druid: the open source, performant, ...
Athens Big Data
 
Game Analytics at London Apache Druid Meetup
Jelena Zanko
 
July 2014 HUG : Pushing the limits of Realtime Analytics using Druid
Yahoo Developer Network
 
Web analytics at scale with Druid at naver.com
Jungsu Heo
 
Adding new type to Druid
Navis Ryu
 
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Julian Hyde
 
Druid meetup 2018-03-13
gianmerlino
 
Scalable olap with druid
Kashif Khan
 
Imply at Apache Druid Meetup in London 1-15-20
Jelena Zanko
 
Apache Druid 101
Data Con LA
 
Benchmarking Apache Druid
Imply
 
Benchmarking Apache Druid
Matt Sarrel
 
Druid: Under the Covers (Virtual Meetup)
Imply
 
A Day in the Life of a Druid Implementor and Druid's Roadmap
Itai Yaffe
 
Understanding apache-druid
Suman Banerjee
 
Vitalii Bondarenko - Масштабована бізнес-аналітика у Cloud Big Data Cluster. ...
Lviv Startup Club
 
Query generation across multiple data stores [SBTB 2016]
Hiral Patel
 
Interactive Analytics at Scale in Apache Hive Using Druid
DataWorks Summit/Hadoop Summit
 
Ad

More from DataWorks Summit (20)

PPTX
Data Science Crash Course
DataWorks Summit
 
PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
PPTX
Managing the Dewey Decimal System
DataWorks Summit
 
PPTX
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
PPTX
Security Framework for Multitenant Architecture
DataWorks Summit
 
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
PPTX
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
PDF
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 
Data Science Crash Course
DataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Managing the Dewey Decimal System
DataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
Security Framework for Multitenant Architecture
DataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 

Recently uploaded (20)

PDF
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
PDF
Building High-Performance Oracle Teams: Strategic Staffing for Database Manag...
SMACT Works
 
PDF
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
PDF
Why Your AI & Cybersecurity Hiring Still Misses the Mark in 2025
Virtual Employee Pvt. Ltd.
 
PDF
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
PDF
REPORT: Heating appliances market in Poland 2024
SPIUG
 
PDF
Event Presentation Google Cloud Next Extended 2025
minhtrietgect
 
PDF
Advances in Ultra High Voltage (UHV) Transmission and Distribution Systems.pdf
Nabajyoti Banik
 
PDF
Software Development Methodologies in 2025
KodekX
 
PDF
Chapter 1 Introduction to CV and IP Lecture Note.pdf
Getnet Tigabie Askale -(GM)
 
PDF
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
PPTX
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
PPTX
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
PPTX
Coupa-Overview _Assumptions presentation
annapureddyn
 
PDF
Cloud-Migration-Best-Practices-A-Practical-Guide-to-AWS-Azure-and-Google-Clou...
Artjoker Software Development Company
 
PDF
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
PPT
Coupa-Kickoff-Meeting-Template presentai
annapureddyn
 
PDF
CIFDAQ'S Market Insight: BTC to ETH money in motion
CIFDAQ
 
PDF
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
PDF
Chapter 2 Digital Image Fundamentals.pdf
Getnet Tigabie Askale -(GM)
 
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
Building High-Performance Oracle Teams: Strategic Staffing for Database Manag...
SMACT Works
 
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
Why Your AI & Cybersecurity Hiring Still Misses the Mark in 2025
Virtual Employee Pvt. Ltd.
 
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
REPORT: Heating appliances market in Poland 2024
SPIUG
 
Event Presentation Google Cloud Next Extended 2025
minhtrietgect
 
Advances in Ultra High Voltage (UHV) Transmission and Distribution Systems.pdf
Nabajyoti Banik
 
Software Development Methodologies in 2025
KodekX
 
Chapter 1 Introduction to CV and IP Lecture Note.pdf
Getnet Tigabie Askale -(GM)
 
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
Coupa-Overview _Assumptions presentation
annapureddyn
 
Cloud-Migration-Best-Practices-A-Practical-Guide-to-AWS-Azure-and-Google-Clou...
Artjoker Software Development Company
 
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
Coupa-Kickoff-Meeting-Template presentai
annapureddyn
 
CIFDAQ'S Market Insight: BTC to ETH money in motion
CIFDAQ
 
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
Chapter 2 Digital Image Fundamentals.pdf
Getnet Tigabie Askale -(GM)
 

Querying Druid in SQL with Superset

  • 2. Problem - Druid is used extensively on our team and at Oath - Druid is hard to interact with due to its JSON input format - Many at Oath are not familiar with how to optimize Druid queries
  • 3. Why use Druid? - Able to ingest and serve data in real-time with low latency - Good for ad-hoc queries - Good for storing aggregate data - Scalable to ingest millions of events/sec
  • 4. Using SQL to bridge the gap ● SQL is the lingua franca of data ● Most at Oath are already familiar with SQL ● SQL is easier to write and more concise than JSON ● All BI tools we use support SQL
  • 5. SQL vs Druid JSON Here is a sample SQL query for a given dataset: SELECT SUM("store_sales") filter (where "store_state" = 'CA'), SUM("store_cost") filter (where "store_state" = 'OR') FROM "foodmart" WHERE "the_month" == 'October' LIMIT 10 The same query in Druid JSON format is much less readable
  • 6. SQL vs Druid JSON { "queryType":"groupBy", "dataSource":"foodmart", "granularity":"all", "dimensions":[], "limitSpec":{ "type":"default", "limit":10, "columns":[] }, "filter":{ "type":"and", "fields":[ { "type":"or", "fields":[ { "type":"selector", "dimension":"store_state", "value":"CA" }, { "type":"selector", "dimension":"store_state", "value":"OR" } ] }, { "type":"not", "field":{ "type":"selector", "dimension":"the_month", "value":"October" } } ] }, "aggregations":[ { "type":"filtered", "filter":{ "type":"selector", "dimension":"store_state", "value":"CA" }, "aggregator":{ "type":"doubleSum", "name":"EXPR$0", "fieldName":"store_sales" } }, { "type":"filtered", "filter":{ "type":"selector", "dimension":"store_state", "value":"OR" }, "aggregator":{ "type":"doubleSum", "name":"EXPR$1", "fieldName":"store_cost" } } ], "intervals":["1900-01-09T00:00:00.000/2992- 01-10T00:00:00.000"] }
  • 7. Pre-existing Solutions - Druid SQL services - Hive Druid connection - Apache Calcite
  • 8. Druid SQL Services - Druid has SQL support via Apache Calcite - Pros: - Significantly simplifies query JSON - Already supported in Druid - Cons: - Support is experimental - Doesn’t support DataSketch aggregators curl -XPOST -H 'Content-Type: application/json' http://BROKER:8082/druid/v2/sql/ -d @query.json { "query" : "SELECT COUNT(*) FROM data_source WHERE foo = 'bar' AND __time > TIMESTAMP '2000-01- 01 00:00:00'", "context" : {"sqlTimeZone" : "America/Los_Angeles"} }
  • 9. Hive Druid Connection - Hive also has some level of Druid support via Apache Calcite - Pros: - Many BI tools already support Hive - Cons: - Lacks support for sketches
  • 10. Apache Calcite - Translator between SQL and Druid JSON - Industry-standard SQL parser - Represent your query in relational algebra, transform using planning rules, and optimize according to a cost model - Open source
  • 11. Our Solution - Use Apache Calcite directly - Address the deficiencies of Calcite and contribute back to the open source community
  • 12. Calcite relational algebra - Relational logic tree translated from SQL query - Each node has its cost based on context - SELECT SUM(a) as c FROM table1 WHERE b=1 ORDER BY c TableScan On table1 Filter (b=1) Project (table1.a -> a) Aggregate (sum(a)) Sort on c
  • 13. Query Planning - Apply rules on the Relational logic tree - Transform certain logic subtree into Druid Query Node TableScan Filter Project Aggregate Sort Druid GroupBy Query node Sort Druid TopN Query node Or
  • 14. Optimization - Use cost model to estimate the performance of different transformed logic tree - Basic idea is to leverage more computation in Druid Druid GroupBy Query node Sort Druid TopN Query node Cost = 10 Cost = 10 Cost = 15
  • 15. Renderer - Render the druid json query to be sent out - If any computation cannot be pushed to json query, run it locally in Calcite. Druid TopN Query node { "queryType":"TopN", "dataSource":"foodmart", "Granularity":"all", …
  • 16. Major Problems - Did not support Post-Aggregation - AVERAGE function - Could run out of memory - Did not support Filtered Aggregations - Could cause Druid query all rows and process them in memory - Did not support Distinct Count Aggregators using ThetaSketches - Calcite will always try to give the user exact results - Distinct count aggregations are not pushed to Druid
  • 17. Post Aggregation Support - New Rule to merge Post aggregation node - New Render that can generate druid query with post aggregation TableScan Project Aggregate Druid GroupBy Query node Aggregate Aggregate Druid GroupBy Query node New Rule
  • 18. Filtered Aggregations Support - New Rule to move Filter operation from Calcite to Druid - Optimization on filters - New rule to extract common filter into outer filter - New rule to combine filter with logical ORs to outer filter TableScan Filter1 Project Aggregate Aggregate Filter2 Filter2 TableScan Filter1 Project Aggregate AggregateFilter2
  • 19. Performance - Avoid unnecessary rows scan in Druid - Greatly reduce the runtime of when filters are involved
  • 20. Why ThetaSketch - Sketches are a class of streaming, stochastic algorithms - Trade off accuracy for speed – orders of magnitude faster - Exact up to configurable thresholds and approximate after - Mathematically provable error bounds - Bounded in space - Set operations – union, intersect, difference Sketches logo from http://datasketches.github.io
  • 21. ThetaSketch Support - New rule to translate Distinct count aggregator node to Thetasketches node - Allow users to config whether approximate cardinality is allowed
  • 22. Performance - Reduced the running time of the query with count distinct aggregator when cardinality estimation is allowed - Sketches column can be utilized now - With post aggregation support, more operation can be applied
  • 23. User Interface - Superset is commonly used with Druid - Superset SQL Lab is popular on SQL-like database From superset documentation: https://superset.incubator.apache.org/
  • 24. Superset Calcite Connection - Superset is python application - Standard python DBAPI is created - Able to use SQL lab to run ad-hoc query on Druid
  • 25. Perform Query Parsing, Planning Internal Computation Druid Adapter SQL Lab Calcite JDBC Output User Superset Calcite Druid

Editor's Notes

  • #4: Druid emerges recently as a is an open-source data store designed for sub-second queries on real-time and historical data. It is usually used to query event data. It can ingest event data in real time and allows flexible data exploration and data aggregation. Right now in Oath, a lot of data sciences want to do a ad-hoc query on users event data and Druid will be their first choice. Now, we will go over how to work with druid when we need to do a adhoc query.
  • #5: In this presentation, we will first go over the reason why we want a SQL interface on the top of Druid. In the meantime, we will briefly introduce Druid and other existing solution. After that, we will focus on the improvement needed to be done for the SQL interface. This part will include our contribution on the open source project. At last, we will introduce how we combine the SQL interface with a neat User interface for a wider range of users to work on Druid.
  • #6: Let’s say I am a data scientist and I may want to run this SQL on druid. In this SQL query, we are looking for summation of store_sales in California and store_costs in Oregon. The data in October should be excluded. Since now the data is in druid, we will have to write a query that Druid understands. In this case, we have to write a JSON and send it to druid within a http request. So how the json will be like?
  • #7: Probably after several experience with druid, ones will be familiar with this format. However, for someone who is already familiar with SQL to translate SQL into this format, it will need some time to learn the druid documentation and probably the syntax of json. It will be great if we can have a SQL interface to query druid data without losing the great performance of druid. Let’s go ahead and explore some SQL interfaces that Druid could work with.
  • #9: First option we can have is to use the SQL services provided by Druid. It works similarly as the druid query and we need to send a http request with the SQL statement in a json object. This function is supported by a open source tool called Apache Calcite.
  • #10: Need more
  • #11: Obviously, Calcite is the core part of the SQL interface on Druid. To develop a SQL interface on Druid, it is necessary to learn about Calcite first. We will briefly introduce Calcite and how it works then go over the contribution we made on Open source Calcite.
  • #13: The most important concept in Calcite is the relational algebra. Calcite basically translate SQL statement into a tree structure with nodes representing the logic in it. For example, the tree on the left side is a simple example algebra tree for the SQL statement on the left side. First, the TableScan to lock down the SQL on certain table. Then the filter contains the logic of WHERE value of b column equals to 1. The Project node is not so obvious but in many SQL database the actual column name may contain namespace and the project node will take care these kinds of translation. The Aggregate node includes the logic of summation function we used. The last one is the sort node which corresponds to order by part in SQL statement.
  • #14: With the relational logic tree, we now can transform the tree to another tree representing the equivalent logic. After transformation, the new tree should be translated into Druid Query. The transformation rules will subtrees with equivalent logic to be transformed between each other. Like the example here, the first four node can be transformed as a druid groupby query through rules we specified. In the other transformation path, the whole tree can be transformed into Druid topN query node which contains equivalent logic. Now, the question is to pick a certain tree as our final result.
  • #15: The final result will be determined by the optimizer in Calcite. In calcite, each node will have cost to quantify the computation power it required to run the logic in the node. Different output trees, therefore, will have different cost. The result will be the final tree with minimum cost. Back to our example, the cost of final tree on the top is 10 + 10 20, but another output tree only have one node with 15 cost. The optimizer will then pick the Druid TopN query node as the final result. Now, the final job is to render the druid json query with the logic we have in the druid query node.
  • #16: Now, the final job is to render the druid json query with the logic we have in the druid query node. Basically it is like a json writer that generate the query based on the information we had in the node.
  • #17: The whole idea of Calcite is brilliant, but at the time when we try to use it, it still has missing piece that can affect the performance. We the decided to contribute to the open source repository to enhance Calcite. These are two major problems we worked on. First, Calcite in that time did not have support on post-aggregation, so part of the computation in function like AVERAGE will be performed in Calcite instead of Druid. Moving those computation to Druid can save memory and time when running query. The second one is the support of filtered Aggregation. Without the support, calcite may assign druid to query all rows and do the filter in Calcite local machine. This could cause a huge fallback in performance when filtered aggregation is involved.
  • #18: To add post aggregation support, we add new rules to merge post aggregation node in supported druid query node, so the final tree can include post aggregation logic. Also, a new renderer is needed to render query with post aggregator. Now, when possible, the post aggregation will be pushed to Druid query node.
  • #19: Similar as the post aggregation support, we add new rules to deal with the filter node right before the aggregate node. The general idea here is to move inner filter to outer filter