SlideShare a Scribd company logo
><
INTRODUCTIONTO
INTRODUCTION TO APACHE CALCITE
APACHE CALCITE
JORDAN HALTERMAN
1
WHAT IS APACHE
CALCITE?
next 2
><INTRODUCTION TO APACHE CALCITE 3
What is Apache Calcite?
• A framework for building SQL databases
• Developed over more than ten years
• Written in Java
• Previously known as Optiq
• Previously known as Farrago
• Became an Apache project in 2013
• Led by Julian Hyde at Hortonworks
><INTRODUCTION TO APACHE CALCITE 4
Projects using Calcite
• Apache Hive
• Apache Drill
• Apache Flink
• Apache Phoenix
• Apache Samza
• Apache Storm
• Apache everything…
><INTRODUCTION TO APACHE CALCITE 5
What is Apache Calcite?
• SQL parser
• SQL validation
• Query optimizer
• SQL generator
• Data federator
><INTRODUCTION TO APACHE CALCITE
Parse
Queries are parsed using
a JavaCC generated
parser
Validate
Queries are validated
against known database
metadata
Optimize
Logical plans are optimized
and converted into physical
expressions
Execute
P h y s i c a l p l a n s a r e
converted into application-
specific executions
01 02 03 04
Stages of query execution
6
COMPONENTS next 7
><INTRODUCTION TO APACHE CALCITE 8
Components of Calcite
• Catalog - Defines metadata and namespaces
that can be accessed in SQL queries
• SQL parser - Parses valid SQL queries into an
abstract syntax tree (AST)
• SQL validator - Validates abstract syntax trees
against metadata provided by the catalog
• Query optimizer - Converts AST into logical
plans, optimizes logical plans, and converts
logical expressions into physical plans
• SQL generator - Converts physical plans to
SQL
CATALOG next 9
><INTRODUCTION TO APACHE CALCITE 10
Calcite Catalog
• Defines namespaces that can be accessed in Calcite
queries
• Schema
• A collection of schemas and tables
• Can be arbitrarily nested
• Table
• Represents a single data set
• Fields defined by a RelDataType
• RelDataType
• Represents fields in a data set
• Supports all SQL data types, including structs and
><INTRODUCTION TO APACHE CALCITE 11
Schema
• A collection of schemas and tables
• Schemas can be arbitrarily nested
><INTRODUCTION TO APACHE CALCITE 12
Schema
• A collection of schemas and tables
• Schemas can be arbitrarily nested
><INTRODUCTION TO APACHE CALCITE 13
Table
• Represents a single data set
• Fields are defined by a RelDataType
><INTRODUCTION TO APACHE CALCITE 14
Table
• Represents a single data set
• Fields are defined by a RelDataType
><INTRODUCTION TO APACHE CALCITE 15
RelDataType
• Represents the data type of an object
• Supports all SQL data types, including
structs and arrays
• Similar to Spark’s DataType
><INTRODUCTION TO APACHE CALCITE 16
RelDataType
><INTRODUCTION TO APACHE CALCITE 17
RelDataType
data type enum
><INTRODUCTION TO APACHE CALCITE 18
Statistic
• Provide table statistics used in optimization
><INTRODUCTION TO APACHE CALCITE 19
Statistic
• Provide table statistics used in optimization
><INTRODUCTION TO APACHE CALCITE 20
Usage of the Calcite catalog
><INTRODUCTION TO APACHE CALCITE 21
Usage of the Calcite catalog
schema
><INTRODUCTION TO APACHE CALCITE 22
Usage of the Calcite catalog
schema table
><INTRODUCTION TO APACHE CALCITE 23
Usage of the Calcite catalog
schema table
data type
><INTRODUCTION TO APACHE CALCITE 24
Usage of the Calcite catalog
schema table
data typedata type field
SQL PARSER next 25
><INTRODUCTION TO APACHE CALCITE 26
Calcite SQL parser
• LL(k) parser written in JavaCC
• Input queries are parsed into an abstract
syntax tree (AST)
• Tokens are represented in Calcite by
SqlNode
• SqlNode can also be converted back to a
SQL string via the unparse method
><INTRODUCTION TO APACHE CALCITE 27
JavaCC
• Java Compiler Compiler
• Created in 1996 at Sun Microsystems
• Generates Java code from a domain-
specific language
• ANTLR is the modern alternative used in
projects like Hive and Drill
• JavaCC has sparse documentation
><INTRODUCTION TO APACHE CALCITE 28
JavaCC
><INTRODUCTION TO APACHE CALCITE 29
JavaCC
><INTRODUCTION TO APACHE CALCITE 30
JavaCC
tokens
><INTRODUCTION TO APACHE CALCITE 31
JavaCC
tokens
or
><INTRODUCTION TO APACHE CALCITE 32
JavaCC
tokens
or
function call
><INTRODUCTION TO APACHE CALCITE 33
JavaCC
tokens
Java code
or
function call
><INTRODUCTION TO APACHE CALCITE 34
SqlNode
• SqlNode represents an element in an
abstract syntax tree
><INTRODUCTION TO APACHE CALCITE 35
SqlNode
• SqlNode represents an element in an
abstract syntax tree
select
><INTRODUCTION TO APACHE CALCITE 36
SqlNode
• SqlNode represents an element in an
abstract syntax tree
identifiersselect
><INTRODUCTION TO APACHE CALCITE 37
SqlNode
• SqlNode represents an element in an
abstract syntax tree
identifiersselect operator
><INTRODUCTION TO APACHE CALCITE 38
SqlNode
• SqlNode represents an element in an
abstract syntax tree
identifiersselect operator identifier
><INTRODUCTION TO APACHE CALCITE 39
SqlNode
• SqlNode represents an element in an
abstract syntax tree
identifiersselect operator identifier
data type
><INTRODUCTION TO APACHE CALCITE 40
SqlNode
• SqlNode represents an element in an
abstract syntax tree
identifiersselect operator identifier
data type
identifier
><INTRODUCTION TO APACHE CALCITE 41
SqlNode
• SqlNode’s unparse method converts a
SQL element back into a string
><INTRODUCTION TO APACHE CALCITE 42
SqlNode
• SqlNode’s unparse method converts a
SQL element back into a string
><INTRODUCTION TO APACHE CALCITE 43
SqlNode
><INTRODUCTION TO APACHE CALCITE 44
SqlNode
• SqlDialect indicates the capitalization
and quoting rules of specific databases
><INTRODUCTION TO APACHE CALCITE 45
SqlNode
• SqlDialect indicates the capitalization
and quoting rules of specific databases
QUERY OPTIMIZER next 46
><INTRODUCTION TO APACHE CALCITE 47
Query Plans
• Query plans represent the steps necessary
to execute a query
><INTRODUCTION TO APACHE CALCITE 48
Query Plans
• Query plans represent the steps necessary
to execute a query
><INTRODUCTION TO APACHE CALCITE 49
Query Plans
• Query plans represent the steps necessary
to execute a query
table scan
table
scan
><INTRODUCTION TO APACHE CALCITE 50
Query Plans
• Query plans represent the steps necessary
to execute a query
inner join table scan
table
scan
><INTRODUCTION TO APACHE CALCITE 51
Query Plans
• Query plans represent the steps necessary
to execute a query
filter
inner join table scan
table
scan
><INTRODUCTION TO APACHE CALCITE 52
Query Plans
• Query plans represent the steps necessary
to execute a query
filter
inner join
project
table scan
table
scan
><INTRODUCTION TO APACHE CALCITE 53
Query Plans
• Query plans represent the steps necessary
to execute a query
filter
inner join
project
table scan
table
scan
><INTRODUCTION TO APACHE CALCITE 54
Query Optimization
• Optimize logical plan
• Goal is typically to try to reduce the amount
of data that must be processed early in the
plan
• Convert logical plan into a physical plan
• Physical plan is engine specific and
represents the physical execution stages
><INTRODUCTION TO APACHE CALCITE 55
Query Optimization
• Prune unused fields
• Merge projections
• Convert subqueries to joins
• Reorder joins
• Push down projections
• Push down filters
><INTRODUCTION TO APACHE CALCITE 56
Query Optimization
><INTRODUCTION TO APACHE CALCITE 57
Query Optimization
><INTRODUCTION TO APACHE CALCITE 58
Query Optimization
><INTRODUCTION TO APACHE CALCITE 59
Query Optimization
push down
project
><INTRODUCTION TO APACHE CALCITE 60
Query Optimization
push down
project
push down
filter
><INTRODUCTION TO APACHE CALCITE 61
Query Optimization
><INTRODUCTION TO APACHE CALCITE 62
Key Concepts
Relational algebra
Row expressions
Traits
Conventions
Rules
Planners
Programs
><INTRODUCTION TO APACHE CALCITE 63
Key Concepts
Relational algebra
Row expressions
Traits
Conventions
Rules
Planners
Programs
RelNode
RexNode
RelTrait
Convention
RelOptRule
RelOptPlanner
Program
><INTRODUCTION TO APACHE CALCITE 64
Relational Algebra
• RelNode represents a relational expression
• Largely equivalent to Spark’s DataFrame
methods
• Logical algebra
• Physical algebra
><INTRODUCTION TO APACHE CALCITE 65
Relational Algebra
TableScan
Project
Filter
Aggregate
Join
Union
Intersect
Sort
><INTRODUCTION TO APACHE CALCITE 66
Relational Algebra
TableScan
Project
Filter
Aggregate
Join
Union
Intersect
Sort
SparkTableScan
SparkProject
SparkFilter
SparkAggregate
SparkJoin
SparkUnion
SparkIntersect
SparkSort
><INTRODUCTION TO APACHE CALCITE 67
Row Expressions
• RexNode represents a row-level expression
• Largely equivalent to Spark’s Column
functions
• Projection fields
• Filter condition
• Join condition
• Sort fields
><INTRODUCTION TO APACHE CALCITE 68
Row Expressions
Input column ref
Literal
Struct field access
Function call
Window expression
><INTRODUCTION TO APACHE CALCITE 69
Row Expressions
Input column ref
Literal
Struct field access
Function call
Window expression
RexInputRef
RexLiteral
RexFieldAccess
RexCall
RexOver
><INTRODUCTION TO APACHE CALCITE 70
Row Expressions
><INTRODUCTION TO APACHE CALCITE 71
Row Expressions
input ref
><INTRODUCTION TO APACHE CALCITE 72
Row Expressions
input ref
function call
><INTRODUCTION TO APACHE CALCITE 73
Traits
• Defined by the RelTrait interface
• Represent a trait of a relational expression
that does not alter execution
• Traits are used to validate plan output
• Three primary trait types:
• Convention
• RelCollation
• RelDistribution
><INTRODUCTION TO APACHE CALCITE 74
Conventions
• Convention is a type of RelTrait
• A Convention is associated with a
RelNode interface
• SparkConvention, JdbcConvention,
EnumerableConvention, etc
• Conventions are used to represent a single
data source
• Inputs to a relational expression must be in
the same convention
><INTRODUCTION TO APACHE CALCITE 75
Conventions
><INTRODUCTION TO APACHE CALCITE 76
Conventions
Spark convention
><INTRODUCTION TO APACHE CALCITE 77
Conventions
Spark convention
JDBC
convention
><INTRODUCTION TO APACHE CALCITE 78
Conventions
Spark convention
JDBC
convention
converter
><INTRODUCTION TO APACHE CALCITE 79
Rules
• Rules are used to modify query plans
• Defined by the RelOptRule interface
• Two types of rules: converters and
transformers
• Converter rules implement Converter and
convert from one convention to another
• Rules are matched to elements of a query
plan using pattern matching
• onMatch is called for matched rules
• Converter rules applied via convert
><INTRODUCTION TO APACHE CALCITE 80
Converter Rule
><INTRODUCTION TO APACHE CALCITE 81
Converter Rule
expression type
><INTRODUCTION TO APACHE CALCITE 82
Converter Rule
expression type
input convention
><INTRODUCTION TO APACHE CALCITE 83
Converter Rule
expression type
input convention
converted convention
><INTRODUCTION TO APACHE CALCITE 84
Converter Rule
expression type
input convention
converted convention
converter function
><INTRODUCTION TO APACHE CALCITE 85
Pattern Matching
><INTRODUCTION TO APACHE CALCITE 86
Pattern Matching
><INTRODUCTION TO APACHE CALCITE 87
Pattern Matching
no match
:-(
><INTRODUCTION TO APACHE CALCITE 88
Pattern Matching
no match
:-(
><INTRODUCTION TO APACHE CALCITE 89
Pattern Matching
match!
no match
:-(
><INTRODUCTION TO APACHE CALCITE 90
Planners
• Planners implement the RelOptPlanner
interface
• Two types of planners:
• HepPlanner
• VolcanoPlanner
><INTRODUCTION TO APACHE CALCITE 91
Heuristic Optimization
• HepPlanner is a heuristic optimizer similar
to Spark’s optimizer
• Applies all matching rules until none can be
applied
• Heuristic optimization is faster than cost-
based optimization
• Risk of infinite recursion if rules make
opposing changes to the plan
><INTRODUCTION TO APACHE CALCITE 92
Cost-based Optimization
• VolcanoPlanner is a cost-based
optimizer
• Applies matching rules iteratively, selecting
the plan with the cheapest cost on each
iteration
• Costs are provided by relational expressions
• Not all possible plans can be computed
• Stops optimization when the cost does not
significantly improve through a determinable
number of iterations
><INTRODUCTION TO APACHE CALCITE 93
Cost-based Optimization
• Cost is provided by each RelNode
• Cost is represented by RelOptCost
• Cost typically includes row count, I/O, and
CPU cost
• Cost estimates are relative
• Statistics are used to improve accuracy of
cost estimations
• Calcite provides utilities for computing
various resource-related statistics for use in
cost estimations
><INTRODUCTION TO APACHE CALCITE 94
Cost-based Optimization
><INTRODUCTION TO APACHE CALCITE 95
Cost-based Optimization
><INTRODUCTION TO APACHE CALCITE 96
Cost-based Optimization
><INTRODUCTION TO APACHE CALCITE 97
Cost-based Optimization
><INTRODUCTION TO APACHE CALCITE 98
Cost-based Optimization
PUTTING IT ALL
TOGETHER
next 99
><INTRODUCTION TO APACHE CALCITE 100
Putting it all together
><INTRODUCTION TO APACHE CALCITE 101
Putting it all together
><INTRODUCTION TO APACHE CALCITE 102
Putting it all together
><INTRODUCTION TO APACHE CALCITE 103
Putting it all together
><INTRODUCTION TO APACHE CALCITE 104
Putting it all together
><INTRODUCTION TO APACHE CALCITE 105
Putting it all together
><INTRODUCTION TO APACHE CALCITE 106
Putting it all together
><INTRODUCTION TO APACHE CALCITE 107
Putting it all together
><INTRODUCTION TO APACHE CALCITE 108
Putting it all together
><INTRODUCTION TO APACHE CALCITE 109
Putting it all together

More Related Content

PDF
Apache Calcite (a tutorial given at BOSS '21)
Julian Hyde
 
PDF
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Julian Hyde
 
PPTX
SQL Query Optimization: Why Is It So Hard to Get Right?
Brent Ozar
 
PDF
Networking in Java with NIO and Netty
Constantine Slisenka
 
PDF
Real-Time Recommendations with Hopsworks and OpenSearch - MLOps World 2022
Jim Dowling
 
PPTX
Observability
Maganathin Veeraragaloo
 
PDF
Apache Calcite Tutorial - BOSS 21
Stamatis Zampetakis
 
PDF
Fast federated SQL with Apache Calcite
Chris Baynes
 
Apache Calcite (a tutorial given at BOSS '21)
Julian Hyde
 
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Julian Hyde
 
SQL Query Optimization: Why Is It So Hard to Get Right?
Brent Ozar
 
Networking in Java with NIO and Netty
Constantine Slisenka
 
Real-Time Recommendations with Hopsworks and OpenSearch - MLOps World 2022
Jim Dowling
 
Apache Calcite Tutorial - BOSS 21
Stamatis Zampetakis
 
Fast federated SQL with Apache Calcite
Chris Baynes
 

What's hot (20)

PPTX
Apache Calcite overview
Julian Hyde
 
PDF
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
Julian Hyde
 
PDF
Apache Calcite: One planner fits all
Julian Hyde
 
PDF
SQL for NoSQL and how Apache Calcite can help
Christian Tzolov
 
PDF
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Julian Hyde
 
PDF
Apache Calcite: One Frontend to Rule Them All
Michael Mior
 
PDF
Spark shuffle introduction
colorant
 
PDF
Top 10 Mistakes When Migrating From Oracle to PostgreSQL
Jim Mlodgenski
 
PDF
Parquet performance tuning: the missing guide
Ryan Blue
 
PDF
Introduction to Redis
Dvir Volk
 
PPTX
Introduction to Storm
Chandler Huang
 
PPTX
Apache Spark Architecture
Alexey Grishchenko
 
PDF
Streaming SQL with Apache Calcite
Julian Hyde
 
PDF
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
mumrah
 
KEY
Introduction to memcached
Jurriaan Persyn
 
PDF
Using Apache Calcite for Enabling SQL and JDBC Access to Apache Geode and Oth...
Christian Tzolov
 
PDF
Adding measures to Calcite SQL
Julian Hyde
 
PDF
A Deep Dive into Query Execution Engine of Spark SQL
Databricks
 
PDF
Re-Engineering PostgreSQL as a Time-Series Database
All Things Open
 
PDF
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
 
Apache Calcite overview
Julian Hyde
 
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
Julian Hyde
 
Apache Calcite: One planner fits all
Julian Hyde
 
SQL for NoSQL and how Apache Calcite can help
Christian Tzolov
 
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Julian Hyde
 
Apache Calcite: One Frontend to Rule Them All
Michael Mior
 
Spark shuffle introduction
colorant
 
Top 10 Mistakes When Migrating From Oracle to PostgreSQL
Jim Mlodgenski
 
Parquet performance tuning: the missing guide
Ryan Blue
 
Introduction to Redis
Dvir Volk
 
Introduction to Storm
Chandler Huang
 
Apache Spark Architecture
Alexey Grishchenko
 
Streaming SQL with Apache Calcite
Julian Hyde
 
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
mumrah
 
Introduction to memcached
Jurriaan Persyn
 
Using Apache Calcite for Enabling SQL and JDBC Access to Apache Geode and Oth...
Christian Tzolov
 
Adding measures to Calcite SQL
Julian Hyde
 
A Deep Dive into Query Execution Engine of Spark SQL
Databricks
 
Re-Engineering PostgreSQL as a Time-Series Database
All Things Open
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
 
Ad

Similar to Introduction to Apache Calcite (20)

PDF
Accelerating SQL queries in NoSQL Databases using Apache Drill and Secondary ...
Aman Sinha
 
PDF
ONE FOR ALL! Using Apache Calcite to make SQL smart
Evans Ye
 
PPTX
The Pushdown of Everything by Stephan Kessler and Santiago Mola
Spark Summit
 
PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
PDF
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Xu Jiang
 
PPTX
Apache HAWQ Architecture
Alexey Grishchenko
 
PPTX
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Sachin Aggarwal
 
PPTX
PL/SQL Tips and Techniques Webinar Presentation
Embarcadero Technologies
 
PDF
Understanding Query Plans and Spark UIs
Databricks
 
PDF
KSQL Intro
confluent
 
PDF
KSQL Deep Dive - The Open Source Streaming Engine for Apache Kafka
Kai Wähner
 
PDF
Analytics with Cassandra & Spark
Matthias Niehoff
 
PPTX
Dan Hotka's Top 10 Oracle 12c New Features
Embarcadero Technologies
 
PDF
What's new in Apache Spark 2.4
boxu42
 
PDF
Web Scale Reasoning and the LarKC Project
Saltlux Inc.
 
PPTX
3 CityNetConf - sql+c#=u-sql
Łukasz Grala
 
PPTX
Data Science at Scale with Apache Spark and Zeppelin Notebook
Carolyn Duby
 
PPTX
Spark sql meetup
Michael Zhang
 
PDF
Flink's SQL Engine: Let's Open the Engine Room!
HostedbyConfluent
 
PPTX
Pydata london meetup - RiakTS, PySpark and Python by Stephen Etheridge
Emmanuel Marchal
 
Accelerating SQL queries in NoSQL Databases using Apache Drill and Secondary ...
Aman Sinha
 
ONE FOR ALL! Using Apache Calcite to make SQL smart
Evans Ye
 
The Pushdown of Everything by Stephan Kessler and Santiago Mola
Spark Summit
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Xu Jiang
 
Apache HAWQ Architecture
Alexey Grishchenko
 
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Sachin Aggarwal
 
PL/SQL Tips and Techniques Webinar Presentation
Embarcadero Technologies
 
Understanding Query Plans and Spark UIs
Databricks
 
KSQL Intro
confluent
 
KSQL Deep Dive - The Open Source Streaming Engine for Apache Kafka
Kai Wähner
 
Analytics with Cassandra & Spark
Matthias Niehoff
 
Dan Hotka's Top 10 Oracle 12c New Features
Embarcadero Technologies
 
What's new in Apache Spark 2.4
boxu42
 
Web Scale Reasoning and the LarKC Project
Saltlux Inc.
 
3 CityNetConf - sql+c#=u-sql
Łukasz Grala
 
Data Science at Scale with Apache Spark and Zeppelin Notebook
Carolyn Duby
 
Spark sql meetup
Michael Zhang
 
Flink's SQL Engine: Let's Open the Engine Room!
HostedbyConfluent
 
Pydata london meetup - RiakTS, PySpark and Python by Stephen Etheridge
Emmanuel Marchal
 
Ad

Recently uploaded (20)

PPTX
oapresentation.pptx
mehatdhavalrajubhai
 
PDF
PFAS Reporting Requirements 2026 Are You Submission Ready Certivo.pdf
Certivo Inc
 
DOCX
The Future of Smart Factories Why Embedded Analytics Leads the Way
Varsha Nayak
 
PDF
What to consider before purchasing Microsoft 365 Business Premium_PDF.pdf
Q-Advise
 
PDF
ShowUs: Pharo Stream Deck (ESUG 2025, Gdansk)
ESUG
 
PPTX
Presentation about variables and constant.pptx
kr2589474
 
PDF
Bandai Playdia The Book - David Glotz
BluePanther6
 
PDF
49784907924775488180_LRN2959_Data_Pump_23ai.pdf
Abilash868456
 
PPTX
TestNG for Java Testing and Automation testing
ssuser0213cb
 
PPTX
The-Dawn-of-AI-Reshaping-Our-World.pptxx
parthbhanushali307
 
PDF
Teaching Reproducibility and Embracing Variability: From Floating-Point Exper...
University of Rennes, INSA Rennes, Inria/IRISA, CNRS
 
PPTX
slidesgo-unlocking-the-code-the-dynamic-dance-of-variables-and-constants-2024...
kr2589474
 
PDF
On Software Engineers' Productivity - Beyond Misleading Metrics
Romén Rodríguez-Gil
 
PDF
Microsoft Teams Essentials; The pricing and the versions_PDF.pdf
Q-Advise
 
PPTX
Services offered by Dynamic Solutions in Pakistan
DaniyaalAdeemShibli1
 
PPTX
Odoo Integration Services by Candidroot Solutions
CandidRoot Solutions Private Limited
 
PDF
QAware_Mario-Leander_Reimer_Architecting and Building a K8s-based AI Platform...
QAware GmbH
 
PPTX
Role Of Python In Programing Language.pptx
jaykoshti048
 
PDF
The Role of Automation and AI in EHS Management for Data Centers.pdf
TECH EHS Solution
 
PPTX
Web Testing.pptx528278vshbuqffqhhqiwnwuq
studylike474
 
oapresentation.pptx
mehatdhavalrajubhai
 
PFAS Reporting Requirements 2026 Are You Submission Ready Certivo.pdf
Certivo Inc
 
The Future of Smart Factories Why Embedded Analytics Leads the Way
Varsha Nayak
 
What to consider before purchasing Microsoft 365 Business Premium_PDF.pdf
Q-Advise
 
ShowUs: Pharo Stream Deck (ESUG 2025, Gdansk)
ESUG
 
Presentation about variables and constant.pptx
kr2589474
 
Bandai Playdia The Book - David Glotz
BluePanther6
 
49784907924775488180_LRN2959_Data_Pump_23ai.pdf
Abilash868456
 
TestNG for Java Testing and Automation testing
ssuser0213cb
 
The-Dawn-of-AI-Reshaping-Our-World.pptxx
parthbhanushali307
 
Teaching Reproducibility and Embracing Variability: From Floating-Point Exper...
University of Rennes, INSA Rennes, Inria/IRISA, CNRS
 
slidesgo-unlocking-the-code-the-dynamic-dance-of-variables-and-constants-2024...
kr2589474
 
On Software Engineers' Productivity - Beyond Misleading Metrics
Romén Rodríguez-Gil
 
Microsoft Teams Essentials; The pricing and the versions_PDF.pdf
Q-Advise
 
Services offered by Dynamic Solutions in Pakistan
DaniyaalAdeemShibli1
 
Odoo Integration Services by Candidroot Solutions
CandidRoot Solutions Private Limited
 
QAware_Mario-Leander_Reimer_Architecting and Building a K8s-based AI Platform...
QAware GmbH
 
Role Of Python In Programing Language.pptx
jaykoshti048
 
The Role of Automation and AI in EHS Management for Data Centers.pdf
TECH EHS Solution
 
Web Testing.pptx528278vshbuqffqhhqiwnwuq
studylike474
 

Introduction to Apache Calcite

  • 1. >< INTRODUCTIONTO INTRODUCTION TO APACHE CALCITE APACHE CALCITE JORDAN HALTERMAN 1
  • 3. ><INTRODUCTION TO APACHE CALCITE 3 What is Apache Calcite? • A framework for building SQL databases • Developed over more than ten years • Written in Java • Previously known as Optiq • Previously known as Farrago • Became an Apache project in 2013 • Led by Julian Hyde at Hortonworks
  • 4. ><INTRODUCTION TO APACHE CALCITE 4 Projects using Calcite • Apache Hive • Apache Drill • Apache Flink • Apache Phoenix • Apache Samza • Apache Storm • Apache everything…
  • 5. ><INTRODUCTION TO APACHE CALCITE 5 What is Apache Calcite? • SQL parser • SQL validation • Query optimizer • SQL generator • Data federator
  • 6. ><INTRODUCTION TO APACHE CALCITE Parse Queries are parsed using a JavaCC generated parser Validate Queries are validated against known database metadata Optimize Logical plans are optimized and converted into physical expressions Execute P h y s i c a l p l a n s a r e converted into application- specific executions 01 02 03 04 Stages of query execution 6
  • 8. ><INTRODUCTION TO APACHE CALCITE 8 Components of Calcite • Catalog - Defines metadata and namespaces that can be accessed in SQL queries • SQL parser - Parses valid SQL queries into an abstract syntax tree (AST) • SQL validator - Validates abstract syntax trees against metadata provided by the catalog • Query optimizer - Converts AST into logical plans, optimizes logical plans, and converts logical expressions into physical plans • SQL generator - Converts physical plans to SQL
  • 10. ><INTRODUCTION TO APACHE CALCITE 10 Calcite Catalog • Defines namespaces that can be accessed in Calcite queries • Schema • A collection of schemas and tables • Can be arbitrarily nested • Table • Represents a single data set • Fields defined by a RelDataType • RelDataType • Represents fields in a data set • Supports all SQL data types, including structs and
  • 11. ><INTRODUCTION TO APACHE CALCITE 11 Schema • A collection of schemas and tables • Schemas can be arbitrarily nested
  • 12. ><INTRODUCTION TO APACHE CALCITE 12 Schema • A collection of schemas and tables • Schemas can be arbitrarily nested
  • 13. ><INTRODUCTION TO APACHE CALCITE 13 Table • Represents a single data set • Fields are defined by a RelDataType
  • 14. ><INTRODUCTION TO APACHE CALCITE 14 Table • Represents a single data set • Fields are defined by a RelDataType
  • 15. ><INTRODUCTION TO APACHE CALCITE 15 RelDataType • Represents the data type of an object • Supports all SQL data types, including structs and arrays • Similar to Spark’s DataType
  • 16. ><INTRODUCTION TO APACHE CALCITE 16 RelDataType
  • 17. ><INTRODUCTION TO APACHE CALCITE 17 RelDataType data type enum
  • 18. ><INTRODUCTION TO APACHE CALCITE 18 Statistic • Provide table statistics used in optimization
  • 19. ><INTRODUCTION TO APACHE CALCITE 19 Statistic • Provide table statistics used in optimization
  • 20. ><INTRODUCTION TO APACHE CALCITE 20 Usage of the Calcite catalog
  • 21. ><INTRODUCTION TO APACHE CALCITE 21 Usage of the Calcite catalog schema
  • 22. ><INTRODUCTION TO APACHE CALCITE 22 Usage of the Calcite catalog schema table
  • 23. ><INTRODUCTION TO APACHE CALCITE 23 Usage of the Calcite catalog schema table data type
  • 24. ><INTRODUCTION TO APACHE CALCITE 24 Usage of the Calcite catalog schema table data typedata type field
  • 26. ><INTRODUCTION TO APACHE CALCITE 26 Calcite SQL parser • LL(k) parser written in JavaCC • Input queries are parsed into an abstract syntax tree (AST) • Tokens are represented in Calcite by SqlNode • SqlNode can also be converted back to a SQL string via the unparse method
  • 27. ><INTRODUCTION TO APACHE CALCITE 27 JavaCC • Java Compiler Compiler • Created in 1996 at Sun Microsystems • Generates Java code from a domain- specific language • ANTLR is the modern alternative used in projects like Hive and Drill • JavaCC has sparse documentation
  • 28. ><INTRODUCTION TO APACHE CALCITE 28 JavaCC
  • 29. ><INTRODUCTION TO APACHE CALCITE 29 JavaCC
  • 30. ><INTRODUCTION TO APACHE CALCITE 30 JavaCC tokens
  • 31. ><INTRODUCTION TO APACHE CALCITE 31 JavaCC tokens or
  • 32. ><INTRODUCTION TO APACHE CALCITE 32 JavaCC tokens or function call
  • 33. ><INTRODUCTION TO APACHE CALCITE 33 JavaCC tokens Java code or function call
  • 34. ><INTRODUCTION TO APACHE CALCITE 34 SqlNode • SqlNode represents an element in an abstract syntax tree
  • 35. ><INTRODUCTION TO APACHE CALCITE 35 SqlNode • SqlNode represents an element in an abstract syntax tree select
  • 36. ><INTRODUCTION TO APACHE CALCITE 36 SqlNode • SqlNode represents an element in an abstract syntax tree identifiersselect
  • 37. ><INTRODUCTION TO APACHE CALCITE 37 SqlNode • SqlNode represents an element in an abstract syntax tree identifiersselect operator
  • 38. ><INTRODUCTION TO APACHE CALCITE 38 SqlNode • SqlNode represents an element in an abstract syntax tree identifiersselect operator identifier
  • 39. ><INTRODUCTION TO APACHE CALCITE 39 SqlNode • SqlNode represents an element in an abstract syntax tree identifiersselect operator identifier data type
  • 40. ><INTRODUCTION TO APACHE CALCITE 40 SqlNode • SqlNode represents an element in an abstract syntax tree identifiersselect operator identifier data type identifier
  • 41. ><INTRODUCTION TO APACHE CALCITE 41 SqlNode • SqlNode’s unparse method converts a SQL element back into a string
  • 42. ><INTRODUCTION TO APACHE CALCITE 42 SqlNode • SqlNode’s unparse method converts a SQL element back into a string
  • 43. ><INTRODUCTION TO APACHE CALCITE 43 SqlNode
  • 44. ><INTRODUCTION TO APACHE CALCITE 44 SqlNode • SqlDialect indicates the capitalization and quoting rules of specific databases
  • 45. ><INTRODUCTION TO APACHE CALCITE 45 SqlNode • SqlDialect indicates the capitalization and quoting rules of specific databases
  • 47. ><INTRODUCTION TO APACHE CALCITE 47 Query Plans • Query plans represent the steps necessary to execute a query
  • 48. ><INTRODUCTION TO APACHE CALCITE 48 Query Plans • Query plans represent the steps necessary to execute a query
  • 49. ><INTRODUCTION TO APACHE CALCITE 49 Query Plans • Query plans represent the steps necessary to execute a query table scan table scan
  • 50. ><INTRODUCTION TO APACHE CALCITE 50 Query Plans • Query plans represent the steps necessary to execute a query inner join table scan table scan
  • 51. ><INTRODUCTION TO APACHE CALCITE 51 Query Plans • Query plans represent the steps necessary to execute a query filter inner join table scan table scan
  • 52. ><INTRODUCTION TO APACHE CALCITE 52 Query Plans • Query plans represent the steps necessary to execute a query filter inner join project table scan table scan
  • 53. ><INTRODUCTION TO APACHE CALCITE 53 Query Plans • Query plans represent the steps necessary to execute a query filter inner join project table scan table scan
  • 54. ><INTRODUCTION TO APACHE CALCITE 54 Query Optimization • Optimize logical plan • Goal is typically to try to reduce the amount of data that must be processed early in the plan • Convert logical plan into a physical plan • Physical plan is engine specific and represents the physical execution stages
  • 55. ><INTRODUCTION TO APACHE CALCITE 55 Query Optimization • Prune unused fields • Merge projections • Convert subqueries to joins • Reorder joins • Push down projections • Push down filters
  • 56. ><INTRODUCTION TO APACHE CALCITE 56 Query Optimization
  • 57. ><INTRODUCTION TO APACHE CALCITE 57 Query Optimization
  • 58. ><INTRODUCTION TO APACHE CALCITE 58 Query Optimization
  • 59. ><INTRODUCTION TO APACHE CALCITE 59 Query Optimization push down project
  • 60. ><INTRODUCTION TO APACHE CALCITE 60 Query Optimization push down project push down filter
  • 61. ><INTRODUCTION TO APACHE CALCITE 61 Query Optimization
  • 62. ><INTRODUCTION TO APACHE CALCITE 62 Key Concepts Relational algebra Row expressions Traits Conventions Rules Planners Programs
  • 63. ><INTRODUCTION TO APACHE CALCITE 63 Key Concepts Relational algebra Row expressions Traits Conventions Rules Planners Programs RelNode RexNode RelTrait Convention RelOptRule RelOptPlanner Program
  • 64. ><INTRODUCTION TO APACHE CALCITE 64 Relational Algebra • RelNode represents a relational expression • Largely equivalent to Spark’s DataFrame methods • Logical algebra • Physical algebra
  • 65. ><INTRODUCTION TO APACHE CALCITE 65 Relational Algebra TableScan Project Filter Aggregate Join Union Intersect Sort
  • 66. ><INTRODUCTION TO APACHE CALCITE 66 Relational Algebra TableScan Project Filter Aggregate Join Union Intersect Sort SparkTableScan SparkProject SparkFilter SparkAggregate SparkJoin SparkUnion SparkIntersect SparkSort
  • 67. ><INTRODUCTION TO APACHE CALCITE 67 Row Expressions • RexNode represents a row-level expression • Largely equivalent to Spark’s Column functions • Projection fields • Filter condition • Join condition • Sort fields
  • 68. ><INTRODUCTION TO APACHE CALCITE 68 Row Expressions Input column ref Literal Struct field access Function call Window expression
  • 69. ><INTRODUCTION TO APACHE CALCITE 69 Row Expressions Input column ref Literal Struct field access Function call Window expression RexInputRef RexLiteral RexFieldAccess RexCall RexOver
  • 70. ><INTRODUCTION TO APACHE CALCITE 70 Row Expressions
  • 71. ><INTRODUCTION TO APACHE CALCITE 71 Row Expressions input ref
  • 72. ><INTRODUCTION TO APACHE CALCITE 72 Row Expressions input ref function call
  • 73. ><INTRODUCTION TO APACHE CALCITE 73 Traits • Defined by the RelTrait interface • Represent a trait of a relational expression that does not alter execution • Traits are used to validate plan output • Three primary trait types: • Convention • RelCollation • RelDistribution
  • 74. ><INTRODUCTION TO APACHE CALCITE 74 Conventions • Convention is a type of RelTrait • A Convention is associated with a RelNode interface • SparkConvention, JdbcConvention, EnumerableConvention, etc • Conventions are used to represent a single data source • Inputs to a relational expression must be in the same convention
  • 75. ><INTRODUCTION TO APACHE CALCITE 75 Conventions
  • 76. ><INTRODUCTION TO APACHE CALCITE 76 Conventions Spark convention
  • 77. ><INTRODUCTION TO APACHE CALCITE 77 Conventions Spark convention JDBC convention
  • 78. ><INTRODUCTION TO APACHE CALCITE 78 Conventions Spark convention JDBC convention converter
  • 79. ><INTRODUCTION TO APACHE CALCITE 79 Rules • Rules are used to modify query plans • Defined by the RelOptRule interface • Two types of rules: converters and transformers • Converter rules implement Converter and convert from one convention to another • Rules are matched to elements of a query plan using pattern matching • onMatch is called for matched rules • Converter rules applied via convert
  • 80. ><INTRODUCTION TO APACHE CALCITE 80 Converter Rule
  • 81. ><INTRODUCTION TO APACHE CALCITE 81 Converter Rule expression type
  • 82. ><INTRODUCTION TO APACHE CALCITE 82 Converter Rule expression type input convention
  • 83. ><INTRODUCTION TO APACHE CALCITE 83 Converter Rule expression type input convention converted convention
  • 84. ><INTRODUCTION TO APACHE CALCITE 84 Converter Rule expression type input convention converted convention converter function
  • 85. ><INTRODUCTION TO APACHE CALCITE 85 Pattern Matching
  • 86. ><INTRODUCTION TO APACHE CALCITE 86 Pattern Matching
  • 87. ><INTRODUCTION TO APACHE CALCITE 87 Pattern Matching no match :-(
  • 88. ><INTRODUCTION TO APACHE CALCITE 88 Pattern Matching no match :-(
  • 89. ><INTRODUCTION TO APACHE CALCITE 89 Pattern Matching match! no match :-(
  • 90. ><INTRODUCTION TO APACHE CALCITE 90 Planners • Planners implement the RelOptPlanner interface • Two types of planners: • HepPlanner • VolcanoPlanner
  • 91. ><INTRODUCTION TO APACHE CALCITE 91 Heuristic Optimization • HepPlanner is a heuristic optimizer similar to Spark’s optimizer • Applies all matching rules until none can be applied • Heuristic optimization is faster than cost- based optimization • Risk of infinite recursion if rules make opposing changes to the plan
  • 92. ><INTRODUCTION TO APACHE CALCITE 92 Cost-based Optimization • VolcanoPlanner is a cost-based optimizer • Applies matching rules iteratively, selecting the plan with the cheapest cost on each iteration • Costs are provided by relational expressions • Not all possible plans can be computed • Stops optimization when the cost does not significantly improve through a determinable number of iterations
  • 93. ><INTRODUCTION TO APACHE CALCITE 93 Cost-based Optimization • Cost is provided by each RelNode • Cost is represented by RelOptCost • Cost typically includes row count, I/O, and CPU cost • Cost estimates are relative • Statistics are used to improve accuracy of cost estimations • Calcite provides utilities for computing various resource-related statistics for use in cost estimations
  • 94. ><INTRODUCTION TO APACHE CALCITE 94 Cost-based Optimization
  • 95. ><INTRODUCTION TO APACHE CALCITE 95 Cost-based Optimization
  • 96. ><INTRODUCTION TO APACHE CALCITE 96 Cost-based Optimization
  • 97. ><INTRODUCTION TO APACHE CALCITE 97 Cost-based Optimization
  • 98. ><INTRODUCTION TO APACHE CALCITE 98 Cost-based Optimization
  • 100. ><INTRODUCTION TO APACHE CALCITE 100 Putting it all together
  • 101. ><INTRODUCTION TO APACHE CALCITE 101 Putting it all together
  • 102. ><INTRODUCTION TO APACHE CALCITE 102 Putting it all together
  • 103. ><INTRODUCTION TO APACHE CALCITE 103 Putting it all together
  • 104. ><INTRODUCTION TO APACHE CALCITE 104 Putting it all together
  • 105. ><INTRODUCTION TO APACHE CALCITE 105 Putting it all together
  • 106. ><INTRODUCTION TO APACHE CALCITE 106 Putting it all together
  • 107. ><INTRODUCTION TO APACHE CALCITE 107 Putting it all together
  • 108. ><INTRODUCTION TO APACHE CALCITE 108 Putting it all together
  • 109. ><INTRODUCTION TO APACHE CALCITE 109 Putting it all together