Introduction to Apache Calcite

><
INTRODUCTIONTO
INTRODUCTION TO APACHE CALCITE
APACHE CALCITE
JORDAN HALTERMAN
1

WHAT IS APACHE
CALCITE?
next 2

><INTRODUCTION TO APACHE CALCITE 3
What is Apache Calcite?
• A framework for building SQL databases
• Developed over more than ten years
• Written in Java
• Previously known as Optiq
• Previously known as Farrago
• Became an Apache project in 2013
• Led by Julian Hyde at Hortonworks

Projects using Calcite
• Apache Hive
• Apache Drill
• Apache Flink
• Apache Phoenix
• Apache Samza
• Apache Storm
• Apache everything…

What is Apache Calcite?
• SQL parser
• SQL validation
• Query optimizer
• SQL generator
• Data federator

><INTRODUCTION TO APACHE CALCITE
Parse
Queries are parsed using
a JavaCC generated
parser
Validate
Queries are validated
against known database
metadata
Optimize
Logical plans are optimized
and converted into physical
expressions
Execute
P h y s i c a l p l a n s a r e
converted into application-
speciﬁc executions
01 02 03 04
Stages of query execution
6

Components of Calcite
• Catalog - Deﬁnes metadata and namespaces
that can be accessed in SQL queries
• SQL parser - Parses valid SQL queries into an
abstract syntax tree (AST)
• SQL validator - Validates abstract syntax trees
against metadata provided by the catalog
• Query optimizer - Converts AST into logical
plans, optimizes logical plans, and converts
logical expressions into physical plans
• SQL generator - Converts physical plans to
SQL

Calcite Catalog
• Defines namespaces that can be accessed in Calcite
queries
• Schema
• A collection of schemas and tables
• Can be arbitrarily nested
• Table
• Represents a single data set
• Fields defined by a RelDataType
• RelDataType
• Represents fields in a data set
• Supports all SQL data types, including structs and

Schema
• Schemas can be arbitrarily nested

Table
• Fields are deﬁned by a RelDataType

RelDataType
• Represents the data type of an object
• Supports all SQL data types, including
structs and arrays
• Similar to Spark’s DataType

RelDataType

RelDataType
data type enum

Statistic
• Provide table statistics used in optimization

Usage of the Calcite catalog

schema

schema table

schema table
data type

schema table
data typedata type field

Calcite SQL parser
• LL(k) parser written in JavaCC
• Input queries are parsed into an abstract
syntax tree (AST)
• Tokens are represented in Calcite by
SqlNode
• SqlNode can also be converted back to a
SQL string via the unparse method

JavaCC
• Java Compiler Compiler
• Created in 1996 at Sun Microsystems
• Generates Java code from a domain-
speciﬁc language
• ANTLR is the modern alternative used in
projects like Hive and Drill
• JavaCC has sparse documentation

JavaCC

JavaCC
tokens

JavaCC
tokens
or

JavaCC
tokens
or
function call

JavaCC
tokens
Java code
or
function call

SqlNode
• SqlNode represents an element in an
abstract syntax tree

SqlNode
select

SqlNode
identifiersselect

SqlNode
identifiersselect operator

SqlNode
identifiersselect operator identifier

SqlNode
data type

SqlNode
data type
identifier

SqlNode
• SqlNode’s unparse method converts a
SQL element back into a string

SqlNode

SqlNode
• SqlDialect indicates the capitalization
and quoting rules of speciﬁc databases

Query Plans
• Query plans represent the steps necessary
to execute a query

Query Plans
to execute a query

Query Plans
to execute a query
table scan
table
scan

Query Plans
to execute a query
inner join table scan
table
scan

Query Plans
to execute a query
filter
inner join table scan
table
scan

Query Plans
to execute a query
filter
inner join
project
table scan
table
scan

Query Optimization
• Optimize logical plan
• Goal is typically to try to reduce the amount
of data that must be processed early in the
plan
• Convert logical plan into a physical plan
• Physical plan is engine speciﬁc and
represents the physical execution stages

Query Optimization
• Prune unused ﬁelds
• Merge projections
• Convert subqueries to joins
• Reorder joins
• Push down projections
• Push down ﬁlters

Query Optimization

Query Optimization
push down
project

Query Optimization
push down
project
push down
filter

Query Optimization

Key Concepts
Relational algebra
Row expressions
Traits
Conventions
Rules
Planners
Programs

Key Concepts
Relational algebra
Row expressions
Traits
Conventions
Rules
Planners
Programs
RelNode
RexNode
RelTrait
Convention
RelOptRule
RelOptPlanner
Program

Relational Algebra
• RelNode represents a relational expression
• Largely equivalent to Spark’s DataFrame
methods
• Logical algebra
• Physical algebra

Relational Algebra
TableScan
Project
Filter
Aggregate
Join
Union
Intersect
Sort

Relational Algebra
TableScan
Project
Filter
Aggregate
Join
Union
Intersect
Sort
SparkTableScan
SparkProject
SparkFilter
SparkAggregate
SparkJoin
SparkUnion
SparkIntersect
SparkSort

Row Expressions
• RexNode represents a row-level expression
• Largely equivalent to Spark’s Column
functions
• Projection ﬁelds
• Filter condition
• Join condition
• Sort ﬁelds

Row Expressions
Input column ref
Literal
Struct ﬁeld access
Function call
Window expression

Row Expressions
Input column ref
Literal
Struct ﬁeld access
Function call
Window expression
RexInputRef
RexLiteral
RexFieldAccess
RexCall
RexOver

Row Expressions

Row Expressions
input ref

Row Expressions
input ref
function call

Traits
• Deﬁned by the RelTrait interface
• Represent a trait of a relational expression
that does not alter execution
• Traits are used to validate plan output
• Three primary trait types:
• Convention
• RelCollation
• RelDistribution

Conventions
• Convention is a type of RelTrait
• A Convention is associated with a
RelNode interface
• SparkConvention, JdbcConvention,
EnumerableConvention, etc
• Conventions are used to represent a single
data source
• Inputs to a relational expression must be in
the same convention

Conventions

Conventions
Spark convention

Conventions
Spark convention
JDBC
convention

Conventions
Spark convention
JDBC
convention
converter

Rules
• Rules are used to modify query plans
• Deﬁned by the RelOptRule interface
• Two types of rules: converters and
transformers
• Converter rules implement Converter and
convert from one convention to another
• Rules are matched to elements of a query
plan using pattern matching
• onMatch is called for matched rules
• Converter rules applied via convert

Converter Rule

Converter Rule
expression type

Converter Rule
expression type
input convention

Converter Rule
expression type
input convention
converted convention

Converter Rule
expression type
input convention
converted convention
converter function

Pattern Matching

Pattern Matching
no match
:-(

Pattern Matching
match!
no match
:-(

Planners
• Planners implement the RelOptPlanner
interface
• Two types of planners:
• HepPlanner
• VolcanoPlanner

Heuristic Optimization
• HepPlanner is a heuristic optimizer similar
to Spark’s optimizer
• Applies all matching rules until none can be
applied
• Heuristic optimization is faster than cost-
based optimization
• Risk of inﬁnite recursion if rules make
opposing changes to the plan

Cost-based Optimization
• VolcanoPlanner is a cost-based
optimizer
• Applies matching rules iteratively, selecting
the plan with the cheapest cost on each
iteration
• Costs are provided by relational expressions
• Not all possible plans can be computed
• Stops optimization when the cost does not
signiﬁcantly improve through a determinable
number of iterations

• Cost is provided by each RelNode
• Cost is represented by RelOptCost
• Cost typically includes row count, I/O, and
CPU cost
• Cost estimates are relative
• Statistics are used to improve accuracy of
cost estimations
• Calcite provides utilities for computing
various resource-related statistics for use in
cost estimations

PUTTING IT ALL
TOGETHER
next 99

Putting it all together

Introduction to Apache Calcite

More Related Content

What's hot (20)

Similar to Introduction to Apache Calcite (20)

Recently uploaded (20)

Introduction to Apache Calcite