SlideShare a Scribd company logo
Tackling Big Data with Hadoop and
Open Source Integration




                         Ciaran Dynes
                         Remy Dubois
Agenda



 1. Talend’s Goal: Democratizing Integration

 2. What is Big Data (integration)?

 3. Big Data for the Masses: Talend’s strategy and vision




Ā© Talend 2011                                               2
Our goal
Talend – The Market Leading Unified Integration Platform

                                     Talend Enterprise


                 Data            Data
                                              MDM     ESB         BPM
                Quality       Integration

                                                                          ¾  Commercial license
                                                                          ¾  Subscription model

         Studio            Repository Deployment Execution   Monitoring



                                                                          ¾  Open source license

                           Talend Open Studio          for
                                                                          ¾  Free of charge
                                                                          ¾  Optional support

                  Data             Data
                 Quality        Integration   MDM     ESB




Recognized as the open source leader in each of its market
            category by all industry analysts
Ā© Talend 2011                                                                                       4
Who uses Talend?

 A high adoption rate

  § 20 million downloads
  § 950,000 users
  § 3,500 customers


                1 product download   150 new customers
                 every 30 seconds        per month

Ā© Talend 2011                                            5
Trying to get from this…




 Ā© Talend 2011 – Stri2y Private & Confidential
 Ā© Talend 2011                                   6
to this…




 Why Talend…

 ONLY Talend generates code that is executed within map reduce. This
 open approach removes the limitation of a proprietary ā€œengineā€ to
 provide a truly unique and powerful set of tools for big data.
Big data is….



                                          Hans Rosling – uses big data to analyze world health trends




     Key Takeaway #1
    transactions, interactions, observations

Ā© Talend 2011 – Stri2y Private & Confidential
Ā© Talend 2011                                                                                           8
Big Data = Transactions + Interactions + Observations


                                                       Sensors/RFID/Devices                                   User Generated Content
                                                                                       Big Data
                Mega, Giga, Tera, Peta bytes


                                                            Sentiment                                        Social Interactions & Feeds
                                                            Mobile Web
                                                                                                             Spatial & GPS coordinates
                                                            User Clicks
                                                                                                               External Demographics

                                                   Web logs                WEB                                  Business Data Feeds
                                                 Offer history                             A/B testing          Video, Audio, Images
                                                                                         Dynamic pricing             SMS/MMS
                                                             CRM Segmentation           Affiliate Networks
                                                                                        Search Marketing
                                                    ERP              Offer details
                                               Purchase detail   Customer Touchpoints Behavioral Targeting
                                               Purchase record     Support Contacts     Dynamic Funnels
                                               Payment record




                                                             Increasing Data Variety and Complexity



                                                                                                                Source: Hortonworks

Ā© Talend 2011 – Stri2y Private & Confidential
Ā© Talend 2011                                                                                                                              9
What is Big Data integration?
Traditional Data Flows


          CRM


                                                 ETL
                                                               Normalized   Traditional Data
          ERP                                    Data             Data
                                                                              Warehouse
                                                Quality

       Finance




 •  Scheduled–daily or weekly,
    sometimes more frequently.                                               Business           Business
                                                                             Analyst            User
 •  Volumes rarely exceed
    terabytes                                           Warehouse
                                                      Administrator
                                                                                               Executives
Ā© Talend 2011 – Stri2y Private & Confidential
Ā© Talend 2011                                                                                          11
The new world of big data

                                                             Social
                                                           Networking
          CRM




          ERP
                                                Big Data


       Finance




Ā© Talend 2011 – Stri2y Private & Confidential
Ā© Talend 2011                                                           12
The new world of big data

                                                              Social
                                                            Networking
          CRM


                                                           Mobile Devices

          ERP



                                                Big Data
       Finance




Ā© Talend 2011 – Stri2y Private & Confidential
Ā© Talend 2011                                                               13
The new world of big data

                                                              Social
                                                            Networking
          CRM


                                                           Mobile Devices

          ERP

                                                            Transactions


       Finance

                                                Big Data




Ā© Talend 2011 – Stri2y Private & Confidential
Ā© Talend 2011                                                               14
The new world of big data

                                                               Social
                                                             Networking
          CRM


                                                           Mobile Devices

          ERP

                                                            Transactions


       Finance
                                                           Network Devices



                                                Big Data       Sensors




Ā© Talend 2011 – Stri2y Private & Confidential
Ā© Talend 2011                                                                15
Key Takeaway #2

                 Forces us to think
Ā© Talend 2011
                 differently
Ā© Talend 2011 – Stri2y Private & Confidential   16
But for Talend…. Big data is…




                …everything that is old, is new again!

Ā© Talend 2011 – Stri2y Private & Confidential
Ā© Talend 2011                                            17
Data driven business


                            enables
          data            governance




                                                         supports
                                  information                                       decisions


                                                                                          drives
  Information provides
  value to the business
  If you can't rely on your information then                                           Your
  the result can be missed opportunities, or                                         business
  higher costs.
      Matthew West and Julian Fowler (1999). Developing High Quality Data Models.
      The European Process Industries STEP Technical Liaison Executive (EPISTLE).
Ā© Talend 2011 – Stri2y Private & Confidential
Ā© Talend 2011                                                                                      18
BIG data driven business

                            enables
     BIG data             governance




                                                         supports
                                      BIG                                            BIG
                                  information                                       decisions

                                                                                          drives
  Information provides
  value to the business
  If you can't rely on your information then
  the result can be missed opportunities, or                                         BIG
  higher costs.                                                                      business

      Matthew West and Julian Fowler (1999). Developing High Quality Data Models.
      The European Process Industries STEP Technical Liaison Executive (EPISTLE).
Ā© Talend 2011 – Stri2y Private & Confidential
Ā© Talend 2011                                                                                      19
ā€œBig Data for the Massesā€
Goal: Democratize Big Data


                                                 Talend Open Studio for Big Data
                                                 ¾  ā€œBig Data for the Massesā€
                                                   ¾  Improves efficiency of big data job
                                                      design with graphic interface
                                                   ¾  Abstracts and generates code
                                                   ¾  Run transforms inside Hadoop

                                          Pig
                                                   ¾  Native support for HDFS, Pig, HBase,
                                                      Sqoop and Hive
                                                   ¾  Apache License 2.0
                                                   ¾  Embedded in Hortonworks Data
         …an open source                              Platform
           ecosystem
Ā© Talend 2011 – Stri2y Private & Confidential
Ā© Talend 2011                                                                                 21
Let us show you…




Ā© Talend 2012
Where to next?




Ā© Talend 2012
How is big data integration being used?

 Use Cases
 •     Recommendation Engine
 •     Sentiment Analysis
 •     Risk Modeling
 •     Fraud Detection
 •     Marketing Campaign Analysis
 •     Customer Churn Analysis
 •     Social Graph Analysis
 •     Customer Experience Analytics
 •     Network Monitoring
 •     Research And Development

 BUT: to what level is DQ required for your use
 case?
Ā© Talend 2011 – Stri2y Private & Confidential
Ā© Talend 2011                                     24
Poor Data Quality + Big Data = Big Problems
Poor Data Quality * Big Data = Big Problems^2




           Key Takeaway #3
           In big data…
           poor data quality can be magnified at huge scale

Ā© Talend 2011                                                 25
Two methods for inserting data quality into a big data job




 1.  Pipelining: as part of the load process


 2.  Load the cluster than implement and execute
     a data quality map reduce job




Ā© Talend 2011                                                 26
E-T-L - Load
      Extract – Transform

Ā© Talend 2011 – Stri2y Private & Confidential
Ā© Talend 2011                                   27
E- DQ -L
      Extract – Improve/Cleanse - Load
Ā© Talend 2011 – Stri2y Private & Confidential
Ā© Talend 2011                                   28
Pipelining: data quality with big data



               CRM
                                                DQ


               ERP



                                                DQ
            Finance
                                                            Big Data

           Social
         Networking
                                                     •  Use traditional data quality tools
                                                     •  No new programming, no PHDs
                                                     •  Once and done
      Mobile Devices



Ā© Talend 2011 – Stri2y Private & Confidential
Ā© Talend 2011                                                                                29
Big data alternative: Load and improve within the cluster



               CRM

                                                      DQ

               ERP
                                                            DQ

            Finance
                                                         Big Data

           Social
         Networking
                                                •    Load first, improve later
                                                •    Really complex to build, limited tools
                                                •    Constant on, increments
      Mobile Devices
                                                •    Insane performance


Ā© Talend 2011 – Stri2y Private & Confidential
Ā© Talend 2011                                                                                 30
big
2012
         data                                   now   Q4   2013


Talend Open Studio for Big Data
¾ Packaged within Hortonworks Data Platform
     …Eclipse tools for HIVE, HDFS, PIG, SCOOP

     …supports Oozie, Hcatalog, Kerberos


¾ Free to download and use under the Apache license
   …democratizing big data through intuitive tools




Ā© Talend 2011 – Stri2y Private & Confidential
Ā© Talend 2011                                                     31
Thanks for attending
Sessions will resume at 11:25am




                             Page 33

More Related Content

PDF
Hadoop's Opportunity to Power Next-Generation Architectures
DataWorks Summit
Ā 
PDF
Delivering next generation enterprise no sql database technology
marcmcneill
Ā 
PDF
Scaling MySQL: Catch 22 of Read Write Splitting
ScaleBase
Ā 
PDF
Scaling MySQL: Benefits of Automatic Data Distribution
ScaleBase
Ā 
PDF
Hadoop World 2011: Changing Company Culture with Hadoop - Amy O'Connor, Nokia
Cloudera, Inc.
Ā 
PDF
Embedded Analytics: The Next Mega-Wave of Innovation
Inside Analysis
Ā 
PDF
ScaleBase Webinar: Methods and Challenges to Scale Out a MySQL Database
ScaleBase
Ā 
PDF
Security, Governance & Integration in a Cloud Connected World
CA API Management
Ā 
Hadoop's Opportunity to Power Next-Generation Architectures
DataWorks Summit
Ā 
Delivering next generation enterprise no sql database technology
marcmcneill
Ā 
Scaling MySQL: Catch 22 of Read Write Splitting
ScaleBase
Ā 
Scaling MySQL: Benefits of Automatic Data Distribution
ScaleBase
Ā 
Hadoop World 2011: Changing Company Culture with Hadoop - Amy O'Connor, Nokia
Cloudera, Inc.
Ā 
Embedded Analytics: The Next Mega-Wave of Innovation
Inside Analysis
Ā 
ScaleBase Webinar: Methods and Challenges to Scale Out a MySQL Database
ScaleBase
Ā 
Security, Governance & Integration in a Cloud Connected World
CA API Management
Ā 

What's hot (20)

PPTX
Search2012 ibm vf
Isabelle Claverie-Berge
Ā 
PDF
BI Forum 2009 - BI Mega Trends
OKsystem
Ā 
PDF
ScaleBase Webinar 8.16: ScaleUp vs. ScaleOut
ScaleBase
Ā 
PPTX
OWF12/Java Michael hirt
Paris Open Source Summit
Ā 
PDF
IBM Stream au Hadoop User Group
Modern Data Stack France
Ā 
PDF
Mike Stolz Dramatic Scalability
deimos
Ā 
PDF
Big Data Analytics in a Heterogeneous World - Joydeep Das of Sybase
BigDataCloud
Ā 
PDF
The Best Analytics Tools
Datalicious
Ā 
PDF
Big Data for Everyman
Michael Wilde
Ā 
PDF
Katrina marques presentation
Ark Group Australia Pty Ltd
Ā 
PDF
Analyse prƩdictive en assurance santƩ par Julien Cabot
Modern Data Stack France
Ā 
PPTX
MWG Big Data & Media - Nick North (GfK UK)
MWG verbindt media
Ā 
PDF
Vision - The Agile Data Center
incommoninc
Ā 
PPTX
Module 3 Adapative Customer Experience Final
Vivastream
Ā 
PDF
HCLT Brochure: E-Discovery and Document Review Solutions
HCL Technologies
Ā 
PDF
Le Cloud de proximitƩ by Monaco Telecom et Interxion
Yannick Quentel
Ā 
PDF
Open Video Customer Presentation
MetroFiber
Ā 
PDF
2012.04.26 big insights streams im forum2
Wilfried Hoge
Ā 
PDF
Enterprise Security Architecture: From Access to Audit
Bob Rhubart
Ā 
PPT
Striving for an Outstanding IT Organization
Huberto Garza
Ā 
Search2012 ibm vf
Isabelle Claverie-Berge
Ā 
BI Forum 2009 - BI Mega Trends
OKsystem
Ā 
ScaleBase Webinar 8.16: ScaleUp vs. ScaleOut
ScaleBase
Ā 
OWF12/Java Michael hirt
Paris Open Source Summit
Ā 
IBM Stream au Hadoop User Group
Modern Data Stack France
Ā 
Mike Stolz Dramatic Scalability
deimos
Ā 
Big Data Analytics in a Heterogeneous World - Joydeep Das of Sybase
BigDataCloud
Ā 
The Best Analytics Tools
Datalicious
Ā 
Big Data for Everyman
Michael Wilde
Ā 
Katrina marques presentation
Ark Group Australia Pty Ltd
Ā 
Analyse prƩdictive en assurance santƩ par Julien Cabot
Modern Data Stack France
Ā 
MWG Big Data & Media - Nick North (GfK UK)
MWG verbindt media
Ā 
Vision - The Agile Data Center
incommoninc
Ā 
Module 3 Adapative Customer Experience Final
Vivastream
Ā 
HCLT Brochure: E-Discovery and Document Review Solutions
HCL Technologies
Ā 
Le Cloud de proximitƩ by Monaco Telecom et Interxion
Yannick Quentel
Ā 
Open Video Customer Presentation
MetroFiber
Ā 
2012.04.26 big insights streams im forum2
Wilfried Hoge
Ā 
Enterprise Security Architecture: From Access to Audit
Bob Rhubart
Ā 
Striving for an Outstanding IT Organization
Huberto Garza
Ā 
Ad

Viewers also liked (20)

PPTX
Big Data, Hadoop, Hortonworks and Microsoft HDInsight
Hortonworks
Ā 
PDF
The Next Generation of Big Data Analytics
Hortonworks
Ā 
PDF
vBACD July 2012 - Apache Hadoop, Now and Beyond
CloudStack - Open Source Cloud Computing Project
Ā 
PDF
Spark Streaming
Edureka!
Ā 
PPTX
Practical Kerberos with Apache HBase
Josh Elser
Ā 
PPTX
Apache Phoenix Query Server
Josh Elser
Ā 
PDF
å•ŸēØ‹ļ¼šData Technology ēš„å¾…å®¢ä¹‹é“
Etu Solution
Ā 
PDF
å°ē£ Hadoop Big Data 2014 č¶Øå‹¢é ęø¬čˆ‡ä¼ę„­ē­–ē•„č—åœ–
Etu Solution
Ā 
PDF
Data Leaders in Action - č³‡ę–™åƒ¹å€¼é ˜č¢–é¢ØēÆ„čˆ‡é—œéµč”Œå‹•
Etu Solution
Ā 
PDF
é‚£äŗ›ä½ ēŸ„é“ēš„ļ¼Œä½†é‚„ę²’ēœ‹éŽēš„ Big Data 風景
Etu Solution
Ā 
PDF
Interface fonctionnelle, Lambda expression, mƩthode par dƩfaut, rƩfƩrence de...
MICHRAFY MUSTAFA
Ā 
PDF
Scala: Pattern matching, Concepts and Implementations
MICHRAFY MUSTAFA
Ā 
PDF
"Big Data beyond Apache Hadoop - How to Integrate ALL your Data" - JavaOne 2013
Kai WƤhner
Ā 
PPTX
Apache Phoenix: Transforming HBase into a SQL Database
DataWorks Summit
Ā 
PPTX
Apache phoenix: Past, Present and Future of SQL over HBAse
enissoz
Ā 
PDF
Scala : programmation fonctionnelle
MICHRAFY MUSTAFA
Ā 
PPTX
Mobile to Mainframe - the Challenges of Enterprise DevOps Adoption
Sanjeev Sharma
Ā 
PDF
Spark RDD : Transformations & Actions
MICHRAFY MUSTAFA
Ā 
PPTX
Introduction to Microsoft Azure HD Insight by Dattatrey Sindhol
HARMAN Services
Ā 
PDF
č³‡ę–™ē§‘å­øåœ˜éšŠäŗŗę‰åŸ¹č‚²åˆ†äŗ« ─ 仄 DSP 為例
Fred Chiang
Ā 
Big Data, Hadoop, Hortonworks and Microsoft HDInsight
Hortonworks
Ā 
The Next Generation of Big Data Analytics
Hortonworks
Ā 
vBACD July 2012 - Apache Hadoop, Now and Beyond
CloudStack - Open Source Cloud Computing Project
Ā 
Spark Streaming
Edureka!
Ā 
Practical Kerberos with Apache HBase
Josh Elser
Ā 
Apache Phoenix Query Server
Josh Elser
Ā 
å•ŸēØ‹ļ¼šData Technology ēš„å¾…å®¢ä¹‹é“
Etu Solution
Ā 
å°ē£ Hadoop Big Data 2014 č¶Øå‹¢é ęø¬čˆ‡ä¼ę„­ē­–ē•„č—åœ–
Etu Solution
Ā 
Data Leaders in Action - č³‡ę–™åƒ¹å€¼é ˜č¢–é¢ØēÆ„čˆ‡é—œéµč”Œå‹•
Etu Solution
Ā 
é‚£äŗ›ä½ ēŸ„é“ēš„ļ¼Œä½†é‚„ę²’ēœ‹éŽēš„ Big Data 風景
Etu Solution
Ā 
Interface fonctionnelle, Lambda expression, mƩthode par dƩfaut, rƩfƩrence de...
MICHRAFY MUSTAFA
Ā 
Scala: Pattern matching, Concepts and Implementations
MICHRAFY MUSTAFA
Ā 
"Big Data beyond Apache Hadoop - How to Integrate ALL your Data" - JavaOne 2013
Kai WƤhner
Ā 
Apache Phoenix: Transforming HBase into a SQL Database
DataWorks Summit
Ā 
Apache phoenix: Past, Present and Future of SQL over HBAse
enissoz
Ā 
Scala : programmation fonctionnelle
MICHRAFY MUSTAFA
Ā 
Mobile to Mainframe - the Challenges of Enterprise DevOps Adoption
Sanjeev Sharma
Ā 
Spark RDD : Transformations & Actions
MICHRAFY MUSTAFA
Ā 
Introduction to Microsoft Azure HD Insight by Dattatrey Sindhol
HARMAN Services
Ā 
č³‡ę–™ē§‘å­øåœ˜éšŠäŗŗę‰åŸ¹č‚²åˆ†äŗ« ─ 仄 DSP 為例
Fred Chiang
Ā 
Ad

Similar to Tackling big data with hadoop and open source integration (20)

PDF
Talend Open Studio and Hortonworks Data Platform
Hortonworks
Ā 
PDF
Powering Next Generation Data Architecture With Apache Hadoop
Hortonworks
Ā 
PDF
Hortonworks roadshow
Accenture
Ā 
PDF
Hadoop: What It Is and What It's Not
Inside Analysis
Ā 
PPTX
2012 06 hortonworks paris hug
Modern Data Stack France
Ā 
PPTX
Break Through the Traditional Advertisement Services with Big Data and Apache...
Hortonworks
Ā 
PDF
The Comprehensive Approach: A Unified Information Architecture
Inside Analysis
Ā 
PDF
Unified big data architecture
DataWorks Summit
Ā 
PPTX
Introduction to Hortonworks Data Platform for Windows
Hortonworks
Ā 
PDF
Hadoop's Role in the Big Data Architecture, OW2con'12, Paris
OW2
Ā 
PDF
Talk IT_ Oracle_ź¹€ķƒœģ™„_110831
Cana Ko
Ā 
PDF
Hortonworks Data Platform for Systems Integrators Webinar 9-5-2012.pptx
Hortonworks
Ā 
PDF
SAP HANA and Apache Hadoop for Big Data Management (SF Scalable Systems Meetup)
Will Gardella
Ā 
PPTX
Tera stream for datastreams
치민 최
Ā 
PPTX
Teradata Big Data London Seminar
Hortonworks
Ā 
PDF
Informatica World 2006 - MDM Data Quality
Database Architechs
Ā 
PPTX
Enterprise Services Solutions
Karya Technologies
Ā 
PPTX
EDF2013: Selected Talk: Bryan Drexler: The 80/20 Rule and Big Data
European Data Forum
Ā 
PDF
Big Data World Forum
bigdatawf
Ā 
PPTX
Metadata Use Cases
dmurph4
Ā 
Talend Open Studio and Hortonworks Data Platform
Hortonworks
Ā 
Powering Next Generation Data Architecture With Apache Hadoop
Hortonworks
Ā 
Hortonworks roadshow
Accenture
Ā 
Hadoop: What It Is and What It's Not
Inside Analysis
Ā 
2012 06 hortonworks paris hug
Modern Data Stack France
Ā 
Break Through the Traditional Advertisement Services with Big Data and Apache...
Hortonworks
Ā 
The Comprehensive Approach: A Unified Information Architecture
Inside Analysis
Ā 
Unified big data architecture
DataWorks Summit
Ā 
Introduction to Hortonworks Data Platform for Windows
Hortonworks
Ā 
Hadoop's Role in the Big Data Architecture, OW2con'12, Paris
OW2
Ā 
Talk IT_ Oracle_ź¹€ķƒœģ™„_110831
Cana Ko
Ā 
Hortonworks Data Platform for Systems Integrators Webinar 9-5-2012.pptx
Hortonworks
Ā 
SAP HANA and Apache Hadoop for Big Data Management (SF Scalable Systems Meetup)
Will Gardella
Ā 
Tera stream for datastreams
치민 최
Ā 
Teradata Big Data London Seminar
Hortonworks
Ā 
Informatica World 2006 - MDM Data Quality
Database Architechs
Ā 
Enterprise Services Solutions
Karya Technologies
Ā 
EDF2013: Selected Talk: Bryan Drexler: The 80/20 Rule and Big Data
European Data Forum
Ā 
Big Data World Forum
bigdatawf
Ā 
Metadata Use Cases
dmurph4
Ā 

More from DataWorks Summit (20)

PPTX
Data Science Crash Course
DataWorks Summit
Ā 
PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
Ā 
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
Ā 
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
Ā 
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
Ā 
PPTX
Managing the Dewey Decimal System
DataWorks Summit
Ā 
PPTX
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
Ā 
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
Ā 
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
Ā 
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
Ā 
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
Ā 
PPTX
Security Framework for Multitenant Architecture
DataWorks Summit
Ā 
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
Ā 
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
Ā 
PPTX
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
Ā 
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
Ā 
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
Ā 
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
Ā 
PDF
Computer Vision: Coming to a Store Near You
DataWorks Summit
Ā 
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
Ā 
Data Science Crash Course
DataWorks Summit
Ā 
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
Ā 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
Ā 
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
Ā 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
Ā 
Managing the Dewey Decimal System
DataWorks Summit
Ā 
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
Ā 
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
Ā 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
Ā 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
Ā 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
Ā 
Security Framework for Multitenant Architecture
DataWorks Summit
Ā 
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
Ā 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
Ā 
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
Ā 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
Ā 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
Ā 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
Ā 
Computer Vision: Coming to a Store Near You
DataWorks Summit
Ā 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
Ā 

Recently uploaded (20)

PPTX
OA presentation.pptx OA presentation.pptx
pateldhruv002338
Ā 
PDF
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
Ā 
PPT
Coupa-Kickoff-Meeting-Template presentai
annapureddyn
Ā 
PDF
Chapter 1 Introduction to CV and IP Lecture Note.pdf
Getnet Tigabie Askale -(GM)
Ā 
PDF
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
Ā 
PDF
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
Ā 
PDF
This slide provides an overview Technology
mineshkharadi333
Ā 
PDF
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
Ā 
PDF
CIFDAQ'S Market Insight: BTC to ETH money in motion
CIFDAQ
Ā 
PDF
Presentation about Hardware and Software in Computer
snehamodhawadiya
Ā 
PDF
Doc9.....................................
SofiaCollazos
Ā 
PPT
L2 Rules of Netiquette in Empowerment technology
Archibal2
Ā 
PDF
Software Development Methodologies in 2025
KodekX
Ā 
PDF
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
Ā 
PDF
DevOps & Developer Experience Summer BBQ
AUGNYC
Ā 
PDF
A Day in the Life of Location Data - Turning Where into How.pdf
Precisely
Ā 
PDF
Revolutionize Operations with Intelligent IoT Monitoring and Control
Rejig Digital
Ā 
PPTX
cloud computing vai.pptx for the project
vaibhavdobariyal79
Ā 
PDF
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
Ā 
PPTX
ChatGPT's Deck on The Enduring Legacy of Fax Machines
Greg Swan
Ā 
OA presentation.pptx OA presentation.pptx
pateldhruv002338
Ā 
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
Ā 
Coupa-Kickoff-Meeting-Template presentai
annapureddyn
Ā 
Chapter 1 Introduction to CV and IP Lecture Note.pdf
Getnet Tigabie Askale -(GM)
Ā 
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
Ā 
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
Ā 
This slide provides an overview Technology
mineshkharadi333
Ā 
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
Ā 
CIFDAQ'S Market Insight: BTC to ETH money in motion
CIFDAQ
Ā 
Presentation about Hardware and Software in Computer
snehamodhawadiya
Ā 
Doc9.....................................
SofiaCollazos
Ā 
L2 Rules of Netiquette in Empowerment technology
Archibal2
Ā 
Software Development Methodologies in 2025
KodekX
Ā 
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
Ā 
DevOps & Developer Experience Summer BBQ
AUGNYC
Ā 
A Day in the Life of Location Data - Turning Where into How.pdf
Precisely
Ā 
Revolutionize Operations with Intelligent IoT Monitoring and Control
Rejig Digital
Ā 
cloud computing vai.pptx for the project
vaibhavdobariyal79
Ā 
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
Ā 
ChatGPT's Deck on The Enduring Legacy of Fax Machines
Greg Swan
Ā 

Tackling big data with hadoop and open source integration

  • 1. Tackling Big Data with Hadoop and Open Source Integration Ciaran Dynes Remy Dubois
  • 2. Agenda 1. Talend’s Goal: Democratizing Integration 2. What is Big Data (integration)? 3. Big Data for the Masses: Talend’s strategy and vision Ā© Talend 2011 2
  • 4. Talend – The Market Leading Unified Integration Platform Talend Enterprise Data Data MDM ESB BPM Quality Integration ¾  Commercial license ¾  Subscription model Studio Repository Deployment Execution Monitoring ¾  Open source license Talend Open Studio for ¾  Free of charge ¾  Optional support Data Data Quality Integration MDM ESB Recognized as the open source leader in each of its market category by all industry analysts Ā© Talend 2011 4
  • 5. Who uses Talend? A high adoption rate § 20 million downloads § 950,000 users § 3,500 customers 1 product download 150 new customers every 30 seconds per month Ā© Talend 2011 5
  • 6. Trying to get from this… Ā© Talend 2011 – Stri2y Private & Confidential Ā© Talend 2011 6
  • 7. to this… Why Talend… ONLY Talend generates code that is executed within map reduce. This open approach removes the limitation of a proprietary ā€œengineā€ to provide a truly unique and powerful set of tools for big data.
  • 8. Big data is…. Hans Rosling – uses big data to analyze world health trends Key Takeaway #1 transactions, interactions, observations Ā© Talend 2011 – Stri2y Private & Confidential Ā© Talend 2011 8
  • 9. Big Data = Transactions + Interactions + Observations Sensors/RFID/Devices User Generated Content Big Data Mega, Giga, Tera, Peta bytes Sentiment Social Interactions & Feeds Mobile Web Spatial & GPS coordinates User Clicks External Demographics Web logs WEB Business Data Feeds Offer history A/B testing Video, Audio, Images Dynamic pricing SMS/MMS CRM Segmentation Affiliate Networks Search Marketing ERP Offer details Purchase detail Customer Touchpoints Behavioral Targeting Purchase record Support Contacts Dynamic Funnels Payment record Increasing Data Variety and Complexity Source: Hortonworks Ā© Talend 2011 – Stri2y Private & Confidential Ā© Talend 2011 9
  • 10. What is Big Data integration?
  • 11. Traditional Data Flows CRM ETL Normalized Traditional Data ERP Data Data Warehouse Quality Finance •  Scheduled–daily or weekly, sometimes more frequently. Business Business Analyst User •  Volumes rarely exceed terabytes Warehouse Administrator Executives Ā© Talend 2011 – Stri2y Private & Confidential Ā© Talend 2011 11
  • 12. The new world of big data Social Networking CRM ERP Big Data Finance Ā© Talend 2011 – Stri2y Private & Confidential Ā© Talend 2011 12
  • 13. The new world of big data Social Networking CRM Mobile Devices ERP Big Data Finance Ā© Talend 2011 – Stri2y Private & Confidential Ā© Talend 2011 13
  • 14. The new world of big data Social Networking CRM Mobile Devices ERP Transactions Finance Big Data Ā© Talend 2011 – Stri2y Private & Confidential Ā© Talend 2011 14
  • 15. The new world of big data Social Networking CRM Mobile Devices ERP Transactions Finance Network Devices Big Data Sensors Ā© Talend 2011 – Stri2y Private & Confidential Ā© Talend 2011 15
  • 16. Key Takeaway #2 Forces us to think Ā© Talend 2011 differently Ā© Talend 2011 – Stri2y Private & Confidential 16
  • 17. But for Talend…. Big data is… …everything that is old, is new again! Ā© Talend 2011 – Stri2y Private & Confidential Ā© Talend 2011 17
  • 18. Data driven business enables data governance supports information decisions drives Information provides value to the business If you can't rely on your information then Your the result can be missed opportunities, or business higher costs. Matthew West and Julian Fowler (1999). Developing High Quality Data Models. The European Process Industries STEP Technical Liaison Executive (EPISTLE). Ā© Talend 2011 – Stri2y Private & Confidential Ā© Talend 2011 18
  • 19. BIG data driven business enables BIG data governance supports BIG BIG information decisions drives Information provides value to the business If you can't rely on your information then the result can be missed opportunities, or BIG higher costs. business Matthew West and Julian Fowler (1999). Developing High Quality Data Models. The European Process Industries STEP Technical Liaison Executive (EPISTLE). Ā© Talend 2011 – Stri2y Private & Confidential Ā© Talend 2011 19
  • 20. ā€œBig Data for the Massesā€
  • 21. Goal: Democratize Big Data Talend Open Studio for Big Data ¾  ā€œBig Data for the Massesā€ ¾  Improves efficiency of big data job design with graphic interface ¾  Abstracts and generates code ¾  Run transforms inside Hadoop Pig ¾  Native support for HDFS, Pig, HBase, Sqoop and Hive ¾  Apache License 2.0 ¾  Embedded in Hortonworks Data …an open source Platform ecosystem Ā© Talend 2011 – Stri2y Private & Confidential Ā© Talend 2011 21
  • 22. Let us show you… Ā© Talend 2012
  • 23. Where to next? Ā© Talend 2012
  • 24. How is big data integration being used? Use Cases •  Recommendation Engine •  Sentiment Analysis •  Risk Modeling •  Fraud Detection •  Marketing Campaign Analysis •  Customer Churn Analysis •  Social Graph Analysis •  Customer Experience Analytics •  Network Monitoring •  Research And Development BUT: to what level is DQ required for your use case? Ā© Talend 2011 – Stri2y Private & Confidential Ā© Talend 2011 24
  • 25. Poor Data Quality + Big Data = Big Problems Poor Data Quality * Big Data = Big Problems^2 Key Takeaway #3 In big data… poor data quality can be magnified at huge scale Ā© Talend 2011 25
  • 26. Two methods for inserting data quality into a big data job 1.  Pipelining: as part of the load process 2.  Load the cluster than implement and execute a data quality map reduce job Ā© Talend 2011 26
  • 27. E-T-L - Load Extract – Transform Ā© Talend 2011 – Stri2y Private & Confidential Ā© Talend 2011 27
  • 28. E- DQ -L Extract – Improve/Cleanse - Load Ā© Talend 2011 – Stri2y Private & Confidential Ā© Talend 2011 28
  • 29. Pipelining: data quality with big data CRM DQ ERP DQ Finance Big Data Social Networking •  Use traditional data quality tools •  No new programming, no PHDs •  Once and done Mobile Devices Ā© Talend 2011 – Stri2y Private & Confidential Ā© Talend 2011 29
  • 30. Big data alternative: Load and improve within the cluster CRM DQ ERP DQ Finance Big Data Social Networking •  Load first, improve later •  Really complex to build, limited tools •  Constant on, increments Mobile Devices •  Insane performance Ā© Talend 2011 – Stri2y Private & Confidential Ā© Talend 2011 30
  • 31. big 2012 data now Q4 2013 Talend Open Studio for Big Data ¾ Packaged within Hortonworks Data Platform …Eclipse tools for HIVE, HDFS, PIG, SCOOP …supports Oozie, Hcatalog, Kerberos ¾ Free to download and use under the Apache license …democratizing big data through intuitive tools Ā© Talend 2011 – Stri2y Private & Confidential Ā© Talend 2011 31
  • 33. Sessions will resume at 11:25am Page 33