SlideShare a Scribd company logo
MAKING BIG DATA COME ALIVE
The key to unlocking the Value in the Internet of Things?
Managing the Data!
2
For Big Data the Key is Variety!
4/25/2016© 2015 Think Big, a Teradata Company
Definition: Datasets so complex and large that they are
awkward to work with using standard tools and techniques
Location Social Images Weblogs Videos Text Audio Sensor
Size is not what is most important; it’s variety
3
Example Use Cases
• Predictive Maintenance
• Search and view detail on issue on the fly
• Identify critical alerts
• Root cause analysis
• Understanding usage
• And many more!
4
Changing Technology Landscape
4/25/2016
5 © 2015 Teradata
AccessPreparationAcquisition
Data Lake Architecture
Math
and Stats
Data
Mining
Business
Intelligence
Applications
Languages
Marketing
ANALYTIC
TOOLS & APPS
USERS
Marketing
Executives
Operational
Systems
Frontline
Workers
Customers
Partners
Engineers
Data
Scientists
Business
Analysts
Streams SearchAggregations
Security, Metadata/Lineage, Administration
Distributed Storage
Msg. queues Cleansing Access
ExperimentsGovernanceFeeds
SOURCES
Sensors
email
Social
Telemetry
Mobile
Tabular Data
Machine logs
C
6 © 2015 Teradata
REFERENCE INFORMATION ARCHITECTURE
New with Big Data
Security, Workload ManagementPublishingPreparation
SecuredLanding
Acquisition
SharedViews&Obfuscation
OptimizedStructures
CommonKeys
DerivedValues,SensitiveDataProtection
CommonSummaries
UserDefinedDataSets
Validation&KeyResolution
ERP
SCM
CRM
Images
Audio
and Video
Machine
Logs
Text
Web and
Social
SOURCES
Business
Analysts
Math
and Stats
Data
Mining
Business
Intelligence
Applications
Marketing
ANALYTIC TOOLS
& APPS
Search
Profiling,Masking,Obfuscation
Data Scientists
Business Analysts
Data Modelers
IT
7
How is Data Management Changing?
• Schema on Read?
– Yes… as step one
– But data still has underlying structure
– It’s more like agile modeling – reflect as much structure as needed
• Loosely coupled schemas loses platform guarantees but gains more application
flexibility
• Data Modeling isn’t dead!
• Metadata is more important than ever
4/25/2016© 2015 Think Big, a Teradata Company
8
Changes in Logical Modeling
• JSON-like structures
– Complex collections of relations, arrays, map of items
• Graphs
– Storing complex, dynamically changing not static relationships
• Binary/CLOB/specialized data
– Ability to execute specialized programs to interpret and process
4/25/2016© 2015 Think Big, a Teradata Company
9
Patterns
4/25/2016
10
Important New Patterns
• Denormalized Fact
• Profile
• Event History
• Timeline
• Network
• Distributed Sources
• Late Data
• Deep Aggregates
• Recovery
• Multiple Active Clusters
4/25/2016© 2015 Think Big, a Teradata Company
11
Event id Actor id Time Event col’s Dim id’s Dim col’s Ext. Data
123 uid1 1/1/15
13:16:11
… … … { “TstA” : 1
…}
456 uid2 1/1/15
13:16:14
… … … { “TstB” : 1
…}
• Fact table about common events to allow e.g., analytics in context
– E.g., wearable device, telematics
• Stored in columnar format (e.g., Parquet, ORCfile)
• Join as was value of slowly changing dimensions
• Often “extension” column of unparsed/not modeled JSON-like data
• Partitioned by event time buckets, perhaps also by other dimension(s)
Event History Pattern
4/25/2016© 2015 Think Big, a Teradata Company
12
Actor id Segment
s
Ev1:id Ev1:fact Ev1:dims Ev1:ext Ev2:id …
uid1 [1, 3, 7] 123 1/1/15
13:16:11
… { “TstA” : 1
…}
789 …
uid2 [2, 3] 456 1/1/15
13:16:14
… { “TstB” : 1
…}
0ab …
• Pivot on event history: table of actors with events over time
– Device history, usage in consumer journey
– Enable support/analysis on specific items, long-lived analysis
• May have hierarchy of actors (e.g., cluster, device, component)
• May be array of events, many columns or subsorted (cluster key)
• Also stored in columnar format, may be partitioned
• May be updated in near real-time AND batch
• Often holds cached algorithm values (combined Profile)
Timeline Pattern
4/25/2016© 2015 Think Big, a Teradata Company
13
• Ongoing status of configuration
– Parts in assembly
– Related items (versions)
– Peer groups
• For physical configuration and/or software components
• Maintain links in graph structure
– May be current or historical
• Use links to pull full context from Event History or Timeline
• Search -> simple query -> complex analytics
– E.g., transitive closure, impact analysis
• Technologies
– BlazeDB, TitanDB, Neo4j
– Spark GraphX & GraphFrame, Giraph
Network
4/25/2016© 2015 Think Big, a Teradata Company
14
Late Data
• Delays from intermittent connectivity, upstream failures
• Lineage tracking is critical
• Watermarks to identify when sufficient data has arrived (based on
statistics, upstream)
• May trigger early, on time & late
• Report on how much data has arrived late
4/25/2016© 2015 Think Big, a Teradata Company
Zipfian Distribution
Case Study
4/25/2016
16
• Global manufacturer of storage devices: hard-drives, SSDs, object storage
• Produces 100’s of millions of devices annually
• Each device contains multiple complex components
– Manufacturing sites are geographically dispersed
– Some components are sourced from suppliers
– Each device generates ~100-1000MB of data during its lifecycle
Case Study: Overview
Confidential
17
Business Challenges
Need to speed cycle time for new product
development
Customer’s demanding faster Failure
Analysis
Engineer’s wasting time playing “where’s
Waldo” with the data
Confidential
18
Technical Challenges
Difficulty storing & exposing binary and
other data types
Current DW’s Unable to Keep Pace with the
Volume
No platform for large-scale analytics
Data silos across manufacturing facilities
Confidential
19
Goal: Expose the entire “DNA” of the device—from
development, manufacturing, to reliability testing and “living
behavior” of device for live behavior —to increase operational
efficiency and quality
-- Chief Information Officer
Confidential
20
Platform Overview
Site 1
Site 2
Site 3
Site 4
Final Assembly
Customer Data
Supplier
Shop Floor Data
Shipment
Data
...
Data Sources
End-to-End
Integrated
Data
Big Data Platform Consumers
Ad hoc Analysis
Defect Pattern
Recognition
Enterprise DW
Batch Analytics
Parallelized
batch analytics
App-Specific
Views
New High-Value
Parameters
raw
extracts
Enriched
data
End-to-End
Traceability
Tester Failure
Analytics
Failure Analysis
Customer data
lookup
...
Applications
Confidential
21
• Large volumes of Binary Data:
– Require 5 years for warranty reasons, leading to PB’s of binary objects
• Schema on read:
– Development/Process Engineering teams change the manufacturing/test data very
frequently; thus, the decoding of the binary data changes very frequently.
– It is very difficult to keep pace with these changes with a traditional RDBMS, often
leading to time-consuming data purging and reloading
Use Case 1: Binary Data…with daily
decoding changes
Parsing
Confidential
22
Use Case 2: Wide Structures (Timeline)
10’s of thousands of parameters collected over the course of 6-8
months for a single device…a wide, de-normalized structure reduces
the complexity for end-user analysis
Confidential
23
Solve new problems - exposing previously “untapped” data sources at a scale that allowed for
identification of patterns causing the issues, E.g., scan 380 billion test points for 8 million products.
Several irregular distributions were found, which allowed the team to identify a code-level bug that was
causing the failures (and therefore scrapped drives).”
Use Case 3:
“Un-Paralleled” Parallel Analysis
Confidential
24
Conclusions
4/25/2016
25
Conclusions
• IoT is about blending data
• Data management patterns & practices are foundational
• Lead to effective analytics
• Reach me at @ronbodkin, ron.bodkin@thinkbiganalytics.com
4/25/2016© 2015 Think Big, a Teradata Company
26
Big Data
Strategy &
Roadmap
Analytics &
Data Science
Training &
Managed
Services
How can my
Organization
Get Value
from Big
Data?
How Do We
Reap Value
from Our Big
Data
Investment?
How Do We
Keep Our
People and
Environment
Operating At
A High
Level?
How Do We
Build a Best
Practices Big
Data
Environment
That Will Meet
our Needs?
Data Lake
Implementation
Hadoop, Spark Solutions since 2010.
We’re Hiring
27
• Incorporate all data from all touch points to
understand true customers’ behavior
• Leverage multi-genre advanced analytics
techniques to generate behavior-based insights
• Available NOW
Customer Satisfaction Index Analytic Solution Announcement

More Related Content

PDF
Benefits of Hadoop as Platform as a Service
DataWorks Summit/Hadoop Summit
 
PPTX
Log I am your father
DataWorks Summit/Hadoop Summit
 
PPTX
Pouring the Foundation: Data Management in the Energy Industry
DataWorks Summit
 
PPTX
Shaping a Digital Vision
DataWorks Summit/Hadoop Summit
 
PPTX
Solving Performance Problems on Hadoop
Tyler Mitchell
 
PPTX
Hortonworks Data Platform and IBM Systems - A Complete Solution for Cognitive...
DataWorks Summit/Hadoop Summit
 
PDF
Apache Eagle: Secure Hadoop in Real Time
DataWorks Summit/Hadoop Summit
 
PPTX
Practical advice to build a data driven company
DataWorks Summit/Hadoop Summit
 
Benefits of Hadoop as Platform as a Service
DataWorks Summit/Hadoop Summit
 
Log I am your father
DataWorks Summit/Hadoop Summit
 
Pouring the Foundation: Data Management in the Energy Industry
DataWorks Summit
 
Shaping a Digital Vision
DataWorks Summit/Hadoop Summit
 
Solving Performance Problems on Hadoop
Tyler Mitchell
 
Hortonworks Data Platform and IBM Systems - A Complete Solution for Cognitive...
DataWorks Summit/Hadoop Summit
 
Apache Eagle: Secure Hadoop in Real Time
DataWorks Summit/Hadoop Summit
 
Practical advice to build a data driven company
DataWorks Summit/Hadoop Summit
 

What's hot (20)

PPTX
LendingClub RealTime BigData Platform with Oracle GoldenGate
Rajit Saha
 
PDF
On Demand HDP Clusters using Cloudbreak and Ambari
DataWorks Summit/Hadoop Summit
 
PPTX
Tools and approaches for migrating big datasets to the cloud
DataWorks Summit
 
PPTX
Using Hadoop to Offload Data Warehouse Processing and More - Brad Anserson
MapR Technologies
 
PPTX
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
DataWorks Summit/Hadoop Summit
 
PPTX
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
Infochimps, a CSC Big Data Business
 
PPTX
Depositing Value from Transactional Data at Danske Bank
DataWorks Summit/Hadoop Summit
 
PDF
Big Data Architecture and Deployment
Cisco Canada
 
PPTX
Webinar: The Modern Streaming Data Stack with Kinetica & StreamSets
Kinetica
 
PPTX
Worldpay - Delivering Multi-Tenancy Applications in A Secure Operational Plat...
DataWorks Summit/Hadoop Summit
 
PPTX
Big Data Analytics Projects - Real World with Pentaho
Mark Kromer
 
PPTX
Building intelligent applications, experimental ML with Uber’s Data Science W...
DataWorks Summit
 
PPTX
How Hadoop Makes the Natixis Pack More Efficient
DataWorks Summit/Hadoop Summit
 
PDF
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
StampedeCon
 
PDF
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...
DataWorks Summit
 
PPTX
Spark DC Interactive Meetup: HTAP with Spark and In-Memory Data Grids
Ali Hodroj
 
PPTX
Keys for Success from Streams to Queries
DataWorks Summit/Hadoop Summit
 
PPTX
Building big data solutions on azure
Eyal Ben Ivri
 
PPTX
Big Data in the Real World
Mark Kromer
 
PPTX
Bring your SAP and Enterprise Data to Hadoop, Apache Kafka and the Cloud
DataWorks Summit/Hadoop Summit
 
LendingClub RealTime BigData Platform with Oracle GoldenGate
Rajit Saha
 
On Demand HDP Clusters using Cloudbreak and Ambari
DataWorks Summit/Hadoop Summit
 
Tools and approaches for migrating big datasets to the cloud
DataWorks Summit
 
Using Hadoop to Offload Data Warehouse Processing and More - Brad Anserson
MapR Technologies
 
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
DataWorks Summit/Hadoop Summit
 
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
Infochimps, a CSC Big Data Business
 
Depositing Value from Transactional Data at Danske Bank
DataWorks Summit/Hadoop Summit
 
Big Data Architecture and Deployment
Cisco Canada
 
Webinar: The Modern Streaming Data Stack with Kinetica & StreamSets
Kinetica
 
Worldpay - Delivering Multi-Tenancy Applications in A Secure Operational Plat...
DataWorks Summit/Hadoop Summit
 
Big Data Analytics Projects - Real World with Pentaho
Mark Kromer
 
Building intelligent applications, experimental ML with Uber’s Data Science W...
DataWorks Summit
 
How Hadoop Makes the Natixis Pack More Efficient
DataWorks Summit/Hadoop Summit
 
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
StampedeCon
 
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...
DataWorks Summit
 
Spark DC Interactive Meetup: HTAP with Spark and In-Memory Data Grids
Ali Hodroj
 
Keys for Success from Streams to Queries
DataWorks Summit/Hadoop Summit
 
Building big data solutions on azure
Eyal Ben Ivri
 
Big Data in the Real World
Mark Kromer
 
Bring your SAP and Enterprise Data to Hadoop, Apache Kafka and the Cloud
DataWorks Summit/Hadoop Summit
 
Ad

Viewers also liked (20)

PPTX
Hadoop and Friends as Key Enabler of the IoE - Continental's Dynamic eHorizon
DataWorks Summit/Hadoop Summit
 
PPTX
Data Process Systems, connecting everything
DataWorks Summit/Hadoop Summit
 
PDF
The Future of Apache Storm
DataWorks Summit/Hadoop Summit
 
PDF
Cooperative Data Exploration with iPython Notebook
DataWorks Summit/Hadoop Summit
 
PPTX
Powering a Virtual Power Station with Big Data
DataWorks Summit/Hadoop Summit
 
PPTX
Protecting Enterprise Data in Apache Hadoop
DataWorks Summit/Hadoop Summit
 
PDF
The Heterogeneous Data lake
DataWorks Summit/Hadoop Summit
 
PDF
A Continuously Deployed Hadoop Analytics Platform?
DataWorks Summit/Hadoop Summit
 
PPTX
Hadoop Everywhere
DataWorks Summit/Hadoop Summit
 
PPTX
Overview of Apache Flink: the 4G of Big Data Analytics Frameworks
DataWorks Summit/Hadoop Summit
 
PDF
NLP Structured Data Investigation on Non-Text
DataWorks Summit/Hadoop Summit
 
PPTX
Implementing the Business Catalog in the Modern Enterprise: Bridging Traditio...
DataWorks Summit/Hadoop Summit
 
PPTX
Hadoop in the Cloud: Real World Lessons from Enterprise Customers
DataWorks Summit/Hadoop Summit
 
PPTX
Using a Data Lake at the core of a Life Assurance business
DataWorks Summit/Hadoop Summit
 
PDF
Architecting a multi-tenanted platform
DataWorks Summit/Hadoop Summit
 
PPTX
Hadoop Platform at Yahoo
DataWorks Summit/Hadoop Summit
 
PPTX
Securing Hadoop in an Enterprise Context
DataWorks Summit/Hadoop Summit
 
PPTX
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
DataWorks Summit/Hadoop Summit
 
PPTX
Ingest and Stream Processing - What will you choose?
DataWorks Summit/Hadoop Summit
 
PPTX
The Evolution of Apache Kylin
DataWorks Summit/Hadoop Summit
 
Hadoop and Friends as Key Enabler of the IoE - Continental's Dynamic eHorizon
DataWorks Summit/Hadoop Summit
 
Data Process Systems, connecting everything
DataWorks Summit/Hadoop Summit
 
The Future of Apache Storm
DataWorks Summit/Hadoop Summit
 
Cooperative Data Exploration with iPython Notebook
DataWorks Summit/Hadoop Summit
 
Powering a Virtual Power Station with Big Data
DataWorks Summit/Hadoop Summit
 
Protecting Enterprise Data in Apache Hadoop
DataWorks Summit/Hadoop Summit
 
The Heterogeneous Data lake
DataWorks Summit/Hadoop Summit
 
A Continuously Deployed Hadoop Analytics Platform?
DataWorks Summit/Hadoop Summit
 
Overview of Apache Flink: the 4G of Big Data Analytics Frameworks
DataWorks Summit/Hadoop Summit
 
NLP Structured Data Investigation on Non-Text
DataWorks Summit/Hadoop Summit
 
Implementing the Business Catalog in the Modern Enterprise: Bridging Traditio...
DataWorks Summit/Hadoop Summit
 
Hadoop in the Cloud: Real World Lessons from Enterprise Customers
DataWorks Summit/Hadoop Summit
 
Using a Data Lake at the core of a Life Assurance business
DataWorks Summit/Hadoop Summit
 
Architecting a multi-tenanted platform
DataWorks Summit/Hadoop Summit
 
Hadoop Platform at Yahoo
DataWorks Summit/Hadoop Summit
 
Securing Hadoop in an Enterprise Context
DataWorks Summit/Hadoop Summit
 
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
DataWorks Summit/Hadoop Summit
 
Ingest and Stream Processing - What will you choose?
DataWorks Summit/Hadoop Summit
 
The Evolution of Apache Kylin
DataWorks Summit/Hadoop Summit
 
Ad

Similar to The key to unlocking the Value in the IoT? Managing the Data! (20)

PDF
Big Data Modeling and Analytic Patterns – Beyond Schema on Read
Think Big, a Teradata Company
 
PDF
Advanced Analytics and Machine Learning with Data Virtualization
Denodo
 
PPTX
When Streaming Becomes Strategic
MapR Technologies
 
PDF
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
Denodo
 
PDF
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
DATAVERSITY
 
PPTX
Rabobank - There is something about Data
BigDataExpo
 
PDF
ADV Slides: Building and Growing Organizational Analytics with Data Lakes
DATAVERSITY
 
PDF
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
Denodo
 
PDF
Got data?… now what? An introduction to modern data platforms
JamesAnderson599331
 
PDF
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
DATAVERSITY
 
PDF
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
Denodo
 
PDF
A Key to Real-time Insights in a Post-COVID World (ASEAN)
Denodo
 
PDF
Quicker Insights and Sustainable Business Agility Powered By Data Virtualizat...
Denodo
 
PDF
The Marriage of the Data Lake and the Data Warehouse and Why You Need Both
Adaryl "Bob" Wakefield, MBA
 
PPTX
Data lake-itweekend-sharif university-vahid amiry
datastack
 
PDF
Are You Killing the Benefits of Your Data Lake?
Denodo
 
PPTX
Big Data Session 1.pptx
ElsonPaul2
 
PPTX
Shikha fdp 62_14july2017
Dr. Shikha Mehta
 
PPTX
Bitkom Cray presentation - on HPC affecting big data analytics in FS
Philip Filleul
 
PDF
Foundation for Success: How Big Data Fits in an Information Architecture
Inside Analysis
 
Big Data Modeling and Analytic Patterns – Beyond Schema on Read
Think Big, a Teradata Company
 
Advanced Analytics and Machine Learning with Data Virtualization
Denodo
 
When Streaming Becomes Strategic
MapR Technologies
 
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
Denodo
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
DATAVERSITY
 
Rabobank - There is something about Data
BigDataExpo
 
ADV Slides: Building and Growing Organizational Analytics with Data Lakes
DATAVERSITY
 
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
Denodo
 
Got data?… now what? An introduction to modern data platforms
JamesAnderson599331
 
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
DATAVERSITY
 
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
Denodo
 
A Key to Real-time Insights in a Post-COVID World (ASEAN)
Denodo
 
Quicker Insights and Sustainable Business Agility Powered By Data Virtualizat...
Denodo
 
The Marriage of the Data Lake and the Data Warehouse and Why You Need Both
Adaryl "Bob" Wakefield, MBA
 
Data lake-itweekend-sharif university-vahid amiry
datastack
 
Are You Killing the Benefits of Your Data Lake?
Denodo
 
Big Data Session 1.pptx
ElsonPaul2
 
Shikha fdp 62_14july2017
Dr. Shikha Mehta
 
Bitkom Cray presentation - on HPC affecting big data analytics in FS
Philip Filleul
 
Foundation for Success: How Big Data Fits in an Information Architecture
Inside Analysis
 

More from DataWorks Summit/Hadoop Summit (20)

PPT
Running Apache Spark & Apache Zeppelin in Production
DataWorks Summit/Hadoop Summit
 
PPT
State of Security: Apache Spark & Apache Zeppelin
DataWorks Summit/Hadoop Summit
 
PDF
Unleashing the Power of Apache Atlas with Apache Ranger
DataWorks Summit/Hadoop Summit
 
PDF
Enabling Digital Diagnostics with a Data Science Platform
DataWorks Summit/Hadoop Summit
 
PDF
Revolutionize Text Mining with Spark and Zeppelin
DataWorks Summit/Hadoop Summit
 
PDF
Double Your Hadoop Performance with Hortonworks SmartSense
DataWorks Summit/Hadoop Summit
 
PDF
Hadoop Crash Course
DataWorks Summit/Hadoop Summit
 
PDF
Data Science Crash Course
DataWorks Summit/Hadoop Summit
 
PDF
Apache Spark Crash Course
DataWorks Summit/Hadoop Summit
 
PDF
Dataflow with Apache NiFi
DataWorks Summit/Hadoop Summit
 
PPTX
Schema Registry - Set you Data Free
DataWorks Summit/Hadoop Summit
 
PPTX
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
DataWorks Summit/Hadoop Summit
 
PDF
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
DataWorks Summit/Hadoop Summit
 
PPTX
Mool - Automated Log Analysis using Data Science and ML
DataWorks Summit/Hadoop Summit
 
PPTX
HBase in Practice
DataWorks Summit/Hadoop Summit
 
PPTX
The Challenge of Driving Business Value from the Analytics of Things (AOT)
DataWorks Summit/Hadoop Summit
 
PDF
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
DataWorks Summit/Hadoop Summit
 
PPTX
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
DataWorks Summit/Hadoop Summit
 
PPTX
Backup and Disaster Recovery in Hadoop
DataWorks Summit/Hadoop Summit
 
PPTX
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
DataWorks Summit/Hadoop Summit
 
Running Apache Spark & Apache Zeppelin in Production
DataWorks Summit/Hadoop Summit
 
State of Security: Apache Spark & Apache Zeppelin
DataWorks Summit/Hadoop Summit
 
Unleashing the Power of Apache Atlas with Apache Ranger
DataWorks Summit/Hadoop Summit
 
Enabling Digital Diagnostics with a Data Science Platform
DataWorks Summit/Hadoop Summit
 
Revolutionize Text Mining with Spark and Zeppelin
DataWorks Summit/Hadoop Summit
 
Double Your Hadoop Performance with Hortonworks SmartSense
DataWorks Summit/Hadoop Summit
 
Hadoop Crash Course
DataWorks Summit/Hadoop Summit
 
Data Science Crash Course
DataWorks Summit/Hadoop Summit
 
Apache Spark Crash Course
DataWorks Summit/Hadoop Summit
 
Dataflow with Apache NiFi
DataWorks Summit/Hadoop Summit
 
Schema Registry - Set you Data Free
DataWorks Summit/Hadoop Summit
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
DataWorks Summit/Hadoop Summit
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
DataWorks Summit/Hadoop Summit
 
Mool - Automated Log Analysis using Data Science and ML
DataWorks Summit/Hadoop Summit
 
HBase in Practice
DataWorks Summit/Hadoop Summit
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
DataWorks Summit/Hadoop Summit
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
DataWorks Summit/Hadoop Summit
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
DataWorks Summit/Hadoop Summit
 
Backup and Disaster Recovery in Hadoop
DataWorks Summit/Hadoop Summit
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
DataWorks Summit/Hadoop Summit
 

Recently uploaded (20)

PDF
Event Presentation Google Cloud Next Extended 2025
minhtrietgect
 
PDF
Why Your AI & Cybersecurity Hiring Still Misses the Mark in 2025
Virtual Employee Pvt. Ltd.
 
PDF
A Day in the Life of Location Data - Turning Where into How.pdf
Precisely
 
PPTX
How to Build a Scalable Micro-Investing Platform in 2025 - A Founder’s Guide ...
Third Rock Techkno
 
PDF
Software Development Company | KodekX
KodekX
 
PDF
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
PDF
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
PPTX
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
PPTX
Coupa-Overview _Assumptions presentation
annapureddyn
 
PDF
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
PDF
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
PDF
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
PPT
Coupa-Kickoff-Meeting-Template presentai
annapureddyn
 
PDF
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
PPTX
Comunidade Salesforce São Paulo - Desmistificando o Omnistudio (Vlocity)
Francisco Vieira Júnior
 
PDF
Best ERP System for Manufacturing in India | Elite Mindz
Elite Mindz
 
PDF
Chapter 1 Introduction to CV and IP Lecture Note.pdf
Getnet Tigabie Askale -(GM)
 
PDF
Revolutionize Operations with Intelligent IoT Monitoring and Control
Rejig Digital
 
PDF
This slide provides an overview Technology
mineshkharadi333
 
PPTX
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
Event Presentation Google Cloud Next Extended 2025
minhtrietgect
 
Why Your AI & Cybersecurity Hiring Still Misses the Mark in 2025
Virtual Employee Pvt. Ltd.
 
A Day in the Life of Location Data - Turning Where into How.pdf
Precisely
 
How to Build a Scalable Micro-Investing Platform in 2025 - A Founder’s Guide ...
Third Rock Techkno
 
Software Development Company | KodekX
KodekX
 
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
Coupa-Overview _Assumptions presentation
annapureddyn
 
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
Coupa-Kickoff-Meeting-Template presentai
annapureddyn
 
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
Comunidade Salesforce São Paulo - Desmistificando o Omnistudio (Vlocity)
Francisco Vieira Júnior
 
Best ERP System for Manufacturing in India | Elite Mindz
Elite Mindz
 
Chapter 1 Introduction to CV and IP Lecture Note.pdf
Getnet Tigabie Askale -(GM)
 
Revolutionize Operations with Intelligent IoT Monitoring and Control
Rejig Digital
 
This slide provides an overview Technology
mineshkharadi333
 
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 

The key to unlocking the Value in the IoT? Managing the Data!

  • 1. MAKING BIG DATA COME ALIVE The key to unlocking the Value in the Internet of Things? Managing the Data!
  • 2. 2 For Big Data the Key is Variety! 4/25/2016© 2015 Think Big, a Teradata Company Definition: Datasets so complex and large that they are awkward to work with using standard tools and techniques Location Social Images Weblogs Videos Text Audio Sensor Size is not what is most important; it’s variety
  • 3. 3 Example Use Cases • Predictive Maintenance • Search and view detail on issue on the fly • Identify critical alerts • Root cause analysis • Understanding usage • And many more!
  • 5. 5 © 2015 Teradata AccessPreparationAcquisition Data Lake Architecture Math and Stats Data Mining Business Intelligence Applications Languages Marketing ANALYTIC TOOLS & APPS USERS Marketing Executives Operational Systems Frontline Workers Customers Partners Engineers Data Scientists Business Analysts Streams SearchAggregations Security, Metadata/Lineage, Administration Distributed Storage Msg. queues Cleansing Access ExperimentsGovernanceFeeds SOURCES Sensors email Social Telemetry Mobile Tabular Data Machine logs C
  • 6. 6 © 2015 Teradata REFERENCE INFORMATION ARCHITECTURE New with Big Data Security, Workload ManagementPublishingPreparation SecuredLanding Acquisition SharedViews&Obfuscation OptimizedStructures CommonKeys DerivedValues,SensitiveDataProtection CommonSummaries UserDefinedDataSets Validation&KeyResolution ERP SCM CRM Images Audio and Video Machine Logs Text Web and Social SOURCES Business Analysts Math and Stats Data Mining Business Intelligence Applications Marketing ANALYTIC TOOLS & APPS Search Profiling,Masking,Obfuscation Data Scientists Business Analysts Data Modelers IT
  • 7. 7 How is Data Management Changing? • Schema on Read? – Yes… as step one – But data still has underlying structure – It’s more like agile modeling – reflect as much structure as needed • Loosely coupled schemas loses platform guarantees but gains more application flexibility • Data Modeling isn’t dead! • Metadata is more important than ever 4/25/2016© 2015 Think Big, a Teradata Company
  • 8. 8 Changes in Logical Modeling • JSON-like structures – Complex collections of relations, arrays, map of items • Graphs – Storing complex, dynamically changing not static relationships • Binary/CLOB/specialized data – Ability to execute specialized programs to interpret and process 4/25/2016© 2015 Think Big, a Teradata Company
  • 10. 10 Important New Patterns • Denormalized Fact • Profile • Event History • Timeline • Network • Distributed Sources • Late Data • Deep Aggregates • Recovery • Multiple Active Clusters 4/25/2016© 2015 Think Big, a Teradata Company
  • 11. 11 Event id Actor id Time Event col’s Dim id’s Dim col’s Ext. Data 123 uid1 1/1/15 13:16:11 … … … { “TstA” : 1 …} 456 uid2 1/1/15 13:16:14 … … … { “TstB” : 1 …} • Fact table about common events to allow e.g., analytics in context – E.g., wearable device, telematics • Stored in columnar format (e.g., Parquet, ORCfile) • Join as was value of slowly changing dimensions • Often “extension” column of unparsed/not modeled JSON-like data • Partitioned by event time buckets, perhaps also by other dimension(s) Event History Pattern 4/25/2016© 2015 Think Big, a Teradata Company
  • 12. 12 Actor id Segment s Ev1:id Ev1:fact Ev1:dims Ev1:ext Ev2:id … uid1 [1, 3, 7] 123 1/1/15 13:16:11 … { “TstA” : 1 …} 789 … uid2 [2, 3] 456 1/1/15 13:16:14 … { “TstB” : 1 …} 0ab … • Pivot on event history: table of actors with events over time – Device history, usage in consumer journey – Enable support/analysis on specific items, long-lived analysis • May have hierarchy of actors (e.g., cluster, device, component) • May be array of events, many columns or subsorted (cluster key) • Also stored in columnar format, may be partitioned • May be updated in near real-time AND batch • Often holds cached algorithm values (combined Profile) Timeline Pattern 4/25/2016© 2015 Think Big, a Teradata Company
  • 13. 13 • Ongoing status of configuration – Parts in assembly – Related items (versions) – Peer groups • For physical configuration and/or software components • Maintain links in graph structure – May be current or historical • Use links to pull full context from Event History or Timeline • Search -> simple query -> complex analytics – E.g., transitive closure, impact analysis • Technologies – BlazeDB, TitanDB, Neo4j – Spark GraphX & GraphFrame, Giraph Network 4/25/2016© 2015 Think Big, a Teradata Company
  • 14. 14 Late Data • Delays from intermittent connectivity, upstream failures • Lineage tracking is critical • Watermarks to identify when sufficient data has arrived (based on statistics, upstream) • May trigger early, on time & late • Report on how much data has arrived late 4/25/2016© 2015 Think Big, a Teradata Company Zipfian Distribution
  • 16. 16 • Global manufacturer of storage devices: hard-drives, SSDs, object storage • Produces 100’s of millions of devices annually • Each device contains multiple complex components – Manufacturing sites are geographically dispersed – Some components are sourced from suppliers – Each device generates ~100-1000MB of data during its lifecycle Case Study: Overview Confidential
  • 17. 17 Business Challenges Need to speed cycle time for new product development Customer’s demanding faster Failure Analysis Engineer’s wasting time playing “where’s Waldo” with the data Confidential
  • 18. 18 Technical Challenges Difficulty storing & exposing binary and other data types Current DW’s Unable to Keep Pace with the Volume No platform for large-scale analytics Data silos across manufacturing facilities Confidential
  • 19. 19 Goal: Expose the entire “DNA” of the device—from development, manufacturing, to reliability testing and “living behavior” of device for live behavior —to increase operational efficiency and quality -- Chief Information Officer Confidential
  • 20. 20 Platform Overview Site 1 Site 2 Site 3 Site 4 Final Assembly Customer Data Supplier Shop Floor Data Shipment Data ... Data Sources End-to-End Integrated Data Big Data Platform Consumers Ad hoc Analysis Defect Pattern Recognition Enterprise DW Batch Analytics Parallelized batch analytics App-Specific Views New High-Value Parameters raw extracts Enriched data End-to-End Traceability Tester Failure Analytics Failure Analysis Customer data lookup ... Applications Confidential
  • 21. 21 • Large volumes of Binary Data: – Require 5 years for warranty reasons, leading to PB’s of binary objects • Schema on read: – Development/Process Engineering teams change the manufacturing/test data very frequently; thus, the decoding of the binary data changes very frequently. – It is very difficult to keep pace with these changes with a traditional RDBMS, often leading to time-consuming data purging and reloading Use Case 1: Binary Data…with daily decoding changes Parsing Confidential
  • 22. 22 Use Case 2: Wide Structures (Timeline) 10’s of thousands of parameters collected over the course of 6-8 months for a single device…a wide, de-normalized structure reduces the complexity for end-user analysis Confidential
  • 23. 23 Solve new problems - exposing previously “untapped” data sources at a scale that allowed for identification of patterns causing the issues, E.g., scan 380 billion test points for 8 million products. Several irregular distributions were found, which allowed the team to identify a code-level bug that was causing the failures (and therefore scrapped drives).” Use Case 3: “Un-Paralleled” Parallel Analysis Confidential
  • 25. 25 Conclusions • IoT is about blending data • Data management patterns & practices are foundational • Lead to effective analytics • Reach me at @ronbodkin, ron.bodkin@thinkbiganalytics.com 4/25/2016© 2015 Think Big, a Teradata Company
  • 26. 26 Big Data Strategy & Roadmap Analytics & Data Science Training & Managed Services How can my Organization Get Value from Big Data? How Do We Reap Value from Our Big Data Investment? How Do We Keep Our People and Environment Operating At A High Level? How Do We Build a Best Practices Big Data Environment That Will Meet our Needs? Data Lake Implementation Hadoop, Spark Solutions since 2010. We’re Hiring
  • 27. 27 • Incorporate all data from all touch points to understand true customers’ behavior • Leverage multi-genre advanced analytics techniques to generate behavior-based insights • Available NOW Customer Satisfaction Index Analytic Solution Announcement

Editor's Notes

  • #6: This illustrates the fundamental processing in iconic form.
  • #7: This slide shows how security and governance would be managed across the different phases of ingest, data preparation and Publishing. It is the heart of the Goldilocks governance This slide builds and starts with Data Scientist that need early access, Modelers next where they expect a lot of data science work already occurred, then jumps up to Business Analysts, the three major user groups, Business Analyst would need access to materialized models and common summaries, but will be mostly accessing through view heavy interfaces to enforce security and help with further materialization or at least representation (views) of the specific models. IT is last and it needs access to everthing
  • #18: Images courtesy of: http://hybridclaims.com
  • #19: Images courtesy of: http://neaglobal.com http://eweek.com
  • #20: Increase customer satisfaction: Commitment to quality Improve customer service and access to data (internally & externally) Increase operational efficiency: improved yield & time-to-market By having end to end visibility to: every test, every diagnostic and all info from all components of a product Enable the business to extract new insights (never-before possible)
  • #27: This slide represents Think Big’s end-to-end big data services portfolio. We have services that span from big data strategy and roadmap all the way to training and managed services. Today we’re going to talk about data lake implementation best practices.