SlideShare a Scribd company logo
Data Infrastructure at Linkedin
Jun Rao and Sam Shah
LinkedIn Confidential ©2013 All Rights Reserved
Outline
LinkedIn Confidential ©2013 All Rights Reserved 2
1. LinkedIn introduction
2. Online/nearline infrastructure overview
3. Infrastructure for data mining
4. Conclusion
The World‟s Largest Professional Network
Members Worldwide
2 new
Members Per Second
100M+
Monthly Unique Visitors
200M+ 2M+
Company Pages
Connecting Talent  Opportunity. At scale…
LinkedIn Confidential ©2013 All Rights Reserved 3
4
Member Profiles
Large dataset
Medium writes
Very high reads
Freshness <1s
People You May Know
5
Large dataset
Compute intensive
High reads
Freshness ~hrs
LinkedIn Today
6
Moving dataset
High writes
High reads
Freshness ~mins
The Big-Data Feedback Loop
LinkedIn Confidential ©2013 All Rights Reserved 7
Value 
Insights 
Scale 
Product
ScienceData
Member
Engagement 
Virality 
Signals 
Refinement 
Infrastructure
Analytics 
LinkedIn Data Infrastructure: Three-Phase Abstraction
LinkedIn Confidential ©2013 All Rights Reserved 8
Users Online Data
Infra
Near-Line
Infra
Application Offline
Data Infra
Infrastructure Latency & Freshness Requirements Products
Online Activity that should be reflected immediately
• Member Profiles
• Company Profiles
• Connections
• Messages
• Endorsements
• Skills
Near-Line Activity that should be reflected soon
• Activity Streams
• Profile Standardization
• News
• Recommendations
• Search
• Messages
Offline Activity that can be reflected later
• People You May Know
• Connection Strength
• News
• Recommendations
• Next best idea…
LinkedIn Data Infrastructure: Sample Stack
9
Infra challenges in 3-phase
ecosystem are diverse,
complex and specific
Some off-the-shelf.
Significant investment in
home-grown, deep and
interesting platforms
Streaming Transactions
10
Databus : Timeline-Consistent
Change Data Capture
LinkedIn Data Infrastructure Solutions
Databus at LinkedIn
12
DB
Bootstrap
Capture
Changes
On-line
Changes
On-line
Changes
DB
Consistent
Snapshot at U
 Transport independent of data
source: Oracle, MySQL, …
 Transactional semantics
 In order, at least once delivery
 Tens of relays
 Hundreds of sources
 Low latency - milliseconds
Consumer 1
Consumer n
Client
Databus
ClientLib
Consumer 1
Consumer n
Databus
ClientLib
Client
Relay
Event Win
Scaling Core Databases
13
RO
RO
RO
Voldemort: Highly-Available
Distributed KV Store
LinkedIn Data Infrastructure Solutions
14
• Pluggable components
• Tunable consistency /
availability
• Key/value model,
server side “views”
• 10 clusters, 100+ nodes
• Largest cluster – 10K+ qps
• Avg latency: 3ms
• Hundreds of Stores
• Largest store – 2.8TB+
Voldemort: Architecture
Streaming Non-transactional Events
16
Offline
Nearline
Processing
Kafka: High-Volume Low-Latency
Messaging System
LinkedIn Data Infrastructure Solutions
17
Kafka Architecture
Producer
Consumer
Producer
Consumer
Zookeeper
topic1-part1
topic2-part2
topic2-part1
topic1-part2
topic2-part2
topic2-part1
topic1-part1 topic1-part2
topic1-part1 topic1-part2
topic2-part2
topic2-part1
Broker 1 Broker 2 Broker 3 Broker 4
Key features
• Scale-out architecture
• High throughput
• Automatic load balancing
• Intra-cluster replication
Per day stats
• writes: 10+ billion messages
• reads: 50+ billion messages
Filling in the Data Store Gap
19
Text
Search
Espresso: Indexed Timeline-Consistent
Distributed Data Store
LinkedIn Data Infrastructure Solutions
20
Application View
21
Hierarchical data model
Rich functionality on resources
 Conditional updates
 Partial updates
 Atomic counters
Rich functionality within
resource groups
 Transactions
 Secondary index
 Text search
Espresso: System Components
22
• Partitioning/replication
• Timeline consistency
• Change propagation
Generic Cluster Manager: Helix
• Generic Distributed State Model
• Config Management
• Automatic Load Balancing
• Fault tolerance
• Cluster expansion and rebalancing
• Espresso, Databus and Search
• Open Source Apr 2012
• https://github.com/linkedin/helix
23
Infrastructure challenges in
large-scale data mining
Putting it together
Top complaints from data scientists
1 Getting the data in (Ingress ETL)
2 Getting the data out (Egress)
3 Workflow management
4 Model of computation
5 …
Top complaints from data scientists
1 Getting the data in (Ingress ETL)
2 Getting the data out (Egress)
3 Workflow management
4 Model of computation
5 …
LinkedIn circa 2010
LinkedIn Confidential ©2013 All Rights Reserved 27
O(n2) data integration complexity
Infrastructure fragility
• Can‟t get all data
• Hard to operate
• Multi-hour delay
• Labor intensive
• Slow
• Does it work?
Process fragility
• Labor intensive
• One man‟s
cleaning…
FE
MT
BE
DT
FE Dev
BE Dev
ETL
Team
ETL DW/
Hadoop
Data model
{
tracking_code=null,
session_id=42,
tracking_time=Tue Jul 31 07:27:25 PDT 2010,
error_key=null,
locale=en_us,
browser_id=ddc61a81-5311-4859-be42-ca7dc7b941e3,
member_id=1213,
page_key=profile,
tracking_info=Viewee=1214,lnl=f,nd=1,o=1214,^SP=pId-
'pro_stars',rslvd=t,vs=v,vid=1214,ps=EDU|EXP|SKIL|,
error_id=null,
page_type=FULL_PAGE,
request_path=view
...
}
Data model (cont‟d)
{
article_id=5560874437395353942,
title=Five Good Reasons to Hire the Unemployed,
language=en_US,
article_source=bit.ly,
url=aHR0cDovL3d3dy5vbmV0aGluZ25ldy5jb20vaW5kZXgucGhwL3dvcmsvMTAyLWZpdmUtZ29v
ZC1yZWFzb25zLXRvLWhpcmUtdGhlLXVuZW1wbG95ZWQK,
...
}
Problems
1 Data integration across systems
2 Fragile infrastructure
3 Lack of proper data models (ad-hoc)
LinkedIn 2013
LinkedIn Confidential ©2013 All Rights Reserved 34
O(n) data integration
Publish/subscribe commit log
Data model
 Hundreds of message types
 Thousands of fields
 What do they all mean?
 What happens when they change?
Data model
1 Education
2 Push data cleanliness upstream
3 O(1) ETL
4 Evidence-based correctness
Data model
 DDL for data definition and schema
 Central versioned registry of all schemas
 Schema review
 Programmatic compatibility model
– Schema changes handled transparently
Workflow
1 Check in schema
2 Code review
3 Ship
Seamless data load into downstream systems
Audit trail
Result: complete, verified copy of all
data available
Top complaints from data scientists
1 Getting the data in (Ingress ETL)
2 Getting the data out (Egress)
3 Workflow management
4 Model of computation
5 …
Egress
store DATA into „kafka://…‟ using Stream();
Top complaints from data scientists
1 Getting the data in (Ingress ETL)
2 Getting the data out (Egress)
3 Workflow management
4 Model of computation
5 …
Workflows
46
Job A
Job B
Job C
Workflows
47
Job A
Job B
Job C
Push to Production
Workflows
48
Job A
Job B
Job C
Push to Production
Job X
Workflows
49
Job A
Job B
Job C
Push to Production
Job X
Push to QA
Real workflows are complicated
50
Workflow management: Azkaban
51
 Dependency management
 Diverse job types (Pig, Hive, Java, . . . )
 Scheduling
 Monitoring
 Configuration
 Retry/restart on failure
 Resource locking
 Log collection
 Historical information
Workflow management: Azkaban
52
Workflow management: Azkaban
53
Top complaints from data scientists
1 Getting the data in (Ingress ETL)
2 Getting the data out (Egress)
3 Workflow management
4 Model of computation
5 …
Model of computation
• Alternating Direction Method of Multipliers (ADMM)
• Distributed Conjugate Gradient Descent (DCGD)
• Distributed L-BFGS
• Bayesian Distributed Learning (BDL)
Graphs
Distributed learning
Near-line processing
LinkedIn Data Infrastructure: A few take-aways
LinkedIn Confidential ©2013 All Rights Reserved 56
1. Building infrastructure in a hyper-growth
environment is challenging.
2. Few vs Many: Balance over-specialized (agile)
vs generic efforts (leverage-able) platforms (*)
3. Balance open-source products with home-
grown platforms (**)
4. Data Model and Integration e2e are key (*)
57
Learning more
data.linkedin.com

More Related Content

PDF
Data Infrastructure at LinkedIn
Amy W. Tang
 
PPTX
LinkedIn Infrastructure (analytics@webscale, at fb 2013)
Jun Rao
 
PPT
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
Yahoo Developer Network
 
PDF
Data Infrastructure at LinkedIn
Amy W. Tang
 
PPTX
Databus - LinkedIn's Change Data Capture Pipeline
Sunil Nagaraj
 
PDF
The "Big Data" Ecosystem at LinkedIn
Sam Shah
 
PDF
Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop
Shirshanka Das
 
PPTX
Databus: LinkedIn's Change Data Capture Pipeline SOCC 2012
Shirshanka Das
 
Data Infrastructure at LinkedIn
Amy W. Tang
 
LinkedIn Infrastructure (analytics@webscale, at fb 2013)
Jun Rao
 
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
Yahoo Developer Network
 
Data Infrastructure at LinkedIn
Amy W. Tang
 
Databus - LinkedIn's Change Data Capture Pipeline
Sunil Nagaraj
 
The "Big Data" Ecosystem at LinkedIn
Sam Shah
 
Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop
Shirshanka Das
 
Databus: LinkedIn's Change Data Capture Pipeline SOCC 2012
Shirshanka Das
 

What's hot (20)

PPTX
Lambda-less Stream Processing @Scale in LinkedIn
DataWorks Summit/Hadoop Summit
 
PDF
Patterns and Anti-Patterns for Memorializing Data Science Project Artifacts
Databricks
 
PDF
What's new in SQL on Hadoop and Beyond
DataWorks Summit/Hadoop Summit
 
PPTX
Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with ...
David Chen
 
PDF
All Aboard the Databus
Amy W. Tang
 
PPTX
10 Big Data Technologies you Didn't Know About
Jesus Rodriguez
 
PPTX
Hdfs 2016-hadoop-summit-dublin-v1
Chris Nauroth
 
PPTX
Gobblin' Big Data With Ease @ QConSF 2014
Lin Qiao
 
PDF
Introduction to Databus
Amy W. Tang
 
PDF
Monitoring MySQL at scale
Ovais Tariq
 
PPTX
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
DataWorks Summit
 
PDF
Continuous Applications at Scale of 100 Teams with Databricks Delta and Struc...
Databricks
 
PPTX
Querying Druid in SQL with Superset
DataWorks Summit
 
PDF
Building Streaming Data Applications Using Apache Kafka
Slim Baltagi
 
PPTX
MongoDB Days Germany: Data Processing with MongoDB
MongoDB
 
PPTX
Gluent Extending Enterprise Applications with Hadoop
gluent.
 
PPTX
The Future of Apache Hadoop an Enterprise Architecture View
DataWorks Summit/Hadoop Summit
 
PDF
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
Data Con LA
 
PPTX
Building Data Pipelines with Spark and StreamSets
Pat Patterson
 
PDF
Data Stream Processing - Concepts and Frameworks
Matthias Niehoff
 
Lambda-less Stream Processing @Scale in LinkedIn
DataWorks Summit/Hadoop Summit
 
Patterns and Anti-Patterns for Memorializing Data Science Project Artifacts
Databricks
 
What's new in SQL on Hadoop and Beyond
DataWorks Summit/Hadoop Summit
 
Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with ...
David Chen
 
All Aboard the Databus
Amy W. Tang
 
10 Big Data Technologies you Didn't Know About
Jesus Rodriguez
 
Hdfs 2016-hadoop-summit-dublin-v1
Chris Nauroth
 
Gobblin' Big Data With Ease @ QConSF 2014
Lin Qiao
 
Introduction to Databus
Amy W. Tang
 
Monitoring MySQL at scale
Ovais Tariq
 
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
DataWorks Summit
 
Continuous Applications at Scale of 100 Teams with Databricks Delta and Struc...
Databricks
 
Querying Druid in SQL with Superset
DataWorks Summit
 
Building Streaming Data Applications Using Apache Kafka
Slim Baltagi
 
MongoDB Days Germany: Data Processing with MongoDB
MongoDB
 
Gluent Extending Enterprise Applications with Hadoop
gluent.
 
The Future of Apache Hadoop an Enterprise Architecture View
DataWorks Summit/Hadoop Summit
 
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
Data Con LA
 
Building Data Pipelines with Spark and StreamSets
Pat Patterson
 
Data Stream Processing - Concepts and Frameworks
Matthias Niehoff
 
Ad

Viewers also liked (16)

PPTX
LinkedIn Segmentation & Targeting Platform: A Big Data Application
Amy W. Tang
 
PDF
Resume- William Myers FD2016.1.4
William Myers
 
PDF
Personal branding playbook
Online Business
 
PPTX
Using Big Data for Improved Healthcare Operations and Analytics
Perficient, Inc.
 
PDF
Unlocking the Experts
LinkedIn
 
PDF
Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...
Shirshanka Das
 
PDF
Participatory Design: Bringing Users Into Your Process
David Sherwin
 
PDF
Introduction To TensorFlow | Deep Learning Using TensorFlow | TensorFlow Tuto...
Edureka!
 
PPTX
Big data ppt
Nasrin Hussain
 
PPTX
What to Upload to SlideShare
SlideShare
 
PPTX
Making Great User Experiences, Pittsburgh Scrum MeetUp, Oct 17, 2017
Carol Smith
 
PDF
Big Data Tutorial For Beginners | What Is Big Data | Big Data Tutorial | Hado...
Edureka!
 
PDF
Taming the ever-evolving Compliance Beast : Lessons learnt at LinkedIn [Strat...
Shirshanka Das
 
PDF
What is Artificial Intelligence | Artificial Intelligence Tutorial For Beginn...
Edureka!
 
PPTX
Top 5 Deep Learning and AI Stories - October 6, 2017
NVIDIA
 
PPTX
AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017
Carol Smith
 
LinkedIn Segmentation & Targeting Platform: A Big Data Application
Amy W. Tang
 
Resume- William Myers FD2016.1.4
William Myers
 
Personal branding playbook
Online Business
 
Using Big Data for Improved Healthcare Operations and Analytics
Perficient, Inc.
 
Unlocking the Experts
LinkedIn
 
Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...
Shirshanka Das
 
Participatory Design: Bringing Users Into Your Process
David Sherwin
 
Introduction To TensorFlow | Deep Learning Using TensorFlow | TensorFlow Tuto...
Edureka!
 
Big data ppt
Nasrin Hussain
 
What to Upload to SlideShare
SlideShare
 
Making Great User Experiences, Pittsburgh Scrum MeetUp, Oct 17, 2017
Carol Smith
 
Big Data Tutorial For Beginners | What Is Big Data | Big Data Tutorial | Hado...
Edureka!
 
Taming the ever-evolving Compliance Beast : Lessons learnt at LinkedIn [Strat...
Shirshanka Das
 
What is Artificial Intelligence | Artificial Intelligence Tutorial For Beginn...
Edureka!
 
Top 5 Deep Learning and AI Stories - October 6, 2017
NVIDIA
 
AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017
Carol Smith
 
Ad

Similar to Data Infrastructure at LinkedIn (20)

PDF
Bg linkedin bigdata_martinschultz_symposium_yale_oct2012
Bhaskar Ghosh
 
PPTX
Ledingkart Meetup #4: Data pipeline @ lk
Mukesh Singh
 
PPTX
Apache Kafka at LinkedIn
Guozhang Wang
 
PPTX
The Big Data Analytics Ecosystem at LinkedIn
rajappaiyer
 
PDF
Data Infrastructure for a World of Music
Lars Albertsson
 
PPTX
The Big Data Ecosystem at LinkedIn
OSCON Byrum
 
PPTX
Data Engineer's Lunch #81: Reverse ETL Tools for Modern Data Platforms
Anant Corporation
 
PPTX
Software Developer and Architecture @ LinkedIn (QCon SF 2014)
Sid Anand
 
PDF
Data pipelines from zero to solid
Lars Albertsson
 
PDF
Scaling LinkedIn - A Brief History
Josh Clemm
 
PPTX
Software architecture for data applications
Ding Li
 
PPT
Resilience: the key requirement of a [big] [data] architecture - StampedeCon...
StampedeCon
 
PDF
Data Platform in the Cloud
Amihay Zer-Kavod
 
PDF
Data Infrastructure at Flipkart (VLDB 2016)
Sharad Agarwal
 
PPTX
Redis For Distributed & Fault Tolerant Data Plumbing Infrastructure
Redis Labs
 
PPTX
Building a Self-Service Hadoop Platform at Linkedin with Azkaban
DataWorks Summit
 
PDF
Linkedin NUS QCon 2009 slides
ruslansv
 
PDF
InfoQ QCon San Francisco 2009
Sean Dawson
 
PDF
Software Development & Architecture @ LinkedIn
C4Media
 
PPTX
The "Big Data" Ecosystem at LinkedIn
Sam Shah
 
Bg linkedin bigdata_martinschultz_symposium_yale_oct2012
Bhaskar Ghosh
 
Ledingkart Meetup #4: Data pipeline @ lk
Mukesh Singh
 
Apache Kafka at LinkedIn
Guozhang Wang
 
The Big Data Analytics Ecosystem at LinkedIn
rajappaiyer
 
Data Infrastructure for a World of Music
Lars Albertsson
 
The Big Data Ecosystem at LinkedIn
OSCON Byrum
 
Data Engineer's Lunch #81: Reverse ETL Tools for Modern Data Platforms
Anant Corporation
 
Software Developer and Architecture @ LinkedIn (QCon SF 2014)
Sid Anand
 
Data pipelines from zero to solid
Lars Albertsson
 
Scaling LinkedIn - A Brief History
Josh Clemm
 
Software architecture for data applications
Ding Li
 
Resilience: the key requirement of a [big] [data] architecture - StampedeCon...
StampedeCon
 
Data Platform in the Cloud
Amihay Zer-Kavod
 
Data Infrastructure at Flipkart (VLDB 2016)
Sharad Agarwal
 
Redis For Distributed & Fault Tolerant Data Plumbing Infrastructure
Redis Labs
 
Building a Self-Service Hadoop Platform at Linkedin with Azkaban
DataWorks Summit
 
Linkedin NUS QCon 2009 slides
ruslansv
 
InfoQ QCon San Francisco 2009
Sean Dawson
 
Software Development & Architecture @ LinkedIn
C4Media
 
The "Big Data" Ecosystem at LinkedIn
Sam Shah
 

More from Amy W. Tang (8)

PDF
Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Amy W. Tang
 
PDF
Espresso: LinkedIn's Distributed Data Serving Platform (Talk)
Amy W. Tang
 
PDF
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
Amy W. Tang
 
PDF
Building Distributed Systems Using Helix
Amy W. Tang
 
PDF
LinkedIn Graph Presentation
Amy W. Tang
 
PDF
Voldemort on Solid State Drives
Amy W. Tang
 
PDF
Untangling Cluster Management with Helix
Amy W. Tang
 
PDF
A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn
Amy W. Tang
 
Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Amy W. Tang
 
Espresso: LinkedIn's Distributed Data Serving Platform (Talk)
Amy W. Tang
 
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
Amy W. Tang
 
Building Distributed Systems Using Helix
Amy W. Tang
 
LinkedIn Graph Presentation
Amy W. Tang
 
Voldemort on Solid State Drives
Amy W. Tang
 
Untangling Cluster Management with Helix
Amy W. Tang
 
A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn
Amy W. Tang
 

Recently uploaded (20)

PPT
L2 Rules of Netiquette in Empowerment technology
Archibal2
 
PDF
A Day in the Life of Location Data - Turning Where into How.pdf
Precisely
 
PDF
Software Development Company | KodekX
KodekX
 
PPTX
Smart Infrastructure and Automation through IoT Sensors
Rejig Digital
 
PDF
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
PDF
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
PDF
DevOps & Developer Experience Summer BBQ
AUGNYC
 
PDF
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
PDF
Doc9.....................................
SofiaCollazos
 
PDF
How-Cloud-Computing-Impacts-Businesses-in-2025-and-Beyond.pdf
Artjoker Software Development Company
 
PPTX
Coupa-Overview _Assumptions presentation
annapureddyn
 
PDF
Why Your AI & Cybersecurity Hiring Still Misses the Mark in 2025
Virtual Employee Pvt. Ltd.
 
PPTX
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
PDF
REPORT: Heating appliances market in Poland 2024
SPIUG
 
PDF
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
PDF
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
PDF
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
PPTX
Comunidade Salesforce São Paulo - Desmistificando o Omnistudio (Vlocity)
Francisco Vieira Júnior
 
PPTX
ChatGPT's Deck on The Enduring Legacy of Fax Machines
Greg Swan
 
PDF
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
L2 Rules of Netiquette in Empowerment technology
Archibal2
 
A Day in the Life of Location Data - Turning Where into How.pdf
Precisely
 
Software Development Company | KodekX
KodekX
 
Smart Infrastructure and Automation through IoT Sensors
Rejig Digital
 
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
DevOps & Developer Experience Summer BBQ
AUGNYC
 
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
Doc9.....................................
SofiaCollazos
 
How-Cloud-Computing-Impacts-Businesses-in-2025-and-Beyond.pdf
Artjoker Software Development Company
 
Coupa-Overview _Assumptions presentation
annapureddyn
 
Why Your AI & Cybersecurity Hiring Still Misses the Mark in 2025
Virtual Employee Pvt. Ltd.
 
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
REPORT: Heating appliances market in Poland 2024
SPIUG
 
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
Comunidade Salesforce São Paulo - Desmistificando o Omnistudio (Vlocity)
Francisco Vieira Júnior
 
ChatGPT's Deck on The Enduring Legacy of Fax Machines
Greg Swan
 
Presentation about Hardware and Software in Computer
snehamodhawadiya
 

Data Infrastructure at LinkedIn

Editor's Notes

  • #9: Transition needs to be goodProducts =&gt; data infrastructure requirements in previous slideAll products don’t make the same latency and freshness requirements from our data infrastructureThe way we bucketize this is….News and recommendations show up in both nearline and offline
  • #40: Not part of kafka
  • #52: - Others: Oozie
  • #57: Data Integration is hard. Having sane and same metadata across systems. Have a schema which works across the 3 phases. Want a rich evolving schemas and make the conforming push as much of data cleaning to source and upstream as much as possible so near-line and off-line helpsSessionization logic is in WH which makes it hard for near-line systems to useExtensible system where changing schema in one phase does not break downstream systemsDon’t build over-specialized systems: e.g. a monitoring system for PYMK – build Azkaban