SlideShare a Scribd company logo
LinkedIn Segmentation & Targeting
Platform: A Big Data Application
Hadoop Summit, June 2013
Hien Luu, Sid Anand
©2013 LinkedIn Corporation. All Rights Reserved.
About Us
*
Hien Luu Sid Anand
©2013 LinkedIn Corporation. All Rights Reserved.
Our mission
Connect the world’s professionals to make
them more productive and successful
Over 200M members and counting
2 4 8
17
32
55
90
145
2004 2005 2006 2007 2008 2009 2010 2011 2012
LinkedIn Members (Millions)
200+
The world’s largest professional network
Growing at more than 2 members/sec
Source :
http://press.linkedin.com/about
©2013 LinkedIn Corporation. All Rights Reserved.
*
>88%Fortune 100 Companies
use LinkedIn Talent Soln to hire
Company Pages
>2.9M
Professional searches in 2012
>5.7B
Languages
19
>30MFastest growing demographic:
Students and NCGs
The world’s largest professional network
Over 64% of members are now international
Source :
http://press.linkedin.com/about
©2013 LinkedIn Corporation. All Rights Reserved.
Other Company Facts
*
• Headquartered in Mountain View, Calif., with offices around the world!
• As of June 1, 2013, LinkedIn has ~3,700 full-time employees located around
the world
Source :
http://press.linkedin.com/about
Agenda
 Company Overview
• Big Data @ LinkedIn
• The Segmentation & Targeting Problem
• Solution : LinkedIn Segmentation & Targeting Platform
• Q & A
Big Data @ LinkedIn
©2013 LinkedIn Corporation. All Rights Reserved.
LinkedIn : Big Data Story
©2013 LinkedIn Corporation. All Rights Reserved.
Our Big Data Story depends on Infrastructure!
• On-line Data Infrastructure
• Near-line Data Infrastructure
• Offline Data Infrastructure
Oracle or
Espresso
Updates
Web
Serving
Teradata
Data Streams
Near-lineOn-line Off-line
Big Data Story : On-line Data
©2013 LinkedIn Corporation. All Rights Reserved.
On-line Data Infrastructure
• Supports typical OLTP requirements
• Highly concurrent R/W access
• Transactional guarantees
• Back-up & Recovery
• Supports a central LinkedIn Data Principle!
• “All data everywhere”
• All OLTP databases need to provide a
time-line consistent change stream
• For this, we developed and open-
sourced Databus!
Oracle or
Espresso
Updates
Web
Serving
On-line
Big Data Story : On-line Data
Oracle or
Espresso Data Change Events
Search
Index
Graph
Index
Read
Replicas
Updates
Standar
dization
A user updates the company, title, & school on his profile. He also accepts a
connection
The write is made to an Oracle or Espresso Master and DataBus replicates it:
• the profile change is applied to the Standardization service
 E.g. the many forms of IBM were canonicalized for search-friendliness
• …. and to the Search Index
 Recruiters can find you immediately by new keywords
• the connection change is applied to the Graph Index service
 The user can now start receiving feed updates from his new connections
Big Data Story : On-line Data
Databus streams also update Hadoop!
Oracle or
Espresso
Search
Index
Graph
Index
Read
Replica
Updates
Standar
dization
Data Change Events
Big Data Story : Near-line & Off-line Data
©2013 LinkedIn Corporation. All Rights Reserved.
2 Main Sources of Data @ LinkedIn
• User-provided data
• e.g. Member Profile data (e.g. employment, education history, endorsements)
• Tracking data via web site instrumentation
• e.g. pages viewed, email opened/sent, social gestures : posts/likes/shares
Oracle or
Espresso
Updates
Databus
Web
Servers
Teradata
The
Segmentation & Targeting
Problem
©2013 LinkedIn Corporation. All Rights Reserved.
Segmentation & Targeting
Segmentation & Targeting Attribute types
Bhaskar Ghosh
Segmentation & Targeting
©2013 LinkedIn Corporation. All Rights Reserved.
Step 1 : Take some information about users
Member ID Join Date Country Responded to
Promotion X1
1 01/01/2013 FR F
2 01/02/2013 BE F
3 01/03/2013 FR F
4 02/01/2013 FR T
Step 2 : Provide some targeting criteria for a new promotion
Pick members where
• Join Date between('01/01/2013", '01/31/2013") and
• Country="FR" and
• Responded to Promotion X1="F"
 Members 1 & 3
Step 3 : Target them for a different email campaign (promotion_X2)
Segmentation & Targeting
©2013 LinkedIn Corporation. All Rights Reserved.
Step 1 : Take some information about users
Member ID Join Date Country Responded to
Promotion X1
1 01/01/2013 FR F
2 01/02/2013 BE F
3 01/03/2013 FR F
4 02/01/2013 FR T
Step 2 : Provide some targeting criteria for a new promotion
Pick members where
• Join Date between('01/01/2013", '01/31/2013") and
• Country="FR" and
• Responded to Promotion X1="F"
 Members 1 & 3
Step 3 : Target them for a different email campaign (promotion_X2)
Attributes
Segment
Definition
Segment
Segmentation & Targeting
©2013 LinkedIn Corporation. All Rights Reserved.
Problem Definition
• The business wants to launch new campaigns often
• The business wants to specify targeting criteria (segment
definitions) using an arbitrary set of attributes
• The attributes often need to be computed to fulfill the targeting
criteria
• This data resides on Hadoop or TD
• The business is most comfortable with SQL-like languages
Segmentation & Targeting Solution
©2013 LinkedIn Corporation. All Rights Reserved.
Segmentation & Targeting
©2013 LinkedIn Corporation. All Rights Reserved.
Attribute
Computation
Engine
Attribute
Serving
Engine
Segmentation & Targeting
©2013 LinkedIn Corporation. All Rights Reserved.
Attribute
Computation
Engine
Self-service
Support various
data sources
Attribute
consolidation
Attribute
availability
Segmentation & Targeting
©2013 LinkedIn Corporation. All Rights Reserved.
Attribute computation
~225M
PB
TB
TB
~240
LinkedIn Segmentation & Targeting Platform
©2013 LinkedIn Corporation. All Rights Reserved.
Attribute Portal Web Application
Attribute & Definition
Metadata
LinkedIn Segmentation & Targeting Platform
©2013 LinkedIn Corporation. All Rights Reserved.
Attribute &
Definition
Metadata
TD Executor
Hive Executor
Pig Executor
REST
REST
REST
LinkedIn Segmentation & Targeting Platform
©2013 LinkedIn Corporation. All Rights Reserved.
M/R
Stitcher
/path/dataset1
/path/dataset2
/path/dataset3
/path/dataset4
/path/lnkd_big_table
Data
Loader
Attribute consolidation & availability
LinkedIn Segmentation & Targeting Platform
©2013 LinkedIn Corporation. All Rights Reserved.
LinkedIn big table, the most sought after data
Segmentation
Propensity
Model
Ad hoc analysis
LinkedIn big table
Segmentation & Targeting
©2013 LinkedIn Corporation. All Rights Reserved.
Attribute
Serving
Engine
Self-service
Attribute predicate
expression
Build
segments
Build lists
Segmentation & Targeting
©2013 LinkedIn Corporation. All Rights Reserved.
Serving Engine
$
count filter sum
complex
expressions
Σ1234
LinkedIn big table
~225M
~240
LinkedIn Segmentation & Targeting Platform
©2013 LinkedIn Corporation. All Rights Reserved.
Inverted
Index
Inverted
Index
Inverted
Index
M/R
Indexer
LinkedIn big table
Attribute &
Definition
Metadata
LinkedIn Segmentation & Targeting Platform
©2013 LinkedIn Corporation. All Rights Reserved.
Who are north American recruiters that
don’t work for a competitor?
Who are the LinkedIn Talent Solution prospects
in Europe?
Who are the job seekers?
LinkedIn Segmentation & Targeting Platform
©2013 LinkedIn Corporation. All Rights Reserved.
JSON Predicate
Expression
JSON Lucene
Query Parser
Inverted
Index
Inverted
Index
Inverted
Index
Segment &
List
LinkedIn Segmentation & Targeting Platform
©2013 LinkedIn Corporation. All Rights Reserved.
Complex tree-like attribute predicate expressions
LinkedIn Segmentation & Targeting Platform
©2013 LinkedIn Corporation. All Rights Reserved.
A marketing campaign is represented by a list
Conclusion
©2013 LinkedIn Corporation. All Rights Reserved.
Move at business speed and scale at LinkedIn scale
 Segmentation & Targeting Platform
– Self-service
– Multiple data sources & massive data volume
– Support complex expression evaluation in seconds
– Attribute availability at business speed
Engineering Team
 Jessica Ho
 Swetha Karthik
 Raj Rangaswamy
 Tony Tong
 Ajinkya Harkare
 Hien Luu
 Sid Anand
©2013 LinkedIn Corporation. All Rights Reserved.
Questions?
More info: data.linkedin.com
©2013 LinkedIn Corporation. All Rights Reserved.

More Related Content

PDF
Viadeos Segmentation platform with Spark on Mesos
Cepoi Eugen
 
PDF
LinkedIn Targeting
Sprout Social
 
PPTX
LinkedIn's Segmentation & Targeting Platform (Hadoop Summit 2013)
Sid Anand
 
PPTX
Common Service and Common Data Model by Henry McCallum
KTL Solutions
 
PPTX
SharePoint Fest Denver - SharePoint 2010 Integration and Interoperability: Wh...
Richard Harbridge
 
PPTX
Hadoop World 2011: LeveragIng Hadoop to Transform Raw Data to Rich Features a...
Cloudera, Inc.
 
PDF
Talent Pools: Using insights to power your talent acquisition strategy
LinkedIn For Search and Recruitment Firms
 
PDF
Computational advertising in Social Networks
Anmol Bhasin
 
Viadeos Segmentation platform with Spark on Mesos
Cepoi Eugen
 
LinkedIn Targeting
Sprout Social
 
LinkedIn's Segmentation & Targeting Platform (Hadoop Summit 2013)
Sid Anand
 
Common Service and Common Data Model by Henry McCallum
KTL Solutions
 
SharePoint Fest Denver - SharePoint 2010 Integration and Interoperability: Wh...
Richard Harbridge
 
Hadoop World 2011: LeveragIng Hadoop to Transform Raw Data to Rich Features a...
Cloudera, Inc.
 
Talent Pools: Using insights to power your talent acquisition strategy
LinkedIn For Search and Recruitment Firms
 
Computational advertising in Social Networks
Anmol Bhasin
 

Viewers also liked (20)

PPTX
Connecting Talent to Opportunity.. at scale @ LinkedIn
Anmol Bhasin
 
PPTX
Tutorial on People Recommendations in Social Networks - ACM RecSys 2013,Hong...
Anmol Bhasin
 
PPTX
Linked in data to power sales - dreamforce nov 18 2013 - vfinal w. appendix
Andres Bang
 
PPTX
Leadership in Uncertain Times - Hudson
HudsonAPAC
 
PPT
LinkedIn Presentation Plainfield Library 2016
Denis Curtin
 
PDF
By the Numbers: Leveraging LinkedIn Data to Become a Strategic Talent Advisor...
LinkedIn Talent Solutions
 
PPTX
Leveraging Data: LinkedIn Recruiter Jobs and Talent Pool Analysis | Talent Co...
LinkedIn Talent Solutions
 
PDF
Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop
Shirshanka Das
 
PPTX
The latest in LinkedIn talent pool reports | Talent Connect Anaheim
LinkedIn Talent Solutions
 
PDF
Strata 2016 - Architecting for Change: LinkedIn's new data ecosystem
Shirshanka Das
 
PPTX
Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with ...
David Chen
 
PPTX
LinkedIn's Logical Data Access Layer for Hadoop -- Strata London 2016
Carl Steinbach
 
PDF
How AlphaGo Works
Shane (Seungwhan) Moon
 
PPT
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
Yahoo Developer Network
 
PPTX
Live Webinar: Advanced Strategies for Leveraging Linkedin Like a Pro
LinkedIn
 
PPT
Jorge Lascas - Workshop linkedin successful strategies - Amsterdam
Jorge Lascas
 
PPTX
Aiinpractice2017deepaklongversion
Deepak Agarwal
 
PDF
LinkedIn Communication Architecture
LinkedIn
 
PPTX
LinkedIn Segmentation & Targeting Platform: A Big Data Application
Amy W. Tang
 
PDF
A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn
Amy W. Tang
 
Connecting Talent to Opportunity.. at scale @ LinkedIn
Anmol Bhasin
 
Tutorial on People Recommendations in Social Networks - ACM RecSys 2013,Hong...
Anmol Bhasin
 
Linked in data to power sales - dreamforce nov 18 2013 - vfinal w. appendix
Andres Bang
 
Leadership in Uncertain Times - Hudson
HudsonAPAC
 
LinkedIn Presentation Plainfield Library 2016
Denis Curtin
 
By the Numbers: Leveraging LinkedIn Data to Become a Strategic Talent Advisor...
LinkedIn Talent Solutions
 
Leveraging Data: LinkedIn Recruiter Jobs and Talent Pool Analysis | Talent Co...
LinkedIn Talent Solutions
 
Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop
Shirshanka Das
 
The latest in LinkedIn talent pool reports | Talent Connect Anaheim
LinkedIn Talent Solutions
 
Strata 2016 - Architecting for Change: LinkedIn's new data ecosystem
Shirshanka Das
 
Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with ...
David Chen
 
LinkedIn's Logical Data Access Layer for Hadoop -- Strata London 2016
Carl Steinbach
 
How AlphaGo Works
Shane (Seungwhan) Moon
 
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
Yahoo Developer Network
 
Live Webinar: Advanced Strategies for Leveraging Linkedin Like a Pro
LinkedIn
 
Jorge Lascas - Workshop linkedin successful strategies - Amsterdam
Jorge Lascas
 
Aiinpractice2017deepaklongversion
Deepak Agarwal
 
LinkedIn Communication Architecture
LinkedIn
 
LinkedIn Segmentation & Targeting Platform: A Big Data Application
Amy W. Tang
 
A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn
Amy W. Tang
 
Ad

Similar to LinkedIn Member Segmentation Platform: A Big Data Application (20)

PDF
How Lucene Powers the LinkedIn Segmentation and Targeting Platform
lucenerevolution
 
PDF
Big Data Ecosystem @ LinkedIn
Minh-Hoang Nguyen
 
PPT
Ict careers
deepak5007
 
PPTX
Linked in for small businesses 2013
Richard Masters
 
PPTX
LinkedIn Infrastructure (analytics@webscale, at fb 2013)
Jun Rao
 
PDF
What Are The Best LinkedIn Email Scrapers To Get Unlimited Emails.pdf
AqsaBatool21
 
PPTX
#SPSOttawa introduction to the #microsoftGraph
Vincent Biret
 
PPT
Linked in stream experimentation framework
Joseph Adler
 
PPTX
Big data arch_analytics
Srinu Adira
 
PPTX
Add-On Demo
Tjaard Du Plessis
 
PDF
Unveiling The Powerhouse LinkedIn Data Scraper By AhmadsoftwareCom.pdf
AqsaBatool21
 
PDF
Bg linkedin bigdata_martinschultz_symposium_yale_oct2012
Bhaskar Ghosh
 
PPTX
Hive at LinkedIn
mislam77
 
PDF
The Ultimate LinkedIn Data Export Tool Guide For Professionals.pdf
AqsaBatool21
 
PPTX
Common Data Model - A Business Database!
Pedro Azevedo
 
PDF
How Can I Extract Leads From LinkedIn Profiles.pdf
AqsaBatool21
 
PDF
Age of Exploration: How to Achieve Enterprise-Wide Discovery
Inside Analysis
 
PPTX
Synopsis_rt_v_k.pptx(fgfefefehgftgegfeh)
vivekkaushik795
 
PPTX
#SPSToronto The SharePoint Framework and the Microsoft Graph on steroids with...
Vincent Biret
 
PDF
The Agile Analyst: Solving the Data Problem with Virtualization
Inside Analysis
 
How Lucene Powers the LinkedIn Segmentation and Targeting Platform
lucenerevolution
 
Big Data Ecosystem @ LinkedIn
Minh-Hoang Nguyen
 
Ict careers
deepak5007
 
Linked in for small businesses 2013
Richard Masters
 
LinkedIn Infrastructure (analytics@webscale, at fb 2013)
Jun Rao
 
What Are The Best LinkedIn Email Scrapers To Get Unlimited Emails.pdf
AqsaBatool21
 
#SPSOttawa introduction to the #microsoftGraph
Vincent Biret
 
Linked in stream experimentation framework
Joseph Adler
 
Big data arch_analytics
Srinu Adira
 
Add-On Demo
Tjaard Du Plessis
 
Unveiling The Powerhouse LinkedIn Data Scraper By AhmadsoftwareCom.pdf
AqsaBatool21
 
Bg linkedin bigdata_martinschultz_symposium_yale_oct2012
Bhaskar Ghosh
 
Hive at LinkedIn
mislam77
 
The Ultimate LinkedIn Data Export Tool Guide For Professionals.pdf
AqsaBatool21
 
Common Data Model - A Business Database!
Pedro Azevedo
 
How Can I Extract Leads From LinkedIn Profiles.pdf
AqsaBatool21
 
Age of Exploration: How to Achieve Enterprise-Wide Discovery
Inside Analysis
 
Synopsis_rt_v_k.pptx(fgfefefehgftgegfeh)
vivekkaushik795
 
#SPSToronto The SharePoint Framework and the Microsoft Graph on steroids with...
Vincent Biret
 
The Agile Analyst: Solving the Data Problem with Virtualization
Inside Analysis
 
Ad

More from DataWorks Summit (20)

PPTX
Data Science Crash Course
DataWorks Summit
 
PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
PPTX
Managing the Dewey Decimal System
DataWorks Summit
 
PPTX
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
PPTX
Security Framework for Multitenant Architecture
DataWorks Summit
 
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
PPTX
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
PDF
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 
Data Science Crash Course
DataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Managing the Dewey Decimal System
DataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
Security Framework for Multitenant Architecture
DataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 

Recently uploaded (20)

PPTX
Coupa-Overview _Assumptions presentation
annapureddyn
 
PDF
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
PDF
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
PPT
Coupa-Kickoff-Meeting-Template presentai
annapureddyn
 
PDF
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
PPTX
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
PDF
Revolutionize Operations with Intelligent IoT Monitoring and Control
Rejig Digital
 
PDF
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
PPTX
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
PDF
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
PDF
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
 
PDF
Building High-Performance Oracle Teams: Strategic Staffing for Database Manag...
SMACT Works
 
PPTX
Smart Infrastructure and Automation through IoT Sensors
Rejig Digital
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PDF
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
PDF
Cloud-Migration-Best-Practices-A-Practical-Guide-to-AWS-Azure-and-Google-Clou...
Artjoker Software Development Company
 
PDF
REPORT: Heating appliances market in Poland 2024
SPIUG
 
PPTX
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
PDF
Doc9.....................................
SofiaCollazos
 
PDF
This slide provides an overview Technology
mineshkharadi333
 
Coupa-Overview _Assumptions presentation
annapureddyn
 
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
Coupa-Kickoff-Meeting-Template presentai
annapureddyn
 
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
Revolutionize Operations with Intelligent IoT Monitoring and Control
Rejig Digital
 
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
 
Building High-Performance Oracle Teams: Strategic Staffing for Database Manag...
SMACT Works
 
Smart Infrastructure and Automation through IoT Sensors
Rejig Digital
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
Cloud-Migration-Best-Practices-A-Practical-Guide-to-AWS-Azure-and-Google-Clou...
Artjoker Software Development Company
 
REPORT: Heating appliances market in Poland 2024
SPIUG
 
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
Doc9.....................................
SofiaCollazos
 
This slide provides an overview Technology
mineshkharadi333
 

LinkedIn Member Segmentation Platform: A Big Data Application

  • 1. LinkedIn Segmentation & Targeting Platform: A Big Data Application Hadoop Summit, June 2013 Hien Luu, Sid Anand ©2013 LinkedIn Corporation. All Rights Reserved.
  • 3. ©2013 LinkedIn Corporation. All Rights Reserved. Our mission Connect the world’s professionals to make them more productive and successful
  • 4. Over 200M members and counting 2 4 8 17 32 55 90 145 2004 2005 2006 2007 2008 2009 2010 2011 2012 LinkedIn Members (Millions) 200+ The world’s largest professional network Growing at more than 2 members/sec Source : http://press.linkedin.com/about ©2013 LinkedIn Corporation. All Rights Reserved.
  • 5. * >88%Fortune 100 Companies use LinkedIn Talent Soln to hire Company Pages >2.9M Professional searches in 2012 >5.7B Languages 19 >30MFastest growing demographic: Students and NCGs The world’s largest professional network Over 64% of members are now international Source : http://press.linkedin.com/about ©2013 LinkedIn Corporation. All Rights Reserved.
  • 6. Other Company Facts * • Headquartered in Mountain View, Calif., with offices around the world! • As of June 1, 2013, LinkedIn has ~3,700 full-time employees located around the world Source : http://press.linkedin.com/about
  • 7. Agenda  Company Overview • Big Data @ LinkedIn • The Segmentation & Targeting Problem • Solution : LinkedIn Segmentation & Targeting Platform • Q & A
  • 8. Big Data @ LinkedIn ©2013 LinkedIn Corporation. All Rights Reserved.
  • 9. LinkedIn : Big Data Story ©2013 LinkedIn Corporation. All Rights Reserved. Our Big Data Story depends on Infrastructure! • On-line Data Infrastructure • Near-line Data Infrastructure • Offline Data Infrastructure Oracle or Espresso Updates Web Serving Teradata Data Streams Near-lineOn-line Off-line
  • 10. Big Data Story : On-line Data ©2013 LinkedIn Corporation. All Rights Reserved. On-line Data Infrastructure • Supports typical OLTP requirements • Highly concurrent R/W access • Transactional guarantees • Back-up & Recovery • Supports a central LinkedIn Data Principle! • “All data everywhere” • All OLTP databases need to provide a time-line consistent change stream • For this, we developed and open- sourced Databus! Oracle or Espresso Updates Web Serving On-line
  • 11. Big Data Story : On-line Data Oracle or Espresso Data Change Events Search Index Graph Index Read Replicas Updates Standar dization A user updates the company, title, & school on his profile. He also accepts a connection The write is made to an Oracle or Espresso Master and DataBus replicates it: • the profile change is applied to the Standardization service  E.g. the many forms of IBM were canonicalized for search-friendliness • …. and to the Search Index  Recruiters can find you immediately by new keywords • the connection change is applied to the Graph Index service  The user can now start receiving feed updates from his new connections
  • 12. Big Data Story : On-line Data Databus streams also update Hadoop! Oracle or Espresso Search Index Graph Index Read Replica Updates Standar dization Data Change Events
  • 13. Big Data Story : Near-line & Off-line Data ©2013 LinkedIn Corporation. All Rights Reserved. 2 Main Sources of Data @ LinkedIn • User-provided data • e.g. Member Profile data (e.g. employment, education history, endorsements) • Tracking data via web site instrumentation • e.g. pages viewed, email opened/sent, social gestures : posts/likes/shares Oracle or Espresso Updates Databus Web Servers Teradata
  • 14. The Segmentation & Targeting Problem ©2013 LinkedIn Corporation. All Rights Reserved.
  • 16. Segmentation & Targeting Attribute types Bhaskar Ghosh
  • 17. Segmentation & Targeting ©2013 LinkedIn Corporation. All Rights Reserved. Step 1 : Take some information about users Member ID Join Date Country Responded to Promotion X1 1 01/01/2013 FR F 2 01/02/2013 BE F 3 01/03/2013 FR F 4 02/01/2013 FR T Step 2 : Provide some targeting criteria for a new promotion Pick members where • Join Date between('01/01/2013", '01/31/2013") and • Country="FR" and • Responded to Promotion X1="F"  Members 1 & 3 Step 3 : Target them for a different email campaign (promotion_X2)
  • 18. Segmentation & Targeting ©2013 LinkedIn Corporation. All Rights Reserved. Step 1 : Take some information about users Member ID Join Date Country Responded to Promotion X1 1 01/01/2013 FR F 2 01/02/2013 BE F 3 01/03/2013 FR F 4 02/01/2013 FR T Step 2 : Provide some targeting criteria for a new promotion Pick members where • Join Date between('01/01/2013", '01/31/2013") and • Country="FR" and • Responded to Promotion X1="F"  Members 1 & 3 Step 3 : Target them for a different email campaign (promotion_X2) Attributes Segment Definition Segment
  • 19. Segmentation & Targeting ©2013 LinkedIn Corporation. All Rights Reserved. Problem Definition • The business wants to launch new campaigns often • The business wants to specify targeting criteria (segment definitions) using an arbitrary set of attributes • The attributes often need to be computed to fulfill the targeting criteria • This data resides on Hadoop or TD • The business is most comfortable with SQL-like languages
  • 20. Segmentation & Targeting Solution ©2013 LinkedIn Corporation. All Rights Reserved.
  • 21. Segmentation & Targeting ©2013 LinkedIn Corporation. All Rights Reserved. Attribute Computation Engine Attribute Serving Engine
  • 22. Segmentation & Targeting ©2013 LinkedIn Corporation. All Rights Reserved. Attribute Computation Engine Self-service Support various data sources Attribute consolidation Attribute availability
  • 23. Segmentation & Targeting ©2013 LinkedIn Corporation. All Rights Reserved. Attribute computation ~225M PB TB TB ~240
  • 24. LinkedIn Segmentation & Targeting Platform ©2013 LinkedIn Corporation. All Rights Reserved. Attribute Portal Web Application Attribute & Definition Metadata
  • 25. LinkedIn Segmentation & Targeting Platform ©2013 LinkedIn Corporation. All Rights Reserved. Attribute & Definition Metadata TD Executor Hive Executor Pig Executor REST REST REST
  • 26. LinkedIn Segmentation & Targeting Platform ©2013 LinkedIn Corporation. All Rights Reserved. M/R Stitcher /path/dataset1 /path/dataset2 /path/dataset3 /path/dataset4 /path/lnkd_big_table Data Loader Attribute consolidation & availability
  • 27. LinkedIn Segmentation & Targeting Platform ©2013 LinkedIn Corporation. All Rights Reserved. LinkedIn big table, the most sought after data Segmentation Propensity Model Ad hoc analysis LinkedIn big table
  • 28. Segmentation & Targeting ©2013 LinkedIn Corporation. All Rights Reserved. Attribute Serving Engine Self-service Attribute predicate expression Build segments Build lists
  • 29. Segmentation & Targeting ©2013 LinkedIn Corporation. All Rights Reserved. Serving Engine $ count filter sum complex expressions Σ1234 LinkedIn big table ~225M ~240
  • 30. LinkedIn Segmentation & Targeting Platform ©2013 LinkedIn Corporation. All Rights Reserved. Inverted Index Inverted Index Inverted Index M/R Indexer LinkedIn big table Attribute & Definition Metadata
  • 31. LinkedIn Segmentation & Targeting Platform ©2013 LinkedIn Corporation. All Rights Reserved. Who are north American recruiters that don’t work for a competitor? Who are the LinkedIn Talent Solution prospects in Europe? Who are the job seekers?
  • 32. LinkedIn Segmentation & Targeting Platform ©2013 LinkedIn Corporation. All Rights Reserved. JSON Predicate Expression JSON Lucene Query Parser Inverted Index Inverted Index Inverted Index Segment & List
  • 33. LinkedIn Segmentation & Targeting Platform ©2013 LinkedIn Corporation. All Rights Reserved. Complex tree-like attribute predicate expressions
  • 34. LinkedIn Segmentation & Targeting Platform ©2013 LinkedIn Corporation. All Rights Reserved. A marketing campaign is represented by a list
  • 35. Conclusion ©2013 LinkedIn Corporation. All Rights Reserved. Move at business speed and scale at LinkedIn scale  Segmentation & Targeting Platform – Self-service – Multiple data sources & massive data volume – Support complex expression evaluation in seconds – Attribute availability at business speed
  • 36. Engineering Team  Jessica Ho  Swetha Karthik  Raj Rangaswamy  Tony Tong  Ajinkya Harkare  Hien Luu  Sid Anand ©2013 LinkedIn Corporation. All Rights Reserved.
  • 37. Questions? More info: data.linkedin.com ©2013 LinkedIn Corporation. All Rights Reserved.

Editor's Notes

  • #6: We’re making great strides toward our mission:LinkedIn has over 225 million members, and we’re now adding more than two members per second. This is the fastest rate of absolute member growth in the company’s history. Sixty-four percent of LinkedIn members are currently located outside of the United States.LinkedIn counts executives from all 2012 Fortune 500 companies as members; its corporate talent solutions are used by 88 of the Fortune 100 companies.More than 2.9 million companies have LinkedIn Company Pages.LinkedIn members did over 5.7 billion professionally-oriented searches on the platform in 2012.[See http://press.linkedin.com/about for a complete list of LinkedIn facts and stats]
  • #16: Email Campaign & Ad targetingAcquire new paid customersRetain and engage existing customersPromote new productsTraining and other important announcements* Talk about the speed of changing segmentation and targeting criteria
  • #17: Professional identitySocial dataBehavioral
  • #22: Given the business problem that Sid outlined, the solution we came up with has two partsThe first part is about compute attributes based on the attribute definitionThe second part is about serving the attribute values to define segments, effectively performing user segmentation
  • #23: The attribute computation engine needs to support these 4 high level requirementsSelf-service meaning thatThere needs to be an easy way for someone on the business team to express the computational logic to compute a set of attributes for the needs of their marketing campaignsThis engine takes care of the complexity in executing the computational logic in terms of when, how as well as where to store the computation resultSupport various data sourcesData are in multiple places – TD and Hadoop. We need support thatFortunately SQL and HiveSQL are very similarAttribute consolidationOnce all the attributes are computed, they needed to be consolidated into a single dataset to make it easy everyone to consume and analyzeData availabilityRegister with Hive and copy the data onto TD system for business folks to consume
  • #24: At the high level, the attribute computation engine needs to be able compute attributes that come from different data sets, and some of these data sets are huge.The output of the computation engine is this big table – 225M roows, one for each member, ~240 columns, one for each attributesBehavioral Data Site Engagement,OL Transactions,Searches,Comments,Discussions….Social DataConnections,Follows,EndorsementsDemographic DataThis data comes from member profileLocation,Gender,Title,Function,Seniority,Education
  • #25: Self-service way to manage attributesA web application where a member of marketing operations or business analyst team can use to express the computation logic in the form SQL select statement. And we call that attribute definition.The SQL statement is either a Teradata SQL statement or Hive QL statementThe web application validates the SQL statements to make sure they are valid and plus we need to extract the attribute name and their types, which will be useful for various purposeThe metadata about the attribute definitions and attributes are captured in a MySQL database. For HIVE QL queries - we support Hive hints as well general tuning parameters like split sizeOnce an attribute definition passes the validation step, it will go through an approval process, which is designed toMake sure there is no attribute duplicatesMake sure the query properly tunedOne of the benefits of this attribute portal is the centralization attribute definitions and make it easy to discovery attributes, the logic behind these attributes and data sourcewhen someone starts working on a marketing campaign, they first identify the targeting criteria based on the goals of the campaignfrom the set of targeting criteria, they identify what are the needed member attributes
  • #26: Attribute computing workhorseThese executors are scheduled to run on a regular basisThey contact the attribute definition metadata repository to retrieve what attribute definitions to executeThey execute the query in parallel using APIsTD executorExecute using JDBC and store result in temporary tablesWe are using an in house library called LASSEN, which is an M/R library that leverages the power of MapReduce framework to quickly and efficiently download the data to HDFS. Hive executorProgrammatically execute these Hive queriesOne of the classes in Hive is not thread safe, therefore we can’t execute Hive QLs in parallel using multiple threads, so we use multiple Hive executors to approach insteadPig executorExecute pig script filesHas the ability to rerun only the failed scriptsInteresting runtime detailsWe have all kinds of queries, simple one and complex ones. The complex ones that may take hours to complete. However we don’t want a query that takes 5 or 6 hours. That would delay the attribute computing phase for all the queries. Our system has a built in mechanism to kill a long running query that exceeds certain amount of timeWhat about failed queries – even though we validate them at the attribute def. submission time, some of them will fail at runtime due to various reason. Our system is built to be resilient against these failed queries. Only the attributes of the failed queries will not be available. Our system collects accounting information about each of the queries – so we know how many queries were successfully completed, how many failed and how long each takes.The output of each attribute definition is stored in a separated folder. So if we have 50 attribute definitions, the attribute values are scattered across 50 folders
  • #27: Once the executors are completed executing and materializing the attributesThe job of the stitcher is to combine all these attributes together into a single data set, which I call LinkedIn big tableIt is an MapReduce job and it acts as a gateway to perform some validations like member id must not be less than 0 or certain values can’t be longer than certain lengthThe output of sticher is a single data set in Avro format that contains one record for every single LinkedIn memberThis output is also registered in Hive for data scientists to consumeTo make the linkedIn big table available for business analysts to generate more insights and further analysis, this same date set is copied onto TD via Data Loader componentThe processing executing these attribute definitions or select statements, stitching the attributes together into s single dataset and load the data onto TD takes about 5 to 6 hours.Not all attributes need to be refreshed daily, so we have a concept partial refresh and full refreshPartial refresh – only a subset of needed attribute definitions are executed and it takes much less time – 2-3 hours vs 5 to 6 hrs
  • #28: Linkedin big table – 200GBThe LinkedIn big table is used for multiple purposesPropensity modelRanking model, where each member is assigned a certain score to indicate how likely a member belongs to certain class of member or likely to take an action.i.e job seeker, or how likely someone will upgrade to paid subscription.Business analysts and data scientistsFor their own analysis The most sought after dataA very rich data set that contains all kinds of interesting attributes about our membersBecause of the heavy lifting has been done and data is available in a single placeOthers don’t to have hunt down what data sets
  • #29: Self-service – web application for business analysts and marketing team to useSomeone who is not familiar with SQLUI that support drag and dropAttribute predicate expression is basically a boolean expression that is evaluated to true or false by comparing an attribute value to an expected valueFor example, whether the country is United States or whether a member has more than 30 connectionsIn order to build segments – we need a way for expressing attribute predicates i.e. country in canada or in united statesSave this expression and evaluate it at a later pointBuilding segmentCombining various attribute predicates into a segmentBuild listsCombining segments together to target a certain set of member population for a marketing campagin
  • #30: Based on the requirements I talked about in the previous slide, the serving engine needs to support the following features/operationsCount – how many members meet certain criteriaFilter members that meet certain criteriaSum – each member is assigned a life time value for a particular product, so we need the ability compute the total dollar amount of a segment based on how many members meet the defined criteriaComplex nested expression with support for conjunction (and) and disjunction (or)The core problem that the serving engine needs to solve is to support arbitrary predicate expression against any of the attributes and return the result in a reasonable amount of time. We basically think this is an information retrieval problem, so we leverage Lucene to help us with this problemTo support those arbitrary predicate expressions, we found Lucene to be pretty good at this kind of problem.
  • #31: Map reduce applicationConsume data in Avro format and create Lucece indexesUsing custom writable to wrap a Lucene documentEach Lucence document contains all the 240+ attributes for each memberUse custom OutputFormat to build Lucene index segmentStore on local disk of reducer taskCopy onto HDFS at the end of the reduce taskLinkedIn big table – 200GBIndex – 175GB* # of map and reduce task
  • #32: First one requires only one attributes – job seeker statusSecond requires two attributesTalent solution prospectsCountry where they work inFirst one would need 3 attributesWhether a member is a recruiterThe country that member works inWhether the company they work is considered a competitor of LinkedIn
  • #33: JSON Predicate Expression – use JSON to define the format of the predicate expression. JSON is well suited for this purpose and it supports nested data structure, fairly flexible, easy to parseSupports different data typesFor each data types, certain operators are supported.An JSON predicate expression consists of an attribute name, data type, operator, and one or more valuesThe JSON predication expression is the contract between the browser and serverStoring the predicate expression in mysql and evaluate it at run time
  • #34: Web applicationHas a UI for defining segments and listsSegment builderDrag arbitrary attributes and build predicate expressionsWith a click of a button, marketing team can get a sense of how many members meet the defined criteria define in the segmentThis will allow them a chance to change the criteria to increase the count for decrease the countSegments are meant as building blocks
  • #35: Segments are building blocks and certain reusable Each marketing campaign is represented by a list, which is a collection of segments, each segment can be one of the two types.Inclusions – include members that meet the defined criteria of each of the selected segmentsNet count and raw countExclusions – exclude those members
  • #36: One of things we are working on is to improve the turn around time for attributes – from the time an attribute is defined to the time it is available for building segments
  • #37: * Give a shout out for engineering team that work on this platform