Infrastructure Around Hadoop

Hadoop Summit 2012

Infrastructure Around Hadoop

Backups, failover, configuration and monitoring

Terran Melconian, Edmund MacKenty

tripadvisor.com/careers 1

What TripAdvisor Does

•  World's largest travel site and community
•  Trip planning user reviews
•  >50 million unique monthly visitors, 30 countries*
•  >60 million reviews and opinions*
•  Run like a startup: 30+ teams all doing their own thing
•  Heavy use of open-source projects
•  Speed Wins!

* source: comScore Media Metrix for TripAdvisor Sites, Worldwide, January 2012

2

What the Warehouse Team Does

•  Retain and aggregate historic site activity data
•  Make data available throughout the company
•  Hits, reviews, forums, contacts, locations, businesses, etc.
•  ~50 nodes in 4 clusters: Cloudera CDH3u3 (Hadoop 0.20.2)
•  Used by ~12 analytics teams, heavy use of Hive
•  Some jobs must run every day (eg. ETL, aggregations)
•  Systems are very open, we trust our users (usually)
•  3 people, fairly new to Hadoop/Hive

3

Why Hadoop at TripAdvisor

•  Hadoop is how we scale analysis past the limits of one machine
–  Some daily jobs taking nearly 24 hours, and we're still growing quickly

•  Our old RDBMS data warehouse could barely keep up with data
ingestion, even running on expensive hardware with a SAN
–  We obtained 20x improvement in wall clock time

•  Reprocess unaggregated historical data as definitions change
–  Before, impossible except for a small sample
–  Now, reprocess years of data at the finest level in a few days

•  Efficient platform for many kinds of statistics
–  Representative example: five-hour RDBMS job went to 25 minutes

4

HA NameNode: DRBD, Corosync and Pacemaker

•  Namenode and JobTracker run on “master” node
•  Datanode and TaskTracker run on “slave” nodes
•  Automatic fail-over of all master-node services to a passive node
•  Provision two identical systems
•  Set up virtual Master IP address to be failed over
•  Secondary namenode on passive node, if available
•  Monitor and automatically restart failed services

5

DRBD/Corosync Configuration

•  DRBD: replicates namenode image, Hive metadata, Oozie job data
–  Create two identical storage devices (we used RAID 1)
–  Connect the master nodes with a cross-over ethernet cable
–  Configure DRBD to use the cross-over and storage devices
–  Use drbdadm to create the replicated device
–  Create a filesystem on /dev/drbd0 with mkfs
–  Cat /proc/drbd to see state of the device
–  Once created, use /etc/init.d/drbd to manage it

•  Corosync: messaging between active-passive masters
–  Configure Corosync to also use the cross-over ethernet cable
–  Corosync will start Pacemaker for you
–  Use /etc/init.d/corosync to manage it, and Pacemaker

6

Pacemaker Configuration

•  Define each resource you want to manage:
–  DRBD device, master IP address, ethernet connectivity checks,
Hadoop namenode and jobtracker, Hive thrift server, MySQL for Hive
metadata, Oozie for workflow coordination

•  Set monitoring intervals for each resource
•  Define resource co-location dependencies
•  Define resource ordering dependencies
•  Restarts failed services, eg. Hive-Thrift
•  Use crm tool to manage nodes and resources
•  Test with a manual fail-over:
–  migrate namenode resource to passive master
–  Use crm status to watch all resources move over

7

Monitoring: Ganglia and Nagios, Job Tracking

•  Visibility into cluster operations
•  Monitor hardware states and resource usage
•  Notify on specific boundary or failure conditions
•  Track MapReduce jobs and Hive tables
•  Identify immediate problems
•  Show trends over time to predict future needs

8

Ganglia

•  Standard monitoring of CPU, Memory, Disk usage, etc.
•  PERL script parses Hadoop metrics, sends using gmetric(1)
•  ~50 Hadoop metrics, ~30 system metrics
•  Graphs for entire cluster and individual nodes
•  Example: Two jobs with different resource profiles

9

Nagios

•  Our primary notification system
•  About 80 checks, ~25 are our own. Examples:
–  check_hdp_connectivity: can master talk to all its slaves?
–  check_hdp_data_nodes: are all configured slave datanodes running?
–  check_hdp_max_mr_settings: does jobtracker have resources we expect?
–  check_hadoop_master_logfiles: are logs being written to?
–  check_hive_server: is it up?

•  Some warnings:
–  Do not let Nagios run hadoop fsck (check_hdp_hdfs)
–  LDAP failure causes email cascade
–  High loads can cause timeouts, which cause notifications

10

Job Tracking

•  PERL script invoked frequently by cron
•  Parses jobtracker log entries since last run
•  Records data on each job in PostreSQL DB:
–  Job ID, user, submitting IP and time, status
–  Cluster ID, queue, Hive query
–  start/stop times for job and first mapper and reducer
–  Mapper and reducer counts, max memory, slots, splits

•  CGI script to do queries:
–  Running jobs, failed jobs, MapReduce capacity usage
–  Job resource usage by status, queue, user

•  Helps post-mortem of problems
•  Used to predict trends, future resource needs

11

Other cron scripts we run

•  Check_load:
–  Dumps Java stack trace when load is too high
–  Emails list of top processes so we can see what was wrong

•  Master nodes:
–  Compresses Hadoop/Hive logs more than 30 days old
–  Removes logs more than 120 days old (we keep 10+ GBs)
–  Check_hdfs: Runs hadoop fsck to see if HDFS is “healthy”
–  Backup current namenode fsimage

•  Slave Nodes:
–  Check_disks: Removes read-only disks from datanode configuration
–  Check_load: Kills some tasks and notifies us when load is too high

•  Refresh production data to development cluster

12

Configuration Management

•  Seems like extra work at first, but essential as you grow.
•  Not Hadoop-specific: manage OS packages, Nagios and Ganglia
scripts, cron jobs, svn, SSH keys, NFS mounts, jars
–  Consistent UID/GIDs critical with DRBD
–  We replace some jars from the RPMs with local fixes
–  Templatized configuration files very convenient. ERB is good.
–  SSH keys made consistent across nodes, masters share host key

•  Use SVN as file delivery mechanism: checkout on each box
•  We chose Puppet as a tool
–  Gets the job done
–  Lacks flexibility in inheritance to specialize defaults per-machine
–  Some aspects of operation are hard to debug

13

Backup: HDFS and Hive DDL

•  Objectives:
–  Provide safety against total HDFS failure due to software bugs or
machine room environmental incident
–  Protect against user error in dropping or overwriting tables
–  Restore data to another cluster

•  Assumptions
–  Repeating one day of processing is acceptable when restoring

•  Components
–  Incremental HDFS backup
–  Hive DDL backup

•  Runs on separate backup server with storage (NexSan)
–  Pull process driven by processes on backup server

14

Backup HDFS

•  Open-source Java app
•  Requires customization to your environment
•  Traverses HDFS directory tree
•  Copies out files modified after a given date
•  Doesn't copy very new directories
–  Needed a way to avoid copying files being written at time of backup
–  HDFS has no snapshots

•  Ignores specified directories
•  Generates restore shell scripts to set owners, perms
•  Verification tool checks file sizes and checksums

15

Backup Hive DDL

•  Open-source Java app uses Thrift server
•  Iterates over all tables and views
•  Constructs DDL statements from Hive metadata
•  Ignores specific tables
•  Generates Hive command script
–  Recreates all tables, adds all partitions back one at a time

•  Used to move metadata to MySQL
•  Restore full cluster:
–  copying files back with copyFromLocal
–  Run perm/owner scripts
–  Reapply Hive DDL

16

Other Things To Potentially Back Up

•  Backup the Namenode Metadata
–  We do this once every 4 hours
–  This is in addition to mirroring on four physical drives

•  Our job tracking database
•  No general backups of root or local FS on machines
–  Recreate machines with Puppet or other configuration management
tool instead

•  Oozie job database
–  We do NOT back this up
–  Tightly coupled with HDFS state and restore would be problematic
–  The recovery procedure is to rebuild and reinstall coordinators

17

Oozie: Why

•  Drawback: several times slower to write than cronjobs, while also
less expressive
•  Advantage: Ability to cleanly depend on input data
–  With cron, you would have to poll for stamps

•  Advantage: Clean and consistent metadata
–  See what ran, what failed, what is still waiting and why
–  Easily retry things which failed – good luck doing that with cron
–  Output datasets are deleted on rerun so ordering is preserved

18

Oozie: How

•  Establish consistent local practices for completion stamps, job
naming, owners, and source code locations
•  Enforce that all jobs must be idempotent
•  Create scripts/makefiles/build.xml to rebuild and reinstall jobs
after changes in their dependencies
•  Bypass the Oozie GUI
–  The CLI is a more capable tool
–  Go straight to the Oozie backing DB and issue SQL queries

•  Rerun coordinator actions, not workflows
•  Don't ever use Derby – we experienced massive corruption

19

Experiences and Expectations

•  Hadoop is not mature from a reliability and stability point of view
–  It will probably get there in a few more years

•  Cluster outages are common events, not outliers
–  Must bounce key services to pick up basic configuration changes such
as adding a new queue
–  As you scale up, you will encounter new classes of problems
–  Example: kernel deadlocks during heavy disk IO

•  You must design for failure and have a robust mechanism to
cleanly and easily resume execution once the cluster is back up.
•  Important jobs must be isolated from developers
–  Each cluster should contain ONE tier of jobs, grouped by SLA, release
process, and time-to-recovery requirements

20

Attributes of Robust Jobs

•  Idempotent and resumable regardless of when/how terminated
•  Has an external framework for recording success/failure, timing,
and amount of data processed
•  Knows what input data it needs and waits for it to be ready
•  Has mechanism for reprocessing if the input data is restated
•  Checked into source control
•  Testable in an expendable cluster before release

21

Benchmarks

•  How to evaluate hardware/network changes or map/reduce slot
tuning?
–  Key insight: For the same job, the same task always does the same
work
–  Rerun job and compare execution of the same task across machines
Machine Tasks Comps Relative Perf (larger is better)
~~~~~~~~~~~~ ~~~~~ ~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
type1_1 82 37 0.99 ====================
type1_2 91 76 0.98 ====================
type1_3 92 35 1.01 ====================
type1_4 88 85 1.06 =====================
type2_1 71 26 1.30 ==========================
type3_1 92 80 0.68 ==============
type4_1 78 42 1.19 ========================
type4_2 78 45 1.29 ==========================
type4_3 75 75 1.19 ========================

remote 546 534 0.97 ===================
local 378 69 1.05 =====================
22

Features you Should Use

•  Fair Scheduler
•  refreshNodes, refreshQueues
•  Hadoop metrics
•  Namenode audit logging (disabled by default in 0.20)
•  Exclude files to decommission slave nodes

23

Staffing

•  We're living proof that you can hire some engineers with good
fundamentals but no specialized experience and throw them in
the deep end (it's the TA way)
•  Skills to hire for:
–  Operations and Linux experience
–  General service troubleshooting
–  Scripting
–  Java
–  SQL (even if not using Hive)

•  Managing clusters which are growing 2x - 4x per year takes 1-2
people working full time just to run in place

24

Open Questions

•  Resuming of jobs on jobtracker restart
•  Reloading of configurations without a restart
•  Robust response to cluster OOM conditions
•  Disabling job submission while allowing existing jobs to finish

•  Please tell us if you have the answers!

25

Appendix

This is for you to read later
after downloading the
presentation
27

Downloads

https://github.com/TAwarehouse/

28

DRBD Configuration
global {
usage-count no;
minor-count 1;
}
common {
protocol C; on master01.tripadvisor.com {
syncer { rate 90M; } device /dev/drbd0;
} disk /dev/sda3;
resource internal { address 10.0.0.1:7789;
startup { flexible-meta-disk internal;
wfc-timeout 600; }
degr-wfc-timeout 60; on master02.tripadvisor.com {
} device /dev/drbd0;
disk { disk /dev/sda3;
on-io-error detach; address 10.0.0.2:7789;
} flexible-meta-disk internal;
net { }
# timeout 60; }
# connect-int 10;
# ping-int 10;
# max-buffers 2048;
# max-epoch-size 2048;
}

29

Corosync Configuration
compatibility: whitetank
totem {
version: 2
secauth: off
threads: 0
interface {
ringnumber: 0 amf {
bindnetaddr: 10.0.0.0 mode: disabled
mcastaddr: 239.0.0.11 }
mcastport: 5415 aisexec {
} user: root
} group: root
logging { }
fileline: off service {
to_stderr: no name: pacemaker
to_logfile: yes ver: 0
to_syslog: yes }
logfile: /var/log/corosync.log
debug: off
timestamp: on
logger_subsys {
subsys: AMF
debug: off
}
}

30

Pacemaker Configuration

node master01.tripadvisor.com attributes standby="off"
node master02.tripadvisor.com attributes standby="off"
property $id="cib-bootstrap-options" stonith-enabled="false" no-quorum-policy="ignore"
expected-quorum-votes="2" dc-version="1.0.12-unknown" cluster-infrastructure="openais"
last-lrm-refresh="1337718104"
rsc_defaults $id="rsc-options" resource-stickiness="100"
primitive DataStore ocf:linbit:drbd params drbd_resource="internal"
op start interval="0" timeout="240s" op stop interval="0" timeout="100s"
primitive fs_DataStore ocf:heartbeat:Filesystem
params device="/dev/drbd0" directory="/data/internal" fstype="ext3"
op monitor interval="60s" timeout="40s" op start interval="0" timeout="60s"
op stop interval="0" timeout="60s"
ms Cluster DataStore
meta master-max="1" master-node="max=1" clone-max="2" clone-node-max="1" notify="true"
colocation fs-with-drbd inf: fs_DataStore Cluster:Master
order drdb-fs inf: Cluster:promote fs_DataStore:start
primitive MasterIP ocf:heartbeat:IPaddr2
params ip="192.168.236.10" nic="bond0" op monitor interval="30s"
colocation ip-with-drbd inf: MasterIP Cluster:Master
order fs-ip inf: fs_DataStore MasterIP
primitive NameNode lsb:hadoop-0.20-namenode op monitor interval="30s" meta target-role="Started"
colocation namenode-with-fs inf: NameNode fs_DataStore
order ip-namenode inf: MasterIP NameNode
primitive JobTracker lsb:hadoop-0.20-jobtracker op monitor interval="30s" meta target-role="Started"
colocation jobtracker-with-fs inf: JobTracker fs_DataStore
order namenode-jobtracker inf: NameNode JobTracker

31

Pacemaker Configuration (cont.)
primitive SecondaryNameNode lsb:hadoop-0.20-secondarynamenode
op monitor interval="30s" meta target-role="Started"
colocation secondarynamenode-not-with-ip -inf: SecondaryNameNode MasterIP
order jobtracker-secnamenode inf: JobTracker SecondaryNameNode
primitive Mysql ocf:heartbeat:mysql
params datadir="/data/internal/mysql" socket="/data/internal/mysql/mysql.sock"
binary="/usr/bin/mysqld_safe" op monitor interval="30s" timeout="30s" op start
interval="0" timeout="120s" op stop interval="0" timeout="120s"
meta target-role="Started"
colocation mysql-with-fs inf: Mysql fs_DataStore
order ip-mysql inf: MasterIP Mysql
primitive HiveThrift lsb:hive-thrift
colocation hivethrift-with-ip inf: HiveThrift MasterIP
order jobtracker-hivethrift inf: JobTracker HiveThrift
order mysql-hivethrift inf: Mysql HiveThrift
primitive Oozie lsb:oozie
colocation oozie-with-fs inf: Oozie MasterIP
order jobtracker-oozie inf: JobTracker Oozie
primitive PingNodes ocf:pacemaker:ping
params host_list="192.168.236.1 192.168.236.2 192.168.236.5" multiplier="100"
op start interval="0" timeout="60s" op monitor interval="30s" timeout="60s"
clone PingClone PingNodes meta interleave="true"
location ping-with-ip MasterIP
rule $id="ping-with-ip-rule" pingd: defined pingd
location prefer-master01.tripadvisor.com MasterIP
rule $id="prefer-master01.tripadvisor.com-rule" 50: #uname eq master01.tripadvisor.com
order ip-ping inf: MasterIP PingClone

32

Nagios Checks

check_apt check_breeze check_by_ssh check_checkup_metric
check_clamd check_cluster check_cronjobs check_crontabs
check_dhcp check_dig check_disk check_disk_smb
check_disk_writable check_dns check_dummy check_fbrs
check_file_age check_files_age check_filesystems check_flexlm
check_ftp check_gc check_hadoop_master_logfiles
check_hdp_connectivity check_hdp_data_nodes check_hdp_hdfs
20

check_hdp_max_mr_settings check_hive 10
check_hive_nsc
check_hive_server check_http check_icmp 0
check_ide_smart
R
check_ifoperstatus check_ifstatus check_imap check_ircd
check_jabber check_load check_local_mail check_log
check_log_updated check_mailq check_memcached check_minerva
check_mrtg check_mrtgtraf check_mysql_repl check_nagios
check_nntp check_nntps check_nrpe check_nt
check_ntp check_ntp_peer check_ntp_time check_nwstat
check_oracle check_overcr check_ping check_pop
check_proc_filehandles check_procs check_real check_rpc
check_sensors check_simap check_smtp check_spop
check_ssh check_ssmtp check_swap check_swapping
check_sys_filehandles check_ta_services check_tcp check_time
check_udp check_ups check_users check_wave
check_writeable_tmp

33

Example Oozie Query
SELECT
a.todaystatus as today,
a.yesterdaystatus as yday,
j.status as parent,
j.app_name,
a.last_modified_time,
a.nominal_time,
a.id
FROM (
SELECT
t.status as todaystatus,
y.status as yesterdaystatus,
COALESCE(t.id, y.id) AS id,
y.job_id,
COALESCE(t.nominal_time, y.nominal_time) AS nominal_time,
COALESCE(t.last_modified_time, y.last_modified_time) AS last_modified_time
FROM (SELECT *
FROM COORD_ACTIONS
WHERE TIMESTAMPDIFF(DAY, last_modified_time, now()) = 0) t
RIGHT OUTER JOIN (SELECT *
FROM COORD_ACTIONS
WHERE TIMESTAMPDIFF(DAY, last_modified_time, now()) = 1) y
ON (t.job_id=y.job_id)
WHERE COALESCE(t.status, '') NOT IN ('SUCCEEDED', 'WAITING')
-- If they're WAITING today, then make sure yesterday ran OK.
OR (t.status = 'WAITING' and y.status <> 'SUCCEEDED')
UNION DISTINCT
-- Dummy record to force the table to exist even when empty, since MySql
-- otherwise emits nothing if data is not returned.
SELECT 'EMPTY', 'RECORD', '', '', '', 'THIS IS A DUMMY RECORD'
)a
LEFT OUTER JOIN COORD_JOBS j
ON a.job_id=j.id
WHERE j.status = 'RUNNING' OR j.status IS NULL
;

34

Sessions will resume at 4:30pm

Page 35

Infrastructure Around Hadoop

More Related Content

What's hot (20)

Similar to Infrastructure Around Hadoop (20)

More from DataWorks Summit (20)

Recently uploaded (20)

Infrastructure Around Hadoop