You are on page 1of 54

Building

 distributed  systems  using  


Helix  

h?p://helix.incubator.apache.org  
 Apache  IncubaGon  Oct,  2012    
@apachehelix  

Kishore  Gopalakrishna,  @kishoreg1980


h?p://www.linkedin.com/in/kgopalak  
  1  
Outline  
•  Introduc)on  
•  Architecture  
•  How  to  use  Helix  
•  Tools  
•  Helix  usage  
 

2  
Examples  of  distributed  data  systems  

3  
Lifecycle  

Cluster  
Fault   Expansion  
tolerance   •  Thro?le  data  movement  
MulG   •  Re-­‐distribuGon  
•  ReplicaGon  
node  
•  Fault  detecGon  
•  Recovery  
Single  
Node  
•  ParGGoning  
•  Discovery  
•  Co-­‐locaGon  

4  
Typical  Architecture  

App.   App.   App.   App.  

Cluster  
Network  
manager  

Node   Node   Node   Node  

5  
   
Distributed  search  service  
INDEX  SHARD  
P.1   P.2   P.5   P.6  
P.3   P.4  

P.3   P.4   P.1   P.2  


P.5   P.6  

REPLICA  
Node  1   Node  2   Node  3  

ParGGon  
Fault  tolerance   ElasGcity  
management  
• MulGple  replicas   • Fault  detecGon   • re-­‐distribute  
• Even   • Auto  create   parGGons  
distribuGon   replicas   • Minimize  
• Rack  aware   • Controlled   movement  
placement   creaGon  of   • Thro?le  data  
replicas     movement  
   
Distributed  data  store  
P.1   P.2   P.3   P.5   P.6   P.7   P.9   P.10   P.11  

P.4   P.5   P.6   P.8   P.1   P.2   P.12   P.3   P.4  


P.1  

P.9   P.10   P.11   P.12   P.7   P.8  


SLAVE  
MASTER   Node  1   Node  2   Node  3  

ParGGon  
Fault  tolerance   ElasGcity  
management  
• MulGple  replicas   • Fault  detecGon   • Minimize  
• 1  designated   • Promote  slave   downGme  
master   to  master   • Minimize  data  
• Even   • Even   movement  
distribuGon   distribuGon   • Thro?le  data  
• No  SPOF   movement  
Message  consumer  group  
•  Similar  to  Message  groups  in  AcGveMQ  
–  guaranteed  ordering  of  the  processing  of  related  messages  
across  a  single  queue  
–  load  balancing  of  the  processing  of  messages  across  
mulGple  consumers  
–  high  availability  /  auto-­‐failover  to  other  consumers  if  a  
JVM  goes  down  
•  Applicable  to  many  messaging  pub/sub  
systems  like  kada,  rabbitmq  etc  
 
8  
Message  consumer  group  

ASSIGNMENT   SCALING   FAULT  TOLERANCE  

9  
Zookeeper  provides  low  level  primiGves.     We  need  high  level  primiGves.  
 

ApplicaGon  
•  File  system   •  Node  
•  Lock   •  ParGGon  
•  Ephemeral   •  Replica  
•  State  
•  TransiGon  
ApplicaGon   Framework    

Consensus  
Zookeeper  
System  

10  
11  
Outline  
•  IntroducGon  
•  Architecture  
•  How  to  use  Helix  
•  Tools  
•  Helix  usage  

12  
Terminologies    
Node   A  single  machine  

Cluster   Set  of  Nodes  

Resource   A  logical  en/ty  e.g.  database,  index,  task  

ParGGon   Subset  of  the  resource.  

Replica   Copy  of  a  parGGon  

State   Status  of  a  parGGon  replica,  e.g  Master,  Slave  

TransiGon   AcGon  that  lets  replicas  change  status  e.g  Slave  -­‐>  Master  

13  
Core  concept  
State  Machine   Constraints   ObjecGves  

• States   • States   • ParGGon  Placement  


• Offline,  Slave,  Master   • M=1,  S=2   • Failure  semanGcs  
• TransiGon   • TransiGons  
• O-­‐>S,  S-­‐>M,S-­‐>M,  M-­‐>S   • concurrent(0-­‐>S)  <  5    

COUNT=2 minimize(maxnj∈N  S(nj)  )


t1≤5
S  
t1 t2

t3 t4
O   M   COUNT=1 minimize(maxnj∈N  M(nj)  )

14  
Helix  soluGon  
Message  consumer  group   Distributed  search  

Start  consumpGon  
MAX=1  
MAX  per  node=5  

Offline   Online  

Stop  consumpGon  

MAX=3  
(number  of  replicas)  
15  
IDEALSTATE  

P1   P2   P3  
ConfiguraGon   Constraints  

• 3  nodes   • 1  Master  


• 3  parGGons  
• 2  replicas  
• 1  Slave  
• Even  
N1:M   N2:M   N3:M  
• StateMachine   distribuGon  

Replica  
placement  
N2:S   N3:S   N1:S  
Replica    
State  

16  
CURRENT  STATE  

N1   •  P1:OFFLINE  
•  P3:OFFLINE  

N2   •  P2:MASTER  
•  P1:MASTER  

N3   •  P3:MASTER  
•  P2:SLAVE  

17  
EXTERNAL  VIEW  

P1   P2   P3  
N1:O   N2:M   N3:M  

N2:M   N3:S   N1:O  

18  
Helix  Based  System  Roles  
PARTICIPANT
IDEAL STATE

SPECTATOR
Controller

Parition routing
logic
CURRENT STATE
RESPONSE COMMAND

P.1   P.2   P.3   P.5   P.6   P.7   P.9   P.10   P.11  

P.4   P.5   P.6   P.8   P.1   P.2   P.12   P.3   P.4  


P.1  

P.9   P.10   P.11   P.12   P.7   P.8  

Node  1   Node  2   Node  3  

19  
Logical  deployment  

20  
Outline  
•  IntroducGon  
•  Architecture  
•  How  to  use  Helix  
•  Tools  
•  Helix  usage  
 
 
21  
Helix  based  soluGon  
1.  Define  
 
2.  Configure  
 
3.  Run  

22  
Define:  State  model  definiGon  
•  States   •  e.g.  MasterSlave  
–  All  possible  states  
–  Priority  
•  TransiGons  
–  Legal  transiGons   S  

–  Priority  
•  Applicable  to  each   O   M  
parGGon  of  a  resource  

23  
Define:  state  model  
Builder = new StateModelDefinition.Builder(“MASTERSLAVE”);!
// Add states and their rank to indicate priority. !
builder.addState(MASTER, 1);!
builder.addState(SLAVE, 2);!
builder.addState(OFFLINE);!
!
//Set the initial state when the node starts!
builder.initialState(OFFLINE);  

//Add transitions between the states.!


builder.addTransition(OFFLINE, SLAVE);!
builder.addTransition(SLAVE, OFFLINE);!
builder.addTransition(SLAVE, MASTER);!
builder.addTransition(MASTER, SLAVE);!
!

24  
Define:  constraints  
State   Transi)on  
ParGGon   Y   Y  
Resource   -­‐   Y  
Node   Y   Y  
COUNT=2
Cluster   -­‐   Y  
S  

COUNT=1
State   Transi)on   O   M  
ParGGon   M=1,S=2   -­‐  

25  
Define:constraints  

// static constraint!
builder.upperBound(MASTER, 1);!
!
!
// dynamic constraint!
builder.dynamicUpperBound(SLAVE, "R");!
!
!
!
// Unconstrained !
builder.upperBound(OFFLINE, -1;  
 

26  
Define:  parGcipant  plug-­‐in  code  

27  
Step  2:  configure  
helix-­‐admin  –zkSvr  <zkAddress>  

CREATE  CLUSTER  

-­‐-­‐addCluster  <clusterName>  

ADD  NODE  

-­‐-­‐addNode  <clusterName  instanceId(host:port)>    

CONFIGURE  RESOURCE    

-­‐-­‐addResource  <clusterName  resourceName  par;;ons  statemodel>    


REBALANCE  èSET  IDEALSTATE  

-­‐-­‐rebalance  <clusterName  resourceName  replicas>  


28  
zookeeper  view  
IDEALSTATE  

29  
Step  3:  Run  
START  CONTROLLER  
run-­‐helix-­‐controller    -­‐zkSvr  localhost:2181  –cluster  MyCluster  

START  PARTICIPANT  

30  
zookeeper  view  

31  
Znode  content  
CURRENT  STATE   EXTERNAL  VIEW  

32  
Spectator  Plug-­‐in  code  

33  
Helix  ExecuGon  modes  

34  
IDEALSTATE  

P1   P2   P3  
ConfiguraGon   Constraints  
N1:M   N2:M   N3:M  
• 3  nodes   • 1  Master  
• 3  parGGons   • 1  Slave  
• 2  replicas   • Even  
• StateMachine   distribuGon  
N2:S   N3:S   N1:S  

Replica   Replica    
placement   State  

35  
ExecuGon  modes  
•  Who  controls  what    
AUTO   AUTO   CUSTOM  
REBALANCE  

Replica   Helix   App   App  


placement  

Replica     Helix   Helix   App  


State  

36  
Auto  rebalance  v/s  Auto  
AUTO  REBALANCE   AUTO  

37  
In  acGon    
Auto  rebalance   Auto    
MasterSlave  p=3  r=2  N=3   MasterSlave  p=3  r=2  N=3  
Node1   Node2   Node3   Node  1   Node  2   Node  3  
P1:M   P2:M   P3:M   P1:M   P2:M   P3:M  
P2:S   P3:S   P1:S   P2:S   P3:S   P1:S  
On  failure:  Auto  create  replica     On  failure:  Only  change  states  to  saGsfy  
and  assign  state   constraint  
Node  1   Node  2   Node  3   Node  1   Node  2   Node  3  
P1:O   P2:M   P3:M   P1:M   P2:M   P3:M  
P2:O   P3:S   P1:S   P2:S   P3:S   P1:M  
P1:M   P2:S  

38  
Custom  mode:  example  

39  
Custom  mode:  handling  failure  
™  Custom  code  invoker  
™  Code  that  lives  on  all  nodes,  but  acGve  in  one  place  
™  Invoked  when  node  joins/leaves  the  cluster  
™  Computes  new  idealstate  
™  Helix  controller  fires  the  transiGon  without  viola)ng  constraints  

P1   P2   P3   P1   P2   P3   Transi)ons  
1   N1   MàS  
2   N2   Sà  M  
N1:M   N2:M   N3:M   N1:S   N2:M   N3:M  
1  &  2  in  parallel  violate  single  
master  constraint  

N2:S   N3:S   N1:S   N2:M   N3:S   N1:S  


Helix  sends  2  aser  1  is  finished  
40  
Outline  
•  IntroducGon  
•  Architecture  
•  How  to  use  Helix  
•  Tools  
•  Helix  usage    

41  
Tools  
•  Chaos  monkey  
•  Data  driven  tesGng  and  debugging  
•  Rolling  upgrade  
•  On  demand  task  scheduling  and  intra-­‐cluster  
messaging  
•  Health  monitoring  and  alerts  

42  
Data  driven  tesGng  
•  Instrument  –  
•   Zookeeper,  controller,  parGcipant  logs  
•  Simulate  –  Chaos  monkey  
•  Analyze  –  Invariants  are  
•  Respect  state  transiGon  constraints  
•  Respect  state  count  constraints  
•  And  so  on  
•  Debugging  made  easy  
•  Reproduce  exact  sequence  of  events    
 
43  
Structured  Log  File  -­‐  sample  
timestamp partition instanceName sessionId state

1323312236368 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE

1323312236426 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE

1323312236530 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE

1323312236530 TestDB_91 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE

1323312236561 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc SLAVE

1323312236561 TestDB_91 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE

1323312236685 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc SLAVE

1323312236685 TestDB_91 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE

1323312236685 TestDB_60 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE

1323312236719 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc SLAVE

1323312236719 TestDB_91 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc SLAVE

1323312236719 TestDB_60 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE

1323312236814 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc SLAVE


No  more  than  R=2  slaves  
Time State Number Slaves Instance
42632 OFFLINE 0 10.117.58.247_12918
42796 SLAVE 1 10.117.58.247_12918
43124 OFFLINE 1 10.202.187.155_12918
43131 OFFLINE 1 10.220.225.153_12918
43275 SLAVE 2 10.220.225.153_12918
43323 SLAVE 3 10.202.187.155_12918
85795 MASTER 2 10.220.225.153_12918
How  long  was  it  out  of  whack?  
Number  of  Slaves   Time     Percentage  
0   1082319   0.5  
1   35578388   16.46  
2   179417802   82.99  
3   118863   0.05  

83%  of  the  Gme,  there  were  2  slaves  to  a  parGGon  


93%  of  the  Gme,  there  was  1  master  to  a  parGGon  

Number  of  Masters   Time   Percentage  


0 15490456 7.164960359
1 200706916 92.83503964
Invariant  2:  State  TransiGons  
FROM   TO   COUNT  

MASTER SLAVE 55
OFFLINE DROPPED 0
OFFLINE SLAVE 298
SLAVE MASTER 155
SLAVE OFFLINE 0
Outline  
•  IntroducGon  
•  Architecture  
•  How  to  use  Helix  
•  Tools  
•  Helix  usage    

48  
Helix  usage  at  LinkedIn  
 

  Espresso  

49  
In  flight  
•  Apache  S4  
–  ParGGoning,  co-­‐locaGon  
–  Dynamic  cluster  expansion  
•  Archiva  
–  ParGGoned  replicated  file  store  
–  Rsync  based  replicaGon  
•  Others  in  evaluaGon  
–  Bigtop  

50  
Auto  scaling  sosware  deployment  tool  
•  States   Offline
< 100

•  Download,  Configure,  Start   Download

•  AcGve,  Standby   Configure

•  Constraint  for  each  state  


Start
•  Download    <  100  
•  AcGve  1000   Active 1000

•  Standby  100   Standby 100

51  
Summary  
•  Helix:  A  Generic  framework  for  building  
distributed  systems  
•  Modifying/enhancing  system  behavior  is  easy  
–  AbstracGon  and  modularity  is  key  
•  Simple  programming  model:  declaraGve  state  
machine  

52  
Roadmap  

•  Features  
•  Span  mulGple  data  centers  
•  AutomaGc  Load  balancing  
•  Distributed  health  monitoring  
•  YARN  Generic  ApplicaGon  master  for  real  Gme  
Apps  
•  Stand  alone  Helix  agent  
 
 
website   h?p://helix.incubator.apache.org  

user   user@helix.incubator.apache.org  

dev   dev@helix.incubator.apache.org  
twi?er   @apachehelix,  @kishoreg1980  

54  

You might also like