107 - Failover in Action Part English

WEBVTT
00:01.010 --> 00:01.500

Okay.
00:05.130 --> 00:05.460

Okay.
00:06.660 --> 00:09.670

So in this activity, we're going to see fail over inaction.
00:10.320 --> 00:14.370

We're actually going to see firsthand what happens when you add a new node to
Elasticsearch.
00:14.580 --> 00:16.560

And what happens when a node goes down?
00:16.890 --> 00:21.180

And we can see how Elasticsearch can resiliently react to both situations.
00:23.950 --> 00:26.680

Now, back when I worked at Amazon, we call this game day.
00:27.190 --> 00:31.930

It's actually a good idea once in a while to simulate an actual failure under
controlled conditions,
00:32.140 --> 00:36.130

just to make sure that your system can actually handle it and you know how to
handle it as an engineer
00:36.130 --> 00:38.290

or an administrator when it happens unexpectedly.
00:39.220 --> 00:43.660

So what we're going to do in this activity is set up three Elasticsearch nodes
running our single virtual
00:43.660 --> 00:44.200

machine.
00:44.620 --> 00:48.910

It's an unusual thing to do, but it will allow us to actually simulate running
three different nodes
00:48.910 --> 00:51.820

on a cluster within our one virtual machine host here.
00:52.660 --> 00:56.980

As we add the second node to our cluster, we can observe how Elasticsearch will
automatically expand
00:56.980 --> 01:01.810

itself into that new node and spread out the shards automatically without you
having to do anything.
01:02.500 --> 01:03.880

Then things will get interesting.
01:04.300 --> 01:08.800
We'll actually stop our original node, the first one that we started with, and
we'll observe everything
01:08.800 --> 01:10.870

moving over to the remaining nodes automatically.
01:11.440 --> 01:16.390

Again, it all just happens as if by magic, and then we'll spin up that first node
again to observe
01:16.390 --> 01:19.000

everything going back to normal automatically as well.
01:19.930 --> 01:23.950

So although there's nothing you really have to do in response to any of these
events, there is value
01:23.950 --> 01:28.810

to actually seeing it happen in person firsthand, gives you some confidence in your
cluster and its
01:28.810 --> 01:32.680

operations, and it also lets you know what to expect when something like this
happens.
01:33.400 --> 01:37.000

If what you're seeing is not what you expect, then you know you actually have to do
something.
01:37.510 --> 01:39.100

So let's start playing with failover.
01:41.110 --> 01:43.300

So in this video, I don't want you to follow along.
01:43.300 --> 01:44.080

Just watch.
01:44.260 --> 01:48.850

The reason is that you need at least three nodes to constitute a cluster in
practice using failover.
01:49.240 --> 01:53.410

And it used to be an Elasticsearch server and you could fake it out and actually
run three nodes on
01:53.410 --> 01:54.010

one machine.
01:54.010 --> 01:57.760

But the trick that I used for that no longer exists in Elasticsearch eight.
01:58.210 --> 02:00.190

However, the process of failover is the same.
02:00.190 --> 02:03.490

So what I'm going to show you is still the same mechanism in Elasticsearch eight.
02:03.880 --> 02:08.140
So you will learn by watching this and seeing what happens, but unfortunately you
won't be able to
02:08.140 --> 02:08.710

follow along.
02:08.710 --> 02:09.340

Hands on.
02:09.730 --> 02:14.140

Just to catch you up on what I've done so far in this Elasticsearch seven instance,
however, I've
02:14.140 --> 02:16.420

set up three nodes running virtually on this machine.
02:16.450 --> 02:21.640

Node one, new two in Node three, and they are set up as three different services
called Elasticsearch,
02:22.090 --> 02:23.330

Elasticsearch Dash No.
02:23.380 --> 02:25.780

Two and Elasticsearch Dash Node three.
02:26.110 --> 02:31.030

And to get you caught up, I basically stopped all three of those and I'm just
starting up the original
02:31.030 --> 02:35.860

one Elasticsearch dot service right now to speed up that first node in our cluster.
02:38.750 --> 02:39.080

All right.
02:39.090 --> 02:39.920

That's chugging away.
02:40.310 --> 02:47.060

Let's also start up a no to while we're at it and we'll just say start
Elasticsearch, Dash and O2.
02:49.070 --> 02:51.560

So now we have two nodes of Elasticsearch spinning up together.
02:51.590 --> 02:58.640

Let's watch the status of our cluster while this happens so we can do that with
Curl X, get 120 7.0.0.1
02:58.640 --> 02:59.540

call 1900.
02:59.630 --> 03:00.440

Slash underscore.
03:00.440 --> 03:02.270

Cluster slash.
03:02.270 --> 03:03.590

Health question.
03:03.590 --> 03:04.100

Mark pretty.
03:07.050 --> 03:09.820

And it hasn't even spun up yet on the original host here.
03:09.900 --> 03:14.700

So remember 9200 is note one and we do need to give it a few minutes to start up.
03:14.700 --> 03:18.840

So I'm going to give it a few minutes and come back when that's up and running.
03:19.920 --> 03:20.800

Let's try it again.
03:20.820 --> 03:21.790

See if we're lucky yet.
03:21.810 --> 03:22.470

No, not yet.
03:22.770 --> 03:26.640

I mean, I could be tailing the logs here to try to get a better idea of when it's
done starting up,
03:26.640 --> 03:31.560

but kind of a quicker way is just to run the top command and you can see it is
chugging away.
03:31.560 --> 03:37.080

It's pegging our CPU 50% on each of our two virtual processors here for each of the
two notes it was
03:37.080 --> 03:37.770

trying to spin up.
03:37.770 --> 03:42.660

So, you know, a kind of a quick and a dirty way to figure out when things might be
ready to go is
03:42.660 --> 03:48.390

when those CPU percentages for those two Java processors associated with
Elasticsearch come down from
03:48.390 --> 03:49.290

50% apiece.
03:49.290 --> 03:52.800

I will tell me that it's done doing all of its initial work to get things started
up.
03:52.800 --> 03:55.020

So I'm just going to sit here until that drops.
03:55.590 --> 03:55.830
All right.
03:55.830 --> 03:57.390

Looks like we've dropped down a little bit.
03:57.630 --> 04:00.480

It's still pretty heavy on the CPU, but I think it's starting to catch up.
04:00.480 --> 04:04.920

So I'm going to hit queue to get out of top and let's try that command again.
04:05.940 --> 04:07.440

Hey, we got we got data here.
04:07.440 --> 04:08.070

So cool.
04:08.070 --> 04:12.720

It did in fact spin up and look, our status is actually green now and is reporting
number of nodes
04:12.720 --> 04:16.560

to number of data nodes to and active primary starts to cool.
04:16.950 --> 04:21.570

So this means that our two nodes are actually up and running on our one virtual
machine here.
04:21.900 --> 04:24.540

And for the first time our status is in green status.
04:24.540 --> 04:29.250

This is the first time our cluster has kind of been happy with itself because it
actually has a backup
04:29.250 --> 04:31.380

copy of all of its shards across those two nodes.
04:31.680 --> 04:35.520

Now, it is sort of a false sense of security because they're all running on the
same virtual machine.
04:35.520 --> 04:40.080

And if this machine got hit by a meteor or something, it would be game over for our
data still.
04:40.080 --> 04:44.670

But Elasticsearch has been tricked into believing that we have some redundancy
going on here.
04:44.670 --> 04:50.460

At least, however, you need at least three nodes for it to be able to elect a new
master of the master
04:50.460 --> 04:51.240

node goes down.
04:51.240 --> 04:53.040

So it's not really green.
04:53.040 --> 04:53.730

It's green ish.
04:53.730 --> 04:55.920

It really wants one more node to be totally happy.
04:55.920 --> 05:02.700

So let's go ahead and spin up that third node to do that will say pseudo slash,
bean slash system control.
05:03.270 --> 05:06.750

Start Elasticsearch Dash Node three.
05:09.210 --> 05:13.320

And now we can actually query our health query again to see how that's doing.
05:13.620 --> 05:14.040

Whoops.
05:14.850 --> 05:15.330

There we go.
05:16.320 --> 05:19.890

And right now, it's still saying we only have two nodes because the third one
hasn't spun up yet.
05:20.850 --> 05:24.920

Interestingly, we can also query port 9201 now for health as well.
05:24.930 --> 05:25.470

That's our No.
05:25.500 --> 05:26.100

Two cluster.
05:26.100 --> 05:27.510

And that should also be responding.
05:27.960 --> 05:28.590

And it is.
05:29.520 --> 05:32.250

So let's just wait for that to actually find our third notice.
05:32.250 --> 05:34.350

It spins up and balance things across it.
05:35.430 --> 05:41.160

Note that we have 40 active charts right now, 20 active primary shards and two
data, and it is being
05:41.160 --> 05:46.500

reported here right now unassigned in initializing charts are both at zero because
it hasn't realized
05:46.500 --> 05:50.130

that that third system has spun up yet and it needs to rebalance across it.
05:50.670 --> 05:54.270
Again, I'm going to run top just to get a hint as to when this might be ready to
go.
05:55.290 --> 05:59.280

And you can see that that and you know, it's spinning up is consuming most of our
CPU time right now.
06:01.190 --> 06:03.020

But it looks like it's starting to balance out a little bit.
06:03.020 --> 06:04.670

So let's go back and try again.
06:07.370 --> 06:08.210

And not quite yet.
06:09.650 --> 06:09.950

All right.
06:09.950 --> 06:11.500

Looks like it finally picked up that new note.
06:11.510 --> 06:16.850

So now we have a number of notes, three and it is saying it is relocating to shards
at the moment.
06:16.850 --> 06:19.730

Let's just keep watching this number of pending tests.
06:19.730 --> 06:20.900

One, relocating shards.
06:20.900 --> 06:25.520

Two, so you can see it's in the process of actually redistributing all of that data
across the third
06:25.520 --> 06:27.020

know that it's just discovered on its own.
06:27.200 --> 06:30.710

All we did was start that up and automatically it's moving data across it.
06:30.710 --> 06:31.610

So that's pretty cool.
06:32.540 --> 06:35.900

Let's just keep running this health check until it's done moving stuff.
06:37.610 --> 06:38.630

Looks like it might be done.
06:44.740 --> 06:50.440

So we're still in green status and we now have three nodes and it looks like we'd
have nothing relocating
06:50.440 --> 06:52.390

or initializing or being unassigned at the moment.
06:52.390 --> 06:54.610
So I think it's pretty much caught up.
06:54.880 --> 06:55.360

Uh, that's.
06:55.360 --> 06:55.810

That's good.
06:56.350 --> 06:58.870

All right, so now let's have some fun and simulate some failures.
06:59.500 --> 07:06.250

So let's go ahead and just kill our original masternode, that is Node one, and see
how it responds.
07:06.550 --> 07:07.480

So I'm going to say.
07:08.600 --> 07:13.700

Pseudo slash bin slash system control, stop Elasticsearch start service.
07:13.700 --> 07:15.320

That's our original service that we spun up.
07:15.320 --> 07:18.170

That was our original master for our cluster.
07:18.680 --> 07:21.320

And you would think that terrible things would happen when you do this.
07:24.020 --> 07:24.350

All right.
07:24.350 --> 07:24.830

It's gone.
07:24.830 --> 07:27.320

So let's go ahead and repeat that health query again.
07:27.320 --> 07:30.500

I'm hitting 9201 for the health query, which is one of the surviving nodes.
07:30.500 --> 07:30.680

No.
07:30.680 --> 07:31.130

Two.
07:32.210 --> 07:36.110

And you can see we're in yellow status right now because we're down to two nodes
now and we haven't
07:36.110 --> 07:38.300

actually finished rebalancing everything yet.
07:38.750 --> 07:40.670

But you can see that it's actually working on it, right?
07:40.670 --> 07:46.400
So we have 14 unassigned shards that it needs to move around, 26 out of 40 active
shards that we had
07:46.400 --> 07:46.970

originally.
07:47.300 --> 07:50.060

Let's just keep doing an update here and we can kind of see that progress.
07:50.900 --> 07:53.150

We now have 14 delayed unassigned shards.
07:56.590 --> 08:00.280

And it's just going to keep chugging away on this until it actually gets everything
rebalanced.
08:01.390 --> 08:03.850

Given enough time, our status should go back to green.
08:05.020 --> 08:08.200

While it's still working though, let's see what happens if we actually try to issue
a search query
08:08.200 --> 08:09.910

while it's in this state of rebalancing.
08:09.940 --> 08:15.280

So let's say a curl doesn't get 120 7.0.0.1 core 9201.
08:15.700 --> 08:19.600

Again, you know, in the real world you'd want a round robin your request across
different hosts anyway.
08:19.600 --> 08:24.430

So in this case we're hitting one of the surviving nodes in our cluster, and let's
hit the Shakespeare
08:24.430 --> 08:32.020

index and do a search with pretty just to get a dump of what's in there and it's
still responding,
08:32.020 --> 08:32.230

right?
08:32.230 --> 08:33.040

So that's pretty cool.
08:33.040 --> 08:37.840

Even though that we, you know, took our original masternode out behind the woodshed
and shot it in
08:37.840 --> 08:40.600

the head, we're still responding to search queries.
08:40.870 --> 08:42.670

It can still keep on going.
08:42.670 --> 08:47.260
It's like nothing happened as far as the application's concerned, except that Port
9200 is no longer
08:47.260 --> 08:47.830

responding.
08:48.560 --> 08:50.680

Let's go ahead and check on the status again of the cluster health.
08:51.520 --> 08:54.940

And we are back to green status now, back to 40 active shards.
08:54.940 --> 08:59.740

So it's successfully rebalanced everything in that final line, active charts,
percentage number,
09:00.170 --> 09:01.090

not set to 100.
09:01.090 --> 09:05.620

You know that things are in a steady state at this point, so that's kind of
awesome.
09:05.620 --> 09:11.710

We destroyed our original masternode and saw Elasticsearch adapt to that,
dynamically elect a new masternode
09:12.040 --> 09:13.780

and just keep on serving requests.
09:13.780 --> 09:14.590

So nothing happened.
09:14.980 --> 09:17.980

That gives us plenty of time to go figure out what actually happened in Node one.
09:18.460 --> 09:20.530

Repair that and bring that node back into service.
09:20.950 --> 09:22.210

And let's imagine we did that.
09:22.210 --> 09:26.500

So let's say, okay, we figured out that, you know, Node one had a bad harddrive.
09:26.500 --> 09:28.600

We went and replaced it and recovered all the data somehow.
09:28.600 --> 09:30.460

And now I'm ready to bring that back into service.
09:30.820 --> 09:33.550

So let's go ahead and start that node back up and see what happens.
09:34.090 --> 09:39.820

Pseudo slash bins, less system control, start Elasticsearch start service.
09:41.710 --> 09:47.080

So that's going to spend the original master node back up and now let's hit our
health query again.
09:48.400 --> 09:52.690

Hasn't seen it yet because it does take a few minutes for that node to start
responding to traffic.
09:54.250 --> 09:56.380

So I'm just going to pause and come back when that's happened.
09:58.920 --> 10:03.510

And very quickly, actually, it picked up that third node and actually moved
everything over to it
10:03.510 --> 10:03.980

as well.
10:03.990 --> 10:09.480

So we are back in green status with the number of nodes three and after charts
percent.
10:09.480 --> 10:11.100

This number is back at 100.
10:11.100 --> 10:12.810

So we have recovered.
10:13.050 --> 10:18.030

Now this if you are following along with this yourself, sometimes this can be a
little bit unreliable
10:18.060 --> 10:22.470

because of the environment that we've set it up in, being a little bit convoluted
and fabricated.
10:23.070 --> 10:28.350

But if you do run into trouble seeing that new node spin back up, you can just stop
and start nodes
10:28.350 --> 10:29.220

two and three again.
10:29.220 --> 10:31.140

And if you try it again, it should work just fine.
10:32.010 --> 10:36.990

I am told that in production this works totally reliably, but you know, if you do
run into trouble,
10:36.990 --> 10:39.360

just try restarting all the services and you should be okay.
10:39.780 --> 10:40.500

But yeah, there you have it.
10:40.500 --> 10:45.900

Fail over in action and recovery as well when we actually put a new a failed host
back into service.
10:46.350 --> 10:47.220
So there it is.

107 - Failover in Action Part English

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

107 - Failover in Action Part English

Uploaded by

Copyright:

Available Formats

WEBVTT

00:01.010 --> 00:01.500

00:05.130 --> 00:05.460

00:06.660 --> 00:09.670

00:10.320 --> 00:14.370

00:14.580 --> 00:16.560

00:16.890 --> 00:21.180

00:23.950 --> 00:26.680

00:27.190 --> 00:31.930

00:32.140 --> 00:36.130

00:36.130 --> 00:38.290

00:39.220 --> 00:43.660

00:43.660 --> 00:44.200

00:44.620 --> 00:48.910

00:48.910 --> 00:51.820

00:52.660 --> 00:56.980

00:56.980 --> 01:01.810

01:02.500 --> 01:03.880

01:08.800 --> 01:10.870

01:11.440 --> 01:16.390

01:16.390 --> 01:19.000

01:19.930 --> 01:23.950

01:23.950 --> 01:28.810

01:28.810 --> 01:32.680

01:33.400 --> 01:37.000

01:37.510 --> 01:39.100

01:41.110 --> 01:43.300

01:43.300 --> 01:44.080

01:44.260 --> 01:48.850

01:49.240 --> 01:53.410

01:53.410 --> 01:54.010

01:54.010 --> 01:57.760

01:58.210 --> 02:00.190

02:00.190 --> 02:03.490

02:08.140 --> 02:08.710

02:08.710 --> 02:09.340

02:09.730 --> 02:14.140

02:14.140 --> 02:16.420

02:16.450 --> 02:21.640

02:22.090 --> 02:23.330

02:23.380 --> 02:25.780

02:26.110 --> 02:31.030

02:31.030 --> 02:35.860

02:38.750 --> 02:39.080

02:39.090 --> 02:39.920

02:40.310 --> 02:47.060

02:49.070 --> 02:51.560

02:51.590 --> 02:58.640

02:58.640 --> 02:59.540

02:59.630 --> 03:00.440

03:00.440 --> 03:02.270

03:02.270 --> 03:03.590

03:03.590 --> 03:04.100

03:07.050 --> 03:09.820

03:09.900 --> 03:14.700

03:14.700 --> 03:18.840

03:19.920 --> 03:20.800

03:20.820 --> 03:21.790

03:21.810 --> 03:22.470

03:22.770 --> 03:26.640

03:26.640 --> 03:31.560

03:31.560 --> 03:37.080

03:37.080 --> 03:37.770

03:37.770 --> 03:42.660