You are on page 1of 38

WEBVTT

00:07.030 --> 00:12.280


Elasticsearch is a complex piece of software by itself, but complexity is further
increased when you

00:12.280 --> 00:14.410


spin up multiple instances to form a cluster.

00:15.280 --> 00:18.040


This complexity comes with the risk of things going wrong.

00:18.730 --> 00:22.660


In this lecture, we're going to explore some common issues that you're likely to
encounter on your

00:22.660 --> 00:23.770


Elasticsearch journey.

00:24.550 --> 00:27.550


There are plenty more potential issues that we can squeeze into this lesson.

00:27.550 --> 00:33.820


So let's focus on the most prevalent ones, mainly related to a node setup, a
cluster formation and

00:33.820 --> 00:34.690


the cluster state.

00:38.440 --> 00:44.290


The potential Elasticsearch issues can be categorized according to the
Elasticsearch lifecycle node

00:44.290 --> 00:44.740


setup.

00:45.790 --> 00:49.060


Potential issues include the installation and initial startup.

00:49.570 --> 00:53.710


The issues can differ significantly depending on how you run your cluster, like
whether it's a local

00:53.710 --> 00:56.920


installation running on containers or a cloud service, etc..

00:57.880 --> 01:02.860


In this lesson, we'll follow the process of a local setup and focus specifically on
bootstrap checks,

01:02.860 --> 01:04.960


which are very important when starting a node up.

01:06.130 --> 01:07.750


Discovery and cluster formation.

01:08.650 --> 01:13.060


This category covers issues related to the discovery process when the nodes need to
communicate with
01:13.060 --> 01:15.220
each other to establish a cluster relationship.

01:16.000 --> 01:21.010


This may involve problems during the initial bootstrapping of the cluster nodes,
not joining the cluster

01:21.010 --> 01:22.750


and problems with master elections.

01:24.090 --> 01:25.650


Indexing data and sharding.

01:26.400 --> 01:29.040


This includes issues related to index settings and mapping.

01:29.220 --> 01:33.540


But as this is covered in other lectures, we'll just touch upon how sharding issues
are reflected in

01:33.540 --> 01:34.350


the cluster state.

01:35.580 --> 01:41.070


Searching search being the ultimate step of the set up journey can raise issues
related to queries that

01:41.070 --> 01:44.460


return less relevant results or issues related to search performance.

01:44.910 --> 01:47.160


This topic is covered in another lecture in this course.

01:50.550 --> 01:54.990


Now that we have some initial background of potential issues with Elasticsearch,
let's go one by one.

01:54.990 --> 01:59.550


Using a practical approach will expose the pitfalls and show how to overcome them.

02:01.180 --> 02:06.280


So before we start messing up our cluster to simulate real world issues, let's back
up our existing

02:06.280 --> 02:06.850


indices.

02:07.210 --> 02:08.320


This will have two benefits.

02:08.470 --> 02:12.580


After we're done, we can get back to where we ended up and just continue on in the
course and we'll

02:12.580 --> 02:16.510


better understand the importance of backing up to prevent data loss while
troubleshooting.
02:17.020 --> 02:18.700
First, we need to set up our repository.

02:18.850 --> 02:23.350


So let's open up our Elasticsearch YAML file using your favorite editor.

02:23.800 --> 02:24.270


I like that.

02:24.280 --> 02:25.780


No, let's see.

02:26.080 --> 02:27.040


Elasticsearch.

02:28.150 --> 02:29.260


Elasticsearch.

02:29.710 --> 02:30.310


Why yaml?

02:35.870 --> 02:39.290


And we want to make sure we have a registered repository path on our machine.

02:39.300 --> 02:41.270


So we're looking for the path dot repo.

02:42.650 --> 02:43.630


Don't think there's one in here.

02:45.990 --> 02:47.190


Let's go ahead and add one then.

02:52.510 --> 02:53.440


Path got repo.

02:55.430 --> 02:56.090


Square bracket.

02:56.420 --> 03:00.590


Home student backups that should do the job.

03:01.550 --> 03:01.910


All right.

03:02.210 --> 03:04.310


Control o, enter control x.

03:04.430 --> 03:05.180


So that saved.

03:06.290 --> 03:10.340


And we might want to save a copy of this config file now as well so we can get back
to it at the end

03:10.340 --> 03:10.880


of the lesson.

03:10.880 --> 03:20.090


So let's make a copy, we'll say pseudo c.p, that's the Elasticsearch Elasticsearch
dot y html and

03:20.090 --> 03:25.460


let's just copy that into our home directory and that way we can just copy that
back when we're done

03:25.460 --> 03:27.440


if we need to restore any of those settings later on.

03:28.370 --> 03:28.840


Okay.

03:28.850 --> 03:33.020


So we need to make sure that the directory exists that we're going to be storing
that repository into

03:33.050 --> 03:34.780


and that Elasticsearch can write into it.

03:34.790 --> 03:39.320


So let's say maker dash p home student backups.

03:42.110 --> 03:48.020


And will change the group on that to elasticsearch like so sudo changed group
Elasticsearch.

03:48.770 --> 03:49.970


Home Student Backups.

03:52.940 --> 03:54.620


And finally make it rideable.

03:55.040 --> 03:59.420


Pseudo change mod g plus w home student backups.

04:02.580 --> 04:06.170


And we need to restart Elasticsearch to pick up that configuration change we made.

04:06.180 --> 04:13.530


So let's say sudo in system control stop Elasticsearch start service.

04:19.250 --> 04:20.330


And we'll restart it.

04:24.650 --> 04:24.980


Okay.

04:24.980 --> 04:30.590


So now we can register the new repository to Elasticsearch at the path we
configured with Curl request.

04:31.110 --> 04:35.870


Put local local host 1900.

04:36.440 --> 04:40.940


Underscore snapshot slash backup dash repo backslash.

04:43.240 --> 04:43.670


Data.

04:43.750 --> 04:44.290


Raw.

04:45.470 --> 04:59.240


Quote, Curly type will be filesystem and the settings will have a location of home
students backups,

05:00.140 --> 05:05.570


backup dash repo, close everything out and looks like a took.

05:06.080 --> 05:10.430


Now we can initiate the snapshot process to do the backup with kernel request.

05:10.700 --> 05:14.540


Put localhost 200 underscore snapshot.

05:17.080 --> 05:21.780


Back up dash repo and we'll call it snapshot dash one.

05:24.440 --> 05:25.460


So it looks like that worked.

05:25.610 --> 05:29.910


We can check the status of that with a simple get request with kernel request.

05:30.620 --> 05:33.980


Get local host native snapshot.

05:35.060 --> 05:35.900


Back up repo.

05:37.190 --> 05:38.030


Snapshot one.

05:39.800 --> 05:40.670


And we'll make a pretty.

05:43.640 --> 05:44.420


Looks like it worked.

05:44.510 --> 05:48.830


It says the state was to success.

05:49.070 --> 05:49.580


All right, cool.

05:50.090 --> 05:50.600


Very good.

05:50.630 --> 05:54.230


Now that we have our data backed up, we can now proceed to nuke our cluster.

05:54.770 --> 05:55.730


So let's get started.

05:56.360 --> 05:58.130


Well, let's recap on the basics about logs.
05:58.400 --> 06:00.710
So we'll start by looking at the Elasticsearch logs.

06:01.160 --> 06:04.470


Their location will depend on the path that logs setting in your Elasticsearch.

06:04.670 --> 06:05.240


Why yaml?

06:05.540 --> 06:10.640


By default they are found in var log Elasticsearch slash whatever your cluster name
is, start log.

06:11.390 --> 06:15.680


So basic tailing commands come in handy to monitor the logs in real time.

06:15.680 --> 06:17.870


And so say want to keep an eye on these logs off to the side?

06:18.230 --> 06:23.030


I'm actually going to start a different terminal window here, so let's go ahead and
start a new Telnet

06:23.030 --> 06:23.690


client here.

06:31.940 --> 06:34.470


It would help if I typed in my password correctly.

06:34.490 --> 06:34.940


There we go.

06:36.320 --> 06:36.710


All right.

06:36.950 --> 06:38.870


And let's see where those logs live.

06:39.140 --> 06:41.690


Those are going to be in var log Elasticsearch.

06:44.250 --> 06:47.160


So our account has insufficient rates to actually read these logs.

06:47.250 --> 06:49.260


Now there are various options to solve this.

06:49.410 --> 06:54.450


For example, a valid group assignment of your Linux user or one generally simpler
approach is to provide

06:54.450 --> 06:57.570


the user sudo permission to run Shell as the Elasticsearch user.

06:58.200 --> 07:02.130


We can do this by editing the pseudo file using the pseudo under route.

07:02.280 --> 07:04.290


So let's just say pseudo vs pseudo.

07:11.710 --> 07:13.420


And we will add the following line.

07:16.410 --> 07:17.460


The center to the bottom here.

07:19.170 --> 07:21.060


How about username?

07:21.840 --> 07:24.030


All equals parentheses.

07:24.090 --> 07:26.270


Elasticsearch parentheses.

07:27.120 --> 07:28.500


No password.

07:29.640 --> 07:31.710


All that should do it.

07:31.890 --> 07:32.820


So control O.o.

07:33.330 --> 07:34.110


Control X.

07:35.530 --> 07:39.250


So after we've done that, we can run the following command to launch a new shell as
the Elasticsearch

07:39.250 --> 07:42.430


user sudo dash S2 Elasticsearch.

07:44.440 --> 07:44.810


Cool.

07:45.220 --> 07:46.360


So now we should have the permissions.

07:46.360 --> 07:48.970


We need to actually look at these logs, so let's try that again.

07:49.780 --> 07:50.740


CD bar.

07:50.770 --> 07:52.000


Log Elasticsearch.

07:53.080 --> 07:53.750


That's better.

07:54.250 --> 07:59.080


And now we can do things like tailed ash and want to look at the last 100 lines in
this log file.

07:59.560 --> 08:03.790


And our cluster name is actually Elasticsearch start log because we haven't changed
it.

08:04.420 --> 08:05.110


And there you have it.

08:05.770 --> 08:08.260


Or sometimes you just want to look for error messages, right?

08:08.260 --> 08:15.550


So for example, we could look at the last 500 log lines and pipe that into grep for
error and that

08:15.550 --> 08:16.670


would just show us any errors.

08:17.140 --> 08:20.170


Fortunately, we don't have any because our cluster is healthy, so that's cool.

08:20.680 --> 08:25.690


And sometimes it can also be useful to grab a few surrounding log lines with the
context parameter because

08:25.690 --> 08:28.990


the messages and struct stack traces can be multi-line sometimes.

08:28.990 --> 08:34.870


So we could say, for example, cat Elasticsearch dot log grep bootstrap.

08:36.260 --> 08:36.560


Dash.

08:36.560 --> 08:40.610


Dash context equals three to get the three surrounding lines for each hit there.

08:41.390 --> 08:45.530


So for example, here we have a bootstrap hit and the three lines before and after
it as well.

08:46.040 --> 08:48.350


So those are some useful tricks for looking at the logs.

08:48.710 --> 08:49.220


All right.

08:49.220 --> 08:51.440


So let's start talking about bootstrap checks.

08:51.440 --> 08:53.150


We'll go back to our primary terminal here.

08:53.900 --> 08:58.580


Bootstrap checks are pre-flight validations performed during a node start, which
ensure that your node

08:58.580 --> 09:00.320


can reasonably perform its functions.

09:00.830 --> 09:03.950


There are two modes which determine the execution of bootstrap checks.

09:04.580 --> 09:10.190


Development mode is when you bind your node only to a loopback address localhost or
with an explicit

09:10.190 --> 09:12.860


discovery type of single dash node.

09:13.400 --> 09:18.050


No bootstrap checks are performed in development mode, and then in production mode
is when you bind

09:18.050 --> 09:24.440


your note to a non loopback address like 0.0.0.0, thus making it reachable by other
nodes.

09:24.800 --> 09:26.960


This is the mode where bootstrap checks are executed.

09:27.620 --> 09:31.520


Let's see them in action because when the checks don't pass, it can become tedious
work to find out

09:31.520 --> 09:32.300


what's going on.

09:33.680 --> 09:38.450


So one of the first system settings recommended by elastic is to disable heap
swapping.

09:39.110 --> 09:43.670


This makes sense because Elasticsearch is highly memory intensive and you don't
want to load your memory

09:43.670 --> 09:44.510


data from disk.

09:45.110 --> 09:46.370


There are two options for this.

09:46.670 --> 09:49.940


One is to remove swap files entirely or minimize sloppiness.

09:50.540 --> 09:54.080


This is the preferred option but requires considerable intervention as the root
user.

09:54.650 --> 09:59.150


Or we can add the bootstrapped on memory lock parameter in the last search dot,
y'know.

09:59.570 --> 10:01.080


So let's try that second option.

10:01.160 --> 10:09.470


Let's go ahead and open our main configuration file with pseudo nano and see
Elasticsearch.
10:09.650 --> 10:11.930
Elasticsearch dot y IMO.

10:13.780 --> 10:16.330


And we'll go ahead and find the bootstrap drop memory lock setting.

10:19.490 --> 10:21.740


And uncomment that to allow it to be true.

10:23.840 --> 10:26.960


Write that out and quit and let's go ahead and restart our service.

10:27.200 --> 10:30.830


So studio system control stop Elasticsearch start service.

10:32.880 --> 10:33.930


And let's restart it.

10:37.760 --> 10:42.200


And after a short wait, we should see some indication of what's happening.

10:44.650 --> 10:45.070


All right.

10:45.070 --> 10:47.800


So, yeah, we actually got an error as a result of doing that.

10:47.800 --> 10:50.110


So let's check our logs and find out what happened.

10:50.650 --> 10:54.010


So let's go spelunking through here and see what went wrong.

10:54.580 --> 10:57.970


Just got to hit the up arrow here to do a fresh tail of my log.

11:00.280 --> 11:01.360


And there we have it.

11:01.360 --> 11:06.100


So there's our error and it says bootstrap checks failed memory locking requested
for Elasticsearch

11:06.100 --> 11:06.520


process.

11:06.520 --> 11:07.930


But memory is not locked.

11:08.560 --> 11:10.160


But didn't we just lock it before?

11:10.840 --> 11:11.650


Well, not really.

11:11.650 --> 11:15.370


We just requested the lock, but it didn't actually get locked, so we hit the memory
lock.
11:15.370 --> 11:16.360
Bootstrap check here.

11:17.140 --> 11:21.490


Now, the easy way to fix this in our case is to allow locking and overwrite into
our system to a unit

11:21.490 --> 11:22.480


file like this.

11:22.810 --> 11:24.580


So let's go back to our other window here.

11:25.270 --> 11:27.430


Sudo system control.

11:28.030 --> 11:30.670


Edit Elasticsearch thought service.

11:32.980 --> 11:36.880


And we're going to put in the following config parameter here service.

11:39.640 --> 11:43.120


Limit mem lock equals infinity.

11:47.640 --> 11:48.120


All right.

11:48.270 --> 11:49.980


And let's try spinning that up again.

11:57.460 --> 11:59.170


And this time it should be okay.

12:03.170 --> 12:03.500


All right.

12:03.500 --> 12:04.400


Looks like success.

12:05.930 --> 12:06.380


Okay.

12:06.650 --> 12:08.390


So let's talk about heap settings next.

12:08.780 --> 12:13.310


Now, if you start playing with the JVM settings in the JVM dot options file, which
you will likely

12:13.310 --> 12:17.450


need to do because by default these settings are set to low for actual production
usage.

12:17.960 --> 12:20.480


You may face a similar problem as as we just did.

12:21.320 --> 12:21.980


So how is that?
12:22.310 --> 12:27.170
Well, by setting the initial heap size lower than the max size, which is actually
quite usual in the

12:27.170 --> 12:27.890


world of Java.

12:28.490 --> 12:32.240


Let's open up that option file and lower the initial heap size to see what's going
to happen.

12:32.780 --> 12:38.540


So sudo nano etsi elasticsearch JVM dot options.

12:38.840 --> 12:40.430


VM dot options.

12:43.740 --> 12:46.980


And let's go ahead and change these memory settings here.

12:49.180 --> 12:50.500


Then a comment, not the original one.

12:50.500 --> 12:59.380


So I can go back to them later and we'll set some new ones X and S 500 megabytes
and slash x and x one

12:59.380 --> 12:59.890


gigabyte.

13:01.540 --> 13:01.810


All right.

13:01.810 --> 13:03.340


So we've lowered the initial heap size.

13:03.700 --> 13:05.110


Let's go ahead and save this setting.

13:07.470 --> 13:09.150


And we'll restart our service again.

13:11.920 --> 13:12.420


Stop it.

13:13.790 --> 13:14.630


And I'll start it.

13:17.170 --> 13:18.970


And we'll see what happens as it spins up.

13:23.220 --> 13:24.360


Well, looks like we had an error.

13:24.370 --> 13:26.800


So let's go back to our logs and see what's going on.

13:26.850 --> 13:30.750


So back to the other window and I'll hit the up arrow just to tail the last one
lines again.

13:32.440 --> 13:32.920


All right.

13:32.930 --> 13:33.760


Well, there we have it.

13:34.060 --> 13:34.960


Error bootstrap.

13:34.960 --> 13:35.560


No validation.

13:35.560 --> 13:37.060


Exception bootstrap checks failed.

13:37.060 --> 13:39.820


Initial heap size not equal to maximum heap size.

13:40.330 --> 13:42.910


So that's telling us pretty explicitly what the problem was there.

13:43.690 --> 13:47.800


Now, generally speaking, this problem is also related to memory logging, where they
need to increase

13:47.800 --> 13:51.400


the heap size during program operations may have undesired consequences.

13:52.120 --> 13:56.770


So remember to set those numbers to equal values and for the actual values, follow
the recommendations

13:56.770 --> 14:01.960


by elastic, which in short is lower than 32 gigabytes and up to half of the
available RAM memory.

14:02.080 --> 14:04.150


Let's go ahead and change those back before we forget.

14:08.670 --> 14:09.180


Yeah.

14:09.510 --> 14:15.270


So we'll set that back to one gig for both and I'll just use Control K to get rid
of those lines and

14:15.270 --> 14:18.600


control o to save and control x, so we should be back in option.

14:19.110 --> 14:20.970


Let's try starting it up again, just to be sure.

14:25.460 --> 14:25.790


All right.

14:25.790 --> 14:27.110


That time has started successfully.

14:28.190 --> 14:31.370


So let's talk about some other system checks you may want to perform when things go
wrong.

14:31.610 --> 14:36.290


There are many other bootstrap checks on the runtime platform and its settings,
including a file descriptors

14:36.290 --> 14:41.750


check a maximum number of threads, check a maximum size, virtual memory check and
many others.

14:42.410 --> 14:46.280


You should definitely browse through their descriptions in the docs, but as we're
running the official

14:46.280 --> 14:51.170


Debian distribution that comes with a predefined system D unit file, most of these
issues are resolved

14:51.170 --> 14:52.850


for us in the unit file, among others.

14:53.270 --> 14:56.630


We can check that unit file to see the individual parameters that get configured.

14:56.900 --> 14:58.760


Let's take a look at that unit file to see what's in it.

14:59.040 --> 15:02.630


We can say sudo cat user lib system.

15:02.630 --> 15:05.990


DX System, elasticsearch dot service.

15:07.900 --> 15:08.200


All right.

15:09.010 --> 15:12.250


So just take a look at the different things that you have at your disposal here.

15:12.400 --> 15:16.390


All sorts of things that could go wrong, but by default, they should be okay in our
installation.

15:18.160 --> 15:22.480


So just remember that if you run the Elasticsearch binary on your own, you will
need to take care of

15:22.480 --> 15:23.410


these settings as well.

15:24.730 --> 15:28.270


Now, the last check we'll run is the one that will carry us nicely to the next
section of the lesson

15:28.270 --> 15:29.270


dealing with clustering.

15:29.290 --> 15:33.790


But before we dive in, let's see what are the configuration parameters that
Elasticsearch checks during

15:33.790 --> 15:36.550


its startup with a discovery configuration check?

15:37.270 --> 15:41.740


There are three key parameters which govern the cluster formation and discovery
process.

15:41.980 --> 15:43.870


Let's pull up our wine ML file to take a look.

15:44.710 --> 15:51.080


Pseudo nano etsy elasticsearch elasticsearch dot y IMO.

15:52.690 --> 15:52.960


All right.

15:52.960 --> 15:57.550


So one is discovery dot seed hosts should be down here.

16:00.680 --> 16:00.930


Yep.

16:01.610 --> 16:06.200


Now, this is a list of ideally all the master eligible nodes in the cluster that we
want to join and

16:06.200 --> 16:07.820


draw the last cluster state from.

16:08.330 --> 16:12.850


Now there's also a discovery dot seed underscore provider setting that you could
set here as well,

16:12.860 --> 16:16.760


and that would allow you to provide the seed hosts lists in the form of a file that
gets reloaded on

16:16.760 --> 16:20.840


any change instead of specifying it within the configuration file itself.

16:21.500 --> 16:24.530


Also, let's look at the cluster dot initial master node setting here.

16:25.070 --> 16:29.540


This is a list of the node names, not hostnames for the very first master
elections.

16:30.230 --> 16:34.190


So before all of these join and vote, the cluster setup won't be completed.

16:35.330 --> 16:39.470


But what if you don't want to form any cluster, but rather just want to run in a
small single node
16:39.470 --> 16:39.830
setup?

16:40.070 --> 16:43.130


Well, you might think you could just eliminate these settings and the y small file.

16:44.150 --> 16:44.470


Right.

16:45.200 --> 16:46.040


But no, that won't work.

16:46.130 --> 16:50.180


After starting up, you would hit another bootstrap error, since at least one of
those parameters needs

16:50.180 --> 16:52.490


to be set to pass a bootstrap check.

16:52.970 --> 16:56.390


So we're going to go ahead and put those back because you can't actually get away
with that.

16:56.930 --> 17:00.680


So let's see why this is and dive deeper into troubleshooting the discovery
process.

17:01.310 --> 17:02.540


First, I'll exit out of here.

17:04.520 --> 17:06.560


And let's shut down our cluster before we forget.

17:12.900 --> 17:14.040


Just stop the service.

17:15.030 --> 17:15.420


All right.

17:16.470 --> 17:20.850


So after we've successfully passed the bootstrap checks and started up our node for
the first time,

17:20.850 --> 17:23.910


the next phase in its lifecycle is the discovery process.

17:24.540 --> 17:28.440


Now, to simulate the formation of a brand new cluster, we're going to need a clean
node.

17:28.710 --> 17:33.420


So we need to remove all the data of the node and thus lose all previous cluster
state information.

17:33.450 --> 17:35.580


That's why we backed everything up to a snapshot earlier.

17:36.120 --> 17:39.390


Now, remember, this is really just to experiment in a real production setup.

17:39.690 --> 17:42.090


There would be very few reasons to do this.

17:42.480 --> 17:45.870


I'm going to go to this other window here where I'm logged in and see Elasticsearch
user because I'm

17:46.290 --> 17:48.090


going to need its permissions to do this stuff.

17:48.750 --> 17:52.500


Armed RF var lib elasticsearch.

17:53.980 --> 17:54.750


Last star.

17:56.100 --> 17:56.430


All right.

17:56.430 --> 17:58.320


We blew away our entire node there.

17:58.950 --> 18:04.020


So now let's imagine a situation where we already had a cluster and we just want
the node to join in.

18:04.650 --> 18:10.110


So we need to make sure the cluster name is correct and linked to some seed host
either by IP or hostname

18:10.110 --> 18:10.530


and port.

18:11.400 --> 18:14.160


So let's go ahead and open up our y am file.

18:15.330 --> 18:17.520


We use vim because that's what's installed under this account.

18:18.030 --> 18:22.380


That's the Elasticsearch Elasticsearch dot waymo.

18:24.600 --> 18:26.910


All right, so we need to make sure that we have a cluster name.

18:28.860 --> 18:29.070


Hit.

18:29.070 --> 18:31.140


I'd go on to insert mode and now I can edit it.

18:32.740 --> 18:36.370


Will change my application to lecture cluster.

18:36.730 --> 18:37.720


It would help if I typed it right.
18:40.190 --> 18:42.380
And we need to set our Discovery seat hosts.

18:45.460 --> 18:46.010


Do.

18:52.170 --> 19:00.240


There they are and we'll change that to 127.00.1 Colin 9301 Now this is just a
demonstration, so we're

19:00.240 --> 19:01.320


using a loopback address.

19:01.350 --> 19:06.420


Normally you put an hostname or an IP here and the actual transport port of one or
more of your nodes

19:06.420 --> 19:07.080


in the cluster.

19:08.820 --> 19:12.360


And just to force the failure that we're interested in, I'm going to comment out
this line for the

19:12.360 --> 19:15.600


initial master nodes, and that way it's not going to be able to reach the master.

19:15.630 --> 19:17.310


We'll see what happens when we hit that failure.

19:18.240 --> 19:21.300


Let's go ahead and hit escape colon WQ.

19:21.630 --> 19:23.280


Exclamation point to right and quit.

19:24.090 --> 19:25.590


And now let's start up our service.

19:35.260 --> 19:35.560


All right.

19:35.560 --> 19:36.870


It looks like it started successfully.

19:36.880 --> 19:39.850


Let's check our route, End Point, to see if it really is running.

19:39.850 --> 19:40.300


Okay.

19:40.450 --> 19:43.120


Curl Local host, coordinated 100.

19:46.550 --> 19:47.330


All right.

19:47.600 --> 19:54.110


So we did get a nice response with various details here, but something is missing
the cluster UUID.

19:55.010 --> 19:57.200


This means that our cluster is not actually formed.

19:57.360 --> 20:02.090


And we can confirm this by checking the cluster state with the cluster health API.

20:02.670 --> 20:03.590


Let's say curl.

20:04.340 --> 20:07.070


Local host coordinates 200 slash underscore.

20:07.070 --> 20:08.330


Cluster slash health.

20:12.240 --> 20:15.090


And after about 30 seconds of waiting will get an exception.

20:19.230 --> 20:20.340


Indeed we did, master.

20:20.340 --> 20:21.450


Not discovered exception.

20:21.870 --> 20:26.760


All right, let's Taylor logs and see that the note didn't discover any master and
will continue the

20:26.760 --> 20:27.740


discovery process.

20:27.750 --> 20:30.780


So let's check our logs and see what happened.

20:31.560 --> 20:35.850


Let's look at the past 500 lines here and those that's lecture cluster.

20:35.850 --> 20:37.920


Don't log this time because we changed the cluster name.

20:40.220 --> 20:42.620


That's the relevant message here, master not discovered.

20:42.770 --> 20:46.880


This note is not previously joined a bootstrap cluster and cluster initial master
nodes is empty on

20:46.880 --> 20:49.280


this node so it's going to continue.

20:49.280 --> 20:53.030


Discovery on 120 7.0.1 9301 from the host providers.

20:53.660 --> 20:58.970


But yeah, that's basically telling us that we had a problem actually electing a
master because we didn't
20:58.970 --> 21:01.220
list any master nodes and it couldn't find any makes sense.

21:01.220 --> 21:01.460


Right?

21:02.750 --> 21:07.070


So these issues are going to be very similar when forming a new cluster and we can
simulate that in

21:07.070 --> 21:09.560


our environment with the cluster initial master node settings.

21:09.830 --> 21:12.290


So again, let's make sure there's no previous data on our node.

21:12.680 --> 21:13.970


We'll go ahead and blow away that.

21:14.360 --> 21:15.980


Let's stop the service before we forget, huh?

21:16.610 --> 21:18.590


So back to this other site here.

21:20.700 --> 21:21.540


Stop service.

21:22.560 --> 21:23.670


Now we're going to blow away.

21:24.150 --> 21:25.530


Var Lib Elasticsearch again.

21:26.800 --> 21:27.370


Like so.

21:29.690 --> 21:30.080


All right.

21:30.080 --> 21:32.720


And now we can edit our way and I'll file again.

21:37.490 --> 21:41.240


And now we're going to go back to make sure our cluster name is still a lecture
cluster.

21:41.240 --> 21:43.010


And now we're gong to set our initial master nodes.

21:45.560 --> 21:47.660


So it was complaining before that we had an empty list there.

21:47.660 --> 21:54.800


So let's give it a a list and I'll go to insert mode and now we can edit this line,
uncomment it and

21:54.800 --> 21:59.120


we'll set it to the list of Node one, note two and Node three.
22:04.580 --> 22:06.080
So let's go ahead and hit escape.

22:06.320 --> 22:11.720


Colin WQ Exclamation point two writing quit and we'll restart the note again.

22:13.340 --> 22:14.240


Start the service.

22:16.120 --> 22:17.460


And see what happens this time.

22:20.290 --> 22:20.610


All right.

22:20.620 --> 22:22.030


Looks like it went okay.

22:22.030 --> 22:23.800


But again, let's check and make sure.

22:23.830 --> 22:24.490


Let's hit the route.

22:24.490 --> 22:24.940


End Point.

22:27.750 --> 22:29.880


Still we have no cluster EOD.

22:29.910 --> 22:32.430


So we didn't actually join a cluster that failed.

22:32.460 --> 22:33.750


And if we do a health check again.

22:37.560 --> 22:39.330


We'll have to wait 30 seconds for that to time out.

22:41.460 --> 22:41.820


All right.

22:41.820 --> 22:42.630


Same deal, Master.

22:42.630 --> 22:43.320


Not discovered.

22:43.380 --> 22:45.690


Let's check the logs again to see what happened this time.

22:46.320 --> 22:48.540


So we'll just tail those last 500 lines again.

22:49.640 --> 22:54.200


And we're going to look for something about discovering master eligible notes to
do.
22:58.080 --> 22:59.490
Probably should have crept for Warren, huh?

23:05.380 --> 23:06.160


This looks interesting.

23:07.480 --> 23:07.900


All right.

23:07.900 --> 23:09.640


Node one not discovered yet.

23:10.660 --> 23:12.370


This node must discover massive eligible nodes.

23:12.370 --> 23:14.660


Node one no to a node three to bootstrap a cluster.

23:14.680 --> 23:15.850


We only discovered node one.

23:16.630 --> 23:19.270


So, yeah, you can't just specify nodes that don't exist there.

23:20.260 --> 23:24.310


All right, so we have performed some experiments here, so we'll need to use your
imagination to complete

23:24.310 --> 23:24.880


the picture.

23:25.000 --> 23:29.080


Now, in a real production scenario, there are many reasons why this problem often
appears.

23:29.650 --> 23:34.030


Since we're dealing with a distributed system, many external factors such as
network communication

23:34.030 --> 23:36.940


come to play and may cause the notes to be unable to reach each other.

23:37.000 --> 23:40.640


So the problem might not just be that I listed a bunch of fictitious hosts there.

23:40.780 --> 23:44.860


It might be that those are valid hosts, but they can't be reached for some reason
to resolve these

23:44.860 --> 23:45.280


issues.

23:45.370 --> 23:46.870


You need to triple check all your settings.

23:47.230 --> 23:49.420


So again, let's go back into them.

23:50.440 --> 23:54.190


We need to make sure the cluster name, all the notes are joining or forming the
right cluster.

23:54.910 --> 24:00.520


The no name and a miss type in the no names can cause invalidity for the master
elections and the seed

24:00.520 --> 24:04.120


hostnames, APIs and supports down here somewhere.

24:05.970 --> 24:10.570


Got to make sure those all have valid seed hosts linked and that the ports are
actually the configured

24:10.570 --> 24:10.960


ones.

24:11.710 --> 24:14.680


We need to check connectivity between the nodes and the firewall settings.

24:14.800 --> 24:19.600


So use telnet or similar tools to inspect your network and make sure it's open for
communication between

24:19.600 --> 24:22.390


the nodes, the transport layer and the ports especially.

24:23.170 --> 24:24.250


Also check SSL.

24:24.250 --> 24:29.080


Intel's communication encryption is a vast topic and we're not going to touch that
here, but it's a

24:29.080 --> 24:33.790


usual source of troubles invalid certificates and untrusted certificate,
authorities and things like

24:33.790 --> 24:34.090


that.

24:34.840 --> 24:38.320


Also be aware that there are special requirements on the certs when encrypting No.

24:38.320 --> 24:39.370


Two, No communication.

24:40.900 --> 24:44.800


All right, the last thing we're going to explore is the relationship between the
shard allocation and

24:44.800 --> 24:47.440


cluster state as these two things are tightly related.

24:48.010 --> 24:52.240


But first, we need to change the Elasticsearch y email configuration to let our
notes successfully

24:52.240 --> 24:53.470


form a single node cluster.

24:53.950 --> 24:59.770


So back in our configuration file here, let's just set the initial master as the
node itself and start

24:59.770 --> 25:00.340


the service.

25:01.580 --> 25:03.170


So to take out No to a No.

25:03.170 --> 25:05.300


Three and just hit I had to go to insert mode.

25:06.410 --> 25:07.910


Forgot I was in vim there for a second.

25:09.420 --> 25:09.890


Escape.

25:10.140 --> 25:12.060


Colin WQ exclamation point.

25:12.270 --> 25:12.980


We wrote that out.

25:12.990 --> 25:15.600


So now let's restart our service yet again.

25:17.540 --> 25:17.990


Stop it.

25:19.590 --> 25:20.040


Started.

25:22.680 --> 25:23.190


All right.

25:23.430 --> 25:25.710


And again, we'll carry the cluster health API.

25:25.810 --> 25:26.610


Let's see what happened.

25:30.150 --> 25:32.590


So we can see the cluster status is, in fact, green.

25:32.640 --> 25:33.210


That's good.

25:33.930 --> 25:35.520


So what does cluster status mean?

25:35.730 --> 25:39.180


Well, it actually reflects the worst state of any of the indices that we have in
our cluster.

25:39.900 --> 25:41.220


The different options include red.

25:41.730 --> 25:44.850


That means one or more shards of the index is not assigned in the cluster.

25:45.360 --> 25:49.920


This can be caused by various issues at the cluster level, like disjoint nodes or
problems with disks

25:49.920 --> 25:50.670


and things like that.

25:51.450 --> 25:56.460


Generally, the red status marks very serious issues, so be prepared for some
potential data loss.

25:57.150 --> 25:58.200


It could also be yellow.

25:58.230 --> 26:00.750


In that case, the primary data are not yet impacted.

26:01.080 --> 26:04.500


All the primary shards are okay, but some replica shards are not assigned.

26:05.130 --> 26:09.540


Like, for example, replicas won't be allocated on the same node as the primary
shard by design.

26:10.290 --> 26:15.420


This status marks a risk of losing data and green means all shards are well
allocated.

26:15.840 --> 26:20.160


However, it doesn't mean that the data is safely replicated as a single node
cluster, since with a

26:20.160 --> 26:22.560


single shard index it would be green as well.

26:23.280 --> 26:26.670


So now let's create an index with one primary shard and one replica.

26:27.510 --> 26:30.210


We'll do that with curl request.

26:30.450 --> 26:42.000


Put local host 9200 slash test what's called the index test backslash slash data
raw curly with the

26:42.000 --> 26:42.900


following settings.

26:43.620 --> 26:55.590


Curly bracket number of groups of shards will be set to one and the number of
replicas will be set to

26:55.590 --> 26:56.220


one as well.
26:56.820 --> 26:57.630
Close everything out.

26:58.350 --> 27:03.570


All right, so suddenly our cluster will turn yellow because our worst performing
index, the only one

27:03.570 --> 27:05.040


we have, is also yellow.

27:05.310 --> 27:06.900


Let's check our health again.

27:08.370 --> 27:08.630


Yep.

27:08.640 --> 27:09.270


Now we're yellow.

27:10.350 --> 27:15.510


Now you can also check the shards assignment with the Cat Shards API and see what's
going on there.

27:15.540 --> 27:20.550


So let's say curl localhost 9200 slash underscore cat slash shards.

27:20.910 --> 27:23.620


Question mark v aha.

27:24.660 --> 27:27.180


So we can see that we have unassigned shards here.

27:28.470 --> 27:32.190


Or if you want a more descriptive information, you can use the cluster allocation.

27:32.190 --> 27:36.750


Explain API, which provides an explanation as to why the individual shards were not
allocated.

27:36.960 --> 27:43.680


To do that will say Karl local host 9200 cluster allocation.

27:44.660 --> 27:46.260


Explain pretty.

27:49.390 --> 27:52.810


And that tells you very explicitly what's going on in our case, as I mentioned
before.

27:53.140 --> 27:57.670


The reason is due to the allocation of the data replica to the same node being
disallowed, since it

27:57.670 --> 28:01.090


makes no sense from a resiliency perspective, you wouldn't have a replica on the
same node.

28:01.090 --> 28:01.930


That's that's silly.

28:02.740 --> 28:03.790


So how would you resolve this?

28:03.820 --> 28:04.930


Well, we have two options.

28:05.350 --> 28:08.620


One would be to remove the replica shard, which is not a real solution.

28:08.620 --> 28:10.870


But if you need the actual status, it will work out.

28:11.440 --> 28:14.530


Or you could add another node on which the shards could be reallocated.

28:14.830 --> 28:16.210


So let's take that second route.

28:18.120 --> 28:23.280


So to simulate the following failures, I actually have two different nodes running
on the same host

28:23.280 --> 28:27.510


here, and setting that up is kind of involved and we're going to do that later in
the course as we

28:27.510 --> 28:28.410


go into failover.

28:28.830 --> 28:32.490


So for now, I just want you to watch and not actually try to follow on yourself.

28:32.670 --> 28:34.350


So I've already done some of the grunt work here.

28:34.650 --> 28:39.750


Basically, you need to set up a separate system to a unit file for the second node
and a server configuration

28:39.750 --> 28:40.800


and stuff like that.

28:40.800 --> 28:43.480


So just watch from this point on.

28:43.500 --> 28:43.800


Okay.

28:44.640 --> 28:48.570


So anyway, let's start by reviewing the main configuration file of that second note
that I've already

28:48.570 --> 28:52.560


set up and will ensure that it will join the same cluster with our existing nodes.

28:52.560 --> 28:59.130


So let's say sudo nano etsi Elasticsearch dash node two is where I put that.
29:05.690 --> 29:07.460
All right, so we have the same cluster name.

29:08.030 --> 29:10.130


We're calling our node here Node two.

29:10.790 --> 29:14.540


And we can see that our seed hosts is set to a loopback address, hopefully.

29:17.870 --> 29:18.090


Yep.

29:18.470 --> 29:21.950


And we can see that our master knows consists of node one and node two.

29:23.110 --> 29:29.770


Let's go ahead and exit out of here and start that second node sudo system control
start Elasticsearch

29:30.460 --> 29:32.940


that's node two dot service.

29:35.310 --> 29:35.640


Okay.

29:35.640 --> 29:38.790


So at this point I have started up a second node on the same VM.

29:38.790 --> 29:42.960


Again, there's quite a bit of configuration behind making that happen, so just
watch for this part

29:42.960 --> 29:43.200


of it.

29:43.860 --> 29:47.220


So now that we have a second node spun up, we should be back in a green status.

29:47.220 --> 29:48.120


So let's check.

29:48.240 --> 29:50.010


Let's say kernel dashed silence.

29:50.970 --> 29:51.780


Local host.

29:53.260 --> 29:56.080


200 slash underscore cluster slash.

29:56.080 --> 29:56.500


Health.

29:57.190 --> 29:57.580


Pretty.

29:59.190 --> 30:00.780


And we'll just grep for the status line.
30:02.380 --> 30:03.610
And our status is green.

30:03.640 --> 30:04.000


Great.

30:04.840 --> 30:08.740


Okay, so we've resolved her issue and the replica shards were automatically
reallocated.

30:08.740 --> 30:09.160


Perfect.

30:10.120 --> 30:11.590


So let's continue with this example.

30:11.590 --> 30:16.660


To simulate the red cluster state, let's start by removing the index and creating
it again, but this

30:16.660 --> 30:19.390


time with only two primary shards and no replica.

30:19.390 --> 30:21.610


And we'll quickly see why this is a bad idea.

30:21.970 --> 30:31.200


So first of all, delete the one that we have with Curl Bash Dash request delete
local host 9200 slash

30:31.210 --> 30:40.750


test and I will recreate it with curl request put localhost 9200 slash test
backslash.

30:42.720 --> 30:52.380


And dash, dash data, dash raw, quick curly settings will set the number of shards.

30:54.510 --> 31:03.660


Two, one, two, two, rather four because we have two notes to work with and number
of replicas to

31:03.660 --> 31:04.140


zero.

31:05.160 --> 31:06.750


So this seems like a pretty bad idea.

31:06.780 --> 31:09.900


You know, we have our shard split across two nodes, but no backups anywhere.

31:11.720 --> 31:13.100


All right, but so far, so good.

31:13.130 --> 31:14.840


You know, it's at least storing it.

31:15.170 --> 31:17.750


Let's check the shards of salmon to see what's actually going on here.
31:17.960 --> 31:24.980
Carol, local host, 9000 underscore cat slash shards, verbose.

31:26.420 --> 31:26.750


Okay.

31:27.110 --> 31:31.280


So we can see that each primary shard is on a different node, which follows the
standard allocation

31:31.280 --> 31:33.860


rules set at the cluster level and at the index level.

31:34.340 --> 31:35.930


And you likely know where we're heading.

31:36.650 --> 31:41.120


So imagine the situation where some network issue emerges and your cluster splits
up, resulting in

31:41.120 --> 31:46.070


disabled node communication, or even worse, some disk malfunctions leading to the
improper functioning

31:46.070 --> 31:46.550


of a node.

31:47.240 --> 31:50.090


Now, the easiest way to simulate this is to just stop one of our nodes.

31:50.360 --> 31:51.680


So let's go ahead and kill No.

31:51.680 --> 31:56.480


Two with a pseudo slash spin slash system control.

31:57.590 --> 31:59.750


Stop Elasticsearch Dash No.

31:59.750 --> 32:00.980


Two dot service.

32:03.020 --> 32:04.010


And down it goes.

32:04.430 --> 32:07.230


So now if we check our status again to do.

32:10.570 --> 32:11.860


We are now in red status.

32:12.700 --> 32:13.390


That's a bad thing.

32:13.840 --> 32:17.110


So now let's check the explain API to learn more about what's going on.

32:17.230 --> 32:19.270


Curl local host.

32:19.990 --> 32:24.040


A200 slash underscore cluster slash allocation.

32:25.640 --> 32:27.080


Explain pretty.

32:29.940 --> 32:30.500


All right.

32:30.510 --> 32:35.220


So we cannot allocate it because a previous copy of the primary chart existed but
can no longer be found

32:35.220 --> 32:36.210


on the nodes in the cluster.

32:36.390 --> 32:37.630


Well, that tells you what's going on.

32:37.650 --> 32:38.610


It's pretty well described.

32:39.030 --> 32:40.680


A node left as we have turned it off.

32:41.040 --> 32:46.320


But in the real world that has various potential causes and no valid shard copy can
be found in the

32:46.320 --> 32:48.780


cluster, in which case we're missing data.

32:49.440 --> 32:54.060


Unfortunately, there's no easy solution to this scenario, as we do not have any
replicas and there's

32:54.060 --> 32:55.500


no way we could remake our data.

32:56.520 --> 33:00.450


So firstly, if you are dealing with some network problems, try to thoroughly
inspect what could go

33:00.450 --> 33:06.210


wrong like a misconfiguration of firewalls and inspect it as a priority, since data
cannot consistently

33:06.210 --> 33:07.380


be indexed in this state.

33:08.280 --> 33:12.690


Now, depending on the document routing, many indexing requests can be pointed
toward the missing shard

33:12.690 --> 33:13.770


and end up timing out.
33:14.460 --> 33:17.050
For example, this to try to insert a document and see what happens.

33:17.070 --> 33:18.200


Curl request.

33:19.560 --> 33:29.190


Post local host 200 slash test underscore doc data raw and we'll just say a
message.

33:31.360 --> 33:31.930


It's data.

33:35.720 --> 33:37.310


And this should lead to an exception.

33:39.600 --> 33:42.780


And after about 30 seconds or so, it finally timed out on me.

33:43.560 --> 33:48.480


Now, secondly, if no possible solution was found, the only option left to get the
index to work properly

33:48.480 --> 33:49.920


may be to allocate a new shard.

33:50.400 --> 33:54.990


But be aware that even if the lost node will come back afterwards, the new shard
will just overwrite

33:54.990 --> 33:56.700


it because it is in a newer state.

33:57.480 --> 34:00.720


Now we can allocate a new shard with the cluster reroute API.

34:00.840 --> 34:05.700


So here we will allocate one for the test index on the node dash one that operates
correctly.

34:06.210 --> 34:08.580


Note that we have to explicitly accept data loss.

34:08.700 --> 34:16.770


So curl request post local host 9200 slash underscore cluster slash reroute.

34:18.210 --> 34:19.620


And we want pretty results.

34:20.770 --> 34:22.990


Backslash did all.

34:24.330 --> 34:26.580


Quick curly commands.

34:27.850 --> 34:29.170


Actually, there's going to be a square bracket.

34:29.170 --> 34:29.830


We have a list of them.

34:31.610 --> 34:32.840


And curly brackets.

34:35.370 --> 34:38.400


Allocate empty primary.

34:41.990 --> 34:43.040


Index test.

34:44.650 --> 34:45.190


Shard.

34:45.820 --> 34:52.990


This one node is node one and except data loss.

34:54.950 --> 34:55.610


We'll be true.

34:58.230 --> 34:59.340


Cause everything out.

35:01.240 --> 35:01.930


I think that's right.

35:05.310 --> 35:05.730


All right.

35:05.940 --> 35:08.880


And afterwards, we should no longer experience timeouts during indexing.

35:10.170 --> 35:11.130


All right, so we're done.

35:11.130 --> 35:15.570


But we just need to restore everything from our backup now, because we did do some
pretty invasive

35:15.570 --> 35:16.570


stuff to our index here.

35:16.590 --> 35:20.580


So we're back at the point where you should be following along if you were
following along earlier.

35:20.610 --> 35:20.880


Okay.

35:20.880 --> 35:24.780


We need to restore from that back up and make sure we're not left with any
lingering issues that we

35:24.780 --> 35:25.590


might have introduced.

35:25.800 --> 35:29.520


So we're going to restore all of our original indices that we backed up earlier.
35:29.940 --> 35:31.860
Before we can do that, we need to do some cleaning up.

35:32.490 --> 35:36.390


So first, we need to make sure that the repository path is registered again in the
Elasticsearch dot.

35:36.990 --> 35:39.120


As we've done some changes to it during the exercise.

35:39.750 --> 35:43.380


So let's go ahead and reference our stored config file that we squirreled away at
the start of the lesson,

35:44.040 --> 35:45.090


and we'll put that back.

35:47.010 --> 35:48.960


So let's see, go back to our home directory.

35:48.960 --> 35:49.860


I think that's where we put it.

35:50.760 --> 35:50.970


Yep.

35:50.970 --> 35:52.320


There's Elasticsearch that Lyonel.

35:52.320 --> 35:54.180


So let's go ahead and move that back into position.

35:54.270 --> 35:55.530


Sudo move Elasticsearch.

35:55.530 --> 35:58.590


So I am going to see Elasticsearch.

36:01.080 --> 36:03.480


All right, let's double check that it's there and looks correct.

36:07.440 --> 36:09.860


Well, I'm the wrong user sudo.

36:09.900 --> 36:13.320


Listen to me and we'll just go ahead and edit it directly at c.

36:13.830 --> 36:14.700


Elasticsearch.

36:15.180 --> 36:18.210


Elasticsearch why?

36:18.240 --> 36:18.510


Email.

36:19.800 --> 36:20.790


Make sure things look normal.
36:21.480 --> 36:21.690
All right.

36:21.690 --> 36:22.920


Things are back to how we started.

36:23.190 --> 36:24.300


We have node one.

36:25.400 --> 36:28.370


We still have the path to repo set to home student backups.

36:28.370 --> 36:32.930


That's very important so we can restore that backup memory lock is commented out
again.

36:33.860 --> 36:36.800


Everything looks like it's back to default settings, so that's good.

36:37.730 --> 36:38.020


All right.

36:38.030 --> 36:38.420


Looks good.

36:38.420 --> 36:38.870


Looks good.

36:39.260 --> 36:43.430


Now, we do need to make sure that Elasticsearch has permission to read that
configuration file we just

36:43.430 --> 36:44.150


restored first.

36:44.150 --> 36:51.590


So let's go to the SC folder and do a pseudo Alice start.

36:51.610 --> 36:53.960


L-A Elasticsearch.

36:55.640 --> 36:56.310


See we have.

36:56.820 --> 36:57.070


Yeah.

36:57.090 --> 36:58.980


So we can see that it's owned by the root group.

36:59.010 --> 36:59.820


We need to change that.

36:59.970 --> 37:01.910


So, sudo change group.

37:02.250 --> 37:03.150


Elasticsearch.
37:04.650 --> 37:05.780
Elasticsearch.

37:09.670 --> 37:10.810


Elasticsearch.

37:12.400 --> 37:13.210


But, Lionel.

37:15.130 --> 37:15.970


Check that again.

37:16.480 --> 37:17.290


All right, that looks better.

37:17.290 --> 37:19.510


So now we should be able to restart our main node.

37:26.660 --> 37:27.200


Like so.

37:31.050 --> 37:31.320


Right.

37:32.250 --> 37:36.390


So now we can reregister our repository again to make sure it's ready to provide
the backup data.

37:36.780 --> 37:47.130


Curl request put will host and a few hundred slash underscore snapshot slash backup
dash repo backslash

37:47.790 --> 37:48.420


data all.

37:51.080 --> 37:52.640


Type is filesystem system.

37:54.150 --> 37:54.870


Settings.

37:58.170 --> 37:58.830


Location.

38:00.120 --> 38:02.640


Home Student Backups.

38:03.060 --> 38:03.780


Backup dash.

38:03.780 --> 38:04.170


Repo.

38:07.900 --> 38:08.350


All right.

38:09.010 --> 38:13.390


And we can check the available snapshots in the repository with a simple cash
request to our backup
38:13.390 --> 38:13.780
repo.

38:13.780 --> 38:16.630


And we should see our snapshot one waiting to be restored.

38:17.260 --> 38:18.310


Curl Local Host.

38:18.340 --> 38:21.370


Note both underscore cat snapshots.

38:22.600 --> 38:23.320


Flash Backup.

38:23.320 --> 38:23.920


Dash Repo.

38:26.060 --> 38:26.360


All right.

38:27.290 --> 38:28.190


It's a success.

38:28.580 --> 38:32.600


Now, to prevent any rights during the restore process, we need to make sure that
all of our indices

38:32.600 --> 38:33.170


are closed.

38:33.380 --> 38:38.300


So, however, as of Elasticsearch eight, they've made it not that easy.

38:38.330 --> 38:43.040


You actually need to disable a safety feature that prevents you from closing all of
your indices at

38:43.040 --> 38:43.730


once together.

38:44.240 --> 38:48.050


So let's fire up our editor and edit our Elasticsearch Dynamo file.

38:48.440 --> 38:54.380


And we're looking for the setting action dot destructive underscore requires,
underscore name, uncomment

38:54.380 --> 38:56.420


that and let it be set to false.

39:04.890 --> 39:10.320


And after that, we're going to stop and restart the service again to pick that
change up.

39:25.200 --> 39:34.050


Curl request post will host a 200 slash underscore all underscore close.
39:36.330 --> 39:39.240
And finally we can restore a backup with Colonel request.

39:40.050 --> 39:43.710


Post localhost 9000 snapshot.

39:45.240 --> 39:45.780


Backup.

39:46.290 --> 39:46.620


Dash.

39:46.620 --> 39:48.930


Repo slash snapshot.

39:48.960 --> 39:49.650


Dash one.

39:51.860 --> 39:53.390


Slash underscore restore.

39:56.540 --> 39:58.430


And it took it slow after a few seconds.

39:58.430 --> 40:01.490


If we check our indices, we should see all the original data back in place.

40:02.210 --> 40:05.930


Curl Local Host 9200 Slasher and Score Cat Slash Indices.

40:08.510 --> 40:10.880


And there's our original Shakespeare index, for example.

40:10.880 --> 40:11.930


So, yeah.

40:12.080 --> 40:12.950


Things have been restored.

40:13.460 --> 40:13.760


Great.

40:13.760 --> 40:18.260


So now that you're armed with foundational knowledge and various commands on
troubleshooting your Elasticsearch

40:18.260 --> 40:22.940


cluster, the last piece of advice is to stay positive even when things are not
working out.

40:23.480 --> 40:26.690


It's part of and parcel to being an Elasticsearch engineer.

You might also like