You are on page 1of 12

In this first lecture of our section of ElasticSearch we are going to start with a little

bit of theory. And what I mean by theory? We are talking about the underlying
concepts of ES.

After all ES is a database, but it’s a little bit different than databases you’ve
probably dealt with before. So we’re going to talk ES in comparison to a relational
database which is probably used in something like Microsoft SQL Server, MySQL
Server, or something of that nature.

And we are going to talk about the concepts of documents, types, and indexes.

If you’re not familiar with those terms don’t worry! You will be by the end of
these lessons.

We’ll talk a little bit of theory in this lesson and then we’ll get tight on into the
practical application and practical usage of ES.

You’ve probably familiar with relational databases.

If you use MySQL or Microsoft SQL, you used a relational database.

Relational database is very easy to understand because it’s very similar to a simple
Excel spreadsheet.

Database is our high-level constructs, and a table within that database is kind of
like spreadsheets.

You can think of an Excel document- it has multiple spreadsheets we are going to
have multiple worksheets as a database in some ways.

This is a loose way of referring to this, but [we have some] similar concepts here.

So on the screen here I have a very simple database shown.

It’s a database with a single table and it’s a players’ table, showing in this case
NBA players.

Now within this we have the concepts of columns and rows.

We have a player ID column, a last name column, a first name column, and so on.
That’s the type data that exist in the table, and we have the rows which is the actual
data itself.

The first row is player ID 101, for LeBron James, a small forward, number 23, who
plays for Cleveland. So that is our first row of data.

And of course in this case we have four player, we have rows of data conforming
to the structure of this table where we have types of data.

And we could have all kinds of tables in this database. If it wasn’t and NBA
players, let’s just say it was an athletes’ database, we could have an NBA players’
table an NFL players table, and so on.

So there’s a lot of ways this can be expanded.

Now this is the simplest form of relational database, but it can also look like this.

This is the same data, but it is broken up into multiple tables.

And in this case along with this players table we have a Teams Table and a
Positions Table.

And this allows us to maintain relationships between certain entities and make our
database more expandable and make it easier for us to manipulate data within the
applications and accessing this data.

Now if you’ve done any database programming or any web application programing
or if you’ve probably dealt with these sorts of things, and you see here that instead
of the position of each player listed in the players’ table we have Position ID, the
Position ID relates to the positions table where that field values is available we can
link up.

See you see for LeBron James his Position ID is 3, and if we could look at
positions table the position ID 3 matches up to the Small Forward (SF) position.

So we know that that data exists that way.

This is done for a lot of reasons: it makes it easier to reference data, insert data, and
also decreases the size of your database.
So instead of having SF in every Small Forward’s position in the Payers’ Table,
you just have the number 3.

In this case we’re only saving a character one byte, but it’s a lot easier. It makes a
lot more sense when we look at the team’s table. Those teams’ names are a lot
longer.

So if we have a database of a million athletes in it, everyone to ever play in NBA


instead of having Minnesota for everyone who played for Minnesota, we’re
actually having just the number 1 and that’s more efficient on the space, which is
certainly a concern when you think about larger database.

So this is another way of viewing the same data, but in a more relational way,
analogous to a relational database, which again you may be somewhat familiar
with.

This is what most people are familiar with when you think of database or ways of
storing information.

Now, ES is not a relational database, and that’s important.

I actually show relational databases and I want to discuss the comparison between
a relational database and ES.

A relational database has a hierarchy of structure where you a database at the top
layer, a table below that and then you have rows and columns.

ES on the other hand, has an index. So an index is equivalent to a database. And


then it has beneath that, types. And types are somewhat equivalent to a table, and
then below that you have documents and fields which are similar to rows and
columns.

I don’t want to say this is a perfect comparison, because if it was, these things
probably not be completely different.

But they are very different. So you can think of it loosely in this structure where as
you have a similar hierarchy (index, type, document, field), but it’s going to work a
little bit different than a relational database.
So for the next several slides and the next few minutes I’m going to go through
these different concepts with you.

I’m going to talk about indexes, types, documents, and fields as a unique way of
storing information in the database, that’s quite different than a relational database.

The first concept I want to talk about, is index.

And as I mentioned before an index is analogous to a database. That is the highest


level of structure we’re dealing with within a database.

??

So just as MySQL can host multiples databases, ES can host multiple indexes for
all sorts of purposes.

Now the key thing to keep in mind here is ES, well, it’s really all about search, as
read in the name.

ES is designed for people and its users to be able to search through data incredibly
quickly.

And all the design considerations for ES are dealt with that in mind. So this is
accomplished in a number of ways.

One of those ways is ways the fact that when data comes into ES, it’s a little bit
slower to process and get indexed. So the process of putting data into index is
‘indexing the data’ and that process is a little slower.

You’re paying a performance cost at the front-end of the data-ingestion, so that


when you go to search for it, it searches much faster.

So what’s happening in this process when you’re indexing is you’re creating the
structure of an index.

ES uses inverted index based upon Lucene which uses inverted indexes.

And an index is: probably you can think of it like you think of a book index, right?
We all think of it like that and here’s a picture of my book index from Practical
Packet Analysis, the third edition.

Now, within index let’s say I want to learn more about let’s say application
baselines.

We’ll look at the index and we’ll look here within the A section, we will look over
the right hand column, go over a couple of notched down, we’ll see “application
baselines” are discussed on pages 244 and 245.

Now, that’s important because then I can just flip to page 244 and start reading and
start accessing the content and into review.

If a book didn’t have an index, the only way for me to find this data is to start with
page 1 and start skimming through pages looking at the section headers and
quickly looking to try to identify the keywords that might lead me to find
“application baselines”.

And in this case as on page 244, and if I had to go through 244 pages to find where
I’m looking for chances are I am going to have tossed the book in the trash a long
way before then.

So indexes are incredibly important in books and very similarly they’re important
for indexing data that you want to search through as well so that our search is
incredibly fast and we get data we want in a timely manner.

Now with regards to how inverted index works I want to show you a quick
example here and we’ve got two sentences here.

The first is “The quick brown fox jumped over the lazy dog”. The second is
“Quick brown foxes leap over lazy dogs in the summer”.

Each of these pieces of data would be considered a document. So we have this


concept of documents in ES which we’ll talk about more of that within minutes-
we have Document 1 and Document 2.

Now when we index these things, what ES is going to do is break them up into
terms.
And I’ve added just a few of those terms here, not all of them, for the sake of
fitting them on the slide: ‘Quick’ with capital Q, ‘The’ with a capital T, ‘brown’,
‘dog’, and ‘dogs’.

So in this case you can see that only document 2 has ‘Quick’ in it with a capital Q,
only document 1 had ‘The’ with a capital T, both documents have the word
‘brown’ in all lower case.

So if I search for the word ‘brown’ ES is going to look at its index first and it’s
going to see that the word ‘brown’ appears in document 1 and document 2 and it’s
going to provide both of those results to me.

And I’m going to be able quickly search and quickly get that result, and I’m
searching the index rather than searching the raw data.

Then one more core thing about ES and how it indexes, is that it can actually index
a little bit smarter than you think it might be able to.

Consider this table! Here we have brown’ again which appears in both documents
all lower case.

But look at the second row- look at the word quick. In the first example ‘Quick’
appeared only in the document 2 and it was capitalized and in the document 1 it
was all lower case.

But here we see it’s noted in the index as being part of the documents 1 and 2.

What’s happened here is ES, a specific filter was applied that said: hey! I’m going
to index this based upon all lower case.

So, even though in the second document ‘Quick’ is uppercased, the first letter is,
it’s going to view that almost the same as ‘quick’ in the first document.

So if I would do a search for the word ‘quick’, I’m actually going to get both
document 1 and document 2 back.

The same thing is happening for the word ‘fox’. ‘Fox’ appears just as it is in
document 1, but in document 2 it’s actually ‘foxes’ (it’s plural).
But what ES is saying is, someone who searches for ‘fox’ probably wants anything
that matches ‘foxes’ too.

And in this case it would attach into those, providing a relevancy score.

So where as searching for ‘fox’ and something that matches ‘fox’ is going to have
a hundred percent relevancy, searching for ‘fox’ and matching for ‘foxes’ may
have a lower relevancy, maybe 80 percent relevancy or something like that.

So when it’s providing search results back to you the ES, it’s providing the results
back, things that are perfectly relevant, things that are only somewhat relevant, and
sorting those often by the relevancy score.

We don’t think about relevance sometimes in search, but relevancy is certainly an


important facet of it.

So that’s one way in which ES can help us with our search.

Now that’s a benefit in some ways, and in some ways it can hurt us, especially in
things about the capitalization issue.

A lot of the times the indicators we are going be using in DFIR or in NSM, we’re
very concerned about the capitalization, because they appear capitalized or not
capitalized based upon particular facets. And we only care about certain
combinations.

So we have to be careful about how that’s used and of course you’ll go through
several examples of when and how to configure that whenever indexing data and
whenever searching data as well.

But just to reiterate indexes are the highest level constructs you’re going to deal
with, we can have multiple indexes that are equivalent to a database in some
degrees and we’ll go through the process of creating and modifying and the leading
indexes we’ll get on over to the hands-on work.

The next important construct of ES that I want to talk about, is actually the lowest
level construct, and that is a DOCUMENT.
And a document, you can think of it kind of like a database record- if you’re
thinking about it in terms of a relational database.

And documents are stored and represented in JSON. JSON is great because it’s
lightweight, it’s really easily parsable, and it’s language independent.

It’s Java’s in the name. JavaScript Serialized Object Notation, but you don’t
actually have to have JavaScript to interact with it.

It can interacted with by any tool because it’s a nicely structured format and
generally I think a lot more preferable than something like XML which a lot of
folks have used in the past.

So JSON’s really nicely structured. I have an example of that here, and really
JSON may look intimidating at first, but there are really only a couple of rules to
follow that can help you understand the structure.

Namely, the first is that data is in ‘name value’ pairs, so you see here at the top.
We have field, value, field, value, so we have this concept of a field name and then
we have a value that goes along with it.

Data is separated by commas. So you see we have field, value, comma, and then
another field, value, commas, or separating those groupings by commas.

Curly braces hold objects. So in this case we have on eobject here at a high level
that holds all these data. So you see curly braces start and end.

And then we have square braces that hold arrays. We have array name, square
bracket and and within that we have another objects.

So there’s an object inside of this array and you could of course have multiple
objects inside of that array.

And that’s really it. For me to go through it quickly that may be a little
intimidating. But that’s really it. There are only those few steps and if you
remember those few things, data’s in ‘name value’ pairs, that is separated by
commas, curly braces hold objects, and square brackets holds arrays. If you can
remember those things, everything works itself out.
We’ve got this nice and made real pretty where we’ve got it tabbed out well. You
don’t even actually have to do that. We’re going to do that here because that’s
good practice and it makes it a lot more readable.

But that’s really the example of how JSON works and how documents or data are
represented in ES.

Let’s actually now look at a more specific example with real data. And here’s one
that based upon the database example we looked at earlier where we were talking
about athletes.

In this case I have two athletes, so I have two separate documents. They’re on the
screen and once they’re two separate documents which I’ve represented with two
separate JSON objects.

The first one begins with the curly brace and several name value pairs. So we have
last name ‘James’, first name ‘LeBron’, position ‘Small Forward’, number ‘23’,
and team ‘Cleveland’.

So we have all that same data we had represented earlier, but it’s represented as a
single JSON object between these curly braces. And of course we have commas
separating each Name Value pair.

Now below that we have another document, another JSON object here. In this
object we have all the same data, but I’ve added a little bit of additional data too.

So I put current team there and notice the current team for him is retired, because
Shaquille O'Neal is retired. And then we have an array, and the array is called past
teams and within that array I have another object that contains all of the teams
Shaquille O'Neal played for previously.

So we have team 1 is Orlando, team 2 is LA, and team 3 is Miami. Shaquille


O'Neal played for many more teams than that, but I wanted to keep this short and
sweet. So we just put those few in there.

So this is an example of JavaScript actually or JSON excuse me actually in


practice.
JSON’s incredibly important, a lot of things we do especially as we’re learning and
interacting with ES will be based on JSON documents. So you’re going to get
plenty of opportunity to practice this and get comfortable creating JSON data and
formatting and structuring it well.

So don’t worry if it doesn’t click perfectly right now, but spend at least a little time
looking at some examples of JSON. We have some here and here’s certainly going
to be more in the course.

Maybe just Google JSON and look at e few different examples and think about
how you might create your own structured data within it.

Think about baseball players. We’re looking at basketball players here.

How would you create a record for baseball player using JSON data? What
information would be important to have in there? How would you structure it?
What would go into an array versus what would exist just in s simple Name Value
pair?

Think about those things and think a little bit about JSON, because it’s going to be
incredibly important as we move forward.

But even if it doesn’t make full sense to you now, don’t worry! It will as we go
through some of these hands-on exercises.

So we discussed indexes and types and documents and fields. Now let’s put the
whole picture together and iterate through an example we’ve already hit on a little
bit earlier.

Now here’s the example. You see the JSON on the right and we see we’re working
again within athletes’ index and we have couple of different types.

So our index is which means we’re going to be storing data about all sorts of
athletes across many different sports.

We’ll start with the type of basketball. We want to start in index ‘basketball
players’, so we define the ‘Basketball’ type. From there, the documents that we
actually into that index and type are the actual athletes themselves.
So we’re talking basketball, so LeBron James, Steph Curry, Karl-Anthony Towns
(actual basketball players), each one is going to exist as a document the structure
we have here.

And that documents each contains an abject which has several fields and values
which include last name, first name, position, number, and team.

So we have th LeBron James example over on the right. Now, we also want to
index or add to our data source football players.

So in this case we’ll define another type for football players, and we’ll index
multiple documents one for each athletes. So you see on the right we have a
document for Dak Prescott, quarterback for the Dallas Cowboys, number 4 who
plays of offense. So we have that example there.

Now, those are the only two types we have listed. We could certainly have more of
those. We could have a type for hockey players, for soccer players, it’s very
extendable in that regards.

And of course when we design our data structure we want that to be extendable.

ES can trick you a bit, because you’re allowed to put data into it without a lot
structure it’ll just kind of do its own thing, but if you want to be able to quickly
search and iterate and program around your data to do cool things, you really need
to have it well defined.

A lot of people nowadays are really interested in using NoSQL type databases like
a MongoDB, because you don’t to know your data as well, you can kind of cram
arbitrary data into them.

You can do that in ES as well, but certainly not something I would recommend. A
lot of those databases tend to be excuses for people not to understand their data,
and as DFIR and NSM analysts, knowing your data is one of the most important
things you have to do in your repertoire tools.

So we really want to know our data well, and we’re going to focus in this course
on defininf it and structuring it really well as we go through that part of the course.
So let’s summarize what we’ve learned in this lesson:

First of all, ES is built for fast search and relies on indexes to facilitate that.

The ES data model is very different from relational databases and uses an inverted
index to store and reference its information so that we can have this desirable goal
of fast search and also take advantage of things like relevancy in our search and we
get documents that are somewhat retaled to our search but maybe not fully
matching exactly.

The main components of the ES data structure are index, types, documents, and
names/values. We talked about both of those and went through a few different
examples.

And documents in EX rely on JSON structured data ad we looked at a lot of JSON


here and we’re certainly going to look at a lot more JSON as we go forward.

So hopefully this makes sense as an overarching discussion of the components of


ES and how it works.

If you want to go a little bit further you can look at the ES documentation which
covers a lot of the same topics here but in much greater depth.

We’re not going to be a database expert in this course. We don’t expect you to be.
This is practical usage course, so we didn’t get into a lot of detailed depth. There’s
a lot of computer science talk you can get into, and discussing inverted index and
database structure and database structures and things like that.

We didn’t do all that here but if you want to get into that the ES documentation is
great place to start.

So we’re going to leave these concepts lie here and we’re going to go ahead next
and into the practical usage of some of these things by actually indexing data into
ES and starting manipulate them.

You might also like