You are on page 1of 56

Class 4 – Indexing and

Aggregation
AGENDA
• Creating different kinds of indexes
• Creating geospatial indexes
• Listing the indexes
• Modifying the indexes
• Dropping the indexes
• Updating Multiple Documents
• Aggregation tools in MongoDB
• MapReduce concept in MongoDB
Your disk has your data files and your journal files,
When you start up mongod, it maps your data files to a shared view.
Basically, the operating system says: “Okay, your data file is 2,000 bytes on disk. I’ll
map that to memory address 1,000,000-1,002,000.
So, if you read the memory at memory address 1,000,042, you’ll be getting the 42nd
byte of the file.“

(Also, the data won’t necessary be loaded until you actually access that memory.
This memory is still backed by the file: if you make changes in memory, the
operating system will flush these changes to the underlying file.

This is basically how mongod works without journaling: it asks the operating system


to flush in-memory changes every 60 seconds.

However, with journaling, mongod makes a second mapping, this one to a private


view. Incidentally, this is why enabling journalling doubles the amount of virtual
memory mongod uses.
 The private view is not connected to the data file, so the operating system cannot
flush any changes from the private view to disk.

Now, when you do a write, mongod writes this to the private view.


mongod will then write this change to the journal file, creating a little
description of which bytes in which file changed.

MonoDB writes to the journal files roughly every 100 milliseconds


The journal appends each change description it gets.
At this point, the write is safe. If mongod crashes, the journal can replay the change,
even though it hasn’t made it to the data file yet.

The journal will then replay this change on the shared view.
Finally, at a glacial speed compared to everything else, the shared view will be flushed
to disk.

By default, mongod requests that the OS do this every 60 seconds.


The last step is that mongod remaps the shared view to the private view.

This prevents the private view from getting too “dirty” (having too many changes
from the shared view it was mapped from).
WiredTiger:
For the data files, MongoDB creates checkpoints (i.e. write the snapshot data to disk)
at intervals of 60 seconds or 2 gigabytes of data to write, depending on which occurs
first.

For the journal data, WiredTiger sets checkpoints for journal data at intervals of 60
seconds or 2 GB of data, depending on which occurs first.

• Because MongoDB uses a log file size limit of 100 MB, WiredTiger creates a new
journal file approximately every 100 MB of data. When WiredTiger creates a new
journal file, WiredTiger syncs the previous journal file.

• If the write operation includes a write concern of j:true,WiredTiger forces a sync on


commit of that operation as well as anything that has happened before.
MongoDB
Introduction to Indexing
 Indexes support the efficient execution of queries in MongoDB.

 Without indexes MongoDB must perform a collection scan, i.e. scan every document
in a collection

 If an appropriate index exists for a query, MongoDB can use the index to limit the
number of documents it must inspect

 Indexes provide high performance read operations for frequently used queries

 Indexes are special data structures 1 that store a small portion of the collection’s
data set in an easy to traverse form

 Fundamentally, indexes in MongoDB are similar to indexes in other database


systems
MongoDB
Different Index Types
 MongoDB provides a number of different index types to support specific types of
data and queries.

 Default _id

 Single Field

 Compound Index

 Multikey Index

 Geospatial Index

 Text Indexes

 Hashed Indexes
MongoDB
Index Properties

 Unique Indexes

 Sparse Indexes

 TTL indexes
MongoDB
Single Field Index
 MongoDB provides complete support for indexes on any field in a collection of
documents

 By default, all collections have an index on the _id field

 Applications and users may add additional indexes to support important queries and
operations.

 The following command creates an index on the name field for the users collection

db.users.createIndex( { “name" : 1 } )
MongoDB
Index on embedded field and document
 MongoDB provides complete support for indexes on any field in a collection of
documents
 You can create indexes on fields within embedded documents, just as you can index
top-level fields in documents.
{ "_id" : 3, "item" : "Book", "available" : true, "soldQty" : 144821, "category" : "NoSQL",
"details" : { "ISDN" : "1234", "publisher" : "XYZ Company" }, "onlineSale" : true
}
 To create index on ISDN field:
db.items.createIndex( {“details.ISDN”: 1 } )
 To create index on embedded document ”details”
db.items.createIndex( {“details.Publisher”: 1 } )
MongoDB
Compound Indexes
 MongoDB supports compound indexes, where a single index structure holds
references to multiple fields 2 within a collection’s documents.

 Compound indexes can support queries that match on multiple fields.


db.products.createIndex( { "item": 1, "stock": 1 } )
 The order of the fields in a compound index is very important.
 Documents sorted first by the values of the item field and, within each value of the
item field, sorted by values of the stock field
 MongoDB imposes a limit of 31 fields for any compound index
MongoDB
Index Prefixes
 Index prefixes are the beginning subsets of indexed fields. For example, consider the
following compound index:
{ "item": 1, “available”:1, "soldQty“:1}
 The index has the following index prefixes:
{ item: 1 }
{ item: 1, available : 1 }
{item:1,available : 1, soldQty:1}
 For a compound index, MongoDB can use the index to support queries on the index
prefixes
 the item field,
 the item field and the available field,
 the item field and the available field and the soldQty field
 However, the index would not be as efficient in supporting the query as would be an
index on only item and soldQty
MongoDB
Sort Operation
 In MongoDB, sort operations can obtain the sort order by retrieving documents
based on the ordering in an index
 Sort operations that use an index often have better performance than those that do
not use an index
 If the query planner cannot obtain the sort order from an index, it will sort the
results in memory
 Sort operations that do not use an index will abort when they use 32 megabytes of
memory.

Sort on Multiple Fields


 We can specify a sort on all the keys of the index or on a subset
 The sort keys must be listed in the same order as they appear in the index

For example, an index key pattern { a: 1, b: 1 } can support a sort on { a: 1, b: 1 } but not
on { b: 1, a: 1 }
MongoDB
Memory allocation for Indexes
 For the fastest processing, ensure that your indexes fit entirely in RAM so that the
system can avoid reading the index from disk.

 To check the size of your indexes, use the db.collection.totalIndexSize()

 If you have and use multiple collections, you must consider the size of all indexes on
all collections
MongoDB
Multikey Indexes
 To index a field that holds an array value, MongoDB creates an index key for each
element in the array
 These multikey indexes support efficient queries against array fields
 Multikey indexes can be constructed over arrays that hold both scalar values (e.g.
strings, numbers) and nested documents.
db.coll.createIndex( { <field>: < 1 or -1 > } )
 MongoDB automatically creates a multikey index if any indexed field is an array; you
do not need to explicitly specify the multikey type
MongoDB
Compound Multikey Indexes
 For a compound multikey index, each indexed document can have at most one
indexed field whose value is an array

 As such, you cannot create a compound multikey index if more than one to-be-
indexed field of a document is an array.
For example, consider a collection that contains the following document:
{ _id: 1, product_id: [ 1, 2 ], retail_id: [ 100, 200 ], category: "AB - both arrays" }

 You cannot create a compound multikey index {product_id : 1, retail_id : 1 } on the


collection since both the a product_id and retail_id fields are arrays.

 The index of a shard key cannot be a multikey index.

 Hashed Indexes Hashed indexes cannot be multikey.


MongoDB
Hashed Indexes
 Hashed indexes maintain entries with hashes of the values of the indexed field.

 The hashing function collapses embedded documents and computes the hash for
the entire value but does not support multi-key (i.e. arrays) indexes

 MongoDB can use the hashed index to support equality queries

 Hashed indexes do not support range queries

 Hashed indexes support sharding a collection using a hashed shard key

 Using a hashed shard key to shard a collection ensures a more even distribution of
data.
db.items.createIndex( { item: "hashed" } )
MongoDB
Time to Live (TTL) Indexes
 TTL indexes are special single-field indexes that MongoDB can use to automatically
remove documents from a collection after a certain amount of time.
 It is useful for certain types of data such as machine generated event data, logs, and
session information that only need to persist in a database for a finite amount of
time.

 Use the db.collection.createIndex() method with the expireAfterSeconds option on a


field whose value is either a date or an array that contains date values
db.eventlog.createIndex( { "lastModifiedDate": 1 }, { expireAfterSeconds: 3600 } )

 If the indexed field in a document is not a date or an array that holds a date value(s),
the document will not expire.

 On replica sets, the TTL background thread only deletes documents on the primary.
Secondary members replicate deletion operations from the primary.

 TTL indexes are a single-field indexes. Compound indexes do not support TTL
MongoDB
Unique Indexes
 A unique index causes MongoDB to reject all documents that contain a duplicate
value for the indexed field
db.items.createIndex( { “item": 1 }, { unique: true } )

 If you use the unique constraint on a compound index, then MongoDB will enforce
uniqueness on the combination of values rather than the individual value for any or
all values of the key.

 If a document does not have a value for the indexed field in a unique index, the
index will store a null value for this document

 Because of the unique constraint, MongoDB will only permit one document that
lacks the indexed field

 You can combine the unique constraint with the sparse index to filter these null
values from the unique index and avoid the error
MongoDB
Sparse Indexes
 Sparse indexes only contain entries for documents that have the indexed field, even
if the index field contains a null value.
 The index skips over any document that is missing the indexed field.
 The index is “sparse” because it does not include all documents of a collection.
db.addresses.createIndex( { "xmpp_id": 1 }, { sparse: true } )
 If a sparse index would result in an incomplete result set for queries and sort
operations, MongoDB will not use that index unless a hint() explicitly specifies the
index.
db.items.find().sort( {soldQty: -1 } ).hint( {soldQty: 1 } )
 2d (geospatial) and text indexes are sparse by Default

 sparse and unique Properties An index that is both sparse and unique prevents
collection from having documents with duplicate values for a field but allows
multiple documents that omit the key.
MongoDB
Text Indexes
 MongoDB provides text indexes to support text search of string content in
documents of a collection
 Text indexes can include any field whose value is a string or an array of string
elements
 To perform queries that access the text index, use the $text query operator.
 To index a field that contains a string or an array of string elements, include the field
and specify the string literal "text" in the index document
db.customer_info.createIndex({”customer_name”: “text”})

 To perform the text search


db.customer_info.find({customer_id:0,$text:{$search:”John”}})
 A collection can have at most one text index.
.
MongoDB
Multi Languages Text Search
Supported Languages and Stop Words
 Mongo DB supports text search for various languages. text indexes drop language-specific stop
words (e.g. in English, “the”, “an”, “a”, “and”, etc.) and uses simple language-specific suffix
stemming
 If you specify a language value of "none", then the text index uses simple tokenization with no
list of stop words and no stemming
db.quotes.createIndex({ content : "text" }, default_language: "spanish" })

 Text Search Languages: Danish,dutch,English,finish, French, German,Italian, Norwegian,


Portuguese, Romanian, Russian, Spanish. Swedish,Turkish, Hungarian, Portuguese
 A compound index can include a text index key in combination with ascending/descending
index keys.
 A compound text index cannot include any other special index types, such as multi-key or
geospatial index fields.
 If the compound text index includes keys preceding the text index key, to perform a $text
search, the query predicate must include equality match conditions on the preceding keys
MongoDB
Index Creation
 MongoDB provides several options that only affect the creation of the index
 Specify these options in a document as the second argument to the
db.collection.createIndex() method
 By default, creating an index blocks all other operations on a database
 The database that holds the collection is unavailable for read or write operations
until the index build completes.
 For potentially long running index building operations, consider the creating the
indexes in the background
db.people.createIndex( { zipcode: 1}, {background: true} )
 By default, background is false for building MongoDB indexes.
 You can combine the background option with other options, as in the following:
db.people.createIndex( { zipcode: 1}, {background: true, sparse: true } )
MongoDB
Backgound Index Creation
 If MongoDB is building an index in the background, you cannot perform other
administrative operations involving that collection, including running repairDatabase
dropping the collection (i.e. db.collection.drop()), and running compact
 The background index operation uses an incremental approach that is slower than
the normal “foreground”
 If the index is larger than the available RAM, then the incremental process can take
much longer than the foreground build
 Always build indexes in production instances using separate application code, during
designated maintenance windows.
MongoDB
Index Creation on Replica Set
 Background index operations on a replica set secondaries begin after the primary
completes building the index
 If MongoDB builds an index in the background on the primary, the secondaries will
then build that index in the background.
 The amount of time required to build the index on a secondary must be within the
window of the oplog, so that the secondary can catch up with the primary
 The default name for an index is the concatenation of the indexed keys and each
key’s direction in the index, 1 or -1.
 Optionally, you can specify a name for an index instead of using the default name.
db.products.createIndex( { item: 1, quantity: -1 } , { name: "inventory" } )
MongoDB
Removing Index

 To remove an index from a collection use the dropIndex() method


db.accounts.dropIndex( { "tax-id": 1 } )

 You can also use the db.collection.dropIndexes() to remove all indexes, except for
the _id index
db.items.dropIndexes()
MongoDB
Modifying an Index

To modify an existing index, you need to drop and recreate the index.

Step 1: Drop the index.


 To modify the index, you must drop the index first.
db.orders.dropIndex({ "cust_id" : 1, "ord_date" : -1, "items" : 1 })

Step 2: Recreate the index


 Recreate the index without the unique constraint.
db.orders.createIndex({ "cust_id" : 1, "ord_date" : -1, "items" : -1 })
MongoDB
Rebuilding Indexes

 If you need to rebuild indexes for a collection you can use the db.collection.reIndex()
method to rebuild all indexes on a collection in a single operation.
This operation drops all indexes, including the _id index , and then rebuilds all
indexes.
db.items.reIndex()
To see the status of an indexing process, you can use the db.currentOp()
you can terminate both background index builds and foreground index builds
To terminate an ongoing index build, use the db.killOp() method
 You cannot terminate a replicated index build on secondary members of a replica
set
MongoDB
Listing Indexes

 List all Indexes on a Collection: To return a list of all indexes on a collection

 Use the db.collection.getIndexes() method or a similar method for your driver.

 List all Indexes for a Database : To list all indexes on all collections in a database,
you can use the following operation in the mongo shell:

 db.getCollectionNames().forEach(function(collection) {
indexes = db[collection].getIndexes();
print("Indexes for " + collection + ":");
printjson(indexes); });
MongoDB
Measure Index Usage

 Query performance is a good general indicator of index use

 MongoDB provides a number of tools that allow you to study query operations and
observe index use for your database

Return Query Plan with explain():

 db.collection.explain() or the cursor.explain() method in executionStats mode to


return statistics about the query process, including the index used, the number of
documents scanned, and the time the query takes to process in milliseconds.

 db.collection.explain() or the cursor.explain() method in allPlansExecution mode to


view partial execution statistics collected during plan selection.
MongoDB
Control Index Usage
Control Index Use with hint() :
 To force MongoDB to use a particular index for a db.collection.find() operation
db.people.find({ name: "John Morrison", zipcode: { $gt: "63000" } })

 To view the execution statistics for a specific index, append to the


db.collection.find() the hint() method followed by cursor.explain(), e.g.

db.people.find({ name: "John Morrison", zipcode: { $gt: "63000" } }).hint( {


zipcode: 1 } ).explain("executionStats").hint( { zipcode: 1 }

 Specify the $natural operator to the hint() method to prevent MongoDB from using
any index:
db.people.find({ name: "John Morrison ", zipcode: { $gt: "63000" } }).hint( { $natural:
1})
MongoDB
Indexes Metrics
 MongoDB provides a number of metrics of index use and operation that you may
want to consider when analyzing index use for your database:

serverStatus:
– scanned
– scanAndOrder

collStats:
– totalIndexSize
– indexSizes

dbStats:
– dbStats.indexes
– dbStats.indexSize
MongoDB
Geospatial Indexes
 MongoDB provides a special type of index for coordinate plane queries, called a
geospatial index
 To create a geospatial index for GeoJSON-formatted data, use the
db.collection.createIndex() method to create a 2dsphere index.
 In the index specification document for the db.collection.createIndex() method,
specify the location field as the index key and specify the string literal "2dsphere" as
the value:
 db.collection.createIndex( { <location field> : "2dsphere" } )
 A compound index can include a 2dsphere index key in combination with non-
geospatial index keys
MongoDB
Geospatial Query Operators

 Inclusion: MongoDB can query for locations contained entirely within a specified
polygon. Inclusion queries use the $geoWithin operator.

 Intersection : MongoDB can query for locations that intersect with a specified
geometry. These queries apply only to data on a spherical surface. These queries
use the $geoIntersects operator.

 Proximity MongoDB can query for the points nearest to another point. Proximity
queries use the $near operator.
MongoDB
$geoWithin Operator
 The $geoWithin operator queries for location data found within a GeoJSON polygon.
Your location data must be stored in GeoJSON format. Use the following syntax:

db.<collection>.find( { <location field> : { $geoWithin :{ $geometry :{ type :


"Polygon" ,coordinates : [ <coordinates> ]} } } } )

 The following example selects all points and shapes that exist entirely within a
GeoJSON polygon:
db.places.find( { loc :{ $geoWithin :{ $geometry :{ type : "Polygon" ,coordinates : [ [[ 0
, 0 ] ,[ 3 , 6 ] ,[ 6 , 1 ] ,[ 0 , 0 ]] ]} } } } )
MongoDB
Proximity to GeoJSON Point
 Proximity queries return the points closest to the defined point and sorts the results
by distance. A proximity query on GeoJSON data requires a 2dsphere index
 To query for proximity to a GeoJSON point, use either the $near operator or
geoNear command. Distance is in meters.
db.<collection>.find( { <location field> :{ $near :{ $geometry :{ type :
"Point" ,coordinates : [ <longitude> , <latitude> ] } ,$maxDistance : <distance in
meters>} } } )
 The geoNear command uses the following syntax:
db.runCommand( { geoNear : <collection> ,near : { type : "Point" ,coordinates:
[ <longitude>, <latitude> ] } ,spherical : true } )
 To select all grid coordinates in a “spherical cap” on a sphere, use $geoWithin with
the $centerSphere operator.
 db.<collection>.find( { <location field> :{ $geoWithin :{ $centerSphere :[ [ <x>, <y> ] ,
<radius> ] }} }
MongoDB
Aggregation
 Aggregations are operations that process data records and return computed
results
 MongoDB provides a rich set of aggregation operations that examine and perform
calculations on the data sets.
 Like queries, aggregation operations in MongoDB use collections of documents as
an input and return results in the form of one or more documents
 Aggregation Pipelines: is a aggregation framework, modeled on the concept of
data processing pipelines. Documents enter a multi-stage pipeline that transforms
the documents into an aggregated result.
 The most basic pipeline stages provide filters that operate like queries and
document transformations that modify the form of the output document
 In addition, pipeline stages can use operators for tasks such as calculating the
average or concatenating a string
MongoDB
Aggregation
MongoDB
Pipeline Operator and Indexes

 Early Filtering: If your aggregation operation requires only a subset of the data in a
collection, use the $match, $limit, and $skip stages to restrict the documents that
enter at the beginning of the pipeline.

 When placed at the beginning of a pipeline, $match operations use suitable indexes
to scan only the matching documents in a collection

 Placing a $match pipeline stage followed by a $sort stage at the start of the pipeline
is logically equivalent to a single query with a sort and can use an index. When
possible, place $match operators at the beginning of the pipeline
MongoDB
Aggregate Pipeline Stages

 Pipeline stages appear in an array. Documents pass through the stages in sequence.
All except the $out and $geoNear stages can appear multiple times in a pipeline.
db.collection.aggregate( [ { <stage> }, ... ] )
 $project: Used to select some specific fields from a collection.

 $match: This is a filtering operation and thus this can reduce the amount of
documents that are given as input to the next stage.

 $group: This does the actual aggregation as discussed above.

 $sort: Sorts the documents.


MongoDB
Aggregate Pipeline Stages

 $skip: With this it is possible to skip forward in the list of documents for a given
amount of documents.
 $limit: This limits the amount of documents to look at by the given number starting
from the current position.
 $unwind: This is used to unwind document that are using arrays. when using an
array the data is kind of pre joinded and this operation will be undone with this to
have individual documents again. Thus with this stage we will increase the amount
of documents for the next stage
MongoDB
Pipeline Operator and Indexes

 The following aggregation operation returns all states with total population greater
than 10 million:
db.zipcodes.aggregate( [{ $group: { _id: "$state", totalPop: { $sum: "$pop" } } },
{ $match: { totalPop: { $gte: 10*1000*1000 } } }] )

 The following aggregation operation returns user names sorted by the month they
joined. This kind of aggregation could help generate membership renewal notices.

db.users.aggregate([{ $project :{month_joined : { $month : "$joined" },name :


"$_id",_id : 0}},{ $sort : { month_joined : 1 } }])
MongoDB
Map Reduce in MongoDB

 Map-reduce is a data processing paradigm for condensing large volumes of data into
useful aggregated results
 For map-reduce operations, MongoDB provides the mapReduce database
command.
 In this map-reduce operation, MongoDB applies the map phase to each input
document (i.e. the documents in the collection that match the query condition).
 The map function emits key-value pairs.
 For those keys that have multiple values, MongoDB applies the reduce phase, which
collects and condenses the aggregated data
 MongoDB then stores the results in a collection. Optionally, the output of the
reduce function may pass through a finalize function to further condense or process
the results of the aggregation
MongoDB
Map Reduce in MongoDB

 All map-reduce functions in MongoDB are JavaScript and run within the mongod
process

 Map-reduce operations take the documents of a single collection as the input and
can perform any arbitrary sorting and limiting before beginning the map stage

 MapReduce can return the results of a map-reduce operation as a document, or


may write the results to collections. The input and the output collections may be
sharded.
MongoDB
Map Reduce in MongoDB
MongoDB
Aggregate Operations
MongoDB provides a number of aggregation operations that perform specific
aggregation operations on a set of data.
These operations provide straightforward semantics for common data processing
options.
Count: MongoDB can return a count of the number of documents that match a query
The following operation would count all documents in the customer_info collection
db.customer_info.count()
Distinct: The distinct operation takes a number of documents that match a query and
returns all of the unique values for a field in the matching documents.
Consider the following operation which returns the distinct values of the field
customer_name:
db.customer_info.distinct( “customer_name" )
MongoDB
Aggregate Operations

Group: The group operation takes a number of documents that match a query, and
then collects groups of documents based on the value of a field or fields. It returns an
array of documents with computed results for each group of documents
Group does not support data in sharded collections.
In addition, the results of the group operation must be no larger than 16 megabytes.
Consider the following group operation which groups documents by the field a,
where a is less than 3, and sums the field count for each group:
db.records.group( {key: { a: 1 },cond: { a: { $lt: 3 } },reduce: function(cur, result) {
result.count += cur.count },initial: { count: 0 }} )
RECAP

Indexing and Aggregation in MongoDB

You might also like