Professional Documents
Culture Documents
Aggregation
AGENDA
• Creating different kinds of indexes
• Creating geospatial indexes
• Listing the indexes
• Modifying the indexes
• Dropping the indexes
• Updating Multiple Documents
• Aggregation tools in MongoDB
• MapReduce concept in MongoDB
Your disk has your data files and your journal files,
When you start up mongod, it maps your data files to a shared view.
Basically, the operating system says: “Okay, your data file is 2,000 bytes on disk. I’ll
map that to memory address 1,000,000-1,002,000.
So, if you read the memory at memory address 1,000,042, you’ll be getting the 42nd
byte of the file.“
(Also, the data won’t necessary be loaded until you actually access that memory.
This memory is still backed by the file: if you make changes in memory, the
operating system will flush these changes to the underlying file.
The journal will then replay this change on the shared view.
Finally, at a glacial speed compared to everything else, the shared view will be flushed
to disk.
This prevents the private view from getting too “dirty” (having too many changes
from the shared view it was mapped from).
WiredTiger:
For the data files, MongoDB creates checkpoints (i.e. write the snapshot data to disk)
at intervals of 60 seconds or 2 gigabytes of data to write, depending on which occurs
first.
For the journal data, WiredTiger sets checkpoints for journal data at intervals of 60
seconds or 2 GB of data, depending on which occurs first.
• Because MongoDB uses a log file size limit of 100 MB, WiredTiger creates a new
journal file approximately every 100 MB of data. When WiredTiger creates a new
journal file, WiredTiger syncs the previous journal file.
Without indexes MongoDB must perform a collection scan, i.e. scan every document
in a collection
If an appropriate index exists for a query, MongoDB can use the index to limit the
number of documents it must inspect
Indexes provide high performance read operations for frequently used queries
Indexes are special data structures 1 that store a small portion of the collection’s
data set in an easy to traverse form
Default _id
Single Field
Compound Index
Multikey Index
Geospatial Index
Text Indexes
Hashed Indexes
MongoDB
Index Properties
Unique Indexes
Sparse Indexes
TTL indexes
MongoDB
Single Field Index
MongoDB provides complete support for indexes on any field in a collection of
documents
Applications and users may add additional indexes to support important queries and
operations.
The following command creates an index on the name field for the users collection
db.users.createIndex( { “name" : 1 } )
MongoDB
Index on embedded field and document
MongoDB provides complete support for indexes on any field in a collection of
documents
You can create indexes on fields within embedded documents, just as you can index
top-level fields in documents.
{ "_id" : 3, "item" : "Book", "available" : true, "soldQty" : 144821, "category" : "NoSQL",
"details" : { "ISDN" : "1234", "publisher" : "XYZ Company" }, "onlineSale" : true
}
To create index on ISDN field:
db.items.createIndex( {“details.ISDN”: 1 } )
To create index on embedded document ”details”
db.items.createIndex( {“details.Publisher”: 1 } )
MongoDB
Compound Indexes
MongoDB supports compound indexes, where a single index structure holds
references to multiple fields 2 within a collection’s documents.
For example, an index key pattern { a: 1, b: 1 } can support a sort on { a: 1, b: 1 } but not
on { b: 1, a: 1 }
MongoDB
Memory allocation for Indexes
For the fastest processing, ensure that your indexes fit entirely in RAM so that the
system can avoid reading the index from disk.
If you have and use multiple collections, you must consider the size of all indexes on
all collections
MongoDB
Multikey Indexes
To index a field that holds an array value, MongoDB creates an index key for each
element in the array
These multikey indexes support efficient queries against array fields
Multikey indexes can be constructed over arrays that hold both scalar values (e.g.
strings, numbers) and nested documents.
db.coll.createIndex( { <field>: < 1 or -1 > } )
MongoDB automatically creates a multikey index if any indexed field is an array; you
do not need to explicitly specify the multikey type
MongoDB
Compound Multikey Indexes
For a compound multikey index, each indexed document can have at most one
indexed field whose value is an array
As such, you cannot create a compound multikey index if more than one to-be-
indexed field of a document is an array.
For example, consider a collection that contains the following document:
{ _id: 1, product_id: [ 1, 2 ], retail_id: [ 100, 200 ], category: "AB - both arrays" }
The hashing function collapses embedded documents and computes the hash for
the entire value but does not support multi-key (i.e. arrays) indexes
Using a hashed shard key to shard a collection ensures a more even distribution of
data.
db.items.createIndex( { item: "hashed" } )
MongoDB
Time to Live (TTL) Indexes
TTL indexes are special single-field indexes that MongoDB can use to automatically
remove documents from a collection after a certain amount of time.
It is useful for certain types of data such as machine generated event data, logs, and
session information that only need to persist in a database for a finite amount of
time.
If the indexed field in a document is not a date or an array that holds a date value(s),
the document will not expire.
On replica sets, the TTL background thread only deletes documents on the primary.
Secondary members replicate deletion operations from the primary.
TTL indexes are a single-field indexes. Compound indexes do not support TTL
MongoDB
Unique Indexes
A unique index causes MongoDB to reject all documents that contain a duplicate
value for the indexed field
db.items.createIndex( { “item": 1 }, { unique: true } )
If you use the unique constraint on a compound index, then MongoDB will enforce
uniqueness on the combination of values rather than the individual value for any or
all values of the key.
If a document does not have a value for the indexed field in a unique index, the
index will store a null value for this document
Because of the unique constraint, MongoDB will only permit one document that
lacks the indexed field
You can combine the unique constraint with the sparse index to filter these null
values from the unique index and avoid the error
MongoDB
Sparse Indexes
Sparse indexes only contain entries for documents that have the indexed field, even
if the index field contains a null value.
The index skips over any document that is missing the indexed field.
The index is “sparse” because it does not include all documents of a collection.
db.addresses.createIndex( { "xmpp_id": 1 }, { sparse: true } )
If a sparse index would result in an incomplete result set for queries and sort
operations, MongoDB will not use that index unless a hint() explicitly specifies the
index.
db.items.find().sort( {soldQty: -1 } ).hint( {soldQty: 1 } )
2d (geospatial) and text indexes are sparse by Default
sparse and unique Properties An index that is both sparse and unique prevents
collection from having documents with duplicate values for a field but allows
multiple documents that omit the key.
MongoDB
Text Indexes
MongoDB provides text indexes to support text search of string content in
documents of a collection
Text indexes can include any field whose value is a string or an array of string
elements
To perform queries that access the text index, use the $text query operator.
To index a field that contains a string or an array of string elements, include the field
and specify the string literal "text" in the index document
db.customer_info.createIndex({”customer_name”: “text”})
You can also use the db.collection.dropIndexes() to remove all indexes, except for
the _id index
db.items.dropIndexes()
MongoDB
Modifying an Index
To modify an existing index, you need to drop and recreate the index.
If you need to rebuild indexes for a collection you can use the db.collection.reIndex()
method to rebuild all indexes on a collection in a single operation.
This operation drops all indexes, including the _id index , and then rebuilds all
indexes.
db.items.reIndex()
To see the status of an indexing process, you can use the db.currentOp()
you can terminate both background index builds and foreground index builds
To terminate an ongoing index build, use the db.killOp() method
You cannot terminate a replicated index build on secondary members of a replica
set
MongoDB
Listing Indexes
List all Indexes for a Database : To list all indexes on all collections in a database,
you can use the following operation in the mongo shell:
db.getCollectionNames().forEach(function(collection) {
indexes = db[collection].getIndexes();
print("Indexes for " + collection + ":");
printjson(indexes); });
MongoDB
Measure Index Usage
MongoDB provides a number of tools that allow you to study query operations and
observe index use for your database
Specify the $natural operator to the hint() method to prevent MongoDB from using
any index:
db.people.find({ name: "John Morrison ", zipcode: { $gt: "63000" } }).hint( { $natural:
1})
MongoDB
Indexes Metrics
MongoDB provides a number of metrics of index use and operation that you may
want to consider when analyzing index use for your database:
serverStatus:
– scanned
– scanAndOrder
collStats:
– totalIndexSize
– indexSizes
dbStats:
– dbStats.indexes
– dbStats.indexSize
MongoDB
Geospatial Indexes
MongoDB provides a special type of index for coordinate plane queries, called a
geospatial index
To create a geospatial index for GeoJSON-formatted data, use the
db.collection.createIndex() method to create a 2dsphere index.
In the index specification document for the db.collection.createIndex() method,
specify the location field as the index key and specify the string literal "2dsphere" as
the value:
db.collection.createIndex( { <location field> : "2dsphere" } )
A compound index can include a 2dsphere index key in combination with non-
geospatial index keys
MongoDB
Geospatial Query Operators
Inclusion: MongoDB can query for locations contained entirely within a specified
polygon. Inclusion queries use the $geoWithin operator.
Intersection : MongoDB can query for locations that intersect with a specified
geometry. These queries apply only to data on a spherical surface. These queries
use the $geoIntersects operator.
Proximity MongoDB can query for the points nearest to another point. Proximity
queries use the $near operator.
MongoDB
$geoWithin Operator
The $geoWithin operator queries for location data found within a GeoJSON polygon.
Your location data must be stored in GeoJSON format. Use the following syntax:
The following example selects all points and shapes that exist entirely within a
GeoJSON polygon:
db.places.find( { loc :{ $geoWithin :{ $geometry :{ type : "Polygon" ,coordinates : [ [[ 0
, 0 ] ,[ 3 , 6 ] ,[ 6 , 1 ] ,[ 0 , 0 ]] ]} } } } )
MongoDB
Proximity to GeoJSON Point
Proximity queries return the points closest to the defined point and sorts the results
by distance. A proximity query on GeoJSON data requires a 2dsphere index
To query for proximity to a GeoJSON point, use either the $near operator or
geoNear command. Distance is in meters.
db.<collection>.find( { <location field> :{ $near :{ $geometry :{ type :
"Point" ,coordinates : [ <longitude> , <latitude> ] } ,$maxDistance : <distance in
meters>} } } )
The geoNear command uses the following syntax:
db.runCommand( { geoNear : <collection> ,near : { type : "Point" ,coordinates:
[ <longitude>, <latitude> ] } ,spherical : true } )
To select all grid coordinates in a “spherical cap” on a sphere, use $geoWithin with
the $centerSphere operator.
db.<collection>.find( { <location field> :{ $geoWithin :{ $centerSphere :[ [ <x>, <y> ] ,
<radius> ] }} }
MongoDB
Aggregation
Aggregations are operations that process data records and return computed
results
MongoDB provides a rich set of aggregation operations that examine and perform
calculations on the data sets.
Like queries, aggregation operations in MongoDB use collections of documents as
an input and return results in the form of one or more documents
Aggregation Pipelines: is a aggregation framework, modeled on the concept of
data processing pipelines. Documents enter a multi-stage pipeline that transforms
the documents into an aggregated result.
The most basic pipeline stages provide filters that operate like queries and
document transformations that modify the form of the output document
In addition, pipeline stages can use operators for tasks such as calculating the
average or concatenating a string
MongoDB
Aggregation
MongoDB
Pipeline Operator and Indexes
Early Filtering: If your aggregation operation requires only a subset of the data in a
collection, use the $match, $limit, and $skip stages to restrict the documents that
enter at the beginning of the pipeline.
When placed at the beginning of a pipeline, $match operations use suitable indexes
to scan only the matching documents in a collection
Placing a $match pipeline stage followed by a $sort stage at the start of the pipeline
is logically equivalent to a single query with a sort and can use an index. When
possible, place $match operators at the beginning of the pipeline
MongoDB
Aggregate Pipeline Stages
Pipeline stages appear in an array. Documents pass through the stages in sequence.
All except the $out and $geoNear stages can appear multiple times in a pipeline.
db.collection.aggregate( [ { <stage> }, ... ] )
$project: Used to select some specific fields from a collection.
$match: This is a filtering operation and thus this can reduce the amount of
documents that are given as input to the next stage.
$skip: With this it is possible to skip forward in the list of documents for a given
amount of documents.
$limit: This limits the amount of documents to look at by the given number starting
from the current position.
$unwind: This is used to unwind document that are using arrays. when using an
array the data is kind of pre joinded and this operation will be undone with this to
have individual documents again. Thus with this stage we will increase the amount
of documents for the next stage
MongoDB
Pipeline Operator and Indexes
The following aggregation operation returns all states with total population greater
than 10 million:
db.zipcodes.aggregate( [{ $group: { _id: "$state", totalPop: { $sum: "$pop" } } },
{ $match: { totalPop: { $gte: 10*1000*1000 } } }] )
The following aggregation operation returns user names sorted by the month they
joined. This kind of aggregation could help generate membership renewal notices.
Map-reduce is a data processing paradigm for condensing large volumes of data into
useful aggregated results
For map-reduce operations, MongoDB provides the mapReduce database
command.
In this map-reduce operation, MongoDB applies the map phase to each input
document (i.e. the documents in the collection that match the query condition).
The map function emits key-value pairs.
For those keys that have multiple values, MongoDB applies the reduce phase, which
collects and condenses the aggregated data
MongoDB then stores the results in a collection. Optionally, the output of the
reduce function may pass through a finalize function to further condense or process
the results of the aggregation
MongoDB
Map Reduce in MongoDB
All map-reduce functions in MongoDB are JavaScript and run within the mongod
process
Map-reduce operations take the documents of a single collection as the input and
can perform any arbitrary sorting and limiting before beginning the map stage
Group: The group operation takes a number of documents that match a query, and
then collects groups of documents based on the value of a field or fields. It returns an
array of documents with computed results for each group of documents
Group does not support data in sharded collections.
In addition, the results of the group operation must be no larger than 16 megabytes.
Consider the following group operation which groups documents by the field a,
where a is less than 3, and sums the field count for each group:
db.records.group( {key: { a: 1 },cond: { a: { $lt: 3 } },reduce: function(cur, result) {
result.count += cur.count },initial: { count: 0 }} )
RECAP