Professional Documents
Culture Documents
March 2016
Contents
Executive Summary ............................................................................................................................................... 4
Introduction .............................................................................................................................................................. 4
Audience .................................................................................................................................................................. 4
Summary .................................................................................................................................................................27
References ............................................................................................................................................................ 29
This whitepaper also discusses the solution benefits for deploying Pure Storage FlashArray in a MongoDB
environment, including:
Introduction
MongoDB is a high-performance, scalable, open source, schema-free, document oriented database
designed for a broad array of modern applications and cloud environments. MongoDB is the industry’s
fastest growing database and it is used by organizations across all walks of life to power on-line,
operational applications where key metrics like high throughput, low latency and high availability are
paramount requirements.
Modern day application demands rich and dynamic data structures, easy scaling, better performance and
low TCO as the customer & business requirements changes rapidly. MongoDB’s dynamic data structure,
ability to index and query data, and auto-sharding makes it a strong tool that adapt to changes well. It
tremendously reduces the complexity comparing to traditionsl RDBMS. On top of this, the Internet of
Things (IoT) and Big Data demands new requirements from the database like scalability, flexbility,
analytics and unified view which MongoDB by its virtue of design philosophy and implementation meets.
MongoDB’s Nexus Architecture blends the best of relational and NoSQL technologies.
Pure Storage FlashArray is the perfect companion to support MongoDB when the working set no longer
fits in the memory and spills over to the storage sub-system which still requires low latency and
consistently high performant response times.
Audience
The target audience for this document includes storage administrators, MongoDB administrators, Data
architects,engineers and partners who want to deploy Pure Storage FlashArray in a MongoDB
environment.
Built on 100% consumer-grade MLC flash, Pure Storage FlashArray delivers all-flash enterprise storage
that is 10X faster, more space and power efficient, more reliable, and infinitely simpler, and yet typically
costs less than traditional performance disk arrays.
Figure 1. FlashArray//m
Accelerating Databases and Applications Speed transactions by 10x with consistent low latency, enable
online data analytics across wide datasets, and mix production, analytics, dev/test, and backup workloads
without fear.
Virtualizing and Consolidating Workloads Easily accommodate the most IO-hungry Tier 1 workloads,
increase consolidation rates (thereby reducing servers), simplify VI administration, and accelerate
common administrative tasks.
Delivering the Ultimate Virtual Desktop Experience Support demanding users with better performance
than physical desktops, scale without disruption from pilot to >1000’s of users, and experience all-flash
performance for under $100/desktop.
Protecting and Recovering Vital Data Assets Provide an always-on protection for business-critical data,
maintain performance even under failure conditions, and recover instantly with FlashRecover.
Pure Storage FlashArray sets the benchmark for all-flash enterprise storage arrays. It delivers:
Consistent Performance FlashArray delivers consistent <1ms average latency. Performance is optimized
for the real-world applications workloads that are dominated by I/O sizes of 32K or larger vs. 4K/8K hero
performance benchmarks. Full performance is maintained even under failures/updates.
Less Cost than Disk Inline de-duplication and compression deliver 5 – 10x space savings across a broad
set of I/O workloads including Databases, Virtual Machines and Virtual Desktop Infrastructure.
Disaster Recovery Built-In FlashArray offers native, fully-integrated, data reduction-optimized backup and
disaster recovery at no additional cost. Setup disaster recovery with policy-based automation within
minutes. And, recover instantly from local, space-efficient snapshots or remote replicas.
Simplicity Built-In FlashArray offers game-changing management simplicity that makes storage
installation, configuration, provisioning and migration a snap. No more managing performance, RAID, tiers
or caching. Achieve optimal application performance without any tuning at any layer. Manage the
FlashArray the way you like it: Web-based GUI, CLI, VMware vCenter, Rest API, or OpenStack.
MongoDB is designed to combine the critical capabilities of relational databases with the innovations of
NoSQL technolgies. Relational databases have been around for many years and offers key features that
are still being taken advantage of;
Expressive Query language. Users should be able to access and manage their data with powerful query,
projection, aggregation and update operators, to support both operational and analytical applications.
Secondary Indexes. Indexes still play a key role in providing efficient and faster acces to data for both
reads and writes.
Strong consistency. Applications should be able to immediately read what has been written to the
database.
While the above features are still used by modern applications, there are other requirements that are not
addressed by relational databases and that has driven the development of NoSQL databases which
provides:
Flexible Data Model. NoSQL databases flourished to address the key requirements for the data which is
prevalent in the modern applications, named flexible data model. All NoSQL databases offer flexible data
model making it easy to store and group data of any structure and allow dynamic changes to the schema
without any downtime.
Elastic Scalability. NoSQL databases were all built with a focus on scalability, so they all include some
form of sharding or partitioning, allowing the database to scale-out on commodity hardware deployed on-
premise or in the cloud.
High Performance. NoSQL databases are designed to delive etreme performance, measured in terms of
bboth throughput and latency at any scale.
With MongoDB, organizations can address diverse application needs, hardware resources, and
deployment designs with a single database technology. Through the use of a flexible storage
architecture, MongoDB can be exended with new capabilities, and configured for optimal use of specific
hardware architectures. MongoDB allows users to mix and match multiple storage engines within a single
deployment.
MongoDB 3.2 ships with four supported storage engines, all of which can coexist within a single
MongoDB replicaset. This makes it easy to evaluate and migrate between them, and to optimize for
specific application requirements.
• The Encrypted storage engine protecting highly sensitive data, without the performance or
management overhead of separate filesysm encryptions.
• The In-Memory storage engine delivering the extreme performance coupled with real time
analytiscs for the most demanding, latency-sensitive applitions.
• The MMAPv1 engine, an improved version of the storage engine used in pre-3.x MongoDB
releases.
Solution Overview
The standard MongoDB deployment involves individual sets or pods of compute, memory and storage
resources put together which seems cost effective. Hence the adoption rate of such config at cloud
providers like AWS is very prominent and works well for a small scale. Companies not using IaaS, prefer
to use direct attached storage (DAS) as an alternate for cost effective solution. It is also not uncommon to
see the use of server PCIe flash cards which gives superior performance but very quickly runs out of
capacity and are operationally expensive to deploy and maintain.
The challenge with these pod models arises when the companies look to scale. The need for consistent
performance becomes paramount and neither DAS nor PCIe flash cards can provide the level of
efficiency, capacity, availability and sustained performance.
Pure Storage FlashArray is the ideal solution to bridge the gap and support high availability, low latency
requirements of MongoDB deployments. We are now seeing many of our customers consolidating their
MongoDB databases on to Pure Storage FlashArray not only to address the reasons stated above but to
take advantage of the various features Pure Storage FlashArray offers like simplicity and ease of
management, snapshots and database cloning, industry leading data reduction, high availability and
resiliency, consistent and scalable performance.
Scalability and performance tests were performed using mongoperf and YCSB tools.
Mongoperf
Mongoperf is the utility to check the disk I/O performance independently of MongoDB. It is a very useful
utility to simulate the I/O operations of MongoDB. It times the random disk I/O activities and presents the
results. To specifically validate the I/O performance without the filesystem cache in to the play, we ran
the tests with mmf (memory mapped files) option set to false, which enables direct I/O access. Mongoperf
accepts various configuration fields to vary the workloads. We used the blocksize (mongoperf.recSizeKB)
of 8KB throughout the tests. The default blocksize is 4KB.
YCSB
The Yahoo! Cloud Serving Benchmark (YCSB) has become the standard benchmarking tool for
evaluating/testing NoSQL database systems. The original YCSB benchmark was developed by Yahoo!
Research division who released it in 2010 with the stated goal of “facilitating performance comparisons of
the new generating of cloud data serving systems”.
The YCSB test consists of loading the dataset into the MongoDB databases and executing multiple
workloads of various read/write ratios.
All these tests were run with mmf (memory mapped files) set to false to test and validate direct I/O
configuration which will be the worst case scenario when all I/O requests were not fulfilled out of RAM
and spilling over to the storage subsystem.
The Pure Storage luns were mounted on the host using XFS filesystem and no LVMs were used.
Mongoperf creates the file everytime it is run with the file size provided. The following picture illustrates
the performance metrics from the Pure Storage FlashArray//m50 when the file was created for the
following mongoperf parameters on a single node.
The test showed that the host was able to get 1.7 GB/s of bandwidth from Pure Storage at an average IO
size of 504KB.
168318
150000 123625 143663
122537
100000 79273
115078
106239
41130 71036 88093
50000 67491
39962
21312 38480
0 21783
10834
4 8 16 32 64 128 256
Threads
1 Node 2 Nodes 4 Nodes
The test results showed steady scalability across various nodes. In the 4 nodes scenario, the throughput
climbed up steadily upto 128 threads and started staying flat or degrading beyond that.
0.6
0.5 0.41
0.4
0.22 0.21 0.23
0.3 0.16 0.17 0.18 0.17 0.18
0.2
0.1 0.19 0.2
0
4 8 16 32 64 128 256
Threads
1 Node 2 Nodes 4 Nodes
Alternatively, the throughput metrics (ops/sec) were plotted against the total threads when run with 1, 2
and 4 nodes. The total threads as marked in x-axis were sum of all threads across the nodes for that run.
For example, 320 total threads is equivalent of 2 nodes running 160 threads each or 4 nodes running 80
threads each. This graph is useful to measure the best possible combination to get the maximum
throughput across all nodes.
161415 184201
168318
182127
150000 123625 143663
122537
100000 79273
115078 111427
106239 100558
71036 88093
41130
50000
39962 67491
38480
0
16 32 64 128 256 320 512
Total Threads
1 Node 2 Nodes 4 nodes
In the case of thread scalability, we found that we got the highest throughput of 221,520 ops/sec with 4
nodes each running 128 threads or total of 512 threads. As mentioned earlier, the latency during this test
was still under 1ms.
226428 247511
200000 172078 205101
In comparsion to the 100% writes, the mixed workload exhibited higher throughput across all runs as the
writes were reduced by 50% and replaced with 50% of reads. Reads are abundant in the all flash array
world when compared to writes and hence, it reflects directly in terms of latency and throughput
numbers.
As like all-writes workload, the mixed workload also exhibited better scalability results across 1, 2 and 4
nodes. In the Mixed workload, we were able to get 254,036 MongoDB database operations/second with
latency consistenly under 0.25ms with 4 nodes. With single node, the performance steadily went up till
128 threads and started degrading. In case of 4 nodes, there were positive improvements till 256
threads.
In contrast to all-writes workload, the mixed workload displayed much better latency across all runs. The
maximum read latency exhibited was 0.11ms and maximum write latency exhibited was 0.25ms. This is
due to superior architectural design of the Pure Storage FlashArray//m model.
0.21
Latency (ms) 0.2
0.17
0.15 4N - Read, 0.12
0.11
0.1
0.09
0.05
0
4 8 16 32 64 128 256 320
Threads
1N - Read 2N - Read 4N - Read
1N - Write 2N - Write 4N - Write
The thread scalability tests doesn’t paint the complete picture in comparison to that of node scalability.
As the maximum throughput of 254,036 that we accomplished was with 256 threads across 4 nodes
which is equivalent of 1024 threads which was not part of the test case. This graph is a different
representation of the scalability numbers in view of total threads.
From the thread scalability tests, we can deduce that with both higher number of threads and nodes we
can achieve higher numbers until we stress either the compute and I/O resources. These tests gives a
view of the performance that can be expected with the varying threads and nodes to identify the right
settings for the desired workload.
300000
246180
224256 229139
250000 247511
Throughput (Ops/Sec)
207053 245326
226428
200000 172078 205101
Graph 7 illustrates the comparison of two workloads, all-writes vs read-writes when run across 4 nodes.
As mentioned earlier, the read/write workload exhibits higher throughput at lower latency than that of all-
writes workload as the write operation has been reduced by 50%. Comparatively read is always better in
the all-flash-array world than writes and it is obvious from the results as shown above.
0.7
200000 172078
Latency (ms)
221520 214600
0.6
0.66
188072
150000 113155 0.5
161415
0.4
100000 123625 0.22 0.41 0.3
79273
0.2
50000 0.24
0.23
0.17 0.18 0.17 0.19 0.1
0 0
8 16 32 64 128 256
Threads
50% R/W 100% Write 50% R/W Latency 100% W Latency
All the tests revealed that the highest throughput were accomplished with higher number of nodes which
is a validation for the scalability aspect that is very critical for NoSQL databases like MongoDB. These
tests also validated the Pure Storage FlashArray//m series’ superior architecture that enables sustained
low latency and higher performance for mixed workloads.
Test Summary
• The latency was under 1 ms for all the tests on the FlashArray.
• The read/write workload experienced more throughout than that of 100% write.
• Increase in threads leads to the host becoming the bottleneck than the storage.
Note
• The test environment uses iSCSI protocol for connectivity. Higher performance can be obtained
by using FC protocol which is supported by Pure Storage FlashArray.
• Write performance on EXT4 was very poor and XFS is overall better for MongoDB.
We used the WiredTiger storage engine in our tests with journaling enabled. We also compared the
performance of the YCSB workloads against MMAPv1 storage engine.
YCSB benchmark is a two step process, where the first step is to load the data and the second step is to
run the actual workload.
For the tests, we used 50 million documents and 5 million operations. Documents included 10 fields of
100 bytes plus key (1 KB record). Records were selected using the Zipfian distribution. Results reflect the
optimal number of threads. The optimal threads were determined by increasing the thread count until the
95th and 99th percentile latency values began to increase and the throughput experiencing no increase.
We performed the following three workloads with varying threads to get the right combination to achieve
higher throughput at lower latency.
We ran the tests with journaling turned on (default behavior) and off (throughput optimized, no write
acknowledgement).
Throughput
Journaled vs no-journal tests was performed to compare the best possible throughput (throughput
optimized with no-journal but can incur data loss in case of failure) against the balanced settings (good
throughput with minimal data loss in case of failure). Only workloads (A & B) that has updates are tested
and Workload-C was not included as part of this test.
The tests results were surprising as the performance difference with Workload A (50% read, 50% write)
was 7% whereas with Workload B (95% read, 5% write) it was 1%. It is understandable that Workload B
has only 5% of writes but in case of 50% write scenario (Workload A), a degradation of 7% is not very
significant. In fact most users will find the tradeoffs worthwile that favors the default configuration over
some settings that can provide additional 7% throughput but at the cost of the availability of data.
No Journal Journaled
Latency
More than throughput, latency of the operations is also very important. Since the average latency is not
necessarily the best metric, YCSB provides the latency metrics at 95% and 99% percentiles, where
observed latency is worse than 9% or 99% of all other latencies. As expected the read latencies are
lower than that of write latencies and the read latencies were under 0.5ms.
0.5 0.457
0.45
0.398
0.4
0.345
0.35
Latency (ms)
0.3 0.265
0.246
0.25
0.198
0.2
0.15
0.1
0.05
0
Workload A Workload B Workload C
Latency (ms)
0.4 0.327
0.3
0.2
0.1
0
Workload A Workload B
The tests revealed that the write latencies also stayed around or below 0.5ms for both the workloads.
These were clearly due to the advantage of Pure Storage FlashArray which could handily manage the
writes that are pushed out by MongoDB periodically.
Based on our testing, we found the optimized thread settings for YCSB is at 16 threads where the latency
stayed below 1ms consistently and experienced highest throughput. Beyond 16 threads, the latency
started increasing gradually but the throughput stayed flat. To further increase the throughput through
scaling, use sharding.
86962
80000
60000 51306
40000
20000
0
2 4 8 16 24 32 64
WiredTiger syncs the buffered log records to disk according to the following intervals or conditions:
Note: Any in between writes that are in the buffers can be lost if mongod crashes before the next flush.
With Journal enabled, Pure Storage FlashRecover Snapshots can be used to take crash-consistent, point
in time snapshots of the MongoDB database at the storage layer.
The snapshot can be taken either through the Pure Storage GUI, CLI or rest based APIs.
We will cover the snapshot and cloning of the database through CLI.
1. Take the snapshot of the volume that is hosting the MongoDB database and journal. If the journal
is in a different volume, it is recommended to create a Protection group with both datafile and
journal volumes.
In the example below, we are listing the details of the volume and taking the snapshot of the
volume which will display the snapshot name that is required for Step 2.
2. Instantiate the snapshot by copying it to a volume that will be connected to destination host.
In the following example, we are instantiating the snapshot we took in Step 1 to a new destination
volume named fs_mdb1_dev.
In the example below, we are connecting the snapshotted volume (fs_mdb1_dev) to the
destination host mongodb2.
4. Scan for the volume that is attached to the destination host using the following Linux command
and mount it. Note: If you are using alias with device multipathing configuration, please update
the /etc/multipath.conf file with the new volume information and the alias. If not, it will show the
device mapper name with the WWID number. In the example below, we have updated the
/etc/multipath.conf file with the alias name of fs_mdb1_dev.
root # rescan-scsi-bus.sh
We recommend using XFS filesystem for MongoDB databases (for both MMAPv1 and WiredTiger storage
engine) over other filesystems. Based on our internal testing we have seen better performance numbers
using XFS than EXT4.
Disable access time settings through mount option. Most file systems will maintain metadata for the last
time a file was accessed. This may be useful for some applications, in a database it means that the file
system will issue a write every time the database accesses a page which will negatively impact the
performance.
Use noatime option with mount to disable the access time settings.
Read-Ahead settings
Set the readahead block size to 32 (16KB) or the size of the most documents, whichever is larger.
If the readhead size is much larger than the size of the data requested, a larger block will be read from
disk which is not ideal for performance and the memory usage.
Use blockdev –-setra <value> command to set the read ahead value.
MongoDB workloads performs poorly with THP because they tend to have sparse rather than contiguous
memory access patterns. It is a best practice to disable transparent huge pages for MongoDB by
performing the following commands.
TRIM/UNMAP
When files are deleted in Unix, the Operating System marks the blocks that are used by the files as free
inside the file system index but it doesn’t inform the SSD about it. This means, the SSD storage cannot
claim that space until it is informed of the action by the O.S. TRIM is the name of the command that the
O.S can send to inform SSD of all the blocks that are free in the file system.
TRIM is enabled at the file system level using mount option named “discard”. This informs SSD in real
time the list of blocks that are were made free when a file is deleted, which allows SSD to perform
The MMAPv1 storage engine of MongoDB uses memory mapped files to map the data files to a region of
virtual memory. By using memory mapped files, MongoDB can treat the contents of its data files as if they
were in memory. Due to the architectural implementation of MMAPv1, the files would be saved in chunks
of 2GB in size. Hence a multi-terabyte MongoDB database using MMAPv1 storage engine can have
numerous memory mapped files.
We recommend not to use discard option at the mount point to achieve TRIM, rather use the “fstrim”
command on demand or on-schedule. Once a day or week should be more than sufficient to invoke
fstrim to reclaim the freed-up space as well as not to incur any performance overhead.
[root@mongodb1 ~]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/fs_m50_mdb1_mperf1 1.0T 529G 495G 52% /p02
Database Cloning
To use PureStorage FlashRecover Snapshots to take snapshots of the MongoDB database, place the
database files and journal on the same volume. As the performance on Pure Storage is not based on the
volume/lun count, it is perfectly fine to place both the datafile files and journals on the same volume.
If multiple MongoDB databases are hosted on the same host and if the databases are candidates for
database cloning, place the databases on its own volume/lun.
Enable Journaling for MongoDB databases. To get a consistent copy of the database, journaling should
be enabled. Without journaling, there is no guarantee that the snapshot will be consistent or valid.
One chassis with 4 identical Intel CPU based SuperMicro servers (product name SYS-F618R2-RTPT+)
were used for the MongoDB Scalability and Consistent performance testing. Red Hat Enterprise Linux 7.2
was installed on the local disk.
Component Description
FlashArray Configuration
The FlashArray//m50 configuration comprised of two active/active controllers and the base chassis
included 20TB of raw SSD storage for a total of 11.17 TB usable storage. 4 x 10GbE network interfaces
from each controller were connected to the dual Cisco 9K switch in a highly redundant configuration.
Note: There are no special configuration or settings that were made on the array. There are no
performance knobs to tune the FlashArray either.
Component Description
FlashArray //m50
3U
Physical
5.12” x 18.94” x 29.72” FlashArray//m chassis
Connectivity
The servers were connected over iSCSI protocol to the storage array through Cisco 9K switch. To
improve the bandwidth and enable redundancy to the server, the two 10G interfaces were bonded
together using load balancing (round-robin) mode. See Appendix – A for the bonding settings that was
used in our Linux hosts.
$cat /etc/sysconfig/network-scripts/ifcfg-bond0
DEVICE=bond0
BOOTPROTO=none
ONBOOT=yes
NAME=bond0
IPV6INIT=no
IPADDR=192.168.1.71
NETMASK=255.255.255.0
BONDING_OPTS="miimon=100 mode=0"
NM_CONTROLLED=no
$cat /etc/sysconfig/network-scripts/ifcfg-ens7f0
TYPE=Ethernet
BOOTPROTO=none
DEFROUTE=yes
IPV6INIT=no
NAME=bond0-slave0
DEVICE=ens7f0
ONBOOT=yes
MASTER=bond0
SLAVE=yes
NM_CONTROLLED=no
$cat /etc/sysconfig/network-scripts/ifcfg-ens7f1
TYPE=Ethernet
BOOTPROTO=none
IPV6INIT=no
#NAME=ens7f1
NAME=bond0-slave1
DEVICE=ens7f1
ONBOOT=yes
MASTER=bond0
SLAVE=yes
NM_CONTROLLED=no
Twitter: @purelydb
T: 650-290-6088
F: 650-625-9667
Sales: sales@purestorage.com
Support: support@purestorage.com
Media: pr@purestorage.com
General: info@purestorage.com