Professional Documents
Culture Documents
DOI 10.3233/WEB-160344
IOS Press
Abstract. Twitter has grown significantly in the past several years and provides a new vector for data collection, offering individ-
ual users and companies valuable insights. This presents a technical challenge to collect and analyze all the data in an efficient
manner. Traditional relational databases have not been able to provide acceptable response times that this new problem presents,
and focus has started shifting to newer technologies such as NoSQL databases. In this paper, we try to answer a question as
follows: “If I want to store and access millions of tweets for data analysis, which database systems should I choose?” We selected
four popular SQL and NoSQL database systems and tested on different twitter dataset varying from one million to fifty million
tweets. Each workload test involves running a core set of data operation commands. The experiment results are promising and
provide guideline for choosing the most efficient database systems based on different user requirements.
2405-6456/16/$35.00 © 2016 – IOS Press and the authors. All rights reserved
276 F. Leung and B. Zhou / Performance evaluation of Twitter datasets on SQL and NoSQL DBMS
greSQL, MongoDB and Redis. Each DBMS has differ- In SQL-based DBMS, the schema is set and defined
ent levels of support for JSON: SQL Server has no na- in advance, and additional data added to the set must
tive JSON support, and PostgreSQL only gained sup- adhere to the schema. For guarantees and a system of
port in late 2013. NoSQL-based systems such as Mon- checks on data, such as type, a strict schema can be
goDB launched in 2009 with native support in a deriva- helpful. NoSQL eschews this strict need and is flexi-
tive format called Binary JSON (BSON), and Redis al- ble with its schema by not explicitly requiring confor-
lows storage of JSON but does not specifically sup- mity. This gives the developer/DBA more power (and
port that data type natively. We explore and present our responsibility) in maintaining the database by group-
findings using datasets consist of millions of tweets, ing together similar data without it all needing to be
tested on and compare the performance between these exactly the same. In some types of databases, such
four DBMS. Our experimental results show some in- as key-value stores, the database itself knows nothing
teresting findings and provide guidelines on how to about the data it is storing.
choose the most efficient DBMS based on different New data is constantly added and stored, with low
user needs. read latency still strongly desired. ACID overhead for
all the data is high, and some users don’t need all its
data all the time, allowing consistency guarantees to
2. Background and related work be dropped in favor for faster performance. NoSQL
does not always have ACID guarantees (it may be im-
2.1. Tweets and big data, collecting and storing plemented by the developer at the application level),
and better scalability helps maintain low read latency
Twitter has grown significantly in the past several alongside redundant data.
years and provides a new vector for data collection, NoSQL is currently the hot new solution to aid com-
offering companies valuable insight in its users and panies and developers in handling the challenges that
customers. This presents a technical challenge to col- Big Data brings.
lect and analyze all the data in an efficient manner.
“Big Data”, a term used to describe the massive in- 2.2. DBMS types
flux of data being handled and studied (not exclusive
to tweets), has been a large driving force for NoSQL Relational database systems based on SQL-syntax
adoption [16]. Traditional relational databases have have existed since the 1970s in many forms, and
not been able to provide acceptable response times have been widely used as the primary means to host
that this new problem presents, and focus has started databases. Strong support for ACID and transac-
shifting to newer technologies such as NoSQL-based tions make them extremely dependable and resilient.
DBMS. In areas in which traceability and dependability are
Big Data can be described in three V’s: volume, va- paramount (e.g. banking and finance), the framework
riety and velocity. The sheer volume of data is diffi- of ACID will not be replaced by something else. The
cult for SQL-based solutions because scalability was overhead of ACID makes SQL-based DBMS slower
not one of the primary concerns decades ago. Though than alternatives that do not strictly conform to it.
certainly scalable, database servers of the past were Despite the abbreviation of the term NoSQL sug-
run on highly specialized hardware, needing more of gests its meaning “No SQL” or “non SQL”, many soft-
the same or similar hardware to expand and scale out. ware vendors prefer it to be viewed as “Not Only
NoSQL emphasizes horizontal scaling on commodity SQL”, representing alternative methods for databases
hardware, allowing a much lower cost to scaling out. to store data [2].
Consistency, as a part of atomicity, consistency, isola-
The methods used by NoSQL-based DBMS are not
tion, durability (ACID), takes a back seat in NoSQL
new, and are also as old as SQL-based ideas. However,
by not guaranteeing it. Instead, redundant data is kept
NoSQL products most popular today have been heav-
with “eventual consistency”, which states that if no
ily influenced by Google BigTable, an in-house data
new updates are made to a given data item, eventually
storage system first introduced in [7,12].
all accesses to that item will return the last updated
Different methods of storing data include:
value. Redundant data also allows for lower read la-
tency as there would be a higher chance of “finding” • column-store – columns consist of a unique name,
the data. value (content), and timestamp;
F. Leung and B. Zhou / Performance evaluation of Twitter datasets on SQL and NoSQL DBMS 277
• document-oriented store – store objects as docu- using the Windows port, despite Redis’s only officially
ments (frequently JSON); supported platform being on Linux. This decision was
• key-value store – use an associative array with a made to keep its environment similar to other DBMS
collection of key, value pairs; already running on Windows. The Python library used
• graph store – based on graph structure. is redis 2.10.3.
NoSQL does not guarantee ACID like SQL-based 2.4. Related work
systems, and is in fact sometimes classified as BASE –
Basically Available, Soft state, Eventual Consistency. There are several papers in literature provide com-
Application developers do have more leeway in the parison between NoSQL databases and relational
details of storage and implementation, at the risk of DBMS in terms of types of NoSQL data store, query
greater dependence on the developer. For the two languages used in NoSQL, advantages and disadvan-
NoSQL DBMS used in our experiment, MongoDB is tages of NoSQL over relational DBMS [2,4,8,14].
a document-oriented store, and Redis is a key-value However, there are very few papers that have dealt
store stored in-memory. specifically with benchmarking and comparing SQL
and NoSQL DBMS. One such is [10] which only used
2.3. DBMS tested datasets sized from 148 to 12,416 tuples/documents,
and only compared MongoDB to SQL Server. Pa-
Microsoft SQL Server 2014 is continuation of their
per [1] benchmarked only the NoSQL DBMS Mon-
long-running DBMS products, with version 1.0 orig-
goDB, ElasticSearch, OrientDB, and Redis between
inally released in 1989. SQL Server is a relational
each other with dataset sizes ranging from 1,000 to
DBMS and offers multiple tiers of its product, ranging
100,000 records.
from the free Express to the not-free Enterprise edi-
There are more informal sources for comparisons,
tions. The editions differ in features available, such as
specifically on personal and company blogs. An older
high availability, security, and replication options, and
but apparently well-known benchmark is the Yahoo!
database size and hardware support (e.g. limitation on
Cloud Serving Benchmark (YCSB) [5], an open source
number of processors and cores usable). The edition
tool initially written in 2010 to test many different
used for testing is the Enterprise edition, whose limi-
types of DBMS, including SQL and NoSQL DBMS.
tations are bound by hardware and not software. The
Some of the workloads defined by that benchmark
many extra features such as the previously mentioned
have also been adopted for testing in this paper. The
high availability and replication are not used. The ver-
scope of our work differs from YCSB in that it em-
sion used is 12.0.4100.1. The Python library used is
phasizes scalability on multiple nodes and processes,
pymssql 2.1.1.
and requires elasticity and high availability. Our test-
PostgreSQL is a relational DBMS developed offi-
ing is purposely confined to a single node with no such
cially by the PostgreSQL Global Development Group,
extra capabilities, rendering YCSB unsuitable for our
consisting of different companies and individuals [15].
specific needs.
The software is free and open source software and
The computer software company Datastax, which
released under the PostgreSQL License. The version
offers an enterprise distribution of the NoSQL DBMS
used is 9.4.1 64-bit. The Python library used is psy-
Cassandra, performed an evaluation between three
copg 2 2.6.1.
NoSQL DBMS: Cassandra, HBase and MongoDB [3].
MongoDB is a document-oriented database with dy-
Their testing utilized YCSB and revealed Cassandra to
namic schemas originally released in its current form
outperform the other DMBS products, which is also
as a standalone database product in 2009, released un-
the product that the company offers as its primary busi-
der the GNU Affero General Public License and is free
ness. Though we don’t necessarily believe their results
and open source software. As of this writing, it is the
to be misleading, we withhold some skepticism for a
most popular database of its type [6]. The version used
less biased outlook.
is 3.0.1. The Python library used is pymongo 3.0.3.
Redis is an in-memory data structure server, sup-
porting several different types of data structures such 3. Objectives
as lists, sets, hashes and bit arrays. The entire dataset
is hosted in physical memory, with changes saved to In addressing aforementioned challenges when stor-
disk in user-specified intervals. The version used is 2.8 ing and accessing big data, traditional DBMS has dif-
278 F. Leung and B. Zhou / Performance evaluation of Twitter datasets on SQL and NoSQL DBMS
ficulty meeting performance requirements. In this pa- The INSERT command uses a separate, smaller
per, we propose to evaluate the performance of four dataset (i.e., the “secondary” dataset) for its tests. The
popular DBMS: Microsoft SQL Server, PostgreSQL, number of tweets is much smaller since the workload
MongoDB and Redis. In particular, we test on different requiring doesn’t need much, though the JSON fields
twitter dataset varying from one million to fifty mil- remain the same.
lion tweets. Our performance evaluation is aimed to
fit in the gap between NoSQL-only testing with small 4.2. Workload descriptions
dataset sizes, and extremely large scalable cloud sys-
tems. Specifically, we look at comparisons between Workload A (50/50 R/W) uses UPDATE com-
SQL and NoSQL with large dataset sizes without scal- mands exclusively, searches for 500 tweets using the
ability and elasticity requirements. tweet_id, and updates the user_name field with the
string “aaabbb”. Workload B (95/5 R/W) uses 450
READ commands searching for the tweet_id, and 50
4. Evaluation methodology UPDATE commands, updating the user_name field
with the string “bbbbbb”. Workload C (100 R) uses
4.1. Workload overview 500 READ commands, searching using the tweet_id
field. Workload Write (100 W) uses 1000 INSERT
Each workload test (detailed below) involves run- commands from the secondary dataset, inserted into a
ning a core set of commands as shown in Table 1 500 new (empty) table and dropped (TRUNCATE or equiv-
times, which is then looped and timed 100 times. Thus, alent) at the end of each loop. This is done to keep each
the raw data will show 100 points of data. The dataset insertion “fresh”.
involved consists of 50 million tweets with unique
tweet IDs (i.e., the “primary” dataset), with tests that 4.3. Schema setup
use a smaller set of data being a portion of the primary
dataset. Table 2 shows database schema across all four
For each command requiring something to find (e.g. DBMS. MongoDB specifically had index created on
READ and UPDATE), it uses a list of 500 tweet IDs the tweet_id field, and it appears to be proper setup
of existing tweets are randomly chosen. This single list to create indices for fields that are intended to be
of random tweet IDs is the same for each workload searched. Due to hardware constraints, the 25-million
test for the given size. For example, the tests using the and 50-million datasets were unable to be tested in Re-
1 million tweet dataset will have a list of 500 tweet dis.
IDs spanning its range, and the tests using the 5 mil-
lion tweet dataset will use a different list that spans the 4.4. Test platform specifications
larger range of tweets.
The operating system was installed on Samsung
Table 1
850 PRO, and all virtual machines (VMs) were hosted
Core commands
only on the Samsung 840 EVO. VMs were managed
Command SQL Server & MongoDB Redis through the included Hyper-V Manager. VM check-
PostgreSQL
points were not utilized to save on disk space.
READ SELECT find_one HMGET
Each DBMS product was tested in its own VM with
UPDATE UPDATE update_one HSET
its own reserved resources. During testing, only the
INSERT INSERT insert_one HMSET
DBMS being tested had its VM powered on – the oth-
TRUNCATE TRUNCATE db.collection.drop FLUSHDB
ers were powered off. Due to disk space constraints on
Table 2
Database schema
DBMS tweet_id created_at user_id user_name text
SQL Server bigint (primary key) Datetime bigint text text
PostgreSQL bigint (primary key) timestamp without timezone bigint text text
MongoDB None specified, inserted as JSON
Redis Key set via hashes Value set via hashes
F. Leung and B. Zhou / Performance evaluation of Twitter datasets on SQL and NoSQL DBMS 279
Table 3 Table 4
Virtual machine host Virtual machine setup
Operating System Windows Server 2012 R2 Datacenter Operating System Windows 7 Professional SP1 x64
CPU Intel i7-4790 @ 3.60 GHz Disk Size 90 GB
Memory 32GB DDR3-1600 Memory 16384 MB
SSD Samsung 850 PRO 256 GB (OS) Virtual Processors 4
Samsung 840 EVO 250 GB (VMs) Virtual machine reserve 100
(resource control)
the SSD, hard disk files of VMs had to be moved back
and forth between the OS SSD and VM SSD. The de- lion unique tweets at a disk size of 4.8 GB. There is
tails are shown in Tables 3 and 4. also a second much-smaller dataset with 18,994 unique
tweets at a disk size of 82 MB that is used in Workload
Write.
5. Data preprocessing
5.2. Filtering data
Figure 1 describes how the data was collected, and
what was prepared to make it the datasets used in test- The dataset was then imported with a filter into
ing. PostgreSQL using its built-in JSON support. During
import, only the tweet_id, created_at, user_id, user_
5.1. Gathering data name and tweet text fields were stored. These fields
were chosen because they were guaranteed to be popu-
The dataset was gathered using Python through lated and to keep the resulting dataset smaller and more
Twitter’s Streaming API and each collected tweet manageable.
was written to a single text file. The filter for tweets After the import into PostgreSQL, the data was then
contained only common words such as “a”, “the” exported out as JSON, which now contains only the
and “people”, resulting in a quick collection of data. above five JSON fields. This dataset measured at only
Tweepy, the API library used, does not run without a 287 MB, with the same number of tweets as the origi-
filter. The resulting dataset contained roughly 1.2 mil- nal dataset.
280 F. Leung and B. Zhou / Performance evaluation of Twitter datasets on SQL and NoSQL DBMS
Table 5
Mean time taken to perform Workload A
Mean (ms) 1 million 5 million 10 million 25 million 50 million
SQL Server 184.68 286.82 183.27 162.49 190.72
PostgreSQL 601.25 615.94 610.32 601.09 608.9
MongoDB 448.9 467.8 458.06 489.6 489.72
Fig. 2. Duplicating tweets commands.
Redis 86.72 96.4 152.19
There was an issue with Python’s JSON decoder on
some of the tweets regarding delimiters. Due to this is- Table 6
sue, the dataset was filtered again through PostgreSQL Standard deviation of time taken to perform Workload A
to remove the offending records. The dataset was then Stddev (ms) 1 million 5 million 10 million 25 million 50 million
trimmed to 1,170,302 tweets. SQL Server 22.05 229.58 63.34 48.63 64.58
The 287 MB-sized primary dataset is now the in- PostgreSQL 22.28 62.33 55.71 55.84 57.35
put file for the next step of processing. The secondary MongoDB 10.92 24.48 20.14 25.95 18.10
dataset is trimmed to 4.2 MB. Redis 25.76 29.71 25.80
6. Experimental results and analysis Tables 5–7 represent time taken to perform work-
load A, in milliseconds (lower is better). Figures 3(a)–
We limited our dataset size to 50 million tweets (e) show the box plots, the red dot represents the mean
due to hardware constraints. Specifically, we found out and the red line represents the median.
F. Leung and B. Zhou / Performance evaluation of Twitter datasets on SQL and NoSQL DBMS 281
(a) (b)
(c) (d)
(e)
Table 7
As we can see, SQL Server performed consider-
Mean % speed of to perform Workload A
ably better at this UPDATE test than PostgreSQL and
Mean % SQL Server PostgreSQL MongoDB Redis even MongoDB, but is still marginally beaten or sim-
speed of ilar to Redis. It suffered from much greater variance
SQL Server – 301% 234% 55% compared to other DBMS, suggesting it is sensitive to
PostgreSQL 33% – 78% 18% background processes or how the database is loaded to
MongoDB 43% 129% – 24% memory. PostgreSQL also suffered from an outlier in
Redis 180% 544% 421% – each dataset size, but had fewer than SQL server and
numbered similarly to MongoDB. MongoDB’s perfor-
282 F. Leung and B. Zhou / Performance evaluation of Twitter datasets on SQL and NoSQL DBMS
Table 8 Table 11
Mean time taken to perform Workload B Mean time taken to perform Workload C
Mean (ms) 1 million 5 million 10 million 25 million 50 million Mean (ms) 1 million 5 million 10 million 25 million 50 million
SQL Server 400.78 395.31 236.72 349.53 413.9 SQL Server 275.31 354.85 234.37 267.03 355.78
PostgreSQL 242.82 242.81 241.25 244.85 244.21 PostgreSQL 222.97 223.75 220.47 224.07 220.78
MongoDB 75.89 74.31 74.95 78.35 76.25 MongoDB 27.37 30.38 28.74 29.58 29.82
Redis 167.03 162.19 58.75 Redis 170.47 174.53 78.28
Table 9 Table 12
Standard deviation of time taken to perform Workload B Standard deviation of time taken to perform Workload C
Stddev (ms) 1 million 5 million 10 million 25 million 50 million Stddev (ms) 1 million 5 million 10 million 25 million 50 million
SQL Server 36.09 31.11 18.61 20.80 29.10 SQL Server 95.31 19.73 14.98 79.47 15.61
PostgreSQL 12.38 11.57 12.45 13.92 12.11 PostgreSQL 11.52 11.92 12.78 10.29 11.65
MongoDB 3.58 5.07 4.28 9.69 6.35 MongoDB 4.43 3.34 3.71 3.32 3.04
Redis 28.67 25.93 32.59 Redis 10.18 12.75 32.98
Table 10 Table 13
Mean % speed of to perform Workload B Mean % speed of to perform Workload C
Mean % SQL Server PostgreSQL MongoDB Redis Mean % SQL Server PostgreSQL MongoDB Redis
speed of speed of
SQL Server – 68% 21% 36% SQL Server – 75% 10% 47%
PostgreSQL 148% – 31% 53% PostgreSQL 134% – 13% 63%
MongoDB 473% 320% – 170% MongoDB 1019% 762% – 484%
Redis 278% 188% 59% – Redis 211% 158% 21% –
mance is only slightly better than PostgreSQL, and it Server variance is very large for the 1 million and 25
seems that its writes, even without ACID transaction million sizes, and not so much on 50 million and 10
log overhead, are not that fast. For the first 3 dataset million sizes. We get the impression that SQL Server is
sizes that Redis participated in, it was generally the quite sensitive to something that the other DBMS don’t
best out of all four DBMS. have. Every size test is run sequentially, so having only
SQL Server experience this is strange.
6.2. Workload B (95/5 R/W) MongoDB takes the lead in both performance and
consistency in this read-only test, performing even bet-
Tables 8–10 represent time taken to perform the ter than Workload B since it has no writes. PostgreSQL
workload B. Figures 4(a)–(e) show the box plots. Mon- and Redis swap places on who is the most consistent,
goDB is by far the best, as expected for a test heavily but when Redis is not around, PostgreSQL places sec-
emphasizing reads, consistently beating out everyone ond behind MongoDB, ahead of SQL Server.
else with the exception of Redis at 10 million. Vari-
ance at 10 million for Redis is odd, and if viewing 6.4. Workload write (100 W)
the trend from 1 million to 10 million, it looks like it
may have continued its poor consistency. SQL Server’s Tables 14–15 represent time taken to perform the
increased performance at 10 million is strange, even workload Write. Figure 6 shows the box plots. SQL
with fewer outliers compared to other sizes. How- Server is by far the slowest at writes. The test was
ever, it was less consistent than PostgreSQL in perfor- re-run several times and performance was consistently
mance. bad, compared to the other DBMS. PostgreSQL has
a strong showing, trailing the MongoDB and Redis
6.3. Workload C (100 R) slightly. Redis has the best write performance, but not
significantly so considering its in-memory advantage.
Tables 11–13 represent time taken to perform the Redis also needs time to write from memory to disk
workload C. Figures 5(a)–(e) show the box plots. SQL (either at shutdown or at certain time increments), so
F. Leung and B. Zhou / Performance evaluation of Twitter datasets on SQL and NoSQL DBMS 283
(a) (b)
(c) (d)
(e)
Table 14 Table 15
Mean time taken to perform Workload Write % speed of to perform Workload Write
Mean (ms) 1000 Inserts % speed of SQL Server PostgreSQL MongoDB Redis
SQL Server 2069.05 SQL Server – 31% 23% 20%
PostgreSQL 633.13 PostgreSQL 322% – 73% 66%
MongoDB 465.95 MongoDB 443% 137% – 91%
Redis 419.03 Redis 489% 152% 110% –
284 F. Leung and B. Zhou / Performance evaluation of Twitter datasets on SQL and NoSQL DBMS
(a) (b)
(c) (d)
(e)
it could be argued that MongoDB could be placed first performance was absolutely abysmal despite identi-
overall. cal conditions to other DBMS. We theorize that SQL
Server is picky about its running conditions and needs
6.5. Analysis
some training or, at the very least, guidance on proper
SQL Server’s overall performance is not consistent setup to have better performance that default installa-
on different workloads. Workload A’s performance tions won’t provide.
was almost uncharacteristically good, and was not PostgreSQL was consistently average or slightly
repeated for any other workloads. Workload Write’s above average in all tests, making it a predictable
F. Leung and B. Zhou / Performance evaluation of Twitter datasets on SQL and NoSQL DBMS 285
[11] Production Notes, http://docs.mongodb.org/manual/ NoSQL databases, International Journal of Computer Appli-
administration/production-notes/. cations (0975–888) 48(20) (2012).
[12] S. Ramanathan, S. Goel and S. Alagumalai, Comparison of [15] The PostgreSQL Global Development Group, http://www.
cloud database: Amazon’s SimpleDB and Google’s Bigtable, postgresql.org/community/contributors/.
in: Recent Trends in Information Systems (ReTIS), 2011 Inter- [16] S. Zillner, T. Becker, R. Munné, K. Hussain, S. Rusitschka,
national Conference on, 2011, pp. 165–168. H. Lippell, E. Curry and A.K. Ojo, Big data-driven innovation
[13] Seismic Waves, https://xkcd.com/723/. in industrial sectors, 2016, pp. 169–178.
[14] C.J. Tauro, S. Aravindh and A.B. Shreeharsha, Comparative
study of the new generation, agile, scalable, high performance