Leung 2016

Web Intelligence 14 (2016) 275–286 275
DOI 10.3233/WEB-160344
IOS Press
Performance evaluation of Twitter datasets

on SQL and NoSQL DBMS
Franklin Leung and Bing Zhou *
Department of Computer Science, Sam Houston State University, Huntsville, Texas, USA 77341
Abstract. Twitter has grown significantly in the past several years and provides a new vector for data collection, offering individ-
ual users and companies valuable insights. This presents a technical challenge to collect and analyze all the data in an efficient
manner. Traditional relational databases have not been able to provide acceptable response times that this new problem presents,
and focus has started shifting to newer technologies such as NoSQL databases. In this paper, we try to answer a question as
follows: “If I want to store and access millions of tweets for data analysis, which database systems should I choose?” We selected
four popular SQL and NoSQL database systems and tested on different twitter dataset varying from one million to fifty million
tweets. Each workload test involves running a core set of data operation commands. The experiment results are promising and
provide guideline for choosing the most efficient database systems based on different user requirements.
Keywords: SQL, NoSQL, JSON, big data, Twitter
1. Introduction constantly being created, stored, and viewed. This also

means that tweets amount to a significant chunk of
Created in 2006, Twitter is a microblogging, social data that may be studied for non-profit reasons like
networking service designed around messages limited research, and for-profit reasons like advertising and
to 140-characters called “tweets”. As this platform is polling a company’s user base. Handling hundreds of
user-driven with few rules (aside from local or federal millions of tweets daily [9] and the availability of a
laws that already exist), users are free to tweet any- public API to download and stream new tweets has al-
thing they want, from mundane descriptions or pic- lowed any person and company to freely find and dis-
tures of their meal, to complaining about a company’s cover trends on opinions and events in real-time.
bad customer service, or even to have a written protest A technical challenge comes into play when need-
of policies enacted by a government. The service is ing to store and access millions of pieces of data con-
free to use, and has been important in fast dissemi- stantly. Traditional SQL-based relational database sys-
nation of information from news-worthy events (such tems have slowly started adapting to the newly unique
as the Egyptian government protest and overthrow) to needs. The data response from the Twitter API is in
other events such as the devastation felt by millions of the form of JavaScript Object Notation (JSON). The
fans when a member left the musical group One Direc- question as follows: “If I want to store and access mil-
tion. In fact, popular web comic XKCD made a point lions of tweets for myself, which Database Manage-
that with the low threshold of tweet content and the ment Systems (DBMS) should I choose?” With dif-
speed of electricity, people outside of the radius of an ferent amounts of native support for the JSON for-
earthquake may receive information about the quake mat, it raises the question of how comparable in per-
via Twitter before they can even feel the seismic waves formance they are for storing tweets. Additionally, the
themselves [13]. sheer number of tweets being created makes having an
This means that if anyone can tweet about anything ideal dataset to be millions of tweets in size, at a mini-
they want, there are a lot of tweets that must exist, mum.
In this paper, we test and compare the performance
* Corresponding author. E-mail: zhou@shsu.edu. of four popular DBMS: Microsoft SQL Server, Post-
2405-6456/16/$35.00 © 2016 – IOS Press and the authors. All rights reserved
276 F. Leung and B. Zhou / Performance evaluation of Twitter datasets on SQL and NoSQL DBMS
greSQL, MongoDB and Redis. Each DBMS has differ- In SQL-based DBMS, the schema is set and defined
ent levels of support for JSON: SQL Server has no na- in advance, and additional data added to the set must
tive JSON support, and PostgreSQL only gained sup- adhere to the schema. For guarantees and a system of
port in late 2013. NoSQL-based systems such as Mon- checks on data, such as type, a strict schema can be
goDB launched in 2009 with native support in a deriva- helpful. NoSQL eschews this strict need and is flexi-
tive format called Binary JSON (BSON), and Redis al- ble with its schema by not explicitly requiring confor-
lows storage of JSON but does not specifically sup- mity. This gives the developer/DBA more power (and
port that data type natively. We explore and present our responsibility) in maintaining the database by group-
findings using datasets consist of millions of tweets, ing together similar data without it all needing to be
tested on and compare the performance between these exactly the same. In some types of databases, such
four DBMS. Our experimental results show some in- as key-value stores, the database itself knows nothing
teresting findings and provide guidelines on how to about the data it is storing.
choose the most efficient DBMS based on different New data is constantly added and stored, with low
user needs. read latency still strongly desired. ACID overhead for
all the data is high, and some users don’t need all its
data all the time, allowing consistency guarantees to
2. Background and related work be dropped in favor for faster performance. NoSQL
does not always have ACID guarantees (it may be im-
2.1. Tweets and big data, collecting and storing plemented by the developer at the application level),
and better scalability helps maintain low read latency
Twitter has grown significantly in the past several alongside redundant data.
years and provides a new vector for data collection, NoSQL is currently the hot new solution to aid com-
offering companies valuable insight in its users and panies and developers in handling the challenges that
customers. This presents a technical challenge to col- Big Data brings.
lect and analyze all the data in an efficient manner.
“Big Data”, a term used to describe the massive in- 2.2. DBMS types
flux of data being handled and studied (not exclusive
to tweets), has been a large driving force for NoSQL Relational database systems based on SQL-syntax
adoption [16]. Traditional relational databases have have existed since the 1970s in many forms, and
not been able to provide acceptable response times have been widely used as the primary means to host
that this new problem presents, and focus has started databases. Strong support for ACID and transac-
shifting to newer technologies such as NoSQL-based tions make them extremely dependable and resilient.
DBMS. In areas in which traceability and dependability are
Big Data can be described in three V’s: volume, va- paramount (e.g. banking and finance), the framework
riety and velocity. The sheer volume of data is diffi- of ACID will not be replaced by something else. The
cult for SQL-based solutions because scalability was overhead of ACID makes SQL-based DBMS slower
not one of the primary concerns decades ago. Though than alternatives that do not strictly conform to it.
certainly scalable, database servers of the past were Despite the abbreviation of the term NoSQL sug-
run on highly specialized hardware, needing more of gests its meaning “No SQL” or “non SQL”, many soft-
the same or similar hardware to expand and scale out. ware vendors prefer it to be viewed as “Not Only
NoSQL emphasizes horizontal scaling on commodity SQL”, representing alternative methods for databases
hardware, allowing a much lower cost to scaling out. to store data [2].
Consistency, as a part of atomicity, consistency, isola-
The methods used by NoSQL-based DBMS are not
tion, durability (ACID), takes a back seat in NoSQL
new, and are also as old as SQL-based ideas. However,
by not guaranteeing it. Instead, redundant data is kept
NoSQL products most popular today have been heav-
with “eventual consistency”, which states that if no
ily influenced by Google BigTable, an in-house data
new updates are made to a given data item, eventually
storage system first introduced in [7,12].
all accesses to that item will return the last updated
Different methods of storing data include:
value. Redundant data also allows for lower read la-
tency as there would be a higher chance of “finding” • column-store – columns consist of a unique name,
the data. value (content), and timestamp;
F. Leung and B. Zhou / Performance evaluation of Twitter datasets on SQL and NoSQL DBMS 277
• document-oriented store – store objects as docu- using the Windows port, despite Redis’s only officially
ments (frequently JSON); supported platform being on Linux. This decision was
• key-value store – use an associative array with a made to keep its environment similar to other DBMS
collection of key, value pairs; already running on Windows. The Python library used
• graph store – based on graph structure. is redis 2.10.3.
NoSQL does not guarantee ACID like SQL-based 2.4. Related work
systems, and is in fact sometimes classified as BASE –
Basically Available, Soft state, Eventual Consistency. There are several papers in literature provide com-
Application developers do have more leeway in the parison between NoSQL databases and relational
details of storage and implementation, at the risk of DBMS in terms of types of NoSQL data store, query
greater dependence on the developer. For the two languages used in NoSQL, advantages and disadvan-
NoSQL DBMS used in our experiment, MongoDB is tages of NoSQL over relational DBMS [2,4,8,14].
a document-oriented store, and Redis is a key-value However, there are very few papers that have dealt
store stored in-memory. specifically with benchmarking and comparing SQL
and NoSQL DBMS. One such is [10] which only used
2.3. DBMS tested datasets sized from 148 to 12,416 tuples/documents,
and only compared MongoDB to SQL Server. Pa-
Microsoft SQL Server 2014 is continuation of their
per [1] benchmarked only the NoSQL DBMS Mon-
long-running DBMS products, with version 1.0 orig-
goDB, ElasticSearch, OrientDB, and Redis between
inally released in 1989. SQL Server is a relational
each other with dataset sizes ranging from 1,000 to
DBMS and offers multiple tiers of its product, ranging
100,000 records.
from the free Express to the not-free Enterprise edi-
There are more informal sources for comparisons,
tions. The editions differ in features available, such as
specifically on personal and company blogs. An older
high availability, security, and replication options, and
but apparently well-known benchmark is the Yahoo!
database size and hardware support (e.g. limitation on
Cloud Serving Benchmark (YCSB) [5], an open source
number of processors and cores usable). The edition
tool initially written in 2010 to test many different
used for testing is the Enterprise edition, whose limi-
types of DBMS, including SQL and NoSQL DBMS.
tations are bound by hardware and not software. The
Some of the workloads defined by that benchmark
many extra features such as the previously mentioned
have also been adopted for testing in this paper. The
high availability and replication are not used. The ver-
scope of our work differs from YCSB in that it em-
sion used is 12.0.4100.1. The Python library used is
phasizes scalability on multiple nodes and processes,
pymssql 2.1.1.
and requires elasticity and high availability. Our test-
PostgreSQL is a relational DBMS developed offi-
ing is purposely confined to a single node with no such
cially by the PostgreSQL Global Development Group,
extra capabilities, rendering YCSB unsuitable for our
consisting of different companies and individuals [15].
specific needs.
The software is free and open source software and
The computer software company Datastax, which
released under the PostgreSQL License. The version
offers an enterprise distribution of the NoSQL DBMS
used is 9.4.1 64-bit. The Python library used is psy-
Cassandra, performed an evaluation between three
copg 2 2.6.1.
NoSQL DBMS: Cassandra, HBase and MongoDB [3].
MongoDB is a document-oriented database with dy-
Their testing utilized YCSB and revealed Cassandra to
namic schemas originally released in its current form
outperform the other DMBS products, which is also
as a standalone database product in 2009, released un-
the product that the company offers as its primary busi-
der the GNU Affero General Public License and is free
ness. Though we don’t necessarily believe their results
and open source software. As of this writing, it is the
to be misleading, we withhold some skepticism for a
most popular database of its type [6]. The version used
less biased outlook.
is 3.0.1. The Python library used is pymongo 3.0.3.
Redis is an in-memory data structure server, sup-
porting several different types of data structures such 3. Objectives
as lists, sets, hashes and bit arrays. The entire dataset
is hosted in physical memory, with changes saved to In addressing aforementioned challenges when stor-
disk in user-specified intervals. The version used is 2.8 ing and accessing big data, traditional DBMS has dif-
ficulty meeting performance requirements. In this pa- The INSERT command uses a separate, smaller
per, we propose to evaluate the performance of four dataset (i.e., the “secondary” dataset) for its tests. The
popular DBMS: Microsoft SQL Server, PostgreSQL, number of tweets is much smaller since the workload
MongoDB and Redis. In particular, we test on different requiring doesn’t need much, though the JSON fields
twitter dataset varying from one million to fifty mil- remain the same.
lion tweets. Our performance evaluation is aimed to
fit in the gap between NoSQL-only testing with small 4.2. Workload descriptions
dataset sizes, and extremely large scalable cloud sys-
tems. Specifically, we look at comparisons between Workload A (50/50 R/W) uses UPDATE com-
SQL and NoSQL with large dataset sizes without scal- mands exclusively, searches for 500 tweets using the
ability and elasticity requirements. tweet_id, and updates the user_name field with the
string “aaabbb”. Workload B (95/5 R/W) uses 450
READ commands searching for the tweet_id, and 50
4. Evaluation methodology UPDATE commands, updating the user_name field
with the string “bbbbbb”. Workload C (100 R) uses
4.1. Workload overview 500 READ commands, searching using the tweet_id
field. Workload Write (100 W) uses 1000 INSERT
Each workload test (detailed below) involves run- commands from the secondary dataset, inserted into a
ning a core set of commands as shown in Table 1 500 new (empty) table and dropped (TRUNCATE or equiv-
times, which is then looped and timed 100 times. Thus, alent) at the end of each loop. This is done to keep each
the raw data will show 100 points of data. The dataset insertion “fresh”.
involved consists of 50 million tweets with unique
tweet IDs (i.e., the “primary” dataset), with tests that 4.3. Schema setup
use a smaller set of data being a portion of the primary
dataset. Table 2 shows database schema across all four
For each command requiring something to find (e.g. DBMS. MongoDB specifically had index created on
READ and UPDATE), it uses a list of 500 tweet IDs the tweet_id field, and it appears to be proper setup
of existing tweets are randomly chosen. This single list to create indices for fields that are intended to be
of random tweet IDs is the same for each workload searched. Due to hardware constraints, the 25-million
test for the given size. For example, the tests using the and 50-million datasets were unable to be tested in Re-
1 million tweet dataset will have a list of 500 tweet dis.
IDs spanning its range, and the tests using the 5 mil-
lion tweet dataset will use a different list that spans the 4.4. Test platform specifications
larger range of tweets.
The operating system was installed on Samsung
Table 1
850 PRO, and all virtual machines (VMs) were hosted
Core commands
only on the Samsung 840 EVO. VMs were managed
Command SQL Server & MongoDB Redis through the included Hyper-V Manager. VM check-
PostgreSQL
points were not utilized to save on disk space.
READ SELECT find_one HMGET
Each DBMS product was tested in its own VM with
UPDATE UPDATE update_one HSET
its own reserved resources. During testing, only the
INSERT INSERT insert_one HMSET
DBMS being tested had its VM powered on – the oth-
TRUNCATE TRUNCATE db.collection.drop FLUSHDB
ers were powered off. Due to disk space constraints on
Table 2
Database schema
DBMS tweet_id created_at user_id user_name text
SQL Server bigint (primary key) Datetime bigint text text
PostgreSQL bigint (primary key) timestamp without timezone bigint text text
MongoDB None specified, inserted as JSON
Redis Key set via hashes Value set via hashes
Fig. 1. Data preprocessing steps.
Table 3 Table 4
Virtual machine host Virtual machine setup
Operating System Windows Server 2012 R2 Datacenter Operating System Windows 7 Professional SP1 x64
CPU Intel i7-4790 @ 3.60 GHz Disk Size 90 GB
Memory 32GB DDR3-1600 Memory 16384 MB
SSD Samsung 850 PRO 256 GB (OS) Virtual Processors 4
Samsung 840 EVO 250 GB (VMs) Virtual machine reserve 100
(resource control)
the SSD, hard disk files of VMs had to be moved back
and forth between the OS SSD and VM SSD. The de- lion unique tweets at a disk size of 4.8 GB. There is
tails are shown in Tables 3 and 4. also a second much-smaller dataset with 18,994 unique
tweets at a disk size of 82 MB that is used in Workload
Write.
5. Data preprocessing
5.2. Filtering data
Figure 1 describes how the data was collected, and
what was prepared to make it the datasets used in test- The dataset was then imported with a filter into
ing. PostgreSQL using its built-in JSON support. During
import, only the tweet_id, created_at, user_id, user_
5.1. Gathering data name and tweet text fields were stored. These fields
were chosen because they were guaranteed to be popu-
The dataset was gathered using Python through lated and to keep the resulting dataset smaller and more
Twitter’s Streaming API and each collected tweet manageable.
was written to a single text file. The filter for tweets After the import into PostgreSQL, the data was then
contained only common words such as “a”, “the” exported out as JSON, which now contains only the
and “people”, resulting in a quick collection of data. above five JSON fields. This dataset measured at only
Tweepy, the API library used, does not run without a 287 MB, with the same number of tweets as the origi-
filter. The resulting dataset contained roughly 1.2 mil- nal dataset.
Table 5
Mean time taken to perform Workload A
Mean (ms) 1 million 5 million 10 million 25 million 50 million
SQL Server 184.68 286.82 183.27 162.49 190.72
PostgreSQL 601.25 615.94 610.32 601.09 608.9
MongoDB 448.9 467.8 458.06 489.6 489.72
Fig. 2. Duplicating tweets commands.
Redis 86.72 96.4 152.19
There was an issue with Python’s JSON decoder on
some of the tweets regarding delimiters. Due to this is- Table 6
sue, the dataset was filtered again through PostgreSQL Standard deviation of time taken to perform Workload A
to remove the offending records. The dataset was then Stddev (ms) 1 million 5 million 10 million 25 million 50 million
trimmed to 1,170,302 tweets. SQL Server 22.05 229.58 63.34 48.63 64.58
The 287 MB-sized primary dataset is now the in- PostgreSQL 22.28 62.33 55.71 55.84 57.35
put file for the next step of processing. The secondary MongoDB 10.92 24.48 20.14 25.95 18.10
dataset is trimmed to 4.2 MB. Redis 25.76 29.71 25.80
5.3. Expanding the data

how Redis’ memory requirement was in fact several
Using the 1,170,302 tweets as the base, the data was times bigger than the disk size of a dataset. The 25 mil-
then duplicated to 50 million records by inserting its lion tweet dataset only required 5.7 GB of disk space,
own data into itself with an arithmetic increase in tweet but the redis-server.exe daemon would unexpectedly
IDs (i.e. add a number when inserting again), the com- crash before even half of that dataset had been loaded
mands are shown in Fig. 2. into 16 GB of memory. This is the reason we scrap
The number added is increased for every subse- Workloads A, B, and C for Redis’ 25 and 50 million
quent insertion. This allows every record to have a tweet datasets.
unique tweet ID despite not all the tweets being wholly We also discovered the sheer amount of time that
unique. The datasets with sizes less than 50 million importing many millions of items takes. Import via
are simply subsets of that data (e.g. 25 million record text file is quite slow, which is how we discovered
dataset is exactly half of the 50 million record dataset). and chose to use the INSERT INTO SQL query. Us-
The file size of the 50 million record dataset ended up ing Workload Write results, it can be calculated that
being 11.4 GB. insert speeds range from 2386 tweets per second down
to 483 tweets per second. Assuming the fastest insert
5.4. Importing data speed that any DBMS measured at 2386/sec, 25 mil-
lion would take around 2.9 hours. Double to the 50
Each DBMS used the same dataset for its input file,
million tweet dataset and that’s a big portion of a day.
using nearly identical Python scripts, differing only in
Additionally, we tried to keep all datasets loaded in
its connection strings and insert commands. Depend-
the database so we could test all sizes together, one
ing on the DBMS, either only the 50 million record
right after another. Notice that the VM disk space set-
dataset was imported from a text file, or the other
tings call for 90 GB, yet the SSD used for testing is
datasets were imported from each dataset’s text file.
only 250 GB. This means that we had to move off and
In SQL Server and PostgreSQL, inserting one table’s
on the two VMs we wanted to test, between the OS
data into another table with a limit/subquery was much
SSD and VM SSD. Each copy took about 20 minutes
faster than inserting each individual table using a text
one-way. Ideally, this would have been a non-issue if
file. MongoDB did not have an efficient way to do the
we had a larger SSD to work with.
same, and was easier to manage with insertion by text
file.
6.1. Workload A (50/50 R/W)
6. Experimental results and analysis Tables 5–7 represent time taken to perform work-
load A, in milliseconds (lower is better). Figures 3(a)–
We limited our dataset size to 50 million tweets (e) show the box plots, the red dot represents the mean
due to hardware constraints. Specifically, we found out and the red line represents the median.
(a) (b)
(c) (d)
(e)
Fig. 3. DBMS performance on Workload A.
Table 7
As we can see, SQL Server performed consider-
Mean % speed of to perform Workload A
ably better at this UPDATE test than PostgreSQL and
Mean % SQL Server PostgreSQL MongoDB Redis even MongoDB, but is still marginally beaten or sim-
speed of ilar to Redis. It suffered from much greater variance
SQL Server – 301% 234% 55% compared to other DBMS, suggesting it is sensitive to
PostgreSQL 33% – 78% 18% background processes or how the database is loaded to
MongoDB 43% 129% – 24% memory. PostgreSQL also suffered from an outlier in
Redis 180% 544% 421% – each dataset size, but had fewer than SQL server and
numbered similarly to MongoDB. MongoDB’s perfor-
Table 8 Table 11
Mean time taken to perform Workload B Mean time taken to perform Workload C
Mean (ms) 1 million 5 million 10 million 25 million 50 million Mean (ms) 1 million 5 million 10 million 25 million 50 million
SQL Server 400.78 395.31 236.72 349.53 413.9 SQL Server 275.31 354.85 234.37 267.03 355.78
PostgreSQL 242.82 242.81 241.25 244.85 244.21 PostgreSQL 222.97 223.75 220.47 224.07 220.78
MongoDB 75.89 74.31 74.95 78.35 76.25 MongoDB 27.37 30.38 28.74 29.58 29.82
Redis 167.03 162.19 58.75 Redis 170.47 174.53 78.28
Table 9 Table 12
Standard deviation of time taken to perform Workload B Standard deviation of time taken to perform Workload C
Stddev (ms) 1 million 5 million 10 million 25 million 50 million Stddev (ms) 1 million 5 million 10 million 25 million 50 million
SQL Server 36.09 31.11 18.61 20.80 29.10 SQL Server 95.31 19.73 14.98 79.47 15.61
PostgreSQL 12.38 11.57 12.45 13.92 12.11 PostgreSQL 11.52 11.92 12.78 10.29 11.65
MongoDB 3.58 5.07 4.28 9.69 6.35 MongoDB 4.43 3.34 3.71 3.32 3.04
Redis 28.67 25.93 32.59 Redis 10.18 12.75 32.98
Table 10 Table 13
Mean % speed of to perform Workload B Mean % speed of to perform Workload C
Mean % SQL Server PostgreSQL MongoDB Redis Mean % SQL Server PostgreSQL MongoDB Redis
speed of speed of
SQL Server – 68% 21% 36% SQL Server – 75% 10% 47%
PostgreSQL 148% – 31% 53% PostgreSQL 134% – 13% 63%
MongoDB 473% 320% – 170% MongoDB 1019% 762% – 484%
Redis 278% 188% 59% – Redis 211% 158% 21% –
mance is only slightly better than PostgreSQL, and it Server variance is very large for the 1 million and 25
seems that its writes, even without ACID transaction million sizes, and not so much on 50 million and 10
log overhead, are not that fast. For the first 3 dataset million sizes. We get the impression that SQL Server is
sizes that Redis participated in, it was generally the quite sensitive to something that the other DBMS don’t
best out of all four DBMS. have. Every size test is run sequentially, so having only
SQL Server experience this is strange.
6.2. Workload B (95/5 R/W) MongoDB takes the lead in both performance and
consistency in this read-only test, performing even bet-
Tables 8–10 represent time taken to perform the ter than Workload B since it has no writes. PostgreSQL
workload B. Figures 4(a)–(e) show the box plots. Mon- and Redis swap places on who is the most consistent,
goDB is by far the best, as expected for a test heavily but when Redis is not around, PostgreSQL places sec-
emphasizing reads, consistently beating out everyone ond behind MongoDB, ahead of SQL Server.
else with the exception of Redis at 10 million. Vari-
ance at 10 million for Redis is odd, and if viewing 6.4. Workload write (100 W)
the trend from 1 million to 10 million, it looks like it
may have continued its poor consistency. SQL Server’s Tables 14–15 represent time taken to perform the
increased performance at 10 million is strange, even workload Write. Figure 6 shows the box plots. SQL
with fewer outliers compared to other sizes. How- Server is by far the slowest at writes. The test was
ever, it was less consistent than PostgreSQL in perfor- re-run several times and performance was consistently
mance. bad, compared to the other DBMS. PostgreSQL has
a strong showing, trailing the MongoDB and Redis
6.3. Workload C (100 R) slightly. Redis has the best write performance, but not
significantly so considering its in-memory advantage.
Tables 11–13 represent time taken to perform the Redis also needs time to write from memory to disk
workload C. Figures 5(a)–(e) show the box plots. SQL (either at shutdown or at certain time increments), so
(a) (b)
(c) (d)
(e)
Fig. 4. DBMS performance on Workload B.
Table 14 Table 15
Mean time taken to perform Workload Write % speed of to perform Workload Write
Mean (ms) 1000 Inserts % speed of SQL Server PostgreSQL MongoDB Redis
SQL Server 2069.05 SQL Server – 31% 23% 20%
PostgreSQL 633.13 PostgreSQL 322% – 73% 66%
MongoDB 465.95 MongoDB 443% 137% – 91%
Redis 419.03 Redis 489% 152% 110% –
(a) (b)
(c) (d)
(e)
Fig. 5. DBMS performance on Workload C.
it could be argued that MongoDB could be placed first performance was absolutely abysmal despite identi-
overall. cal conditions to other DBMS. We theorize that SQL
Server is picky about its running conditions and needs
6.5. Analysis
some training or, at the very least, guidance on proper
SQL Server’s overall performance is not consistent setup to have better performance that default installa-
on different workloads. Workload A’s performance tions won’t provide.
was almost uncharacteristically good, and was not PostgreSQL was consistently average or slightly
repeated for any other workloads. Workload Write’s above average in all tests, making it a predictable
tweets, its performance scaling up is uncertain as they

appeared to gradually get worse.
As a whole, NoSQL-based systems appear bet-
ter for faster performance and read-heavy workloads.
SQL-based systems remain competitive, but definitely
have a non-zero performance deficit while maintaining
ACID compliance.
As for answering the original question of “If I want
to store and access millions of tweets for myself, which
DBMS should I choose?” We would recommend Mon-
goDB. It has a higher learning curve, but for storing
and retrieving tweets, its performance is excellent, and
Fig. 6. DBMS performance on Workload Write.
its free and open source nature keeps software costs to
a minimum. Hardware needs are not special, and is in
fact designed specifically with commodity hardware in
and dependable choice. Write performance is also
mind [11].
respectable. PostgreSQL appears to be much better
For future work, testing with better hardware to fit
suited for different workloads with a default installa-
Redis’ 25-million and 50-million-size datasets would
tion, compared to SQL Server.
enhance comparisons between it and the other DBMS.
MongoDB performed extremely well on tests with Queries can also be expanded to be more complex in
read requests, even outperforming Redis in many ar- nature, involving joins and possibly other resource-
eas. The prerequisite of index creation for reasonable intensive tasks.
performance on MongoDB shows it not to be a catch-
all solution, requiring forward-planning by the devel-
oper. Its identical cost to PostgreSQL makes it easy to References
adopt.
[1] E.A. Abubakar, Performance evaluation of NoSQL systems
Redis was not much faster in the tests it was best at, using YCSB in a resource austere environment, International
and did not end up being first in all tests. It’s likely that Journal of Applied Information Systems 7(8) (2014). doi:
Redis needs more finesse in setup to get the most per- 10.5120/ijais14-451229.
formance out of it. There were no special actions per- [2] N. Ameya, A. Poriya and D. Poojary, Type of NoSQL
databases and its comparison with relational databases, Inter-
formed on Redis compared to MongoDB (index cre-
national Journal of Applied Information Systems 5(4) (2013),
ation) and with the variety of data structures Redis pro- 16–19.
vides (sorted sets may have helped), its performance [3] Benchmarking Top NoSQL Databases, Datastax Corporation,
would most certainly be better with a more experi- 2013.
enced user. In retrospect, Redis’ in-memory advantage [4] F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wal-
lach, M. Burrows, T. Chandra, A. Fikes and R.E. Gruber,
is great enough that not doing anything except insert- Bigtable: A distributed storage system for structured data,
ing and testing data (as with SQL Server and Post- ACM Transactions on Computer Systems (TOCS) 26(2) (2008),
greSQL) still allows it to be quite fast. The bottleneck 4. doi:10.1145/1365815.1365816.
in testing for us was having enough memory to prop- [5] Core workloads, https://github.com/brianfrankcooper/YCSB/
wiki/Core-Workloads.
erly complete all the tests.
[6] DB-Engines ranking, http://db-engines.com/en/ranking.
[7] J. Dean and S. Ghemawat, MapReduce: Simplified data pro-
cessing in large clusters, Communications of the ACM 51(1)
7. Conclusion and future work (2008), 107–113. doi:10.1145/1327452.1327492.
[8] J. Han, E. Haihong, G. Le and J. Du, Survey on NoSQL
database, in: Pervasive Computing and Applications (ICPCA),
In this paper, we compare four popular SQL and 2011 6th International Conference on, IEEE, 2011, pp. 363–
NoSQL DBMS on various size datasets from Twitter. 366.
In each workload test, it appears that SQL Server, Post- [9] R. Krikorian, New Tweets per second record, and how!
greSQL and MongoDB handle up to 50 million records https://blog.twitter.com/2013/new-tweets-per-second-record-
and-how.
of data about the same as only 1 million records, with [10] Z. Parker, S. Poe and S.V. Vrbsky, Comparing NoSQL Mon-
no considerable benefits or penalties in performance. goDB to an SQL DB, in: Proceedings of the 51st ACM South-
Looking at Redis’ performance at 1, 5, and 10 million east Conference, Article No. 5, 2013.
[11] Production Notes, http://docs.mongodb.org/manual/ NoSQL databases, International Journal of Computer Appli-
administration/production-notes/. cations (0975–888) 48(20) (2012).
[12] S. Ramanathan, S. Goel and S. Alagumalai, Comparison of [15] The PostgreSQL Global Development Group, http://www.
cloud database: Amazon’s SimpleDB and Google’s Bigtable, postgresql.org/community/contributors/.
in: Recent Trends in Information Systems (ReTIS), 2011 Inter- [16] S. Zillner, T. Becker, R. Munné, K. Hussain, S. Rusitschka,
national Conference on, 2011, pp. 165–168. H. Lippell, E. Curry and A.K. Ojo, Big data-driven innovation
[13] Seismic Waves, https://xkcd.com/723/. in industrial sectors, 2016, pp. 169–178.
[14] C.J. Tauro, S. Aravindh and A.B. Shreeharsha, Comparative
study of the new generation, agile, scalable, high performance

Leung 2016

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Leung 2016

Uploaded by

Copyright:

Available Formats

Web Intelligence 14 (2016) 275–286 275

Performance evaluation of Twitter datasets

Keywords: SQL, NoSQL, JSON, big data, Twitter

1. Introduction constantly being created, stored, and viewed. This also

Fig. 1. Data preprocessing steps.

5.3. Expanding the data

Fig. 3. DBMS performance on Workload A.

Fig. 4. DBMS performance on Workload B.

Fig. 5. DBMS performance on Workload C.

tweets, its performance scaling up is uncertain as they

You might also like