editor’s note

For developers by developers Dear Readers! In this issue you’ll find loads of information and inspiration because we write about very interesting topic – MongoDB. MongoDB (from “humongous”) is an open source document-oriented NoSQL database system written by C++. It is characterized by high scalability, performance and the lack of a clearly defined structure of the supported databases. In this issue you’ll find six fantastic articles: After reading the first article written by Krishnachytanya Ayyagari you will be able to say No to SQL. NoSQL databases won’t replace relational databases, but instead will become a better option for certain types of projects. People will learn to look at their data and be able to choose from many databases for many needs. There will be a growing realization that the relational databases in use today are often good tools but that other tools have their place as well. In the article entitled: “MongoDB emerges as a NoSQL leader”, written by Ric Johnson we discussed core topics of MongoDB with technical and administrative point of view. The basic objective of this article is to impart knowledge about huge data storage that can easily scale your data with support of replication. From the third article MongoDB for an Open-Data Portal written by Stefan Edlich, Marc Boekera and Sonam Singh you’ll learn inter alia: why MongoDB is the leading NoSQL Database, the diversity of features and APIs this database offers, code Examples that show how to interact with the database and, finally, why MongoDB has a lot of tools to ensure production use without much effort. Shane R. Spencer written Advanced atomic batch information processing article. From article you also will learn: unique MongoDB batch processing techniques, traditional batch processing techniques, MongoDB atomic modifications and RDBMS atomic gotchas. In the next MongoDB article written by Muhammad Idrees you can read and learn about: a basic idea about what MongoDB is, some cool features revealed by this No SQL database, introduction to the client shell and getting used with basic database operations and datatypes supported by MongoDB. Dileepa Jayathiloha, Ashan Fernando and Charith Sooriyaarachchi written artile Thinking Big to Deal with Big Data. A Practical Insight into MongoDB. Document databases in general, and MongoDB in particular, comes very handy when attacking problems where organized data with little or no schema need to be dealt with. I would like to thank our great experts and specialists in MongoDB fields, thanks to them we can publish MongoDB issue today. Angelika Gucwa and SDJ team. Managing: Angelika Gucwa angelika.gucwa@software.com.pl Senior Consultant/Publisher: Paweł Marciniak Editor in Chief: Grzegorz Tabaka grzegorz.tabaka@software.com.pl Art Director: Patrycja Przybyłowicz patrycja.przybylowicz@software.com.pl DTP: Patrycja Przybyłowicz Production Director: Andrzej Kuca andrzej.kuca@software.com.pl Marketing Director: Angelika Gucwa angelika.gucwa@software.com.pl Proofreadres: Michael Munt, Nick Baronian, Dan Dieterle, Patrik Gange, Aby Rao, Jeffrey Smith Betatesters: Paweł Brzęk, Francesco Consiglio, Keith DeBus, Demazy Mbella, Matteo Massaro, Arthur Tumanyan Publisher: Hakin9 Media Sp. z o.o. SK 02-682 Warszawa, ul. Bokserska 1 www.en.sdjournal.org Whilst every effort has been made to ensure the high quality of the magazine, the editors make no warranty, express or implied, concerning the results of content usage. All trade marks presented in the magazine were used only for informative purposes. All rights to trade marks presented in the magazine are reserved by the companies which own them. To create graphs and diagrams we used program by Mathematical formulas created by Design Science MathType™ DISCLAIMER! The techniques described in our articles may only be used in private, local networks. The editors hold no responsibility for misuse of the presented techniques or consequent data loss.

2

4/2012

table of Content

MongodB
say no to sQL
by Krishnachytanya Ayyagari As the tile says, let’s say No to SQL. Ok, this simple statement triggers many questions to development community. To quote a few of them we have Why do we need to say that? What motivates us to say so? What are the reasons? What benefits we have if we say this? What is the relation between this agenda and MangoDB?… and a bunch more. In this article we will discuss exactly about above questions and will have a detailed survey of databases that are using NoSQL along an overview of NoSQL databases. structured data as JSON-like documents with dynamic schemas (MongoDB calls the format BSON), making the integration of data in certain types of applications easier and faster. MongoDB is an open source, non-relational database system designed to meet the needs of modern Web 2.0 applications. Extensive built-in support for MapReduce-style aggregation and geospatial indexes to aggregate and query data more easily. MongoDB has a developer-friendly data model, administrator-friendly configuration options, and natural-feeling language APIs provided by drivers and the database shell.

MongodB emerges as a nosQL Leader

by Ric Johnson In 2007, Eliot Horowitz and the 10gen team started with a concept. They wanted to engineer a tool that would combine the best features of traditional, relational databases and make them work in a distributed platform designed to combine elasticity, scalability and easy administration in a way tailored for modern web applications. The concept evolved into MongoDB.

thinking Big to deal with Big data : A Practical insight into MongodB

by Stefan Edlich, Marc Boeker, Sonam Singh Besides Hadoop, MongoDB is the leading NoSQL Database because it is feature rich and fast responding to the community. We choose MongoDB to build an OpenData Platform / a Market-Place for Data. In this article we introduce MongoDB with all its features and we investigate, how these features are useful for our needs. Practical experiences in creating and running such a platform will be presented along with outstanding new features MongoDB recently introduced.

MongodB for an open-data Portal

by Dileepa Jayathiloha, Ashan Fernando, Charith Sooriyaarachchi NoSQL databases have become a popular topic among enterprise data architectures on web and cloud world. MongoDB is one of the most popular open source pillars in this NoSQL family. NoSQL databases can be categorized into four classes: key-value, big table, document-oriented and graph; MongoDB falls under document-oriented databases. This article presents a practical insight into MongoDB while focusing on a case study where we detail the technical solution we implemented using MongoDB for a commercial problem. How the problem was attacked utilizing strengths in MongoDB is comprehended along with a comparison with RDBMS and other NoSQL models. We also provide a pragmatic guide on when and where to use MongoDB.

Advanced aAomic batch information Processing

by Shane R. Spencer Databases can be seen as reliable work queues when information that is inserted into them needs to be processed again in some way regardless of how quickly or how often. The most common form of post processing is when information needs to be migrated from one database to another as a very simple synchronization to help with distributing load, performing a backup, or separating the information into more focused sets.

MongodB

by Muhammad Idrees MongoDB makes part of the “new” NoSQL family of database systems. Instead of storing data in tables as is done in a “classical” relational database, MongoDB stores
en.sdjournal.org

3

MongodB

say no to sQL
As the tile says, let’s say No to SQL. Ok, this simple statement triggers many questions to development community.

o quote a few of them we have Why do we need to say that? What motivates us to say so? What are the reasons? What benefits we have if we say this? What is the relation between this agenda and MangoDB?… and a bunch more. In this article we will discuss exactly about above questions and will have a detailed survey of databases that are using NoSQL along an overview of NoSQL databases. Mr.Sql database says….

T

The concept described by the term NoSQL means a database system, which is distributed, may not require fixed table schemas, usually avoids join operations, typically scales horizontally, does not expose a SQL interface and may be open source. Now before we kick start into the topic let us see what two geeks Mr.Sql and Mr.NoSql are discussing about in Table1. (Read left to right) Mr.NoSql database says….

What is the Concept of nosQL

I have a fixed Layout. The structure of data in a relation- That what makes you difficult to maintain buddy al database is predefined by the layout of the tables and the fixed names and types of the columns that makes me more organized Ok, Users can scale a relational database by running it Exactly…. more powerful and expensive. And to scale on a more powerful computers beyond a certain point, though, you must be distributed across multiple servers. Also you don’t work easily in a distributed manner because joining your tables across a distributed system is difficult, as said by my friend, Craigslist software engineer Jeremy Zawodny. Why can’t he then distribute me across different proces- You aren’t designed to function with data partitioning, sors and work so distributing your functionality is a chore. You can ask this Stephen O’Grady, an analyst with market research firm RedMonk. Fine then, what about the fact that I can withstand com- With you, users must convert all data into tables. When plex data and give users a flexibility to interact with the data doesn’t fit easily into a table, your structure can them be complex, difficult, and slow to work with. Then use SQL. Using SQL is convenient with structured As you said using SQL language with other types of indata. formation is difficult because it’s designed to work with structured, relationally organized databases with fixed table information, explained Stefan Edlich, professor at the Beuth University of Applied Sciences in Berlin. However, SQL can entail large amounts of complex code and doesn’t work well with modern, agile development, he said. I offer a big feature set and data integrity. Yes I agree, but the problem here is database users often don’t need all the features, as well as the cost and complexity they add.

table 1 : Conversation of our SQL geeks (Read left to right)

4

4/2012

en.sdjournal.org

MongodB

MongodB emerges as NoSQL Leader
In 2007, Eliot Horowitz and the 10gen team started with a concept. They wanted to engineer a tool that would combine the best features of traditional, relational databases and make them work in a distributed platform designed to combine elasticity, scalability and easy administration in a way tailored for modern web applications. The concept evolved into MongoDB.

U

nique in a field of new NoSQL databases, MongoDB is rooted in Binary JSON, a lightweight JavaScript-based data exchange format designed to be easily traversable and efficient in encoding and decoding. MongoDB is well suited to cloud applications because of its document-oriented data model. It achieves speed and manageability through the use of embedded docs and allows for easy horizontal scalability because of its reduced reliance on joins. Its schema-free database also serves to create increased development agility. These unique features, combined with recent partnerships with high-profile, large-volume users like Craigslist, MTV, and Disney have catapulted MongoDB into the forefront of NoSQL technology. Featuring index performance enhancements, new querying and Shell features, and a host of other upgrades in its March 2012 release, MongoDB is a robust, open-source database platform characterized by continuous improvements and cutting edge technological advances. If you want to avail the opportunity to interact with a highly optimized database that provides full accessibility of agile and scalable development in an open source environment then you need to delve into MongoDB which is high performance document based NOSQL database that allows users to store structured data as JSONlike documents with dynamic schemas. The integration of data with other applications made it distinguishable in terms of functionality and support. The goal of MongoDB is to bridge the gap between key-value stores and relational databases. MongoDb development commenced by 10gen in 2007 and in 2009 it emerged as an open source, NoSQL product with an AGPL license. It was created by former DoubleClick Founder and CTO Dwight Merriman and former DoubleClick engineer and ShopWiki Founder and CTO Eliot Horowitz. They collaborated their vision and experience developing large scale, highly robust sys-

tems to create an innovative kind of database which inherits various features of relational database like the concept of indexes and dynamic queries. The ideology is changed from relational to document based database which extends several other features of improved agility through flexible schemas. The prominent feature of the MongoDB data model is a simplified coding structure that improves performance of grouping data and also helps developers to map object-oriented language in the absence of an ORM layer. It increases the productivity with a flexible document model. MongoDB is specifically designed to work with commodity servers in an elastic virtualized environment to save cost with data reliability.

the reason for Using MongodB

In the lineage of communication where information is flowing so rapidly organizations need a sustainable and durable database which can grow with time, execute faster development and enable flexible deployment. MongoDB is a highly optimized document based database that engage their clients to provide built-in support for horizontal scalability also and facilitates users to manage their applications in no time. MongoDB has been designed to cater to BigData - if your database is running on a single server then you will reach a scaling limit whereas MongoDB scales by adding more servers and is able to add more capacity whenever you want. It entails the concept of robust technology as it fully supports consistency and transactional updates. Data integrity is guaranteed through journaling and replication. Auto sharding is also the one of the most recommendable options which allows users to distribute data across multiple nodes. Replica sets give high availability with automatic failover and recovery of database nodes within or across data centers.

12

4/2012

MongodB

MongodB for an Open-Data Portal
Besides Hadoop, MongoDB is the leading NoSQL Database because it is feature rich and fast responding to the community. We choose MongoDB to build an Open-Data Platform/a Market-Place for Data. In this article we introduce MongoDB with all its features and we investigate, how these features are useful for our needs. Practical experiences in creating and running such a platform will be presented along with outstanding new features MongoDB recently introduced.

In a current research project at Beuth University of Technology (App.Sc.) Berlin, we had to develop a new and innovative Open-Data Platform / a Marketplace for Data. Thus we had to evaluate all database solutions in the market so far and choose MongoDB, because of its unmatched set of innovative features. In the following text we want to outline this features and how they foster our requirements for an Open-Data Platform. Select any statistic about NoSQL and you see MongoDB on one of the first places. Perhaps together with its strongest competitors Hadoop, Redis or Cassandra. Nevertheless in earlier versions, we also knew that MongoDB had some issues concerning durability and the scaling architecture, which is not based on consistent hashing. But MongoDB has an incredible open development process with a public Jira instance and carefully listens to customers. But back to the roots. Being created in C++ by Dwight Merriman and Eliot Horowitz for some Web-Shops like ShopWiki.com, MongoDB now has one of the largest installation base in the world with far over 1000 remarkable sites as SourceForge, Craigslist, SAP, Eventbrite, Springer, Cern, github, Grooveshark, The New York Times and many more. But the more important point is that there are over 100 MongoDB hosters and MongoDB is creeping to become a standard for PaaS platforms together with Redis. And there must be a reason why. One is for sure that MongoDB is moving fast in its versions and maturity and another is also that there are at least already eight books available by O’Reilly’, Manning, Apress and more.

MongodB on its Way to the top

Another important point is that MongoDB [1] does not feel completely different to developers having experiences with MySQL: • Basic- / Unique- / Compund-Indexes • Transactions in terms of Atomic updates • Stored Procedures in terms of Server Side JavaScript execution • Cursors • Views in Terms of stored MapReduce collections • Replication • lots of Web-Frontends • distributed binary storage in terms of using GridFS Even some kind of triggers are easily possible if you trace the MongoDB logs. So not many features will be missed as e.g. real ACID.

Up and running

The MongoDB installation is a matter of minutes and can’t be easier. Data is organized in the following way: Mongo-Instance x Database x Collection x Document The documents in MongoDB are stored using JSON [1]. Internally they are transferred via the BSON [2]. JSON are nested key-value pairs with the possibility of using nested objects and arrays. MongoDB comes with a shell (mongo) and you have to take care of not mixing up the database and the collection here, because you simply will. You start the server with some typical options like

mongod --dbpath F:\DATABASES\open-data -v –rest.

26

4/2012

MongodB

Advanced Atomic batch information Processing
MongoDB has been major interest for him over the past year and will continue to be part of several professional and personal projects. Recently an obsession has formed with how to effectively and efficiently allocate documents for parallel batch processing using atomic operations within MongoDB documents.
typical Cases for Post Processing information batch vs. individual information Processing

Databases can be seen as reliable work queues when information that is inserted into them needs to be processed again in some way regardless of how quickly or how often. The most common form of post processing is when information needs to be migrated from one database to another as a very simple synchronization to help with distributing load, performing a backup, or separating the information into more focused sets. For just about any long term database project archiving information becomes a necessity as well. Moving information away from active databases into archive databases often involves a lot of verification checks to ensure the information was copied properly before it is removed from the active database. This is a form of post processing that involves several steps and potentially requires extra fields on both the active and archive database in order to mark the information as having passed verification and when the information was migrated. Software developers interested in full text indexing find that effective post processing allows them to check documents before submitting them to indexing servers. This is similar to migrating information from one database to another. Call centers that require analytics on call volume data often separate information into multiple databases in order to create secure work environments for a specific information set. When recording of telephone conversations is required it is important to know if a recording exists on disk that matches one of the fields in the call record information. This is an example of when post processing looks for extra information outside of the database and possibly fills in a few fields before marking the information as processed. Recordings are typically compressed further as well which may be part of the postprocessing for each call record before the call can be submitted to another database or even shown in search results.

When a database requires any level of post-processing an extra field is used to denote if or when specific information had been processed. This field is typically indexed to speed up the selection process when looking for unprocessed information within a larger set. Either approach of processing information in batches or as individual rows or documents takes advantage of this processing field and the inherited atomicity that comes with most databases to create a lock to keep information from being processed more than once in both serial and parallel post-processing environments. Individual information processing is simple and straightforward and is often the first step when database administrators and software developers start attempting to add post-processing to databases. What they soon find out is that that individual queries for a single unlocked piece of information as well as individual requests to handle locking is highly inefficient when attempting to process information as fast as possible. It is important to understand that each individual database request forces the database server to parse the request, queue it up for processing, and then start at the beginning of the index when finding the information to process. By choosing to lock a set of information rather than a single piece of information the amount database requests and subsequent index scans are reduced substantially.

Simplified batch Information Processing

A common practice of batch processing involves a batch processing program that simply selects a limited set of information that doesn’t have a boolean field set to `true` which marks it as batch processed. (Listing 1) Once the batch processor is done with each selected piece of information it updates the database again to change the `batch_processed` mark. (Listing 2) This technique is very simple and very fast but it lacks scalability. Another reason to stay away from this is that
4/2012

32

MongodB

MongodB
MongoDB makes part of the “new” NoSQL family of database systems. Instead of storing data in tables as is done in a “classical” relational database, MongoDB stores structured data as JSON-like documents with dynamic schemas (MongoDB calls the format BSON), making the integration of data in certain types of applications easier and faster.
ongoDB is an open source, non-relational database system designed to meet the needs of modern Web 2.0 applications. Extensive built-in support for MapReduce-style aggregation and geospatial indexes to aggregate and query data more easily. MongoDB has a developer-friendly data model, administrator-friendly configuration options, and natural-feeling language APIs provided by drivers and the database shell. Not only that, MongoDB has several unique features, such as atomic updates and indexed array keys, that greatly influence the kinds of schemas that make sense. A single MongoDB node is able to comfortably serve 1000s of requests per second on cheap hardware. When you need to scale beyond that, you can use either replication (keeping several copies of the data on different servers) or sharding (partitioning the data across servers). MongoDB even includes logic to automatically load-balance your shards as your database and load increase. Let’s take a look at some of its main features that make it a good choice : • MongoDB is well suited to handle large volumes of data. Situations arises where traditional relational database system becomes too expensive in terms of system resources (for example time and space, definitely the large volumes of the data need more processing), MongoDB could become a better alternative. • MongoDB supports asynchronous insert operations. So the application code asks MongoDB to insert a document and moves on to the next task without waiting for the server to respond. This frees the application to do its task without being stuck to one long database operation, and so enhance user responsiveness. This makes it an excellent tool for logging. For example, your website can process an HTTP request , logs various details of the request (for example the time, user agent, cookies information etc.) in the database, and then generate the output. Since the insertion is

M

asynchronous, the output generation is continued without delay. MapReduce is an approach to data processing which has two significant benefits over other traditional solutions. The first, and main, reason it was development is performance. The second benefit of MapReduce is that, you can write real code to do your processing. MapReduce code is substantially richer and let you endorse a good technique before you go for a more specialized solution. This is a powerful and flexible tool for data processing and may be considered as another useful feature also with asynchronous behavior.

MongoDB is a document-oriented database, as opposed to a relational one. The primary reason for moving away from the relational model is to make scaling out easier, makes your application to scale with little effort. Apart from scaling, there are many other advantages as well. The basic idea is to replace the concept of a “row” with a more flexible model, the “document”. Each collection (table in relation database) has set of documents (think of documents as rows in relations databases) . By allowing embedded documents and arrays, the document-oriented approach makes it possible to represent complex hierarchical relationships with a single record. This fits very naturally into the way developers in modern object-oriented languages think about their data objects. Developers directly mapped their objects concepts in programming language to the database-level, and has to think less about how to save their object’s state in the data-store and how to retrieved it back to object state. So developer-friendly rich data model will enhance development speed and simplifies design complexities to communicate and implement with great ease.

extensive data Model

38

4/2012

MongodB

thinking Big to deal with Big data A Practical Insight into MongoDB
NoSQL databases have become a popular topic among enterprise data architectures on web and cloud world. MongoDB is one of the most popular open source pillars in this NoSQL family. NoSQL databases can be categorized into four classes: key-value, big table, document-oriented and graph; MongoDB falls under document-oriented databases.
his article presents a practical insight into MongoDB while focusing on a case study where we detail the technical solution we implemented using MongoDB for a commercial problem. How the problem was attacked utilizing strengths in MongoDB is comprehended along with a comparison with RDBMS and other NoSQL models. We also provide a pragmatic guide on when and where to use MongoDB.

T

Historical Background

MongoDB was developed as a part of a PaaS service product by 10gen, which is similar to Google app engine. In year 2007 10gen started development of the MongoDB inside 10gen app engine. But in 2008, they decided to separate database part from app engine and make it open source. This was a milestone for MongoDB because it started to get users, proving to be a successful product.

Flexible documents Unlike other databases, especially RDBMS, MongoDB stores data in documents. Data entities in RDBMS are ‘flat’, but MongoDB documents can contain composite fields such as arrays and hashes. MongoDB documents are stored as JSON objects. More appropriately, the storage form is binary JSON, which the MongoDB community calls BSON. Capacity of a single document is limited to 16 MB in the current release and will be increased in future.
# simple mongoDB document var data = {name: “charith “, company: “99X Technology”}; db.employers.save(data);

Why MongodB?
schema free Schema in MongoDB is very different from schema in RDBMS. It can be considered as a schema-free database, which means different data structures can be stored in the same collection. Agile development Agile development is used by many software projects today. Agile process promotes short duration and iterative development life cycles. Using RDBMS in an agile project is not practical at all times because agile nature often introduces changes. As discussed above, MongoDB is schema free. This specialty is best for agile development, because schema changes happen due to requirement changes. Rapid database schema changes are no more a problem with the use of a schema free database like MongoDB.

Cloud ready MongoDB is ready to run on commodity hardware, virtualized environments and the cloud. Database is able to expand with whatever hardware present. High performance MongoDB has no acknowledgement for data writes. This is very important when writing big data into a server. Rather than costly “join”s it uses embedding, which makes read write fast. Indexing enhances query performance. MongoDB supports indexing, even indexing of keys from embedded documents and arrays. Horizontally scalable When data size keeps growing, new types of complexities emerge. Solution for this with most technologies like RDMS is vertical scaling by buying bigger servers. MongoDB is horizontally scalable which means data scalability is possible by adding multiple servers. Advantage here is the lack of the need for upgrading servers when data set gets bigger. The problem can be dealt with by incrementally introducing suitable computing platforms.
4/2012

44

Competent professionals, lean processes and tight collaboration are our weapons for beating time!

We

build software products because it’s our passion. We are focused, because we understand product engineering is vastly different to bespoke application development. We have a strong focus on the art and science of product engineering and our pride is to see your product winning the market place. We serve established as well as startup ISVs who seek a better outcome and not just software development.
www.99XTechnology.com

www.facebook.com/99XTechnology www.twitter.com/99XTechnology

en.sdjournal.org