Professional Documents
Culture Documents
Introduction to NoSQL databases-: NoSQL, which stands for "Not Only SQL," is a category of
database systems that are designed to handle large volumes of unstructured and semi-structured
data. Unlike traditional relational databases that rely on a structured schema and SQL for
querying, NoSQL databases provide flexibility and scalability by allowing for dynamic schemas
and horizontal scaling across multiple servers.
● Key-value pair
● Document-oriented
● Column-oriented
● Graph-based
● Time series
These types of databases are optimized specifically for applications that need large data volumes,
flexible data models, and low latency. To achieve these objectives, NoSQL databases employ
various techniques, and it's important to note that not all database options prioritize the same set
of factors that are mentioned here:
● Consistency
An example of the kind of data you might store in a wide-column data store
AWS offers Amazon Keyspaces (for Apache Cassandra) as a wide-column managed
database service.
4. Graph — Graph databases are used to store and query highly connected data. Data can be
modeled in the form of entities (also referred to as nodes, or vertices) and the
relationships between those entities (also referred to as edges). The strength or nature of
the relationships also carry significant meaning in graph databases.
Users can then traverse the graph structure by starting at a defined set of nodes or edges
and travel across the graph, along defined relationship types or strengths, until they reach
some defined condition. Results can be returned in the form of literals, lists, maps, or
graph traversal paths. Graph databases provide a set of query languages that contain
syntax designed for traversing a graph structure, or matching a certain structural pattern.
An example of a social network graph. Given the people (nodes) and their relationships
(edges), you can find out who the "friends of friends" of a particular person are—for
example, the friends of Howard's friends.
AWS offers Amazon Neptune as a managed graph database service.
5. Time series — A time series database is designed to store and retrieve data records that
are sequenced by time, which are sets of data points that are associated with timestamps
and stored in time sequence order. Time series databases make it easy to measure how
measurements or events change over time; for example, temperature readings from
weather sensors or intraday stock prices.
To navigate this landscape, numerous tools have been developed to streamline data management.
Among these tools are two prominent database management systems: Cassandra and MongoDB.
In this article, we will explore both systems' strengths and weaknesses to help compare
Cassandra vs. MongoDB and determine the optimal choice for your needs.
Key Takeaways:
1. Cassandra makes it ideal for managing extensive datasets across multiple commodity
servers while ensuring high availability and fault tolerance.
2. MongoDB makes it suitable for applications with evolving data requirements and
complex data structures.
3. The decision between Cassandra and MongoDB hinges on scalability needs,
consistency requirements, and data model complexity.
What Is Cassandra?
Pros
1. High Scalability: Cassandra can scale linearly to handle vast datasets, making it
suitable for large-scale applications.
2. High Availability: Its distributed architecture ensures high availability even during
node failures.
3. Flexible Data Model: The flexible data model allows for versatile data storage and
accommodates various use cases.
4. Tunable Consistency: Users have control over consistency levels, allowing them to
tailor consistency requirements based on application needs.
5. Fast Writes: Cassandra excels in write-heavy workloads, providing low-latency write
operations.
6. Decentralized Architecture: The decentralized architecture eliminates single points of
failure, enhancing fault tolerance.
7. Community Support: Being open-source, Cassandra benefits from a large and active
community that provides support, documentation, and frequent updates.
Cons
What Is MongoDB?
MongoDB is a NoSQL database that utilizes a flexible, document-based data model. Instead of
storing data in tables and rows like a traditional database, MongoDB stores data in collections of
JSON-like documents. These documents can vary in structure, allowing for dynamic schemas
and making them suitable for applications with evolving data requirements.
Features
Pros
1. Flexible Data Model: MongoDB's document-oriented data model makes it easy to
represent complex hierarchical data structures.
2. Scalability: MongoDB scales horizontally, allowing it to handle large volumes of data
and high traffic loads.
3. High Performance: MongoDB is designed for high throughput and low latency,
making it suitable for real-time applications.
4. Ease of Use: MongoDB's JSON-like documents and flexible schema make it easy for
developers to work with.
5. Community Support: MongoDB has a large and active community that provides
resources, tutorials, and support for developers.
6. Rich Query Language: MongoDB's query language supports various operations,
including complex aggregations and geospatial queries.
Cons
Cloud Offerings Compatible with major cloud Compatible with major cloud
providers providers
Languages Java, Python, C#, and more JavaScript, Python, Java, and
Supported more
Cassandra vs. MongoDB: Similarities
Both Cassandra and MongoDB are popular NoSQL databases that offer flexibility, scalability,
and high performance for handling large volumes of data. Despite their differences, they share
several similarities in terms of database type, data structure, scalability, and licensing:
Database Type
Cassandra vs. MongoDB - both fall under the umbrella of NoSQL databases, which means they
depart from the traditional relational database model based on tables and SQL queries. NoSQL
databases are designed to handle large-scale, distributed data sets and are often used in scenarios
where flexibility, scalability, and performance are crucial.
Data Structure
Scalability
Licensing
● Cassandra is released under the Apache License 2.0, allowing users to use, modify,
and distribute the software without restrictions on usage or redistribution. This
permissive license has contributed to Cassandra's widespread adoption in various
industries.
● MongoDB is released under the Server Side Public License (SSPL), a copyleft license
derived from the GNU Affero General Public License (AGPL). The SSPL requires
users who offer MongoDB as a service to open-source the source code of their service,
ensuring that improvements to the software are shared with the community.
Choosing between Cassandra and MongoDB depends on your project's specific requirements and
characteristics. Cassandra might be better if you need a highly scalable, distributed database
optimized for write-heavy workloads and strong consistency. Cassandra's decentralized
architecture and support for linear scalability make it suitable for applications requiring high
availability and fault tolerance, such as real-time analytics and IoT platforms.
On the other hand, if your application requires flexible schema design, rich query capabilities,
and ease of development, MongoDB could be the preferred option. MongoDB's
document-oriented data model and powerful query language make it well-suited for applications
with evolving data requirements and complex data structures, such as content management
systems and e-commerce platforms. Ultimately, the decision between Cassandra and MongoDB
should be based on scalability needs, consistency requirements, data model complexity, and
developer preferences.
Cassandra stores data using a partition key determined by the table's primary key, organizing
data across nodes in a cluster for high availability and fault tolerance. It utilizes a wide-column
store model. MongoDB stores data in BSON format (a binary representation of JSON
documents), with a dynamic schema that allows documents in a collection to have different fields
and structures.
Cassandra excels in scalability with its masterless architecture, allowing for seamless horizontal
scaling and no single point of failure. It is designed for distributed environments. MongoDB
supports horizontal scalability by sharding and distributing data across multiple servers but relies
on config servers and routing processes to manage the cluster.
The performance of Cassandra and MongoDB can vary depending on the workload. Cassandra is
highly optimized for write-heavy workloads and can handle large volumes of data across many
commodity servers. MongoDB offers robust performance for read-heavy workloads and is
generally considered more versatile for a wider range of applications due to its
document-oriented model.
5. How do the communities and support for Cassandra vs. MongoDB compare?
The communities and support for both databases are strong, with large, active communities and
extensive documentation. MongoDB has a commercial entity behind it, offering professional
support and managed services, which can be a plus for enterprises. Cassandra, part of the Apache
Software Foundation, also has commercial support through third parties and is well-regarded in
the open-source community, ensuring good support options for both databases.
Data warehousing systems have been a part of business intelligence (BI) solutions for over three
decades, but they have evolved recently with the emergence of new data types and data hosting
methods. Traditionally, a data warehouse was hosted on-premises—often on a mainframe
computer—and its functionality was focused on extracting data from other sources, cleansing
and preparing the data, and loading and maintaining the data in a relational database. More
recently, a data warehouse might be hosted on a dedicated appliance or in the cloud, and most
data warehouses have added analytics capabilities and data visualization and presentation tools.
Data warehouse architecture
Bottom tier: The bottom tier consists of a data warehouse server, usually a relational
database system, which collects, cleanses, and transforms data from multiple data sources
through a process known as Extract, Transform, and Load (ETL) or a process known as
Extract, Load, and Transform (ELT). For most organizations that use ETL, the process
relies on automation, and is efficient, well-defined, continuous and batch-driven.
Middle tier: The middle tier consists of an OLAP (online analytical processing) server
which enables fast query speeds. Three types of OLAP models can be used in this tier,
which are known as ROLAP, MOLAP and HOLAP. The type of OLAP model used is
dependent on the type of database system that exists.
Top tier: The top tier is represented by some kind of front-end user interface or reporting
tool, which enables end users to conduct ad-hoc data analysis on their business data.
A short history of data warehouse architecture
Most data warehouses will be built around a relational database system, either on-premise or in
the cloud, where data is both stored and processed. Other components would include a metadata
management system and an API connectivity layer enabling the warehouse to pull data from
organizational sources and provide access to analytics and visualization tools.
A typical data warehouse has four main components: a central database, ETL tools, metadata and
access tools. All of these components are engineered for speed so that you can get results quickly
and analyze data on the fly.
The data warehouse has been around for decades. Born in the 1980s, it addressed the need to
optimize analytics on data. As companies’ business applications began to grow and
generate/store more data, they needed data warehouse systems that could both manage the data
and analyze it. At a high level, database admins could pull data from their operational systems
and add a schema to it via transformation before loading it into their data warehouse.
As data warehouse architecture evolved and grew in popularity, more people within a company
started using it to access data–and the data warehouse made it easy to do so with structured data.
This is where metadata became important. Reporting and dashboarding became a key use case,
and SQL (structured query language) became the de facto way of interacting with that data.
Components of data warehouse architecture
Let's take a closer look at each component.
ETL
When database analysts want to move data from a data source into their data warehouse, this is
the process they use. In short, ETL converts data into a usable format so that once it’s in the data
warehouse, it can be analyzed/queried/etc.
Metadata
Metadata is data about data. Basically, it describes all of the data that’s stored in a system to
make it searchable. Some examples of metadata include authors, dates or locations of an article,
create date of a file, the size of a file, etc. Think of it like the titles of a column in a spreadsheet.
Metadata allows you to organize your data to make it usable, so you can analyze it to create
dashboards and reports.
SQL query processing
SQL is the de facto standard language for querying your data. This is the language that analysts
use to pull out insights from their data stored in the data warehouse. Typically data warehouses
have proprietary SQL query processing technologies tightly coupled with the compute. This
allows for very high performance when it comes to your analytics. One thing to note, however, is
that the cost of a data warehouse can start getting expensive the more data and SQL compute
resources you have.
Data layer:
The data layer is the access layer that allows users to actually get to the data. This is typically
where you’d find a data mart. This layer partitions segments of your data out depending on who
you want to give access to, so you can get very granular across your organization. For instance,
you may not want to give your sales team access to your HR team’s data, and vice versa.
This is related to the data layer in that you need to be able to provide fine-grained access and
security policies across all your organization’s data. Typically data warehouses have very good
data governance and security capabilities built in, so you don’t need to do a lot of custom data
engineering work to include this. It’s important to plan for governance and security as you add
more data to your warehouse and as your company grows.
OLAP tools are designed for multidimensional analysis of data in a data warehouse, which
contains both historical and transactional data. Common uses of OLAP include data mining and
other business intelligence apps, complex analytical calculations, and predictive scenarios, as
well as business reporting functions like financial analysis, budgeting, and forecast planning.
For a deep dive into the differences between these approaches, check out "OLAP vs. OLTP:
What's the Difference?"
Schemas in data warehouses
Schemas are ways in which data is organized within a database or data warehouse. There are two
main types of schema structures, the star schema and the snowflake schema, which will impact
the design of your data model.
Star schema: This schema consists of one fact table which can be joined to a number of
denormalized dimension tables. It is considered the simplest and most common type of schema,
and its users benefit from its faster speeds while querying.
Data warehouse vs. data lake
Using a data pipeline, a data warehouse gathers raw data from multiple sources into a central
repository, structured using predefined schemas designed for data analytics. A data lake is a data
warehouse without the predefined schemas. As a result, it enables more types of analytics than a
data warehouse. Data lakes are commonly built on big data platforms such as Apache Hadoop.
Data warehouse vs. data mart
A data mart is a subset of a data warehouse that contains data specific to a particular business
line or department. Because they contain a smaller subset of data, data marts enable a department
or business line to discover more-focused insights more quickly than possible when working
with the broader data warehouse data set.
A database is built primarily for fast queries and transaction processing, not analytics. A database
typically serves as the focused data store for a specific application, whereas a data warehouse
stores data from any number (or even all) of the applications in your organization.
A database focuses on updating real-time data while a data warehouse has a broader scope,
capturing current and historical data for predictive analytics, machine learning, and other
advanced types of analysis.
Types of data warehouses
A cloud data warehouse is a data warehouse specifically built to run in the cloud, and it is offered
to customers as a managed service. Cloud-based data warehouses have grown more popular over
the last five to seven years as more companies use cloud computing services and seek to reduce
their on-premises data center footprint.
With a cloud data warehouse, the physical data warehouse infrastructure is managed by the cloud
company, meaning that the customer doesn’t have to make an upfront investment in hardware or
software and doesn’t have to manage or maintain the data warehouse solution.
A business can purchase a data warehouse license and then deploy a data warehouse on their
own on-premises infrastructure. Although this is typically more expensive than a cloud data
warehouse service, it might be a better choice for government entities, financial institutions, or
other organizations that want more control over their data or need to comply with strict security
or data privacy standards or regulations.
Better data quality: A data warehouse centralizes data from a variety of data sources,
such as transactional systems, operational databases, and flat files. It then cleanses the
operational data, eliminates duplicates, and standardizes it to create a single source of the
truth.
Faster, business insights: Data from disparate sources limit the ability of decision makers
to set business strategies with confidence. Data warehouses enable data integration,
allowing business users to leverage all of a company’s data into each business decision.
Data warehouse data makes it possible to report on themes, trends, aggregations, and
other relationships among data collected from an engineering lifecycle management
(ELM) app.
Gaining and growing competitive advantage: All of the above combine to help an
organization finding more opportunities in data, more quickly than is possible from
disparate data stores.
The data lakehouse can also support your data science, ML and AI workloads in addition to your
reporting and dashboarding workloads. If you are looking to upgrade from data warehouse
architecture, then developing an open data lakehouse is the way to go.
- Data Complexity: Managing the security of diverse data types (structured, semi-structured,
unstructured) in big data ecosystems can be challenging. For instance, healthcare organizations
face complexities in securing electronic health records (EHRs), medical images, and patient data.
- Scalability: Traditional security measures may struggle to scale effectively in distributed big
data environments. For example, Apache Hadoop clusters require robust security measures to
protect data across multiple nodes and clusters.
- Data Governance: Establishing and enforcing data governance policies, access controls, and
data lifecycle management practices is crucial. For instance, financial institutions must comply
with regulatory requirements like the Payment Card Industry Data Security Standard (PCI DSS)
to secure payment data.
- Insider Threats: Malicious insiders or unintentional data leaks can pose significant risks. For
example, the Edward Snowden incident highlighted the impact of insider threats on national
security and data privacy.
- Data Encryption in Transit: Securing data during transmission using protocols like TLS
(Transport Layer Security). For example, HTTPS encrypts data exchanged between web
browsers and servers.
- Homomorphic Encryption: Performing computations on encrypted data without decrypting it
first, ensuring data privacy. For example, Microsoft Research's SEAL library enables
homomorphic encryption for cloud-based computations.
- Tokenization: Replacing sensitive data with non-sensitive tokens to reduce exposure. For
example, tokenization is used in payment processing to secure credit card information.
- Data Masking: Obfuscating sensitive data while maintaining usability. For example,
healthcare organizations use data masking to anonymize patient data for research and analytics.
3. Role-based Access Control in Distributed Systems:
- Role Definitions: Defining roles and permissions based on user responsibilities and access
needs. For example, an administrator role has full access, while a guest role has limited access.
- Access Control Policies: Implementing rules and conditions for granting or denying access to
resources. For example, AWS Identity and Access Management (IAM) allows fine-grained
access control based on policies.
- Centralized vs. Decentralized Control: Choosing between centralized control (e.g., LDAP,
Active Directory) and decentralized control (e.g., blockchain-based access control) based on
scalability and management needs.
- Audit and Monitoring: Tracking access events, detecting anomalies, and ensuring compliance
with security policies. For example, SIEM (Security Information and Event Management)
systems provide real-time monitoring and analysis of security events.
- Dynamic Access Control: Adapting access permissions based on changing user roles or
contexts. For example, Google Cloud Identity and Access Management (IAM) allows dynamic
role assignments based on attributes.
- Data Minimization : Collecting only necessary data and minimizing storage of personal
information. For example, mobile apps often request minimal permissions to access user data to
comply with privacy regulations.