Big Data Notes

UNIT – 5
Problem with relational database system
1 – Maintenance Problem
The maintenance of the relational database becomes difficult over time due to the
increase in the data. Developers and programmers have to spend a lot of time
maintaining the database.
2 – Cost
The relational database system is costly to set up and maintain. The initial cost of the
software alone can be quite pricey for smaller businesses, but it gets worse when you
factor in hiring a professional technician who must also have expertise with that
specific kind of program.
3 – Physical Storage
A relational database is comprised of rows and columns, which requires a lot of
physical memory because each operation performed depends on separate storage.
The requirements of physical memory may increase along with the increase of data.
4 – Lack of Scalability
While using the relational database over multiple servers, its structure changes and
becomes difficult to handle, especially when the quantity of the data is large. Due to
this, the data is not scalable on different physical storage servers. Ultimately, its
performance is affected i.e. lack of availability of data and load time etc. As the
database becomes larger or more distributed with a greater number of servers, this
will have negative effects like latency and availability issues affecting overall
performance.
5 – Complexity in Structure
Relational databases can only store data in tabular form which makes it difficult to
represent complex relationships between objects. This is an issue because many
applications require more than one table to store all the necessary data required by
their application logic.
6 – Decrease in performance over time
The relational database can become slower, not just because of its reliance on multiple
tables. When there is a large number of tables and data in the system, it causes an
increase in complexity. It can lead to slow response times over queries or even
complete failure for them depending on how many people are logged into the server
at a given time.
Introduction to NoSQL
NoSQL originally referring to non SQL or non relational is a database that provides a
mechanism for storage and retrieval of data. This data is modeled in means other
than the tabular relations used in relational databases. Such databases came into
existence in the late 1960s, but did not obtain the NoSQL moniker until a surge of
popularity in the early twenty-first century. NoSQL databases are used in real-time
web applications and big data and their use are increasing over time.
• NoSQL systems are also sometimes called Not only SQL to emphasize the
fact that they may support SQL-like query languages. A NoSQL database
includes simplicity of design, simpler horizontal scaling to clusters of
machines and finer control over availability. The data structures used by
NoSQL databases are different from those used by default in relational
databases which makes some operations faster in NoSQL. The suitability
of a given NoSQL database depends on the problem it should solve.
• NoSQL databases, also known as “not only SQL” databases, are a new type
of database management system that have gained popularity in recent
years. Unlike traditional relational databases, NoSQL databases are
designed to handle large amounts of unstructured or semi-structured data,
and they can accommodate dynamic changes to the data model. This
makes NoSQL databases a good fit for modern web applications, real-time
analytics, and big data processing.
• Data structures used by NoSQL databases are sometimes also viewed as
more flexible than relational database tables. Many NoSQL stores
compromise consistency in favour of availability, speed and partition
tolerance. Barriers to the greater adoption of NoSQL stores include the use
of low-level query languages, lack of standardized interfaces, and huge
previous investments in existing relational databases.
• Most NoSQL stores lack true ACID(Atomicity, Consistency, Isolation,
Durability) transactions but a few databases, such as MarkLogic,
Aerospike, FairCom c-treeACE, Google Spanner (though technically a
NewSQL database), Symas LMDB, and OrientDB have made them central
to their designs.
• Most NoSQL databases offer a concept of eventual consistency in which
database changes are propagated to all nodes so queries for data might
not return updated data immediately or might result in reading data that
is not accurate which is a problem known as stale reads. Also some NoSQL
systems may exhibit lost writes and other forms of data loss. Some NoSQL
systems provide concepts such as write-ahead logging to avoid data loss.
• One simple example of a NoSQL database is a document database. In a
document database, data is stored in documents rather than tables. Each
document can contain a different set of fields, making it easy to
accommodate changing data requirements
• For example, “Take, for instance, a database that holds data regarding
employees.”. In a relational database, this information might be stored in
tables, with one table for employee information and another table for
department information. In a document database, each employee would
be stored as a separate document, with all of their information contained
within the document.
• NoSQL databases are a relatively new type of database management
system that have gained popularity in recent years due to their scalability
and flexibility. They are designed to handle large amounts of unstructured
or semi-structured data and can handle dynamic changes to the data
model. This makes NoSQL databases a good fit for modern web
applications, real-time analytics, and big data processing.
Key Features of NoSQL:
1. Dynamic schema: NoSQL databases do not have a fixed schema and can
accommodate changing data structures without the need for migrations
or schema alterations.
2. Horizontal scalability: NoSQL databases are designed to scale out by
adding more nodes to a database cluster, making them well-suited for
handling large amounts of data and high levels of traffic.
3. Document-based: Some NoSQL databases, such as MongoDB, use a
document-based data model, where data is stored in semi-structured
format, such as JSON or BSON.
4. Key-value-based: Other NoSQL databases, such as Redis, use a key-value
data model, where data is stored as a collection of key-value pairs.
5. Column-based: Some NoSQL databases, such as Cassandra, use a column-
based data model, where data is organized into columns instead of rows.
6. Distributed and high availability: NoSQL databases are often designed to
be highly available and to automatically handle node failures and data
replication across multiple nodes in a database cluster.
7. Flexibility: NoSQL databases allow developers to store and retrieve data
in a flexible and dynamic manner, with support for multiple data types and
changing data structures.
8. Performance: NoSQL databases are optimized for high performance and
can handle a high volume of reads and writes, making them suitable for big
data and real-time applications.
Advantages of NoSQL:
There are many advantages of working with NoSQL databases such as MongoDB and
Cassandra. The main advantages are high scalability and high availability.
1. High scalability : NoSQL databases use sharding for horizontal scaling.
Partitioning of data and placing it on multiple machines in such a way that
the order of the data is preserved is sharding. Vertical scaling means
adding more resources to the existing machine whereas horizontal scaling
means adding more machines to handle the data. Vertical scaling is not
that easy to implement but horizontal scaling is easy to implement.
Examples of horizontal scaling databases are MongoDB, Cassandra, etc.
NoSQL can handle a huge amount of data because of scalability, as the data
grows NoSQL scale itself to handle that data in an efficient manner.
2. Flexibility: NoSQL databases are designed to handle unstructured or
semi-structured data, which means that they can accommodate dynamic
changes to the data model. This makes NoSQL databases a good fit for
applications that need to handle changing data requirements.
3. High availability : Auto replication feature in NoSQL databases makes it
highly available because in case of any failure data replicates itself to the
previous consistent state.
4. Scalability: NoSQL databases are highly scalable, which means that they
can handle large amounts of data and traffic with ease. This makes them a
good fit for applications that need to handle large amounts of data or
traffic
5. Performance: NoSQL databases are designed to handle large amounts of
data and traffic, which means that they can offer improved performance
compared to traditional relational databases.
6. Cost-effectiveness: NoSQL databases are often more cost-effective than
traditional relational databases, as they are typically less complex and do
not require expensive hardware or software.
Disadvantages of NoSQL:
NoSQL has the following disadvantages.
1. Lack of standardization : There are many different types of NoSQL
databases, each with its own unique strengths and weaknesses. This lack
of standardization can make it difficult to choose the right database for a
specific application
2. Lack of ACID compliance : NoSQL databases are not fully ACID-compliant,
which means that they do not guarantee the consistency, integrity, and
durability of data. This can be a drawback for applications that require
strong data consistency guarantees.
3. Narrow focus : NoSQL databases have a very narrow focus as it is mainly
designed for storage but it provides very little functionality. Relational
databases are a better choice in the field of Transaction Management than
NoSQL.
4. Open-source : NoSQL is open-source database. There is no reliable
standard for NoSQL yet. In other words, two database systems are likely to
be unequal.
5. Lack of support for complex queries : NoSQL databases are not designed
to handle complex queries, which means that they are not a good fit for
applications that require complex data analysis or reporting.
6. Lack of maturity : NoSQL databases are relatively new and lack the
maturity of traditional relational databases. This can make them less
reliable and less secure than traditional databases.
7. Management challenge : The purpose of big data tools is to make the
management of a large amount of data as simple as possible. But it is not
so easy. Data management in NoSQL is much more complex than in a
relational database. NoSQL, in particular, has a reputation for being
challenging to install and even more hectic to manage on a daily basis.
8. GUI is not available : GUI mode tools to access the database are not
flexibly available in the market.
9. Backup : Backup is a great weak point for some NoSQL databases like
MongoDB. MongoDB has no approach for the backup of data in a
consistent manner.
10. Large document size : Some database systems like MongoDB and
CouchDB store data in JSON format. This means that documents are quite
large (BigData, network bandwidth, speed), and having descriptive key
names actually hurts since they increase the document size.
Types of NoSQL database:
Types of NoSQL databases and the name of the databases system that falls in that
category are:
1. Graph Databases: Examples – Amazon Neptune, Neo4j
2. Key value store: Examples – Memcached, Redis, Coherence
3. Tabular: Examples – Hbase, Big Table, Accumulo
4. Document-based: Examples – MongoDB, CouchDB, Cloudant
When should NoSQL be used:

1. When a huge amount of data needs to be stored and retrieved.
2. The relationship between the data you store is not that important
3. The data changes over time and is not structured.
4. Support of Constraints and Joins is not required at the database level
5. The data is growing continuously and you need to scale the database
regularly to handle the data.
In conclusion, NoSQL databases offer several benefits over traditional relational
databases, such as scalability, flexibility, and cost-effectiveness. However, they also
have several drawbacks, such as a lack of standardization, lack of ACID compliance,
and lack of support for complex queries. When choosing a database for a specific
application, it is important to weigh the benefits and drawbacks carefully to
determine the best fit.
SQL vs NoSQL: Which one is better to use?

Shakespeare was probably not thinking about databases when he wrote this line but
this is still the critical question most companies face these days. The biggest decision
when it comes to choosing a database is picking a relational database (SQL) or a non-
relational (NoSQL) database. While a relational database is a viable option most
times, it is unsuited for large datasets and big data analysis. This is the major reason
for the popularity of the NoSQL database systems in major Internet companies
like Google, Yahoo, Amazon, etc.
However, the decision to choose a database is not that simple (what is really?!!).
Both the SQL and NoSQL databases have different structures and different data
storage methods. So the choice between SQL vs NoSQL essentially boils down to the
type of database that is required for a particular project.
What’s so different?
Both SQL and NoSQL databases serve the same purpose i.e. storing data but they go
about it in vastly different ways. There are multiple differences between the SQL and
NoSQL databases and it is important to understand them in order to make an
informed choice about the type of database required.
Keeping that in mind, some of the important differences between the SQL and NoSQL
databases are given as follows:
1. Language:
Let’s imagine that in the database world, everyone speaks X Language. So it would
be quite confusing if you started speaking Y language in the middle of that. This is
the case with SQL databases. The SQL databases manipulate the data based on SQL
which is one of the most versatile and widely-used language options available. While
this makes it a safe choice especially for complex queries, it can also be restrictive.
This is because it requires the use of predefined schemas to determine the structure
of data before you work with it and changing the structure can be quite confusing
(like using Y language).
Now again imagine a database world where multiple languages like are spoken.
While this world would be a little chaotic, speaking Y language would be fine because
you would be sure to find a fellow idiot! This is a NoSQL database that has a dynamic
schema for unstructured data. Here, data is stored in many ways which means it can
be document-oriented, column-oriented, graph-based, etc. This flexibility means
that documents can be created without having a defined structure and so each
document can have its own unique structure.
2. Scalability
Think about a tall building in your neighborhood. If given the option, would it be
better to add more floors in this building or create a new building entirely for more
residents?
This is the problem for SQL and NoSQL databases. SQL databases are vertically
scalable. This means that the load on a single server can be increased by increasing
things like RAM, CPU, or SSD. (More floors can be added to this building). On the
other hand, NoSQL databases are horizontally scalable. This means that more traffic
can be handled by sharding, or adding more servers in your NoSQL database. (More
buildings can be added to the neighborhood).
In the long run, it is better to add more buildings than floors as that is more stable
(Less chance of creating a Leaning Tower of Pisa!!!). Thus, NoSQL can ultimately
become larger and more powerful, making NoSQL databases the preferred choice for
large or ever-changing data sets.
3. Schema Design
A schema refers to the blueprint of a database i.e how the data is organized. The
schema of an SQL database and a NoSQL database is markedly different. Let’s use a
joke to better understand this.
This basically means that the poor database admins couldn’t find a table in NoSQL
because there is no standard schema definition for NoSQL databases. They are either
key-value pairs, document-based, graph databases or wide-column stores depending
on the requirements. On the other hand, if those database admins had gone to a SQL
bar, they certainly would have found tables as SQL databases have a table-based
schema.
This difference in schema makes relational SQL databases a better option for
applications that require multi-row transactions such as an accounting system or for
legacy systems that were built for a relational structure. However, NoSQL databases
are much better suited for big data as flexibility is an important requirement which
is fulfilled by their dynamic schema.
4. Community
SQL is a mature technology(Like your old but very wise Uncle) and there are many
experienced developers who understand it. Also, great support is available for all SQL
databases from their vendors. There are even a lot of independent consultants who
can help with the SQL database for very large scale deployments.
On the other hand, NoSQL is comparatively new(The young and Fun Cousin!) and so
some NoSQL databases are reliant on community support. Also, only limited outside
experts are available for setting up and deploying large scale NoSQL deployments.
The Big Questions!!!
NoSQL is a recent technology compared to SQL. So naturally, there are lots of
questions in regards to it especially in the context of big data and data analytics.
Some of the major questions relating to this are addressed below:
Is NoSQL faster than SQL?
In general, NoSQL is not faster than SQL just as SQL is not faster than NoSQL. For
those that didn’t get that statement, it means that speed as a factor for SQL and
NoSQL databases depends on the context.
SQL databases are normalized databases where the data is broken down into various
logical tables to avoid data redundancy and data duplication. In this scenario, SQL
databases are faster than their NoSQL counterparts for joins, queries, updates, etc.
On the other hand, NoSQL databases are specifically designed for unstructured data
which can be document-oriented, column-oriented, graph-based, etc. In this case, a
particular data entity is stored together and not partitioned. So performing read or
write operations on a single data entity is faster for NoSQL databases as compared
to SQL databases.
Is NoSQL better for Big Data Applications?
They say “Necessity is the Mother of Invention!” and that certainly turned out to be
true in the case of NoSQL. The NoSQL databases for big data were specifically
developed by the top internet companies such as Google, Yahoo, Amazon, etc. as the
existing relational databases were not able to cope with the increasing data
processing requirements.
NoSQL databases have a dynamic schema that is much better suited for big data as
flexibility is an important requirement. Also, large amounts of analytical data can be
stored in NoSQL databases for predictive analysis. An example of this is data from
various social media sites such as Instagram, Twitter, Facebook, etc. NoSQL
databases are horizontally scalable and can ultimately become larger and more
powerful if required. All of this makes NoSQL databases the preferred choice for big
data applications.
The choice between SQL and NoSQL depends entirely on individual circumstances as
both of them have advantages as well as disadvantages. SQL databases are long
established with fixed schema design and a set structure. They are ideal for
applications that require multi-row transactions such as an accounting system or for
legacy systems that were built for a relational structure.
On the other hand, NoSQL databases are easily scalable, flexible and simple to use
as they have no rigid schema. They are ideal for applications with no specific schema
definitions such as content management systems, big data applications, real-time
analytics, etc.
Aggregate Data Model in NoSQL

We know, NoSQL are databases that store data in another format other than relational
databases. NoSQL deals in nearly every industry nowadays. For the people who
interact with data in databases, the Aggregate Data model will help in that interaction.
Features of NoSQL Databases:
• Schema Agnostic: NoSQL Databases do not require any specific schema or
s storage structure than traditional RDBMS.
• Scalability: NoSQL databases scale horizontally as data grows rapidly
certain commodity hardware could be added and scalability features could
be preserved for NoSQL.
• Performance: To increase the performance of the NoSQL system one can
add a different commodity server than reliable and fast access of database
transfer with minimum overhead.
• High Availability: In traditional RDBMS it relies on primary and secondary
nodes for fetching the data, Some NoSQL databases use master place
architecture.
• Global Availability: As data is replicated among multiple servers and
clouds the data is accessible to anyone, this minimizes the latency period.
Aggregate Data Models:
The term aggregate means a collection of objects that we use to treat as a unit. An
aggregate is a collection of data that we interact with as a unit. These units of data or
aggregates form the boundaries for ACID operation.
Example of Aggregate Data Model:
Here in the diagram have two Aggregate:

• Customer and Orders link between them represent an aggregate.
• The diamond shows how data fit into the aggregate structure.
• Customer contains a list of billing address
• Payment also contains the billing address
• The address appears three times and it is copied each time
• The domain is fit where we don’t want to change shipping and billing
address.
Consequences of Aggregate Orientation:
• Aggregation is not a logical data property It is all about how the data is being
used by applications.
• An aggregate structure may be an obstacle for others but help with some
data interactions.
• It has an important consequence for transactions.
• NoSQL databases don’t support ACID transactions thus sacrificing
consistency.
• aggregate-oriented databases support the atomic manipulation of a single
aggregate at a time.
Advantage:
• It can be used as a primary data source for online applications.
• Easy Replication.
• No single point Failure.
• It provides fast performance and horizontal Scalability.
• It can handle Structured semi-structured and unstructured data with equal
effort.
Disadvantage:
• No standard rules.
• Limited query capabilities.
• Doesn’t work well with relational data.
• Not so popular in the enterprise.
• When the value of data increases it is difficult to maintain unique values.
Key-Value Data Model in NoSQL

A key-value data model or database is also referred to as a key-value store. It is a non-
relational type of database. In this, an associative array is used as a basic database in
which an individual key is linked with just one value in a collection. For the values, keys
are special identifiers. Any kind of entity can be valued. The collection of key-value
pairs stored on separate records is called key-value databases and they do not have
an already defined structure.
How do key-value databases work?

A number of easy strings or even a complicated entity are referred to as a value that
is associated with a key by a key-value database, which is utilized to monitor the entity.
Like in many programming paradigms, a key-value database resembles a map object
or array, or dictionary, however, which is put away in a tenacious manner and
controlled by a DBMS.
An efficient and compact structure of the index is used by the key-value store to have
the option to rapidly and dependably find value using its key. For example, Redis is a
key-value store used to tracklists, maps, heaps, and primitive types (which are simple
data structures) in a constant database. Redis can uncover a very basic point of
interaction to query and manipulate value types, just by supporting a predetermined
number of value types, and when arranged, is prepared to do high throughput.
When to use a key-value database:
Here are a few situations in which you can use a key-value database:-
• User session attributes in an online app like finance or gaming, which is
referred to as real-time random data access.
• Caching mechanism for repeatedly accessing data or key-based design.
• The application is developed on queries that are based on keys.
Features:
• One of the most un-complex kinds of NoSQL data models.
• For storing, getting, and removing data, key-value databases utilize simple
functions.
• Querying language is not present in key-value databases.
• Built-in redundancy makes this database more reliable.
Advantages:
• It is very easy to use. Due to the simplicity of the database, data can accept
any kind, or even different kinds when required.
• Its response time is fast due to its simplicity, given that the remaining
environment near it is very much constructed and improved.
• Key-value store databases are scalable vertically as well as horizontally.
• Built-in redundancy makes this database more reliable.
Disadvantages:
• As querying language is not present in key-value databases, transportation
of queries from one database to a different database cannot be done.
• The key-value store database is not refined. You cannot query the database
without a key.
Some examples of key-value databases:
Here are some popular key-value databases which are widely used:
• Couchbase: It permits SQL-style querying and searching for text.
• Amazon DynamoDB: The key-value database which is mostly used is
Amazon DynamoDB as it is a trusted database used by a large number of
users. It can easily handle a large number of requests every day and it also
provides various security options.
• Riak: It is the database used to develop applications.
• Aerospike: It is an open-source and real-time database working with
billions of exchanges.
• Berkeley DB: It is a high-performance and open-source database providing
scalability.
Document Databases in NoSQL

In this article, we will see about the Document Data Model of NoSQL and apart from
Examples, Advantages, Disadvantages, and Applications of the document data model.
Document Data Model:
A Document Data Model is a lot different than other data models because it stores
data in JSON, BSON, or XML documents. in this data model, we can move documents
under one document and apart from this, any particular elements can be indexed to
run queries faster. Often documents are stored and retrieved in such a way that it
becomes close to the data objects which are used in many applications which means
very less translations are required to use data in applications. JSON is a native language
that is often used to store and query data too.
So in the document data model, each document has a key-value pair below is an
example for the same.
{
"Name" : "Yashodhra",
"Address" : "Near Patel Nagar",
"Email" : "yahoo123@yahoo.com",
"Contact" : "12345"
}
Working of Document Data Model:
This is a data model which works as a semi-structured data model in which the records
and data associated with them are stored in a single document which means this data
model is not completely unstructured. The main thing is that data here is stored in a
document.
Features:
• Document Type Model: As we all know data is stored in documents rather
than tables or graphs, so it becomes easy to map things in many
programming languages.
• Flexible Schema: Overall schema is very much flexible to support this
statement one must know that not all documents in a collection need to
have the same fields.
• Distributed and Resilient: Document data models are very much dispersed
which is the reason behind horizontal scaling and distribution of data.
• Manageable Query Language: These data models are the ones in which
query language allows the developers to perform CRUD (Create Read
Update Destroy) operations on the data model.
Examples of Document Data Models :
• Amazon DocumentDB
• MongoDB
• Cosmos DB
• ArangoDB
• Couchbase Server
• CouchDB
Advantages:
• Schema-less: These are very good in retaining existing data at massive
volumes because there are absolutely no restrictions in the format and the
structure of data storage.
• Faster creation of document and maintenance: It is very simple to create
a document and apart from this maintenance requires is almost nothing.
• Open formats: It has a very simple build process that uses XML, JSON, and
its other forms.
• Built-in versioning: It has built-in versioning which means as the
documents grow in size there might be a chance they can grow in
complexity. Versioning decreases conflicts.
Disadvantages:
• Weak Atomicity: It lacks in supporting multi-document ACID transactions.
A change in the document data model involving two collections will require
us to run two separate queries i.e. one for each collection. This is where it
breaks atomicity requirements.
• Consistency Check Limitations: One can search the collections and
documents that are not connected to an author collection but doing this
might create a problem in the performance of database performance.
• Security: Nowadays many web applications lack security which in turn
results in the leakage of sensitive data. So it becomes a point of concern,
one must pay attention to web app vulnerabilities.
Applications of Document Data Model :
• Content Management: These data models are very much used in creating
various video streaming platforms, blogs, and similar services Because each
is stored as a single document and the database here is much easier to
maintain as the service evolves over time.
• Book Database: These are very much useful in making book databases
because as we know this data model lets us nest.
• Catalog: When it comes to storing and reading catalog files these data
models are very much used because it has a fast reading ability if incase
Catalogs have thousands of attributes stored.
• Analytics Platform: These data models are very much used in the Analytics
Platform.
NoSQL database misconceptions

Over the years, many misconceptions about NoSQL databases have spread throughout the
developer community. In this section, we'll discuss two of the most common misconceptions:
• Relationship data is best suited for relational databases.
• NoSQL databases don't support ACID transactions.
.
Misconception: relationship data is best suited for relational databases
A common misconception is that NoSQL databases or non-relational databases don’t store
relationship data well. NoSQL databases can store relationship data — they just store it
differently than relational databases do.
In fact, when compared with relational databases, many find modeling relationship data in
NoSQL databases to be easier than in relational databases, because related data doesn’t have
to be split between tables. NoSQL data models allow related data to be nested within a single
data structure.
Misconception: NoSQL databases don't support ACID transactions
Another common misconception is that NoSQL databases don't support ACID transactions.
Some NoSQL databases like MongoDB do, in fact, support ACID transactions.
Note that the way data is modeled in NoSQL databases can eliminate the need for multi-
record transactions in many use cases. Consider the earlier example where we stored
information about a user and their hobbies in both a relational database and a document
database. In order to ensure information about a user and their hobbies was updated
together in a relational database, we'd need to use a transaction to update records in two
tables. In order to do the same in a document database, we could update a single document
— no multi-record transaction required.
Introduction to Graph Database on NoSQL

A graph database is a type of NoSQL database that is designed to handle data with
complex relationships and interconnections. In a graph database, data is stored as
nodes and edges, where nodes represent entities and edges represent the
relationships between those entities.
1. Graph databases are particularly well-suited for applications that require
deep and complex queries, such as social networks, recommendation
engines, and fraud detection systems. They can also be used for other types
of applications, such as supply chain management, network and
infrastructure management, and bioinformatics.
2. One of the main advantages of graph databases is their ability to handle and
represent relationships between entities. This is because the relationships
between entities are as important as the entities themselves, and often
cannot be easily represented in a traditional relational database.
3. Another advantage of graph databases is their flexibility. Graph databases
can handle data with changing structures and can be adapted to new use
cases without requiring significant changes to the database schema. This
makes them particularly useful for applications with rapidly changing data
structures or complex data requirements.
4. However, graph databases may not be suitable for all applications. For
example, they may not be the best choice for applications that require
simple queries or that deal primarily with data that can be easily
represented in a traditional relational database. Additionally, graph
databases may require more specialized knowledge and expertise to use
effectively.
Some popular graph databases include Neo4j, OrientDB, and ArangoDB. These
databases provide a range of features, including support for different data models,
scalability, and high availability, and can be used for a wide variety of applications.
As we all know the graph is a pictorial representation of data in the form of nodes and
relationships which are represented by edges. A graph database is a type of database
used to represent the data in the form of a graph. It has three components: nodes,
relationships, and properties. These components are used to model the data. The
concept of a Graph Database is based on the theory of graphs. It was introduced in the
year 2000. They are commonly referred to NoSql databases as data is stored using
nodes, relationships and properties instead of traditional databases. A graph database
is very useful for heavily interconnected data. Here relationships between data are
given priority and therefore the relationships can be easily visualized. They are flexible
as new data can be added without hampering the old ones. They are useful in the fields
of social networking, fraud detection, AI Knowledge graphs etc.
The description of components are as follows:
• Nodes: represent the objects or instances. They are equivalent to a row in
database. The node basically acts as a vertex in a graph. The nodes are
grouped by applying a label to each member.
• Relationships: They are basically the edges in the graph. They have a
specific direction, type and form patterns of the data. They basically
establish relationship between nodes.
• Properties: They are the information associated with the nodes.
Some examples of Graph Databases software are Neo4j, Oracle NoSQL DB, Graph base
etc. Out of which Neo4j is the most popular one.
In traditional databases, the relationships between data is not established. But in the
case of Graph Database, the relationships between data are prioritized. Nowadays
mostly interconnected data is used where one data is connected directly or indirectly.
Since the concept of this database is based on graph theory, it is flexible and works
very fast for associative data. Often data are interconnected to one another which also
helps to establish further relationships. It works fast in the querying part as well
because with the help of relationships we can quickly find the desired nodes. join
operations are not required in this database which reduces the cost. The relationships
and properties are stored as first-class entities in Graph Database.
Graph databases allow organizations to connect the data with external sources as well.
Since organizations require a huge amount of data, often it becomes cumbersome to
store data in the form of tables. For instance, if the organization wants to find a
particular data that is connected with another data in another table, so first join
operation is performed between the tables, and then search for the data is done row
by row. But Graph database solves this big problem. They store the relationships and
properties along with the data. So if the organization needs to search for a particular
data, then with the help of relationships and properties the nodes can be found
without joining or without traversing row by row. Thus the searching of nodes is not
dependent on the amount of data.
Types of Graph Databases:
• Property Graphs: These graphs are used for querying and analyzing data
by modelling the relationships among the data. It comprises of vertices that
has information about the particular subject and edges that denote the
relationship. The vertices and edges have additional attributes called
properties.
• RDF Graphs: It stands for Resource Description Framework. It focuses
more on data integration. They are used to represent complex data with
well defined semantics. It is represented by three elements: two vertices,
an edge that reflect the subject, predicate and object of a sentence. Every
vertex and edge is represented by URI(Uniform Resource Identifier).
When to Use Graph Database?
• Graph databases should be used for heavily interconnected data.
• It should be used when amount of data is larger and relationships are
present.
• It can be used to represent the cohesive picture of the data.
How Graph and Graph Databases Work?
Graph databases provide graph models They allow users to perform traversal queries
since data is connected. Graph algorithms are also applied to find patterns, paths and
other relationships this enabling more analysis of the data. The algorithms help to
explore the neighboring nodes, clustering of vertices analyze relationships and
patterns. Countless joins are not required in this kind of database.
Example of Graph Database:
• Recommendation engines in E commerce use graph databases to provide
customers with accurate recommendations, updates about new products
thus increasing sales and satisfying the customer’s desires.
• Social media companies use graph databases to find the “friends of friends”
or products that the user’s friends like and send suggestions accordingly to
user.
• To detect fraud Graph databases play a major role. Users can create graph
from the transactions between entities and store other important
information. Once created, running a simple query will help to identify the
fraud.
Advantages of Graph Database:
• Potential advantage of Graph Database is establishing the relationships with
external sources as well
• No joins are required since relationships is already specified.
• Query is dependent on concrete relationships and not on the amount of
data.
• It is flexible and agile.
• it is easy to manage the data in terms of graph.
• Efficient data modeling: Graph databases allow for efficient data modeling
by representing data as nodes and edges. This allows for more flexible and
scalable data modeling than traditional relational databases.
• Flexible relationships: Graph databases are designed to handle complex
relationships and interconnections between data elements. This makes
them well-suited for applications that require deep and complex queries,
such as social networks, recommendation engines, and fraud detection
systems.
• High performance: Graph databases are optimized for handling large and
complex datasets, making them well-suited for applications that require
high levels of performance and scalability.
• Scalability: Graph databases can be easily scaled horizontally, allowing
additional servers to be added to the cluster to handle increased data
volume or traffic.
• Easy to use: Graph databases are typically easier to use than traditional
relational databases. They often have a simpler data model and query
language, and can be easier to maintain and scale.
Disadvantages of Graph Database:
• Often for complex relationships speed becomes slower in searching.
• The query language is platform dependent.
• They are inappropriate for transactional data
• It has smaller user base.
• Limited use cases: Graph databases are not suitable for all applications. They
may not be the best choice for applications that require simple queries or
that deal primarily with data that can be easily represented in a traditional
relational database.
• Specialized knowledge: Graph databases may require specialized
knowledge and expertise to use effectively, including knowledge of graph
theory and algorithms.
• Immature technology: The technology for graph databases is relatively new
and still evolving, which means that it may not be as stable or well-
supported as traditional relational databases.
• Integration with other tools: Graph databases may not be as well-integrated
with other tools and systems as traditional relational databases, which can
make it more difficult to use them in conjunction with other technologies.
• Overall, graph databases on NoSQL offer many advantages for applications
that require complex and deep relationships between data elements. They
are highly flexible, scalable, and performant, and can handle large and
complex datasets. However, they may not be suitable for all applications,
and may require specialized knowledge and expertise to use effectively.
Future of Graph Database:
Graph Database is an excellent tool for storing data but it cannot be used to completely
replace the traditional database. This database deals with a typical set of
interconnected data. Although Graph Database is in the developmental phase it is
becoming an important part as business and organizations are using big data and
Graph databases help in complex analysis. Thus these databases have become a must
for today’s needs and tomorrow success.
Schemaless Database
Traditional relational databases are well-defined, using a schema to describe every functional
element, including tables, rows views, indexes, and relationships. By exerting a high degree
of control, the database administrator can improve performance and prevent capture of low-
quality, incomplete, or malformed data. In a SQL database, the schema is enforced by the
Relational Database Management System (RDBMS) whenever data is written to disk.
But in order to work, data needs to be heavily formatted and shaped to fit into the table
structure. This means sacrificing any undefined details during the save, or storing valuable
information outside the database entirely.
A schemaless database, like MongoDB, does not have these up-front constraints, mapping to
a more ‘natural’ database. Even when sitting on top of a data lake, each document is created
with a partial schema to aid retrieval. Any formal schema is applied in the code of your
applications; this layer of abstraction protects the raw data in the NoSQL database and allows
for rapid transformation as your needs change.
Any data, formatted or not, can be stored in a non-tabular NoSQL type of database. At the
same time, using the right tools in the form of a schemaless database can unlock the value of
all of your structured and unstructured data types.
How does a schemaless database work?

In schemaless databases, information is stored in JSON-style documents which can have
varying sets of fields with different data types for each field. So, a collection could look like
this:
{
name : “Joe”, age : 30, interests : ‘football’ }
{
name : “Kate”, age : 25
}
As you can see, the data itself normally has a fairly consistent structure. With the schemaless
MongoDB database, there is some additional structure — the system namespace contains an
explicit list of collections and indexes. Collections may be implicitly or explicitly created —
indexes must be explicitly declared.
What are the benefits of using a schemaless

database?
• Greater flexibility over data types
By operating without a schema, schemaless databases can store, retrieve, and query
any data type — perfect for big data analytics and similar operations that are powered
by unstructured data. Relational databases apply rigid schema rules to data, limiting
what can be stored.
• No pre-defined database schemas
The lack of schema means that your NoSQL database can accept any data type —
including those that you do not yet use. This future-proofs your database, allowing it
to grow and change as your data-driven operations change and mature.
• No data truncation
A schemaless database makes almost no changes to your data; each item is saved in
its own document with a partial schema, leaving the raw information untouched. This
means that every detail is always available and nothing is stripped to match the
current schema. This is particularly valuable if your analytics needs to change at some
point in the future.
• Suitable for real-time analytics functions
With the ability to process unstructured data, applications built on NoSQL databases
are better able to process real-time data, such as readings and measurements from
IoT sensors. Schemaless databases are also ideal for use with machine learning and
artificial intelligence operations, helping to accelerate automated actions in your
business.
• Enhanced scalability and flexibility
With NoSQL, you can use whichever data model is best suited to the job. Graph
databases allow you to view relationships between data points, or you can use
traditional wide table views with an exceptionally large number of columns. You can
query, report, and model information however you choose. And as your requirements
grow, you can keep adding nodes to increase capacity and power.
When a record is saved to a relational database, anything (particularly metadata) that
does not match the schema is truncated or removed. Deleted at write, these details
cannot be recovered at a later point in time.

Big Data Notes

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Big Data Notes

Uploaded by

Copyright:

Available Formats

UNIT – 5

Problem with relational database system

When should NoSQL be used:

SQL vs NoSQL: Which one is better to use?

Aggregate Data Model in NoSQL

Here in the diagram have two Aggregate:

Key-Value Data Model in NoSQL

How do key-value databases work?

Document Databases in NoSQL

NoSQL database misconceptions

Introduction to Graph Database on NoSQL

How does a schemaless database work?

What are the benefits of using a schemaless

You might also like