You are on page 1of 21

4 / 2 / 2021

Dr. Mohammed H. Al-Jammas


Hierarchy of Data
Bit: A binary digit (i.e., 0 or 1) that represents a circuit
that is either on or off.
Character: A basic building block of most
information, consisting of uppercase letters,
lowercase letters, numeric digits, or special
symbols
Field: Typically a name, a number, or a combination
of characters that describes an aspect of a business
object or activity.
Record: A collection of data fields all related to
one object, activity, or individual.

File: A collection of related records.

Database: a collection of integrated and related file


Data Entities, Attributes, and Keys
Entities, attributes, and keys are important database concepts.
Entity is a person, place, or thing (object) for which data is collected, stored, and maintained.
Examples of entities include employees, products, and customers. Most organizations organize
and store data as entities.
Attribute is a characteristic of an entity. For example, employee number, last name, first name,
hire date, and department number are attributes for an employee.
The specific value of an attribute, called a Data Item, can be found in the fields of the record
describing an entity.
A Key is a field within a record that is used to identify the record

Primary key is a field or set of fields


that uniquely identifies the record. No
other record can have the same primary
key
Database for student Marks Student number: 10 ch
Address:
Student Name:
Country:20 ch
Student Name 1-Name: 12 ch
Government: 20 ch
2-Name: 12 ch
Student Number City: 20 ch
3-Name: 12 ch
Gender Town: 20 ch
4-Name: 12 ch
Village: 20 ch
Subject title S-Name: 20 ch
Gender: 1 ch
Marks
Mobile No. : 14 ch
E-mail : 30 ch
Address:
Subject Country:
Subject code : 10 ch Government:
Subject title :20 ch City:
Town:
Village:
Marks Street No.:
Student No. 10 ch Home No.:
Academic Year : 9 ch
Subject code : 10 ch
A-Mark: 2 ch
F-Mark : 2 ch
Attempt : 1 ch
The Traditional Approach to Data Management
The Database Approach
At one time, information systems
referenced specific files containing
relevant data. For example, a payroll
system would use a payroll file. Each
distinct operational system used data files
dedicated to that system.
Today, most organizations use the
Database approach to data management,
where multiple information systems share
a pool of related data.
To use the database approach to data management, additional software a Database Management
System (DBMS) is required.
DBMS consists of a group of programs that can be used as an interface between a database and
the user of the database. Typically, this software acts as a buffer between the application programs
and the database itself.
Data Modeling and Database Characteristics
When building a database, an organization must carefully consider the following questions:
• Content. What data should be collected and at what cost?
• Access. What data should be provided to which users and when?
• Logical structure. How should data be arranged so that it makes sense to a given user?
• Physical organization. Where should data be physically located?
• Archiving. How long must this data be stored?
• Security. How can this data be protected from unauthorized access?
Data Modeling
Data Model : Is a diagram of entities and their relationships. Data modeling usually involves
developing an understanding of a specific business problem and then analyzing the data and
information needed to deliver a solution.

Relational Database Model: A simple but highly useful way to organize data into collections
of two-dimensional tables called relations.

In the relational model, data is placed in two-dimensional tables, or relations. As long as they
share at least one common attribute, these relations can be linked to provide output useful
information. In this example, all three tables include the Dept. number attribute.

Domain: The range of allowable values for a data attribute


Manipulating Data
After entering data into a relational database, users can make inquiries and analyze the data.
Basic data manipulations include selecting, projecting, and joining

Selecting: Manipulating data to eliminate rows according to certain criteria.

Projecting: Manipulating data to eliminate columns in a table.

Joining: Manipulating data to combine two or more tables

Linking: The ability to combine two or more tables through common data attributes to form a
new table with only the unique data attributes.
One of the primary advantages of a relational
database is that it allows tables to be linked,
as shown in Figure. This linkage reduces data
redundancy and allows data to be organized
more logically. The ability to link to the
manager’s Social Security number stored
once in the Manager table eliminates the need
to store it multiple times in the Project table.

The relational database model is widely used.


It is easier to control, more flexible, and more
intuitive than other approaches because it
organizes data in tables.

Databases based on the relational model


include Oracle, IBM DB2, Microsoft SQL
Server, Microsoft Access, MySQL, Sybase,
and others.
Data Cleansing

Data used in decision making must be accurate, complete, economical, flexible, reliable, relevant,
simple, timely, verifiable, accessible, and secure.

Data cleansing (data cleaning or data scrubbing): is the process of detecting and then
correcting or deleting incomplete, incorrect, inaccurate, or irrelevant records that reside in a
database. The goal of data cleansing is to improve the quality of the data used in decision making.
The “bad data” may have been caused by user data-entry errors or by data corruption during data
transmission or storage. Data cleansing is different from data validation, which involves the
identification of “bad data” and its rejection at the time of data entry.

One data cleansing solution is to identify and correct data by crosschecking it against a validated
data set.
Database Management Systems (DBMSs)

DBMS : is a group of programs used as an interface between a database and application programs
or between a database and the user. Database management systems come in a wide variety of
types and capabilities, ranging from small inexpensive software packages to sophisticated systems
costing hundreds of thousands of dollars.

• SQL Databases
SQL (Structured English Query Language) : Is a special-purpose programming language for
accessing and manipulating data stored in a relational database. SQL was originally defined by
IBM Research Center.
Big Data
Big Data is the term used to describe data collections that are so enormous (terabytes or more)
and complex (from sensor data to social media data) that traditional data management software,
hardware, and analysis processes are incapable of dealing with them.

Characteristics of Big Data:


• Volume. In 2014, it was estimated that the volume of data that exists in the
digital universe was 4.4 zettabytes (one zettabyte equals one trillion gigabytes).
The digital universe is expected to grow to an amazing 44 zettabytes by 2020
• Velocity. The velocity at which data is currently coming at us exceeds 5 trillion bits per second.
This rate is accelerating rapidly, and the volume of digital data is expected to double every two
years between now and 2020.
• Variety. Data today comes in a variety of formats. Some of the data is what computer scientists
call structured data, its format is known in advance, and it fits nicely into traditional databases.
However, most of the data that an organization must deal with is unstructured data, meaning
that it is not organized in any predefined manner. Unstructured data comes from sources such
as word-processing documents, social media, email, photos, surveillance video, and phone
messages.
Sources of Big Data

Much of this collected data is


unstructured and does not fit
neatly into traditional relational
database management.

Sources of an organization’s useful data


Technologies Used to Process Big Data
Data Warehouses
A data warehouse is a database that holds
business information from many sources in
the enterprise, covering all aspects of the
company’s processes, products, and
customers.
Data warehouses allow managers to “drill
down” to get greater detail or “roll up” to
generate aggregate or summary reports. The
primary purpose is to relate information in
innovative ways and help managers and
executives make better decisions. A data
warehouse stores historical data that has
been extracted from operational systems and
external data sources.
Data Marts
Data Mart is a subset of a data warehouse. Data marts bring the data warehouse concept online
analysis of sales, inventory, and other vital business data that have been gathered from transaction
processing systems to small and medium-sized businesses and to departments within larger
companies. Rather than store all enterprise data in one monolithic database, data marts contain a
subset of the data for a single aspect of a company’s business.

Data Lakes
A traditional data warehouse is created by extracting (and discarding some data in the process),
transforming (modifying), and loading incoming data for predetermined and specific analyses and
applications. This process can be lengthy and computer intensive, taking days to complete. A data
lake (also called an enterprise data hub) takes a “store everything” approach to big data, saving all
the data in its raw and unaltered form. The raw data residing in a data lake is available when users
decide just how they want to use the data to glean new insights. Only when the data is accessed
for a specific analysis is it extracted from the data lake, classified, organized, edited, or
transformed. Thus a data lake serves as the definitive source of data in its original, unaltered form.
Its contents can include business transactions, clickstream data, sensor data, server logs, social
media, videos, and more.
NoSQL Databases
A NoSQL database provides a means to store and retrieve data that is modeled using some means
other than the simple two dimensional tabular relations used in relational databases. Such
databases are being used to deal with the variety of data found in big data and Web applications.
A major advantage of NoSQL databases is the ability to spread data over multiple servers so that
each server contains only a subset of the total data. This so-called horizontal scaling capability
enables hundreds or even thousands of servers to operate on the data, providing faster response
times for queries and updates. Most relational database management systems have problems with
such horizontal scaling and instead require large, powerful, and expensive proprietary servers
and large storage systems.
Another advantage of NoSQL databases is that they do not require a predefined schema; data
entities can have attributes edited or assigned to them at any time.
The choice of a relational database management system versus a NoSQL solution depends on the
problem that needs to be addressed. Often, the data structures used by NoSQL databases are
more flexible than relational database tables and, in many cases, they can provide improved
access speed and redundancy.
The four main categories of NoSQL :

• Key–value NoSQL databases are similar to SQL databases, but have only two columns (“key”
and “value”), with more complex information sometimes stored within the “value” columns.
• Document NoSQL databases are used to store, retrieve, and manage document-oriented
information, such as social media posts and multimedia, also known as semi-structured data.
• Graph NoSQL databases are used to understand the relationships among events, people,
transactions, locations, and sensor readings and are well suited for analyzing interconnections
such as when extracting data from social media.
• Column NoSQL databases store data in columns, rather than in rows, and are able to deliver
fast response times for large volumes of data.
Hadoop is an open-source software framework that includes several software modules that
provide a means for storing and processing extremely large data sets, Hadoop has two primary
components:
• Data processing component
• (Hadoop Distributed File System, HDFS) for data
storage. Hadoop divides data into subsets and
distributes the subsets onto different servers for
processing. A Hadoop cluster may consist of
thousands of servers. In a Hadoop cluster, a subset
of the data within the HDFS and the MapReduce
system are housed on every server in the cluster.
This places the data processing software on the
same servers where the data is stored, thus
speeding up data retrieval. This approach creates a
highly redundant computing environment that
allows the application to keep running even if
individual servers fail.

You might also like