You are on page 1of 35

IT6701 – INFORMATION MANAGEMENT

UNIT I - DATABASE MODELLING, MANAGEMENT AND DEVELOPMENT

Database design and modelling - Business Rules and Relationship; Java database Connectivity
(JDBC), Database connection Manager, Stored Procedures. Trends in Big Data systems
including NoSQL - Hadoop HDFS, MapReduce, Hive, and enhancements.

1.1 DATABASE DESIGN


1.2 DATABASE MODELLING

1.2.1 ER MODEL
1.2.1.1 BASIC CONCEPTS OF ER MODEL

i) Entity:
The entity is the thing in the real world. It may be a concrete object with physical existence, for
example, a person, student, car or employee or it may be an object with a theoretical existence or
abstraction, for example, company, job, university course.

Strong versus Weak Entity :

Strong entity independent from other entity types, They always possess one or more
attributes that uniquely distinguish each occurrence of the entity.

Weak entity depends on some entity type, they don’t possess unique attributes.
ii) Attribute:
Attributes are properties that describe the entity. For example, a student can be described using
the attributes such as registrarion_no, roll_no, name, class, marks, total percentage, etc

iii) Null Values:

iv) Complex attributes:

1.2.2 ER TO RELATIONAL DATA MODEL:


Relationship versus Weak Relationship:

Relationships are meaningful association between or among entities.

Weak relationship or identifying relationship are connection that exist


between weak entity type and its owner.
Enhanced ER Model:
 Genaralization
 Specialization
 Aggregation

Generalization is a bottom-up approach in which two lower level entities combine to form a higher
level entity. In generalization, the higher level entity can also combine with other lower level entity
to make further higher level entity.
Specialization is opposite to Generalization. It is a top-down approach in which one higher level
entity can be broken down into two lower level entity. In specialization, some higher level entities
may not have lower-level entity sets at all.

Aggregation is a process when relation between two entity is treated as a single entity. Here the
relation between Center and Course, is acting as an Entity in relation with Visitor.
Normalization:
Normalization is a formal process fordeciding which attributes should be grouped together in a
relation so that all anomaliesare removed.
A logical design method which minimizes data redundancy and reduces design flaws.
• Consists of applying various “normal” forms to the database design.
• The normal forms break down large tables into smaller subsets.

First Normal Form (1NF)


Each attribute must be atomic
• No repeating columns within a row.
• No multi-valued columns.
1NF simplifies attributes
• Queries become easier.

Second Normal Form (2NF)


Each attribute must be functionally dependent on the primary key.
• Functional dependence - the property of one or more attributes that uniquely
determines the value of other attributes.
• Any non-dependent attributes are moved into a smaller (subset) table.
2NF improves data integrity.
• Prevents update, insert, and delete anomalies.

Functional Dependence
Name, dept_no, and dept_name are functionally dependent on emp_no. (emp_no -> name,
dept_no, dept_name)
Skills is not functionally dependent on emp_no since it is not unique to each emp_no.
2NF

Data Integrity

• Insert Anomaly - adding null values. eg, inserting a new department does not require the
primary key of emp_no to be added.
• Update Anomaly - multiple updates for a single name change, causes performance
degradation. eg, changing IT dept_name to IS
• Delete Anomaly - deleting wanted information. eg, deleting the IT department removes
employee Barbara Jones from the database
Third Normal Form (3NF)
Remove transitive dependencies.
• Transitive dependence - two separate entities exist within one table.
• Any transitive dependencies are moved into a smaller (subset) table.
3NF further improves data integrity.
• Prevents update, insert, and delete anomalies.

Transitive Dependence

Dept_no and dept_name are functionally dependent on emp_no however, department can be
considered a separate entity.

3NF
Boyce codd Normal Form (BCNF)
It is an advance version of 3NF that’s why it is also referred as 3.5NF. BCNF is stricter than
3NF. A table complies with BCNF if it is in 3NF and for every functional dependency X->Y, X
should be the super key of the table.

Example: Suppose there is a company wherein employees work inmore than one department.
They store the data like this:
emp_id emp_nationality emp_dept dept_type dept_no_of_emp
1001 Austrian Production and planning D001 200
1001 Austrian stores D001 250
1002 American design and technical support D134 100
1002 American Purchasing department D134 600

Functional dependencies in the table above:


emp_id -> emp_nationality
emp_dept -> {dept_type, dept_no_of_emp}

Candidate key: {emp_id, emp_dept}


The table is not in BCNF as neither emp_id nor emp_dept alone are keys.
To make the table comply with BCNF we can break the table in three tables like this:

emp_nationality table:
emp_id emp_nationality
1001 Austrian
1002 American

emp_dept table:
emp_dept dept_type dept_no_of_emp
Production and planning D001 200
stores D001 250
design and technical support D134 100
Purchasing department D134 600

emp_dept_mapping table:
emp_id emp_dept
1001 Production and planning
1001 stores
1002 design and technical support
1002 Purchasing department

Functional dependencies:
emp_id -> emp_nationality
emp_dept -> {dept_type, dept_no_of_emp}

Candidate keys:
For first table: emp_id
For second table: emp_dept
For third table: {emp_id, emp_dept}

This is now in BCNF as in both the functional dependencies left side part is a key.

Fourth Normal Form (4NF)


Fourth Normal Form : A relation is said to be in 4NF when:
1:It is in BCNF.
2:There is no multivalued dependency in the relation.
MVD occurs when two or more independent multi valued attributes are about the same attribute
occur within same table. As R –> –> B.

Because of 4NF we are having delete and update anomalies.

EMPID LANG SKILLS


101 ENG TEACH
101 HIN CONVER
101 ENG CONVER
101 HIN TEACH
202 ENG SING
202 HIN CONVER

Here both the empid is having more than one skill or language known. so
EMPID –> –> LANG
EMPID –> –> SKILLS
So, if 101 stops teaching in hindi then we have to delete 1 more row ( eng, teach).
Make 2 relation to decompose the above relation in two part :-
1: EMP1 (EMPID,LANG)
2: EMP2 (EMPID,SKILLS)
EMPID LANG
101 ENG
101 HIN
202 ENG
202 HIN

EMPID SKILLS
101 TEACH
101 CONVER
202 SING
202 CONVER

Now Both the relations are in 4NF.

Fifth Normal Form (5NF)


A database is said to be in 5NF, if and only if,
 It's in 4NF

 If we can decompose table further to eliminate redundancy and anomaly, and when we
re-join the decomposed tables by means of candidate keys, we should not be losing the
original data or any new record set should not arise. In simple words, joining two or
more decomposed table should not lose records nor create new records.
Consider an example of different Subjects taught by different lecturers and the lecturers taking
classes for different semesters.

Note: Please consider that Semester 1 has Mathematics, Physics and Chemistry and Semester 2
has only Mathematics in its academic year!!

Hence we have to decompose the table in such a way that it satisfies all the rules till 4NF and
when join them by using keys, it should yield correct record. Here, we can represent each
lecturer's Subject area and their classes in a better way. We can divide above table into three -
(SUBJECT, LECTURER), (LECTURER, CLASS), (SUBJECT, CLASS)

Now, each of combinations is in three different tables. If we need to identify who is teaching
which subject to which semester, we need join the keys of each table and get the result.

For example, who teaches Physics to Semester 1, we would be selecting Physics and Semester1
from table 3 above, join with table1 using Subject to filter out the lecturer names. Then join with
table2 using Lecturer to get correct lecturer name. That is we joined key columns of each table to
get the correct data. Hence there is no lose or new data - satisfying 5NF condition.

1.2.3 Physical Database Design and Performance:

Physical database design translates the logical data model into a set of SQL statements
that define the database. For relational database systems, it is relatively easy to translate from a
logical data model into a physical database. Rules for translation: Entities become tables in
thephysical database.
The primary goal of physical database design is data processing efficiency. Designing
physical files and databases requires certain information that should have been collected and
produced during prior systems development phases.

The information needed for physical file and database design includes these
requirements:
• Normalized relations, including estimates for the range of the number of rows in each
table
• Definitions of each attribute, along with physical specifications such as maximum
possible length
• Descriptions of where and when data are used in various ways (entered, retrieved,
deleted, and updated, including typical frequencies of these events)
• Expectations or requirements for response time and data security, backup, recovery,
retention, and integrity
• Descriptions of the technologies (database management systems) used for imple-
menting the database.

Logical Design Compared with Physical Design

physical database design requires several critical decisions that will affect the integrity and
performance of the application system. These key decisions include the following:

 Choosing the storage format (called data type) for each attribute from the logical data
model. The format and associated parameters are chosen to maximize data integrity and
to minimize storage space.
 Giving the database management system guidance regarding how to group attributes
from the logical data model into physical records.
 Giving the database management system guidance regarding how to arrange similarly
structured records in secondary memory (primarily hard disks), using a structure (called a
file organization) so that individual and groups of records can be stored, retrieved, and
updated rapidly.
 Selecting structures (including indexes and the overall database architecture) for storing
and connecting files to make retrieving related data more efficient.
 Preparing strategies for handling queries against the database that will optimize
performance and take advantage of the file organizations and indexes that you have
specified. Efficient database structures will be of benefit only if queries and the database
management systems that handle those queries are tuned to intelligently use those
structures.
1.3 BUSINESS RULES
A business rule is “a statement that defines or constrains some aspect of the business. It is
intended to assert business structure or to control or influence the behaviour of the
business.Rules prevent, cause, or suggest things to happen”. For example, the following two
statements are common expressions of business rules that affect data processing and storage:
• “A student may register for a section of a course only if he or she has
successfullycompleted the prerequisites for that course.”
• “A preferred customer qualifies for a 10 percent discount, unless he has an
overdueaccount balance.”

THE BUSINESS RULES PARADIGM


The concept of business rules has been used in information systems for some time. There are
many software products that help organizations manage their business rules (for example,
JRules from ILOG, an IBM company).

In the database world, it has been more common to use the related term integrity constraint
when referring to such rules. The intent of this term is somewhat more limited in scope,
usually referring to maintaining valid data values and relationships in the database.

A business rules approach is based on the following premises:


• Business rules are a core concept in an enterprise because they are an expression of
business policy and guide individual and aggregate behavior. Well-structured business rules
can be stated in natural language for end users and in a data model for systems developers.
• Business rules can be expressed in terms that are familiar to end users. Thus, users can
define and then maintain their own rules.
• Business rules are highly maintainable. They are stored in a central repository, and each
rule is expressed only once, then shared throughout the organization. Each rule is discovered
and documented only once, to be applied in all systems development projects.
• Enforcement of business rules can be automated through the use of software thatcan
interpret the rules and enforce them using the integrity mechanisms of thedatabase
management system

Scope of Business Rules


In this chapter and the next, we are concerned with business rules that impact only an
organization’s databases. Most organizations have a host of rules and/or policies thatfall
outside this definition. For example, the rule “Friday is business casual dress day”may be an
important policy statement, but it has no immediate impact on databases. Incontrast, the rule
“A student may register for a section of a course only if he or she hassuccessfully completed
the prerequisites for that course” is within our scope because itconstrains the transactions that
may be processed against the database.

GOOD BUSINESS RULES


Whether stated in natural language, a structured data model,or other information systems
documentation, a business rule will have certain characteristicsif it is to be consistent with
the premises outlined previously.

GATHERING BUSINESS RULES


Business rules appear in descriptions of business functions, events, policies, units,
stakeholders, and other objects. These descriptions can be found in interview notes from
individual and group information systems requirements collection sessions, organizational
documents (e.g., personnel manuals, policies, contracts, marketing brochures, and technical
instructions), and other sources. Rules are identified by asking questions about who, what,
when, where, why, and how of the organization.
Data Names and Definitions
Fundamental to understanding and modelling data are naming and defining data objects. Data
objects must be named and defined before they can be used unambiguously in a model of
organizational data.

Characteristics of a Good Business Rule

DATA NAMES
Guidelines for naming entities, relationships, andattributes as we develop the entity-
relationship data model, but there are some generalguidelines about naming any data object.
Data names should

• Relate to business, not technical (hardware or software), characteristics; so,Customer is a


good name, but File10, Bit7, and Payroll Report Sort Key are notgood names.
• Be meaningful, almost to the point of being self-documenting (i.e., the definitionwill refine
and explain the name without having to state the essence of the object’smeaning); you should
avoid using generic words such as has, is, person, or it.
• Be unique from the name used for every other distinct data object; words shouldbe
included in a data name if they distinguish the data object from other similardata objects
(e.g., Home Address versus Campus Address).
• Be readable, so that the name is structured as the concept would most naturallybe said (e.g.,
Grade Point Average is a good name, whereas Average GradeRelative To A, although
possibly accurate, is an awkward name).
• Be composed of words taken from an approved list; each organization oftenchooses a
vocabulary from which significant words in data names must be chosen(e.g., maximum is
preferred, never upper limit, ceiling, or highest); alternative, oralias names, also can be used
as can approved abbreviations (e.g., CUST forCUSTOMER)
• Be repeatable, meaning that different people or the same person at differenttimes should
develop exactly or almost the same name; this often means that thereis a standard hierarchy
or pattern for names (e.g., the birth date of a student would be Student Birth Date and the
birth date of an employee would beEmployee Birth Date).
• Follow a standard syntax, meaning that the parts of the name should follow astandard
arrangement adopted by the organization.

Salin (1990) suggests that you develop data names by


1. Preparing a definition of the data. (We talk about definitions next.)
2. Removing insignificant or illegal words (words not on the approved list fornames); note
that the presence of AND and OR in the definition may imply thattwo or more data objects
are combined, and you may want to separate the objectsand assign different names.
3. Arranging the words in a meaningful, repeatable way.
4. Assigning a standard abbreviation for each word.
5. Determining whether the name already exists, and if so, adding other qualifiersthat make
the name unique.

DATA DEFINITIONS
A definition (sometimes called a structural assertion) is considered atype of business rule.
Adefinition is an explanationof a term or a fact. Aterm is a word or phrase that has a specific
meaning for the business.
Examples of terms are course, section, rental car, flight, reservation, and passenger. Terms
areoften the key words used to form data names. Terms must be defined carefully and
concisely.

However, there is no need to define common terms such as day, month, person, ortelevision,
because these terms are understood without ambiguity by most persons.Afact is an
association between two or more terms. Afact is documented as a simpledeclarative
statement that relates terms. Examples of facts that are definitions arethe following (the
defined terms are underlined):
• “A course is a module of instruction in a particular subject area.” This definitionassociates
two terms: module of instruction and subject area. We assume that theseare common terms
that do not need to be further defined.
• “Acustomer may request a model of car from a rental branch on a particular date.”This fact,
which is a definition of model rental request, associates the four underlined terms.

GOOD DATA DEFINITIONS


Some general guidelines to follow:
• Definitions (and all other types of business rules) are gathered from the samesources as all
requirements for information systems.
• Definitions will usually be accompanied by diagrams, such as entity-relationshipdiagrams.
The definition does not need to repeat what is shown on the diagrambut rather supplement
the diagram.
• Definitions will be stated in the singular and explain what the data is, not what itis not. A
definition will use commonly understood terms and abbreviations andstand alone in its
meaning and not embed other definitions within it.
• Subtleties
• Special or exceptional conditions
• Examples
• Where, when, and how the data are created or calculated in the organization
• Whether the data are static or changes over time
• Whether the data are singular or plural in its atomic form
• Who determines the value for the data
• Who owns the data (i.e., who controls the definition and usage)
• Whether the data are optional or whether empty (what we will call null) valuesare allowed
• Whether the data can be broken down into more atomic parts or are often combinedwith
other data into some more composite or aggregate form.
• A data object should not be added to a data model, such as an entity-relationshipdiagram,
until after it has been carefully defined (and named) and there is agreementon this definition.

1.4 JDBC (Java Database Connectivity)

Java Database Connectivity (JDBC) is an application programming interface (API) which allows
the programmer to connect and interact with databases. It provides methods to query and update
data in the database through update statements like SQL's CREATE, UPDATE, DELETE and
INSERT and query statements such as SELECT. Additionally, JDBC can run stored procedures.

1.4.1 JDBC Driver Types

JDBC drivers are divided into four types or levels. The different types of jdbc drivers are:

Type 1: JDBC-ODBC Bridge driver (Bridge)


Type 2: Native-API/partly Java driver (Native)
Type 3: AllJava/Net-protocol driver (Middleware)
Type 4: All Java/Native-protocol driver (Pure)

4 types of jdbc drivers are elaborated in detail as shown below:

Type 1 JDBC Driver


JDBC-ODBC Bridge driver
The Type 1 driver translates all JDBC calls into ODBC calls and sends them to the ODBC
driver. ODBC is a generic API. The JDBC-ODBC Bridge driver is recommended only for
experimental use or when no other alternative is available.

Type 1: JDBC-ODBC Bridge

Advantage
The JDBC-ODBC Bridge allows access to almost any database, since the database’s ODBC
drivers are already available.

Disadvantages
1. Since the Bridge driver is not written fully in Java, Type 1 drivers are not portable.
2. A performance issue is seen as a JDBC call goes through the bridge to the ODBC driver, then
to the database, and this applies even in the reverse process. They are the slowest of all driver
types.
3. The client system requires the ODBC Installation to use the driver.
4. Not good for the Web.

Type 2 JDBC Driver


Native-API/partly Java driver

The distinctive characteristic of type 2 jdbc drivers are that Type 2 drivers convert JDBC calls
into database-specific calls i.e. this driver is specific to a particular database. Some distinctive
characteristic of type 2 jdbc drivers are shown below. Example: Oracle will have oracle native
api.
Type 2: Native api/ Partly Java Driver

Advantage
The distinctive characteristic of type 2 jdbc drivers are that they are typically offer better
performance than the JDBC-ODBC Bridge as the layers of communication (tiers) are less than
that of Type
1 and also it uses Native api which is Database specific.

Disadvantage
1. Native API must be installed in the Client System and hence type 2 drivers cannot be used for
the Internet.
2. Like Type 1 drivers, it’s not written in Java Language which forms a portability issue.
3. If we change the Database we have to change the native api as it is specific to a database
4. Mostly obsolete now
5. Usually not thread safe.

Type 3 JDBC Driver


All Java/Net-protocol driver

Type 3 database requests are passed through the network to the middle-tier server. The middle-
tier then translates the request to the database. If the middle-tier server can in turn use Type1,
Type 2 or Type 4 drivers.
Type 3: All Java/ Net-Protocol Driver

Advantage
1. This driver is server-based, so there is no need for any vendor database library to be present on
client machines.
2. This driver is fully written in Java and hence Portable. It is suitable for the web.
3. There are many opportunities to optimize portability, performance, and scalability.
4. The net protocol can be designed to make the client JDBC driver very small and fast to load.
5. The type 3 driver typically provides support for features such as caching (connections, query
results, and so on), load balancing, and advanced
system administration such as logging and auditing.
6. This driver is very flexible allows access to multiple databases using one driver.
7. They are the most efficient amongst all driver types.

Disadvantage
It requires another server application to install and maintain. Traversing the recordset may take
longer, since the data comes through the backend server.

Type 4 JDBC Driver


Native-protocol/all-Java driver

The Type 4 uses java networking libraries to communicate directly with the database server.
Type 4: Native-protocol/all-Java driver

Advantage
1. The major benefit of using a type 4 jdbc drivers are that they are completely written in Java to
achieve platform independence and eliminate deployment administration issues. It is most
suitable for the web.
2. Number of translation layers is very less i.e. type 4 JDBC drivers don’t have to translate
database requests to ODBC or a native connectivity interface or to pass the request on to another
server, performance is typically quite good.
3. You don’t need to install special software on the client or server. Further, these drivers can be
downloaded dynamically.

Disadvantage
With type 4 drivers, the user needs a different driver for each database.

1.4.2 Steps to Connect Java with Database

There are 5 steps to connect any java application with the database in java using JDBC. They
are as follows:
 Register the driver class
 Creating connection
 Creating statement
 Executing queries
 Closing connection

1) Register the driver class


The forName() method of Class class is used to register the driver class. This method is used
to dynamically load the driver class.
Syntax of forName() method:
public static void forName(String className)throws ClassNotFoundException
Example to register the OracleDriver class
Class.forName("oracle.jdbc.driver.OracleDriver");

2) Create the connection object


The getConnection() method of DriverManager class is used to establish connection with the
database.

Syntax of getConnection() method:


1) public static Connection getConnection(String url)throws SQLException
2) public static Connection getConnection(String url,String name,String password)
throws SQLException
Example to establish connection with the Oracle database
Connection con=DriverManager.getConnection(
"jdbc:oracle:thin:@localhost:1521:xe","system","password");

3) Create the Statement object


The createStatement() method of Connection interface is used to create statement. The object
of statement is responsible to execute queries with the database.

Syntax of createStatement() method:


public Statement createStatement()throws SQLException
Example to create the statement object
Statement stmt=con.createStatement();

4) Execute the query


The executeQuery() method of Statement interface is used to execute queries to the database.
This method returns the object of ResultSet that can be used to get all the records of a table.

Syntax of executeQuery() method:


public ResultSet executeQuery(String sql)throws SQLException
Example to execute query
ResultSet rs=stmt.executeQuery("select * from emp");
while(rs.next()){
System.out.println(rs.getInt(1)+" "+rs.getString(2));
}

5) Close the connection object


By closing connection object statement and ResultSet will be closed automatically. The
close() method of Connection interface is used to close the connection.

Syntax of close () method:


public void close()throws SQLException
Example to close connection
con.close();
1.5 DATABASE CONNECTION MANAGER
An OLE DB connection manager enables a package to connect to a data source by using an OLE
DB provider. For example, an OLE DB connection manager that connects to SQL Server can use
the Microsoft OLE DB Provider for SQL Server.

When you add an OLE DB connection manager to a package, Integration Services creates a
connection manager that will resolve to an OLE DB connection at run time, sets the connection
manager properties, and adds the connection manager to the Connections collection on the
package.

The ConnectionManagerType property of the connection manager is set to OLEDB.

The OLE DB connection manager can be configured in the following ways:


 Provide a specific connection string configured to meet the requirements of the selected
provider.
 Depending on the provider, include the name of the data source to connect to.
 Provide security credentials as appropriate for the selected provider.
 Indicate whether the connection that is created from the connection manager is retained at
run time.
Data connections
Select an existing OLE DB data connection from the list.
Data connection properties
View properties and values for the selected OLE DB data connection.
New
Create an OLE DB data connection by using the Connection Manager dialog box.
Delete
Select a data connection, and then delete it by using the Delete button.

1.6 STORED PROCEDURES


A stored procedure in SQL Server is a group of one or more Transact-SQL statements or a
reference to a Microsoft .NET Framework common runtime language (CLR) method. Procedures
resemble constructs in other programming languages because they can:
 Accept input parameters and return multiple values in the form of output parameters to
the calling program.
 Contain programming statements that perform operations in the database. These include
calling other procedures.
 Return a status value to a calling program to indicate success or failure (and the reason
for failure).
Types of Stored Procedures
User-defined
A user-defined procedure can be created in a user-defined database or in all system databases
except the Resource database. The procedure can be developed in either Transact-SQL or as a
reference to a Microsoft .NET Framework common runtime language (CLR) method.

Temporary
Temporary procedures are a form of user-defined procedures. The temporary procedures are like
a permanent procedure, except temporary procedures are stored in tempdb. There are two types
of temporary procedures: local and global. They differ from each other in their names, their
visibility, and their availability. Local temporary procedures have a single number sign (#) as the
first character of their names; they are visible only to the current user connection, and they are
deleted when the connection is closed. Global temporary procedures have two number signs (##)
as the first two characters of their names; they are visible to any user after they are created, and
they are deleted at the end of the last session using the procedure.

System
System procedures are included with SQL Server. They are physically stored in the internal,
hidden Resource database and logically appear in the sys schema of every system- and user-
defined database. In addition, the msdb database also contains system stored procedures in
the dbo schema that are used for scheduling alerts and jobs. Because system procedures start
with the prefix sp_, we recommend that you do not use this prefix when naming user-defined
procedures.

SQL Server supports the system procedures that provide an interface from SQL Server to
external programs for various maintenance activities. These extended procedures use the xp_
prefix.

Extended User-Defined
Extended procedures enable creating external routines in a programming language such as C.
These procedures are DLLs that an instance of SQL Server can dynamically load and run.

Benefits of Stored Procedures

The following list describes some benefits of using procedures.


 Reduced server/client network traffic
 Stronger security
 Reuse of code
 Easier maintenance
 Improved performance
1.7 TRENDS IN BIG DATA SYSTEMS

WHAT IS BIG DATA?


Big data means really a big data, it is a collection of large datasets that cannot be processed
using traditional computing techniques. Big data is not merely a data, rather it has become a
complete subject, which involves various tools, technqiues and frameworks.

Thus Big Data includes huge volume, high velocity, and extensible variety of data. The data in
it will be of three types.
 Structured data : Relational data.
 Semi Structured data : XML data.
 Unstructured data : Word, PDF, Text, Media Logs.

BENEFITS OF BIG DATA


Big data is really critical to our life and its emerging as one of the most important technologies
in modern world. Follow are just few benefits which are very much known to all of us:
 Using the information kept in the social network like Facebook, the marketing agencies
are learning about the response for their campaigns, promotions, and other advertising
mediums.
 Using the information in the social media like preferences and product perception of
their consumers, product companies and retail organizations are planning their
production.
 Using the data regarding the previous medical history of patients, hospitals are providing
better and quick service.

BIG DATA TECHNOLOGIES


Big data technologies are important in providing more accurate analysis, which may lead to
more concrete decision-making resulting in greater operational efficiencies, cost reductions, and
reduced risks for the business.

To harness the power of big data, you would require an infrastructure that can manage and
process huge volumes of structured and unstructured data in realtime and can protect data
privacy and security.

There are various technologies in the market from different vendors including Amazon, IBM,
Microsoft, etc., to handle big data. While looking into the technologies that handle big data, we
examine the following two classes of technology:

Operational Big Data


This include systems like MongoDB that provide operational capabilities for real-time,
interactive workloads where data is primarily captured and stored.

NoSQL Big Data systems are designed to take advantage of new cloud computing architectures
that have emerged over the past decade to allow massive computations to be run inexpensively
and efficiently. This makes operational big data workloads much easier to manage, cheaper, and
faster to implement.
Some NoSQL systems can provide insights into patterns and trends based on real-time data with
minimal coding and without the need for data scientists and additional infrastructure.

Analytical Big Data


This includes systems like Massively Parallel Processing (MPP) database systems and
MapReduce that provide analytical capabilities for retrospective and complex analysis that may
touch most or all of the data.

MapReduce provides a new method of analyzing data that is complementary to the capabilities
provided by SQL, and a system based on MapReduce that can be scaled up from single servers
to thousands of high and low end machines.

These two classes of technology are complementary and frequently deployed together.

1.7.1 NoSQL Data Bases

A NoSQL ("non SQL" or "non-relational") database provides a mechanism


for storage and retrieval of data which is modelled in means other than the tabular relations used
in relational databases.

Next Generation Databases mostly addressing some of the points: being non-relational,
distributed, open-source and horizontally scalable.
The original intention has been modern web-scale databases. The movement began early 2009
and is growing rapidly. Often more characteristics apply such as: schema-free, easy replication
support, simple API, eventually consistent / BASE (not ACID), a huge amount of data and
more.

The term NoSQL was used by Carlo Strozzi in 1998 to name his lightweight, Strozzi
NoSQL open-source relational database that did not expose the standard SQL interface, but was
still relational.

Types and Examples:


There have been various approaches to classify NoSQL databases, each with different categories
and subcategories, some of which overlap. What follows is a basic classification by data model,
with examples:

Column: Accumulo, Cassandra, Druid, HBase, Vertica

Document: Apache CouchDB, Clusterpoint, Couchbase, DocumentDB, HyperDex, Lotus Notes,


MarkLogic, MongoDB, OrientDB, Qizx, RethinkDB

Key-value: Aerospike, Couchbase, Dynamo, FairCom c-treeACE, FoundationDB, HyperDex,


MemcacheDB, MUMPS, Oracle NoSQL Database, OrientDB, Redis, Riak, Berkeley DB

Graph: AllegroGraph, InfiniteGraph, MarkLogic, Neo4J, OrientDB, Virtuoso, Stardog


Multi-model: Alchemy Database, ArangoDB, CortexDB, FoundationDB, MarkLogic,
OrientDB.
1.7.2 HDFS (Hadoop File System)

Hadoop File System was developed using distributed file system design. It is run on commodity hardware.
Unlike other distributed systems, HDFS is highly fault tolerant and designed using low-cost hardware.

HDFS holds very large amount of data and provides easier access. To store such huge data, the files are stored
across multiple machines. These files are stored in redundant fashion to rescue the system from possible data
losses in case of failure. HDFS also makes applications available to parallel processing.
Features of HDFS

 It is suitable for the distributed storage and processing.


 Hadoop provides a command interface to interact with HDFS.
 The built-in servers of namenode and datanode help users to easily check the status of
cluster.
 Streaming access to file system data.
 HDFS provides file permissions and authentication.
HDFS Architecture
Given below is the architecture of a Hadoop File System.

HDFS follows the master-slave architecture and it has the following elements.
Namenode
The namenode is the commodity hardware that contains the GNU/Linux operating system and
the namenode software. It is a software that can be run on commodity hardware. The system
having the namenode acts as the master server and it does the following tasks:

 Manages the file system namespace.


 Regulates client’s access to files.
 It also executes file system operations such as renaming, closing, and opening files and
directories.
Datanode
The datanode is a commodity hardware having the GNU/Linux operating system and datanode
software. For every node (Commodity hardware/System) in a cluster, there will be a datanode.
These nodes manage the data storage of their system.

 Datanodes perform read-write operations on the file systems, as per client request.
 They also perform operations such as block creation, deletion, and replication according
to the instructions of the namenode.
Block
Generally the user data is stored in the files of HDFS. The file in a file system will be divided
into one or more segments and/or stored in individual data nodes. These file segments are called
as blocks. In other words, the minimum amount of data that HDFS can read or write is called a
Block. The default block size is 64MB, but it can be increased as per the need to change in
HDFS configuration.
Goals of HDFS
 Fault detection and recovery: Since HDFS includes a large number of commodity
hardware, failure of components is frequent. Therefore HDFS should have mechanisms
for quick and automatic fault detection and recovery.
 Huge datasets: HDFS should have hundreds of nodes per cluster to manage the
applications having huge datasets.
 Hardware at data: A requested task can be done efficiently, when the computation
takes place near the data. Especially where huge datasets are involved, it reduces the
network traffic and increases the throughput.

1.7.3 Map Reduce

MapReduce is a framework using which we can write applications to process huge amounts of
data, in parallel, on large clusters of commodity hardware in a reliable manner.

MapReduce is a processing technique and a program model for distributed computing based on
java. The MapReduce algorithm contains two important tasks, namely Map and Reduce. Map
takes a set of data and converts it into another set of data, where individual elements are broken
down into tuples (key/value pairs). Secondly, reduce task, which takes the output from a map as
an input and combines those data tuples into a smaller set of tuples. As the sequence of the
name MapReduce implies, the reduce task is always performed after the map job.

The major advantage of MapReduce is that it is easy to scale data processing over multiple
computing nodes. Under the MapReduce model, the data processing primitives are called
mappers and reducers. Decomposing a data processing application into mappers and reducers is
sometimes nontrivial.
The Algorithm
 Generally MapReduce paradigm is based on sending the computer to where the data
resides!
 MapReduce program executes in three stages, namely map stage, shuffle stage, and
reduce stage.
 Map stage: The map or mapper’s job is to process the input data. Generally the input
data is in the form of file or directory and is stored in the Hadoop file system (HDFS).
The input file is passed to the mapper function line by line. The mapper processes the
data and creates several small chunks of data.
 Reduce stage: This stage is the combination of the Shufflestage and the Reduce stage.
The Reducer’s job is to process the data that comes from the mapper. After processing, it
produces a new set of output, which will be stored in the HDFS.
 During a MapReduce job, Hadoop sends the Map and Reduce tasks to the appropriate
servers in the cluster.
 The framework manages all the details of data-passing such as issuing tasks, verifying
task completion, and copying data around the cluster between the nodes.

 Most of the computing takes place on nodes with data on local disks that reduces the
network traffic.
 After completion of the given tasks, the cluster collects and reduces the data to form an
appropriate result, and sends it back to the Hadoop server.

Inputs and Outputs


The MapReduce framework operates exclusively on <key, value> pairs, that is, the framework
views the input to the job as a set of <key, value> pairs and produces a set of <key, value> pairs
as the output of the job, conceivably of different types.
The key and value classes have to be serializable by the framework and hence need to implement
the Writable interface. Additionally, the key classes have to implement the Writable
Comparable interface to facilitate sorting by the framework.
Input and Output types of a MapReduce job:
(input) <k1, v1> -> map -> <k2, v2> -> combine -> <k2, v2> -> reduce -> <k3, v3> (output)
1.7.4 Hive

Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides on
top of Hadoop to summarize Big Data, and makes querying and analysing easy.

Initially Hive was developed by Facebook, later the Apache Software Foundation took it up and
developed it further as an open source under the name Apache Hive. It is used by different
companies. For example, Amazon uses it in Amazon Elastic MapReduce.
Hive is not

 A relational database
 A design for OnLine Transaction Processing (OLTP)
 A language for real-time queries and row-level updates
Features of Hive

 It stores schema in a database and processed data into HDFS.


 It is designed for OLAP.
 It provides SQL type language for querying called HiveQL or HQL.
 It is familiar, fast, scalable, and extensible.

Architecture of Hive
The following component diagram depicts the architecture of Hive:
This component diagram contains different units. The following table describes each unit:
Unit Name Operation

User Interface Hive is a data warehouse infrastructure software that can create
interaction between user and HDFS. The user interfaces that
Hive supports are Hive Web UI, Hive command line, and Hive
HD Insight (In Windows server).

Meta Store Hive chooses respective database servers to store the schema or
Metadata of tables, databases, columns in a table, their data
types, and HDFS mapping.

HiveQL Process Engine HiveQL is similar to SQL for querying on schema info on the
Metastore. It is one of the replacements of traditional approach
for MapReduce program. Instead of writing MapReduce
program in Java, we can write a query for MapReduce job and
process it.

Execution Engine The conjunction part of HiveQL process Engine and MapReduce
is Hive Execution Engine. Execution engine processes the query
and generates results as same as MapReduce results. It uses the
flavor of MapReduce.

HDFS or HBASE Hadoop distributed file system or HBASE are the data storage
techniques to store data into file system.

You might also like