Professional Documents
Culture Documents
Unit 1 - Im-Notes PDF
Unit 1 - Im-Notes PDF
Database design and modelling - Business Rules and Relationship; Java database Connectivity
(JDBC), Database connection Manager, Stored Procedures. Trends in Big Data systems
including NoSQL - Hadoop HDFS, MapReduce, Hive, and enhancements.
1.2.1 ER MODEL
1.2.1.1 BASIC CONCEPTS OF ER MODEL
i) Entity:
The entity is the thing in the real world. It may be a concrete object with physical existence, for
example, a person, student, car or employee or it may be an object with a theoretical existence or
abstraction, for example, company, job, university course.
Strong entity independent from other entity types, They always possess one or more
attributes that uniquely distinguish each occurrence of the entity.
Weak entity depends on some entity type, they don’t possess unique attributes.
ii) Attribute:
Attributes are properties that describe the entity. For example, a student can be described using
the attributes such as registrarion_no, roll_no, name, class, marks, total percentage, etc
Generalization is a bottom-up approach in which two lower level entities combine to form a higher
level entity. In generalization, the higher level entity can also combine with other lower level entity
to make further higher level entity.
Specialization is opposite to Generalization. It is a top-down approach in which one higher level
entity can be broken down into two lower level entity. In specialization, some higher level entities
may not have lower-level entity sets at all.
Aggregation is a process when relation between two entity is treated as a single entity. Here the
relation between Center and Course, is acting as an Entity in relation with Visitor.
Normalization:
Normalization is a formal process fordeciding which attributes should be grouped together in a
relation so that all anomaliesare removed.
A logical design method which minimizes data redundancy and reduces design flaws.
• Consists of applying various “normal” forms to the database design.
• The normal forms break down large tables into smaller subsets.
Functional Dependence
Name, dept_no, and dept_name are functionally dependent on emp_no. (emp_no -> name,
dept_no, dept_name)
Skills is not functionally dependent on emp_no since it is not unique to each emp_no.
2NF
Data Integrity
• Insert Anomaly - adding null values. eg, inserting a new department does not require the
primary key of emp_no to be added.
• Update Anomaly - multiple updates for a single name change, causes performance
degradation. eg, changing IT dept_name to IS
• Delete Anomaly - deleting wanted information. eg, deleting the IT department removes
employee Barbara Jones from the database
Third Normal Form (3NF)
Remove transitive dependencies.
• Transitive dependence - two separate entities exist within one table.
• Any transitive dependencies are moved into a smaller (subset) table.
3NF further improves data integrity.
• Prevents update, insert, and delete anomalies.
Transitive Dependence
Dept_no and dept_name are functionally dependent on emp_no however, department can be
considered a separate entity.
3NF
Boyce codd Normal Form (BCNF)
It is an advance version of 3NF that’s why it is also referred as 3.5NF. BCNF is stricter than
3NF. A table complies with BCNF if it is in 3NF and for every functional dependency X->Y, X
should be the super key of the table.
Example: Suppose there is a company wherein employees work inmore than one department.
They store the data like this:
emp_id emp_nationality emp_dept dept_type dept_no_of_emp
1001 Austrian Production and planning D001 200
1001 Austrian stores D001 250
1002 American design and technical support D134 100
1002 American Purchasing department D134 600
emp_nationality table:
emp_id emp_nationality
1001 Austrian
1002 American
emp_dept table:
emp_dept dept_type dept_no_of_emp
Production and planning D001 200
stores D001 250
design and technical support D134 100
Purchasing department D134 600
emp_dept_mapping table:
emp_id emp_dept
1001 Production and planning
1001 stores
1002 design and technical support
1002 Purchasing department
Functional dependencies:
emp_id -> emp_nationality
emp_dept -> {dept_type, dept_no_of_emp}
Candidate keys:
For first table: emp_id
For second table: emp_dept
For third table: {emp_id, emp_dept}
This is now in BCNF as in both the functional dependencies left side part is a key.
Here both the empid is having more than one skill or language known. so
EMPID –> –> LANG
EMPID –> –> SKILLS
So, if 101 stops teaching in hindi then we have to delete 1 more row ( eng, teach).
Make 2 relation to decompose the above relation in two part :-
1: EMP1 (EMPID,LANG)
2: EMP2 (EMPID,SKILLS)
EMPID LANG
101 ENG
101 HIN
202 ENG
202 HIN
EMPID SKILLS
101 TEACH
101 CONVER
202 SING
202 CONVER
If we can decompose table further to eliminate redundancy and anomaly, and when we
re-join the decomposed tables by means of candidate keys, we should not be losing the
original data or any new record set should not arise. In simple words, joining two or
more decomposed table should not lose records nor create new records.
Consider an example of different Subjects taught by different lecturers and the lecturers taking
classes for different semesters.
Note: Please consider that Semester 1 has Mathematics, Physics and Chemistry and Semester 2
has only Mathematics in its academic year!!
Hence we have to decompose the table in such a way that it satisfies all the rules till 4NF and
when join them by using keys, it should yield correct record. Here, we can represent each
lecturer's Subject area and their classes in a better way. We can divide above table into three -
(SUBJECT, LECTURER), (LECTURER, CLASS), (SUBJECT, CLASS)
Now, each of combinations is in three different tables. If we need to identify who is teaching
which subject to which semester, we need join the keys of each table and get the result.
For example, who teaches Physics to Semester 1, we would be selecting Physics and Semester1
from table 3 above, join with table1 using Subject to filter out the lecturer names. Then join with
table2 using Lecturer to get correct lecturer name. That is we joined key columns of each table to
get the correct data. Hence there is no lose or new data - satisfying 5NF condition.
Physical database design translates the logical data model into a set of SQL statements
that define the database. For relational database systems, it is relatively easy to translate from a
logical data model into a physical database. Rules for translation: Entities become tables in
thephysical database.
The primary goal of physical database design is data processing efficiency. Designing
physical files and databases requires certain information that should have been collected and
produced during prior systems development phases.
The information needed for physical file and database design includes these
requirements:
• Normalized relations, including estimates for the range of the number of rows in each
table
• Definitions of each attribute, along with physical specifications such as maximum
possible length
• Descriptions of where and when data are used in various ways (entered, retrieved,
deleted, and updated, including typical frequencies of these events)
• Expectations or requirements for response time and data security, backup, recovery,
retention, and integrity
• Descriptions of the technologies (database management systems) used for imple-
menting the database.
physical database design requires several critical decisions that will affect the integrity and
performance of the application system. These key decisions include the following:
Choosing the storage format (called data type) for each attribute from the logical data
model. The format and associated parameters are chosen to maximize data integrity and
to minimize storage space.
Giving the database management system guidance regarding how to group attributes
from the logical data model into physical records.
Giving the database management system guidance regarding how to arrange similarly
structured records in secondary memory (primarily hard disks), using a structure (called a
file organization) so that individual and groups of records can be stored, retrieved, and
updated rapidly.
Selecting structures (including indexes and the overall database architecture) for storing
and connecting files to make retrieving related data more efficient.
Preparing strategies for handling queries against the database that will optimize
performance and take advantage of the file organizations and indexes that you have
specified. Efficient database structures will be of benefit only if queries and the database
management systems that handle those queries are tuned to intelligently use those
structures.
1.3 BUSINESS RULES
A business rule is “a statement that defines or constrains some aspect of the business. It is
intended to assert business structure or to control or influence the behaviour of the
business.Rules prevent, cause, or suggest things to happen”. For example, the following two
statements are common expressions of business rules that affect data processing and storage:
• “A student may register for a section of a course only if he or she has
successfullycompleted the prerequisites for that course.”
• “A preferred customer qualifies for a 10 percent discount, unless he has an
overdueaccount balance.”
In the database world, it has been more common to use the related term integrity constraint
when referring to such rules. The intent of this term is somewhat more limited in scope,
usually referring to maintaining valid data values and relationships in the database.
DATA NAMES
Guidelines for naming entities, relationships, andattributes as we develop the entity-
relationship data model, but there are some generalguidelines about naming any data object.
Data names should
DATA DEFINITIONS
A definition (sometimes called a structural assertion) is considered atype of business rule.
Adefinition is an explanationof a term or a fact. Aterm is a word or phrase that has a specific
meaning for the business.
Examples of terms are course, section, rental car, flight, reservation, and passenger. Terms
areoften the key words used to form data names. Terms must be defined carefully and
concisely.
However, there is no need to define common terms such as day, month, person, ortelevision,
because these terms are understood without ambiguity by most persons.Afact is an
association between two or more terms. Afact is documented as a simpledeclarative
statement that relates terms. Examples of facts that are definitions arethe following (the
defined terms are underlined):
• “A course is a module of instruction in a particular subject area.” This definitionassociates
two terms: module of instruction and subject area. We assume that theseare common terms
that do not need to be further defined.
• “Acustomer may request a model of car from a rental branch on a particular date.”This fact,
which is a definition of model rental request, associates the four underlined terms.
Java Database Connectivity (JDBC) is an application programming interface (API) which allows
the programmer to connect and interact with databases. It provides methods to query and update
data in the database through update statements like SQL's CREATE, UPDATE, DELETE and
INSERT and query statements such as SELECT. Additionally, JDBC can run stored procedures.
JDBC drivers are divided into four types or levels. The different types of jdbc drivers are:
Advantage
The JDBC-ODBC Bridge allows access to almost any database, since the database’s ODBC
drivers are already available.
Disadvantages
1. Since the Bridge driver is not written fully in Java, Type 1 drivers are not portable.
2. A performance issue is seen as a JDBC call goes through the bridge to the ODBC driver, then
to the database, and this applies even in the reverse process. They are the slowest of all driver
types.
3. The client system requires the ODBC Installation to use the driver.
4. Not good for the Web.
The distinctive characteristic of type 2 jdbc drivers are that Type 2 drivers convert JDBC calls
into database-specific calls i.e. this driver is specific to a particular database. Some distinctive
characteristic of type 2 jdbc drivers are shown below. Example: Oracle will have oracle native
api.
Type 2: Native api/ Partly Java Driver
Advantage
The distinctive characteristic of type 2 jdbc drivers are that they are typically offer better
performance than the JDBC-ODBC Bridge as the layers of communication (tiers) are less than
that of Type
1 and also it uses Native api which is Database specific.
Disadvantage
1. Native API must be installed in the Client System and hence type 2 drivers cannot be used for
the Internet.
2. Like Type 1 drivers, it’s not written in Java Language which forms a portability issue.
3. If we change the Database we have to change the native api as it is specific to a database
4. Mostly obsolete now
5. Usually not thread safe.
Type 3 database requests are passed through the network to the middle-tier server. The middle-
tier then translates the request to the database. If the middle-tier server can in turn use Type1,
Type 2 or Type 4 drivers.
Type 3: All Java/ Net-Protocol Driver
Advantage
1. This driver is server-based, so there is no need for any vendor database library to be present on
client machines.
2. This driver is fully written in Java and hence Portable. It is suitable for the web.
3. There are many opportunities to optimize portability, performance, and scalability.
4. The net protocol can be designed to make the client JDBC driver very small and fast to load.
5. The type 3 driver typically provides support for features such as caching (connections, query
results, and so on), load balancing, and advanced
system administration such as logging and auditing.
6. This driver is very flexible allows access to multiple databases using one driver.
7. They are the most efficient amongst all driver types.
Disadvantage
It requires another server application to install and maintain. Traversing the recordset may take
longer, since the data comes through the backend server.
The Type 4 uses java networking libraries to communicate directly with the database server.
Type 4: Native-protocol/all-Java driver
Advantage
1. The major benefit of using a type 4 jdbc drivers are that they are completely written in Java to
achieve platform independence and eliminate deployment administration issues. It is most
suitable for the web.
2. Number of translation layers is very less i.e. type 4 JDBC drivers don’t have to translate
database requests to ODBC or a native connectivity interface or to pass the request on to another
server, performance is typically quite good.
3. You don’t need to install special software on the client or server. Further, these drivers can be
downloaded dynamically.
Disadvantage
With type 4 drivers, the user needs a different driver for each database.
There are 5 steps to connect any java application with the database in java using JDBC. They
are as follows:
Register the driver class
Creating connection
Creating statement
Executing queries
Closing connection
When you add an OLE DB connection manager to a package, Integration Services creates a
connection manager that will resolve to an OLE DB connection at run time, sets the connection
manager properties, and adds the connection manager to the Connections collection on the
package.
Temporary
Temporary procedures are a form of user-defined procedures. The temporary procedures are like
a permanent procedure, except temporary procedures are stored in tempdb. There are two types
of temporary procedures: local and global. They differ from each other in their names, their
visibility, and their availability. Local temporary procedures have a single number sign (#) as the
first character of their names; they are visible only to the current user connection, and they are
deleted when the connection is closed. Global temporary procedures have two number signs (##)
as the first two characters of their names; they are visible to any user after they are created, and
they are deleted at the end of the last session using the procedure.
System
System procedures are included with SQL Server. They are physically stored in the internal,
hidden Resource database and logically appear in the sys schema of every system- and user-
defined database. In addition, the msdb database also contains system stored procedures in
the dbo schema that are used for scheduling alerts and jobs. Because system procedures start
with the prefix sp_, we recommend that you do not use this prefix when naming user-defined
procedures.
SQL Server supports the system procedures that provide an interface from SQL Server to
external programs for various maintenance activities. These extended procedures use the xp_
prefix.
Extended User-Defined
Extended procedures enable creating external routines in a programming language such as C.
These procedures are DLLs that an instance of SQL Server can dynamically load and run.
Thus Big Data includes huge volume, high velocity, and extensible variety of data. The data in
it will be of three types.
Structured data : Relational data.
Semi Structured data : XML data.
Unstructured data : Word, PDF, Text, Media Logs.
To harness the power of big data, you would require an infrastructure that can manage and
process huge volumes of structured and unstructured data in realtime and can protect data
privacy and security.
There are various technologies in the market from different vendors including Amazon, IBM,
Microsoft, etc., to handle big data. While looking into the technologies that handle big data, we
examine the following two classes of technology:
NoSQL Big Data systems are designed to take advantage of new cloud computing architectures
that have emerged over the past decade to allow massive computations to be run inexpensively
and efficiently. This makes operational big data workloads much easier to manage, cheaper, and
faster to implement.
Some NoSQL systems can provide insights into patterns and trends based on real-time data with
minimal coding and without the need for data scientists and additional infrastructure.
MapReduce provides a new method of analyzing data that is complementary to the capabilities
provided by SQL, and a system based on MapReduce that can be scaled up from single servers
to thousands of high and low end machines.
These two classes of technology are complementary and frequently deployed together.
Next Generation Databases mostly addressing some of the points: being non-relational,
distributed, open-source and horizontally scalable.
The original intention has been modern web-scale databases. The movement began early 2009
and is growing rapidly. Often more characteristics apply such as: schema-free, easy replication
support, simple API, eventually consistent / BASE (not ACID), a huge amount of data and
more.
The term NoSQL was used by Carlo Strozzi in 1998 to name his lightweight, Strozzi
NoSQL open-source relational database that did not expose the standard SQL interface, but was
still relational.
Hadoop File System was developed using distributed file system design. It is run on commodity hardware.
Unlike other distributed systems, HDFS is highly fault tolerant and designed using low-cost hardware.
HDFS holds very large amount of data and provides easier access. To store such huge data, the files are stored
across multiple machines. These files are stored in redundant fashion to rescue the system from possible data
losses in case of failure. HDFS also makes applications available to parallel processing.
Features of HDFS
HDFS follows the master-slave architecture and it has the following elements.
Namenode
The namenode is the commodity hardware that contains the GNU/Linux operating system and
the namenode software. It is a software that can be run on commodity hardware. The system
having the namenode acts as the master server and it does the following tasks:
Datanodes perform read-write operations on the file systems, as per client request.
They also perform operations such as block creation, deletion, and replication according
to the instructions of the namenode.
Block
Generally the user data is stored in the files of HDFS. The file in a file system will be divided
into one or more segments and/or stored in individual data nodes. These file segments are called
as blocks. In other words, the minimum amount of data that HDFS can read or write is called a
Block. The default block size is 64MB, but it can be increased as per the need to change in
HDFS configuration.
Goals of HDFS
Fault detection and recovery: Since HDFS includes a large number of commodity
hardware, failure of components is frequent. Therefore HDFS should have mechanisms
for quick and automatic fault detection and recovery.
Huge datasets: HDFS should have hundreds of nodes per cluster to manage the
applications having huge datasets.
Hardware at data: A requested task can be done efficiently, when the computation
takes place near the data. Especially where huge datasets are involved, it reduces the
network traffic and increases the throughput.
MapReduce is a framework using which we can write applications to process huge amounts of
data, in parallel, on large clusters of commodity hardware in a reliable manner.
MapReduce is a processing technique and a program model for distributed computing based on
java. The MapReduce algorithm contains two important tasks, namely Map and Reduce. Map
takes a set of data and converts it into another set of data, where individual elements are broken
down into tuples (key/value pairs). Secondly, reduce task, which takes the output from a map as
an input and combines those data tuples into a smaller set of tuples. As the sequence of the
name MapReduce implies, the reduce task is always performed after the map job.
The major advantage of MapReduce is that it is easy to scale data processing over multiple
computing nodes. Under the MapReduce model, the data processing primitives are called
mappers and reducers. Decomposing a data processing application into mappers and reducers is
sometimes nontrivial.
The Algorithm
Generally MapReduce paradigm is based on sending the computer to where the data
resides!
MapReduce program executes in three stages, namely map stage, shuffle stage, and
reduce stage.
Map stage: The map or mapper’s job is to process the input data. Generally the input
data is in the form of file or directory and is stored in the Hadoop file system (HDFS).
The input file is passed to the mapper function line by line. The mapper processes the
data and creates several small chunks of data.
Reduce stage: This stage is the combination of the Shufflestage and the Reduce stage.
The Reducer’s job is to process the data that comes from the mapper. After processing, it
produces a new set of output, which will be stored in the HDFS.
During a MapReduce job, Hadoop sends the Map and Reduce tasks to the appropriate
servers in the cluster.
The framework manages all the details of data-passing such as issuing tasks, verifying
task completion, and copying data around the cluster between the nodes.
Most of the computing takes place on nodes with data on local disks that reduces the
network traffic.
After completion of the given tasks, the cluster collects and reduces the data to form an
appropriate result, and sends it back to the Hadoop server.
Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides on
top of Hadoop to summarize Big Data, and makes querying and analysing easy.
Initially Hive was developed by Facebook, later the Apache Software Foundation took it up and
developed it further as an open source under the name Apache Hive. It is used by different
companies. For example, Amazon uses it in Amazon Elastic MapReduce.
Hive is not
A relational database
A design for OnLine Transaction Processing (OLTP)
A language for real-time queries and row-level updates
Features of Hive
Architecture of Hive
The following component diagram depicts the architecture of Hive:
This component diagram contains different units. The following table describes each unit:
Unit Name Operation
User Interface Hive is a data warehouse infrastructure software that can create
interaction between user and HDFS. The user interfaces that
Hive supports are Hive Web UI, Hive command line, and Hive
HD Insight (In Windows server).
Meta Store Hive chooses respective database servers to store the schema or
Metadata of tables, databases, columns in a table, their data
types, and HDFS mapping.
HiveQL Process Engine HiveQL is similar to SQL for querying on schema info on the
Metastore. It is one of the replacements of traditional approach
for MapReduce program. Instead of writing MapReduce
program in Java, we can write a query for MapReduce job and
process it.
Execution Engine The conjunction part of HiveQL process Engine and MapReduce
is Hive Execution Engine. Execution engine processes the query
and generates results as same as MapReduce results. It uses the
flavor of MapReduce.
HDFS or HBASE Hadoop distributed file system or HBASE are the data storage
techniques to store data into file system.