You are on page 1of 42

ADVANCED DATABASE

MANAGEMENT SYSTEM

| WEEK 11 AND 12 |
➢ CENTRALIZED VS DISTRIBUTED VERSION
CONTROL
• There is no doubt that version control makes developers work more easily and fast.
• In most of the organization, developers use either:
• Centralized Version Control System(CVCS)
- like Subversion(SVN) or Concurrent Version System(CVS)
• Distributed Version Control System(DVCS)
- like Git (Written in C), Mercurial (Written in Python) or Bazaar (Written in
Python).
• In DVCS there is also a master branch or central version in the code but it works in a
different way than centralized source control.
• CENTRALIZED VERSION CONTROL SYSYTEM
- The server is the master repository that contains all of the versions of the
code.
- To work on any project, firstly user or client needs to get the code from the
master repository or server.
- So, the client communicates with the server and pulls all the code or current
version of the code from the server to their local machine.
- In other terms we can say, you need to take an update from the master
repository and then you get the local copy of the code in your system.
- Committing a change simply means merging your own code into the master
repository or making a new version of the source code.
- Centralized source control is getting the latest version of the code from a
central repository that will contain other people’s code as well, making your
own changes in the code, and then committing or merging those changes into
the central repository.
• DISTRIBUTED VERSION CONTROL SYSTEM
- In distributed version control most of the mechanism or model applies the
same as centralized.
- The only major difference you will find here is, instead of one single
repository which is the server, here every single developer or client has their
own server and they will have a copy of the entire history or version of the
code and all of its branches in their local server or machine.
- Basically, every client or user can work locally and disconnected which is
more convenient than centralized source control and that’s why it is called
distributed.
- You don’t need to rely on the central server, you can clone the entire history
or copy of the code to your hard drive.
- You clone the code from the master repository in your own hard drive, then
you get the code from your own repository to make changes and after doing
changes, you commit your changes to your local repository and at this point,
your local repository will have ‘change sets‘ but it is still disconnected with
the master repository (master repository will have different ‘sets of changes‘
from each and every individual developer’s repository), so to communicate
with it, you issue a request to the master repository and push your local
repository code to the master repository.
- PULLING: Getting the new change from a repository
- PUSHING: Merging your local repository’s ‘set of changes’
- It doesn’t follow the way of communicating or merging the code straight
forward to the master repository after making changes.
- Firstly you commit all the changes in your own server or repository and then
the ‘set of changes’ will merge to the master repository.

• BASIC DIFFRENCE WITH PROS AND CONS


- Centralized version control is easier to learn than distributed. If you are a
beginner, you’ll have to remember all the commands for all the operations in
DVCS and working on DVCS might be confusing initially. CVCS is easy to
learn and easy to set up.
- DVCS has the biggest advantage in that it allows you to work offline and
gives flexibility. You have the entire history of the code in your own hard
drive, so all the changes you will be making in your own server or to your
own repository which doesn’t require an internet connection, but this is not in
the case of CVCS.
- DVCS is faster than CVCS because you don’t need to communicate with the
remote server for each and every command. You do everything locally which
gives you the benefit to work faster than CVCS.
- Working on branches is easy in DVCS. Every developer has an entire history
of the code in DVCS, so developers can share their changes before merging
all the ‘sets of changes to the remote server. In CVCS it’s difficult and time-
consuming to work on branches because it requires to communicate with the
server directly.
- If the project has a long history or the project contain large binary files, in that
case, downloading the entire project in DVCS can take more time and space
than usual, whereas in CVCS you just need to get few lines of code because
you don’t need to save the entire history or complete project in your own
server so there is no requirement for additional space.
- If the main server goes down or it crashes in DVCS, you can still get the
backup or entire history of the code from your local repository or server where
the full revision of the code is already saved. This is not in the case of CVCS,
there is just a single remote server that has entire code history.
- Merge conflicts with other developer’s code are less in DVCS. Because every
developer work on their own piece of code. Merge conflicts are more in
CVCS in comparison to DVCS.
- In DVCS, sometimes developers take the advantage of having the entire
history of the code and they may work for too long in isolation which is not a
good thing. This is not in the case of CVCS.
• From Google Trends and all the above points, it’s clear that DVCS has more
advantages and it’s more popular than CVCS, but if we need to talk about choosing a
version control, so it also depends on which one is more convenient for you to learn as a
beginner. You can choose any one of them but DVCS gives more benefit once you just
go with the flow of using its commands.
• CENTRALIZED SYSTEMS
- Are systems that use client/server architecture where one or more client nodes
are directly connected to a central server.
- This is the most commonly used type of system in many organizations where
a client sends a request to a company server and receives the response.
(Example like Wikipedia)
• CHARACTERISTICS OF CENTRALIZED SYSTEM
- Presence of a global clock: As the entire system consists of a central node (a
server/ a master) and many client nodes (a computer/ a slave), all client nodes
sync up with the global clock(the clock of the central node).
- One single central unit: One single central unit which serves/coordinates all
the other nodes in the system.
- Dependent failure of components: Central node failure causes the entire
system to fail. This makes sense because when the server is down, no other
entity is there to send/receive responses/requests.
• SCALING
- Only vertical scaling on the central server is possible. Horizontal scaling will
contradict the single central unit characteristic of this system of a single
central entity.
• COMPONENTS OF CENTRALIZED SYSTEM
- Node (Computer, Mobile, etc.).
- Server.
- Communication link (Cables, Wi-Fi, etc.).
• ARCHITECTRUE OF CENTRALIZED SYSTEM
- Client-Server architecture. The central node that serves the other nodes in the
system is the server node and all the other nodes are the client nodes.
• LIMITATIONS OF CENTRALIZED SYSTEM
- Can’t scale up vertically after a certain limit – After a limit, even if you
increase the hardware and software capabilities of the server node, the
performance will not increase appreciably leading to a cost/benefit ratio < 1.
- Bottlenecks can appear when the traffic spikes – as the server can only
have a finite number of open ports to which can listen to connections from
client nodes. So, when high traffic occurs like a shopping sale, the server can
essentially suffer a Denial-of-Service attack or Distributed Denial-of-Service
attack.
• ADVANTAGES OF CENTRALIZED SYSTEM
- Easy to physically secure. It is easy to secure and service the server and
client nodes by virtue of their location
- Smooth and elegant personal experience – A client has a dedicated system
which he uses(for example, a personal computer) and the company has a
similar system which can be modified to suit custom needs
- Dedicated resources (memory, CPU cores, etc)
- More cost-efficient for small systems up to a certain limit – As the central
systems take fewer funds to set up, they have an edge when small systems
have to be built
- Quick updates are possible – Only one machine to update.
- Easy detachment of a node from the system. Just remove the connection of
the client node from the server and voila! Node detached.
• DISADVANTAGES OF CENTRALIZED SYSTEM
- Highly dependent on the network connectivity – The system can fail if the
nodes lose connectivity as there is only one central node.
- No graceful degradation of the system – abrupt failure of the entire system
- Less possibility of data backup. If the server node fails and there is no
backup, you lose the data straight away
- Difficult server maintenance – There is only one server node and due to
availability reasons, it is inefficient and unprofessional to take the server down
for maintenance. So, updates have to be done on-the-fly (hot updates) which is
difficult and the system could break.
• APPLICATION OF CENTRALIZED SYSTEM
- Application development – Very easy to set up a central server and send
client requests. Modern technology these days do come with default test
servers which can be launched with a couple of commands. For example,
Express server, Django server.
- Data analysis – Easy to do data analysis when all the data is in one place and
available for analysis
- Personal computing
• USE CASES
- Centralized databases – all the data in one server for use.
- Single-player games like Need For Speed, GTA Vice City – an entire game in
one system(commonly, a Personal Computer)
- Application development by deploying test servers leading to easy
debugging, easy deployment, easy simulation
- Personal Computers
- Organizations Using – National Informatics Center (India), IBM

• DECENTRALIZED SYSTEMS
- These are other types of systems that have been gaining a lot of popularity,
primarily because of the massive hype of Bitcoin. Now many organizations
are trying to find the application of such systems.
- In decentralized systems, every node makes its own decision. The final
behavior of the system is the aggregate of the decisions of the individual
nodes. Note that there is no single entity that receives and responds to the
request.
- EXAMPLE: BITCOIN. Bitcoin for example because it is the most popular
use case of decentralized systems. No single entity/organization owns the
bitcoin network. The network is a sum of all the nodes who talk to each other
for maintaining the amount of bitcoin every account holder has.
• CHARACTERISTICS OF CENTRALIZED SYSTEM
- Lack of a global clock: Every node is independent of each other and hence,
has different clocks that they run and follow.
- Multiple central units (Computers/Nodes/Servers): More than one central
unit which can listen for connections from other nodes
- Dependent failure of components: one central node failure causes a part of
the system to fail; not the whole system
• SCALING
- Vertical scaling is possible. Each node can add resources (hardware, software)
to itself to increase the performance leading to an increase in the performance
of the entire system.
• COMPONENTS DECENTRALIZED SYSTEM
- Node (Computer, Mobile, etc.)
- Communication link (Cables, Wi-Fi, etc.)
• ARCHITECTURE OF DECENTRALIZED SYSTEM
- peer-to-peer architecture – all nodes are peers of each other. No one node
has supremacy over other nodes
- master-slave architecture – One node can become a master by voting and
help in coordinating of a part of the system but this does not mean the node
has supremacy over the other node which it is coordinating
• LIMITATIONS OF DECENTRALIZED
- May lead to the problem of coordination at the enterprise level – When
every node is the owner of its own behavior, it’s difficult to achieve collective
tasks
- Not suitable for small systems – Not beneficial to build and operate small
decentralized systems because of the low cost/benefit ratio
- No way to regulate a node on the system – no superior node overseeing the
behavior of subordinate nodes
• ADVANTAGES OF DECENTRALIZED SYSTEM
- Minimal problem of performance bottlenecks occurring – The entire load
gets balanced on all the nodes; leading to minimal to no bottleneck situations
- High availability – Some nodes(computers, mobiles, servers) are always
available/online for work, leading to high availability
- More autonomy and control over resources – As each node controls its own
behavior, it has better autonomy leading to more control over resources
• DISADVANTAGES OF DECENTRALIZED SYSTEM
- Difficult to achieve global big tasks – No chain of command to command
others to perform certain tasks
- No regulatory oversight
- Difficult to know which node failed – Each node must be pinged for
availability checking and partitioning of work has to be done to actually find
out which node failed by checking the expected output with what the node
generated
- Difficult to know which node responded – When a request is served by a
decentralized system, the request is actually served by one of the nodes in the
system but it is actually difficult to find out which node indeed served the
request.
• APPLICATIONS OF DECENTRALIZED SYSTEM
- Private networks – peer nodes joined with each other to make a private
network.
- Cryptocurrency – Nodes joined to become a part of a system in which digital
currency is exchanged without any trace and location of who sent what to
whom. However, in bitcoin, we can see the public address and amount of
bitcoin transferred, but those public addresses are mutable and hence difficult
to trace.
• USE CASES
- Blockchain
- Decentralized databases – Entire databases split into parts and distributed to
different nodes for storage and use. For example, records with names starting
from ‘A’ to ‘K’ in one node, ‘L’ to ‘N’ in the second node, and ‘O’ to ‘Z’ in
the third node
- Cryptocurrency
- Organizations Using – Bitcoin, Tor network

• DISTRIBUTED SYSTEMS
- In decentralized systems, every node makes its own decision. The final
behaviour of the system is the aggregate of the decisions of the individual
nodes. Note that there is no single entity that receives and responds to the
request.
- EXAMPLE: GOOGLE SEARCH SYSTEM. Each request is worked upon
by hundreds of computers which crawl the web and return the relevant results.
To the user, Google appears to be one system, but it actually is multiple
computers working together to accomplish one single task (return the results
to the search query).
• CHARACTERISTICS OF DISTRIBUTED SYSTEM
- Concurrency of components: Nodes apply consensus protocols to agree on
the same values/transactions/commands/logs.
- Lack of a global clock: All nodes maintain their own clock.
- Independent failure of components: In a distributed system, nodes fail
independently without having a significant effect on the entire system. If one
node fails, the entire system sans the failed node continues to work.
• SCALING
- Horizontal and vertical scaling is possible.
• COMPONENTS OF DISTRIBUTED SYSTEM
- Node (Computer, Mobile, etc.)
- Communication link (Cables, Wi-Fi, etc.)
• ARCHITECTURE OF DISTRIBUTED SYSTEM
- peer-to-peer – all nodes are peers of each other and work towards a common
goal
- client-server – some nodes become server nodes for the role of coordinator,
arbiter, etc.
- n-tier architecture – different parts of an application are distributed in
different nodes of the systems and these nodes work together to function as an
application for the user/client
• LIMITATIONS OF DISTRIBUTED SYSTEM
- Difficult to design and debug algorithms for the system. These algorithms
are difficult because of the absence of a common clock; so no temporal
ordering of commands/logs can take place.
- Nodes can have different latencies which have to be kept in mind while
designing such algorithms. The complexity increases with the increase in the
number of nodes.
- No common clock causes difficulty in the temporal ordering of
events/transactions
- Difficult for a node to get the global view of the system and hence take
informed decisions based on the state of other nodes in the system
• ADVANTAGES OF DISTRIBUTED SYSTEM
- Low latency than a centralized system – Distributed systems have low
latency because of high geographical spread, hence leading to less time to get
a response
• DISADVANTAGES OF DISTRIBUTED SYSTEM
- Difficult to achieve consensus
- The conventional way of logging events by absolute time they occur is not
possible here
• APPLICATIONS OF DISTRIBUTED SYSTEM
- Cluster computing – a technique in which many computers are coupled
together to work so that they achieve global goals. The computer cluster acts
as if they were a single computer
- Grid computing – All the resources are pooled together for sharing in this
kind of computing turning the systems into a powerful supercomputer;
essentially.
• USE CASES
- SOA-based systems
- Multiplayer online games
- Organizations Using – Apple, Google, Facebook.
• DISTRIBUTED DATABASE SYSTEM
- A distributed database is basically a database that is not limited to one system,
it is spread over different sites, i.e, on multiple computers or over a network of
computers.
- A distributed database system is located on various sites that don’t share
physical components.
- This may be required when a particular database needs to be accessed by
various users globally.
- It needs to be managed such that for the users it looks like one single database.
• TYPES
1. HOMOGENEOUS DATABASE
- In a homogeneous database, all different sites store database identically. The
operating system, database management system, and the data structures used –
all are the same at all sites. Hence, they’re easy to manage.
2. HETEROGENEOUS DATABASE
- In a heterogeneous distributed database, different sites can use different
schema and software that can lead to problems in query processing and
transactions. Also, a particular site might be completely unaware of the other
sites. Different computers may use a different operating system, different
database application. They may even use different data models for the
database. Hence, translations are required for different sites to communicate.
• DISTRIBUTED DATA STORAGE
1. REPLICATION
- In traditional databases, the entire relationship is stored redundantly at 2 or
more sites. If the entire database is available at all sites, it is a fully\r
redundant database. In replication, systems maintain copies of every single
piece of data at each site.
2. FRAGMENTATION
- The key point is to make sure that each of the fragments can be used to
reconstruct the original relation (i.e. there isn't any loss of data).
- Horizontal Fragmentation-Splitting by rows: The relation is fragmented
into groups of tuples so that each tuple is assigned to at least one fragment.
- Vertical Fragmentation-Splitting by columns: The schema of the relation is
divided into smaller schemas. Each fragment must contain a common
candidate key so as to ensure a lossless join.
• APPLICATIONS OF DISTRIBUTED DATABASE
- It is used in Corporate Management Information System.
- It is used in multimedia applications.
- Used in Military’s control system, Hotel chains etc.
- It is also used in manufacturing control system.
• ADVANTAGES OF DISTRIBUTED DATABASE
- Distributed database as a collection of multiple interrelated databases
distributed over a computer network and a distributed database management
system as a software system that basically manages a distributed database
while making the distribution transparent to the user.
- Distributed database management basically proposed for the various reason
from organizational decentralization and economical processing to greater
autonomy. Some of these advantages are as follows:
1. Managemnet of data with different level of transparency
- Ideally, a database should be distribution transparent in the sense of hiding the
details of where each file is physically stored within the system. The
following types of transparencies are basically possible in the distributed
database system:
- Network transparency: This basically refers to the freedom for the user from
the operational details of the network. These are of two types Location and
naming transparency.
- Replication transparencies: It basically made user unaware of the existence
of copies as we know that copies of data may be stored at multiple sites for
better availability performance and reliability.
- Fragmentation transparency: It basically made user unaware about the
existence of fragments it may be the vertical fragment or horizontal
fragmentation.
2. Increased Reliability and availability
- Reliability is defined as the probability that a system is running at a certain
time while other systems are not and this means that data can be stored on
both sites.
3. Easier Expansion
- In a distributed environment expansion of the system in terms of adding more
data, increasing database sizes, or adding more data, increasing database sizes
or adding more processor is much easier.
4. Improved Performance
- We can achieve interquery and intraquery parallelism by executing multiple
queries at different sites by breaking up a query into a number of subqueries
that basically executes in parallel which basically leads to improvement in
performance.
• FUNCTIONS OF DISTRIBUTED DATABASE SYSTEM
- Distribution basically leads to increased complexity in the system design and
implementation. This is to achieve the potential advantages such as:
1. Network Transparencies
2. Increased Reliability
3. Improved Performance
4. Easier Expansion
• FUNCTIONS OF CENTRALIZED DBMS
- The basic function of centralized DBMS is that it provides complete view of
our data. For example, we can have the query for the number of customers
who are willing to buy worldwide.
- The second basic function of Centralized DBMS is that it is easy to
manage than other distributed systems.
• FUNCTIONS OF DISTRIBUTED DATABASE SYSTEM
1. Keeping track of data- The basic function of DDBMS is to keep track of the
data distribution, fragmentation and replication by expanding the DDBMS
catalog.
2. Distributed Query Processing- The basic function of DDBMS is basically its
ability to access remote sites and to transmits queries and data among the
various sites via a communication network.
3. Replicated Data Management- The basic function of DDBMS is basically to
decide which copy of a replicated data item to access and to maintain the
consistency of copies of replicated data items.
4. Distributed Database Recovery- The ability to recover from the individual
site crashes and from new types of failures such as failure of communication
links.
5. Security- The basic function of DDBMS is to execute Distributed Transaction
with proper management of the security of the data and the
authorization/access privilege of users.
6. Distributed Directory Management- A directory basically contains
information about data in the database. The directory may be global for the
entire DDB, or local for each site. The placement and distribution of the
directory may have design and policy issues.
7. Distributed Transaction Management- The basic function of DDBMS is its
ability to devise execution strategies for queries and transaction that access
data from more than one site and to synchronize the access to distributed data
and basically to maintain the integrity of the complete database.
• But these function basically increases the complexity of a DDBMS over centralized
DBMS.
| WEEK 14 AND 15 |
➢ DATA WAREHOUSING
• WHAT IS DATA WAREHOUSING?
- Data Warehousing (DW) is process for collecting and managing data from
varied sources to provide meaningful business insights.
- Data Warehousing is a vital component of business intelligence.
- It is also a library of all data in terms of businesses.
• DATA WAREHOUSE SYSTEM
- Data warehouse system is also known by the following name:
✓ Decision Support System (DSS)
- Is a computer program application used to improve a company's
decision-making capabilities.
✓ Executive Information System (EIS)
- Is a decision support system (DSS) used to assist senior executives in
the decision-making process.
- It does this by providing easy access to important data needed to
achieve strategic goals in an organization.
✓ Management Information System (MIS)
- is a computer system consisting of hardware and software that serves
as the backbone of an organization's operations.
- An MIS gathers data from multiple online systems, analyzes the
information, and reports data to aid in management decision-making.
✓ Business Intelligence Solution (BI)
- serve to retrieve, process, analyze and report data for making
informed business decisions.
✓ Analytic Application
- business intelligence applications software, specially designed to
measure and increase the performance of specific business operations
(HR, procurement, finance, sales, marketing, ...) or industries.
✓ Data Warehouse
• HISTORY OF DATAWAREHOUSE
- The Datawarehouse benefits users to understand and enhance their
organization’s performance.
- Here are some key events in evolution of Data Warehouse:
✓ 1960 – Dartmouth and General Mills in a joint research project,
develop the terms dimensions and facts.
✓ 1970 – A Nielsen and IRI introduces dimensional data marts for retails
sales.
✓ 1983 – Tera Data Corporation introduces a database management
system which is specifically designed for decision support.
- Data warehousing started in the late 1980s when IBM worker Paul Murphy
and Barry Devlin developed the Business Data Warehouse.
- However, the real concept was given by Inmon Bill. He was considered as a
Father of Data Warehouse. He had written about a variety of topics for
building, usage, and maintenance of the warehouse and the Corporate
Information Factory.
- Data Warehouse is NOT a new thing.
• HOW DATAWAREHOUSE WORKS?
- A Data Warehouse works as a central where information arrives from one or
more data sources.
- Data flows into a data warehouse from the transactional system and other
relational databases.
- Data may be:
1. Structured
2. Semi-Structured
3. Unstructured data
• TYPES OF DATA WAREHOUSE
- Three main types of Data Warehouses (DWH) are:
1. Enterprise Data Warehouse (EDW): Enterprise Data Warehouse is a
centralized warehouse. It provides decision support service across the
enterprise.
2. Operational Data Store (ODS): Operational Data Store, which is also
called ODS, are nothing but data store required when neither Data
warehouse nor OLTP systems support organizations reporting needs.
3. Data Mart: A data mart is a subset of the data warehouse. It specially
designed for a particular line of business, such as sales, finance, sales or
finance.
• GENERAL STAGES OF DATA WAREHOUSE
- The following are general stages of use of the data warehouse (DWH):
✓ Offline Operational Database: In this stage, data is just copied from
an operational system to another server. In this way, loading,
processing, and reporting of the copied data do not impact the
operational system’s performance.
✓ Offline Data Warehouse: Data in the Datawarehouse is regularly
updated from the Operational Database. The data in Datawarehouse is
mapped and transformed to meet the Datawarehouse objectives.
✓ Real Time Data Warehouse: In this stage, Data warehouses are
updated whenever any transaction takes place in operational database.
For example, Airline or railway booking system.
✓ Integrated Data Warehouse: In this stage, Data Warehouse are
updated continuously when the operational system performs a
transaction. The Datawarehouse then generates transactions which are
passed back to the operational system.
• COMPONENTS OF DATA WAREHOUSE
- Four components of Data Warehouses are:
✓ Load Manager: Load manager is also called the front component.
✓ Warehouse Manager: Warehouse manager performs operations
associated with the management of the data in the warehouse.
✓ Query Manager: Query manager is also known as backend
component.
✓ End User Access Tools: This is categorized into five different groups
like:
1. Data Reporting
2. Query Tools
3. Application Development Tools
4. EIS Tools
5. OLAP Tools and Data Mining Tools
• WHO NEEDS DATA WAREHOUSE?
- DWH (Data Warehouse) is needed for all types of users like:
✓ Decision makers who rely on mass amount of data.
✓ Users who use customized, complex processes to obtain information
from multiple data sources.
✓ It is also used by the people who want simple technology to access the
data.
✓ It also essential for those people who want a systematic approach for
making decisions.
✓ If the user wants fast performance on a huge amount of data which is
necessity for reports, grids or charts, then Data Warehouse proves
useful.
✓ Data Warehouse is a first step. If you want to discover ‘hidden
patterns’ of data-flows and groupings.
• WHAT IS A DATA WAREHOUSE USED FOR?
- Here, are most common sectors where Data Warehouse is used:
✓ Airline
✓ Banking
✓ Healthcare
✓ Public Sector
✓ Investment and Insurance Sector
✓ Retail Chain
✓ Telecommunications
✓ Hospital and Industry
• STEPS TO IMPLEMENT DATA WAREHOUSE
- The best way to address the business risk associated with a Data Warehouse
implementation is to employ a three-prong strategy as below:
1. Enterprise Strategy
2. Phased Delivery
3. Iterative Prototyping
• BEST PRACTICES TO IMPLEMENT A DATA WAREHOUSE
- Decide a plan to test the consistency, accuracy, and integrity of the data.
- The data warehouse must be well integrated, well defined and time stamped.
- While designing Data Warehouse make sure you use right tool, stick to life
cycle, take care about data conflicts and ready to learn your mistakes.
- Never replace operational systems and reports.
- Don’t spend too much time on extracting, cleaning and loading data.
- Ensure to involve all stakeholders including business personnel in Data
Warehouse implementation process.
- Establish that Data warehousing is a joint or team project. You don’t want to
create Data warehouse that is not useful to the end users.
- Prepare a training plan for the end users.
• THE FUTURE OF DATA WAREHOUSING
- Change in regulatory constraints may limit the ability to combine source of
disparate data. These disparate sources may include unstructured data which is
difficult to store.
- As the size of the databases grows, the estimates of what constitutes a very
large database continue to grow. It is complex to build and run data warehouse
systems which are always increasing in size. The hardware and software
resources are available today do not allow to keep a large amount of data
online.
- Multimedia data cannot be easily manipulated as text data, whereas textual
information can be retrieved by the relational software available today. This
could be a research subject.
• DATA WAREHOUSE TOOLS
- There are many Data Warehousing tools are available in the market. Here, are
some most prominent one:
1. MarkLogic - MarkLogic Server is designed to securely store and manage
a variety of data to run transactional, operational, and analytical
applications.
2. Oracle - Oracle Database is the first database designed for enterprise grid
computing, the most flexible and cost-effective way to manage
information and applications.
3. Amazon RedShift - WS Redshift is a data warehouse product built by
Amazon Web Services. It's used for large scale data storage and analysis,
and is frequently used to perform large database migrations.

| WEEK 14 AND 15 |
➢ DATA WAREHOUSING: CONTINUATION
• DATA WAREHOUSING
- Data warehousing is the secure electronic storage of information by a business
or other organization.
- The goal of data warehousing is to create a trove of historical data that can be
retrieved and analyzed to provide useful insight into the organization's
operations.
- Data warehousing is a vital component of business intelligence.
- That wider term encompasses the information infrastructure that modern
businesses use to track their past successes and failures and inform their
decisions for the future.
✓ Data warehousing is the storage of information over time by a business
or other organization.
✓ New data is periodically added by people in various key departments
such as marketing and sales.
✓ The warehouse becomes a library of historical data that can be
retrieved and analyzed in order to inform decision-making in the
business.
✓ The key factors in building an effective data warehouse include
defining the information that is critical to the organization and
identifying the sources of the information.
✓ A database is designed to supply real-time information.
✓ A data warehouse is designed as an archive of historical information.
• HOW DATA WAREHOUSING WORKS
- The need to warehouse data evolved as businesses began relying on computer
systems to create, file, and retrieve important business documents.
- The concept of data warehousing was introduced in 1988 by IBM
researchers Barry Devlin and Paul Murphy.
- Data warehousing is designed to enable the analysis of historical data.
- Comparing data consolidated from multiple heterogeneous (different) sources
can provide insight into the performance of a company.
- A data warehouse is designed to allow its users to run queries and analyses on
historical data derived from transactional sources.
- Data added to the warehouse do not change and cannot be altered.
- The warehouse is the source that is used to run analytics on past events, with a
focus on changes over time.
- Warehoused data must be stored in a manner that is secure, reliable, easy to
retrieve, and easy to manage.
• MAINTAINING THE DATA WAREHOUSE
- There are certain steps that are taken to maintain a data warehouse.
- One step is data extraction, which involves gathering large amounts of data
from multiple source points.
- After a set of data has been compiled, it goes through data cleaning, the
process of combing through it for errors and correcting or excluding any that
are found.
- The cleaned-up data are then converted from a database format to a warehouse
format.
- Once stored in the warehouse, the data goes through sorting, consolidating,
and summarizing, so that it will be easier to use.
- Over time, more data are added to the warehouse as the various data sources
are updated.
- A key book on data warehousing is W. H. Inmon’ s "Building the Data
Warehouse," a practical guide that was first published in 1990 and has been
reprinted several times.
- Today, businesses can invest in cloud-based data warehouse software services
from companies including Microsoft, Google, Amazon, and Oracle, among
others.
• WHAT IS DATA MINING?
- Businesses warehouse data primarily for data mining.
- That involves looking for patterns of information that will help them improve
their business processes.
- A good data warehousing system makes it easier for different departments
within a company to access each other's data.
- For example, a marketing team can assess the sales team's data in order to
make decisions about how to adjust their sales campaigns.
• 5 STEPS OF DATA MINING
- The data mining process breaks down into five steps:
1. An organization collects data and loads it into a data warehouse.
2. The data are then stored and managed, either on in-house servers or in a
cloud service.
3. Business analysts, management teams, and information technology
professionals’ access and organize the data.
4. Application software sorts the data.
5. The end-user presents the data in an easy-to-share format, such as a graph
or table.
• DATA WAREHOUSING VS DATABASES
- A data warehouse is not the same as a database:
✓ A database is a transactional system that monitors and updates real-
time data in order to have only the most recent data available.
✓ A data warehouse is programmed to aggregate structured data over
time.
- For example, a database might only have the most recent address of a
customer, while a data warehouse might have all the addresses for the
customer for the past 10 years.
- Data mining relies on the data warehouse. The data in the warehouse are sifted
for insights into the business over time.
• ADVANTAGES AND DISADVANTAGES OF DATA WAREHOUSES
- Data warehousing is intended to give a company a competitive advantage.
- It creates a resource of pertinent information that can be tracked over time and
analyzed in order to help a business make more informed decisions.
- It also can drain company resources and burden its current staff with routine
tasks intended to feed the warehouse machine.
- The Corporate Finance Institute identifies these potential disadvantages of
maintaining a data warehouse:
✓ It takes considerable time and effort to create and maintain the
warehouse.
✓ Gaps in information, caused by human error, can take years to surface,
damaging the integrity and usefulness of the information.
✓ When multiple sources are used, inconsistencies between them can
cause information losses.
- Advantages:
✓ Provides fact-based analysis on past company performance to inform
decision-making.
✓ Serves as a historical archive of relevant data.
✓ Can be shared across key departments for maximum usefulness.
- Disadvantages:
✓ Creating and maintaining the warehouse is resource-heavy.
✓ Input errors can damage the integrity of the information archived.
✓ Use of multiple sources can cause inconsistencies in the data.
• WHAT IS A DATA WAREHOUSE AND WHAT IS IT USED FOR?
- A data warehouse is an information storage system for historical data that can
be analyzed in numerous ways.
- Companies and other organizations draw on the data warehouse to gain
insight into past performance and plan improvements to their operations.
• WHAT ARE THE STAGES OF DATA WAREHOUSING?
- There are at least seven stages to the creation of a data warehouse, according
to ITPro Today, an industry publication. They include:
✓ Determining the business objectives and its key performance
indicators.
✓ Collecting and analyzing the appropriate information.
✓ Identifying the core business processes that contribute the key data.
✓ Constructing a conceptual data model that shows how the data are
displayed to the end-user.
✓ Locating the sources of the data and establishing a process for feeding
data into the warehouse.
✓ Establish a tracking duration. Data warehouses can become unwieldy.
Many are built with levels of archiving, so that older information is
retained in less detail.
✓ Implementing the plan.
• IS SQL A DATA WAREHOUSE?
- SQL, or Structured Query Language, is a computer language that is used to
interact with a database in terms that it can understand and respond to.
- It contains a number of commands such as "select," "insert," and "update." It
is the standard language for relational database management systems.
- A database is not the same as a data warehouse, although both are stores of
information.
- A database is an organized collection of information.
- A data warehouse is an information archive that is continuously built from
multiple sources.
• THE BOTTOM LINE: SUMMARY
- The data warehouse is a company's repository of information about its
business and how it has performed over time.
- Created with input from employees in each of its key departments, it is the
source for analysis that reveals the company's past successes and failures and
informs its decision-making.

| WEEK 16 |
➢ DATABASE CONNECTIVITY AND WEB
TECHNOLOGIES
• CONNECTIVITY
- A DBMS must have access to data, or it won't do much.
- The methods used to access the data can vary according to the available route
to the data.
- When the database is on the computer you are using, it is much simpler than
when it is stored on a network device.
• THREE LAYERED VIEW OF DATA ACCESS
1. TOP LAYER
- an interface between an application trying to use the database and the
middle layer;
- the request that comes out of this layer will be formatted in one of several
protocols.
2. MIDDLE LAYER
- where transformation and translation of the data request take place, changing
from an understood language supported in the top layer to the language of the
actual database used to create the data layer;
- the middleware may pass the request to a system, or may read the data directly
3. DATA LAYER
- where the actual data is stored, in the language of the database and of the
storage medium;
- there may be server software that functions at this level.
• TECHNOLOGIES: ODBC, OLE, ADO.NET. JDBC
- When the request comes from an application, it flows from the application to
the middleware, to the database. Responses flow in the opposite direction.
❖ Native SQL connectivity
- whatever method your DBMS provides for access through its proprietary
interface;
- this method may provide the best access to features the vendor provides in its
own interface, but the method and interface will only be useful for data stored
with that DBMS
❖ Microsoft’s Open Database Connectivity (ODBC), Data Access Objects
(DAO), and Remote Data Objects (RDO)
- ODBC - allows Windows programs to use SQL with databases through an
Application Programming Interface (API); the text states that this is the most
widely supported interface
- DAO - also uses an API; when used with MS Access, can provide more
features of that application
- RDO - interface better suited to server-based databases such as MS SQL,
Oracle, and DB2
- DAO and RDO use services from ODBC
- Applications access ODBC through an API, a driver manager manages
connections to the database, and an ODBC driver does the communications
with the database.
- To establish an ODBC connection, you must create a Data Source Name for
it.
- That process requires an ODBC driver, a unique name for the connection, and
driver parameters including a server name, the name and location of the
database, a user name, and credentials to access the database.
❖ Microsoft’s Object Linking and Embedding for Databases (OLE-DB)
- used for connection to relational and non-relational databases; based on
Microsoft Component Object Model (COM);
- objects used in this method may be consumers (that request and use data), or
providers (that connect to a data source and provide data)
- To make it more confusing, providers can be plain data providers, or service
providers that are inserted between the consumer and the (data) provider to
perform more management services, or a selection of services.
- Scripting language support is not included, so another layer is added called
ActiveX Data Objects (ADO) which provides a path for scripts to access
OLE-DB, then to access the database.
❖ Microsoft’s ActiveX Data Objects (ADO.NET)
- used to access data through .NET environments, through an extension of the
ADO/OLE-DB method;
- allows manipulation of a dataset, which is a downloaded copy of a database,
which is manipulated then synchronized with the master copy of the database;
this allows processing of data pulled across the Internet, which is then placed
into the original source
❖ Oracle’s Java Database Connectivity (JDBC)
- concerns the use of Java, the programming language, not JavaScript, the
scripting language;
- Java programs are supposed to work on any platform equipped to run it,
regardless of the local operating system; running a Java program works
through a locally installed API:
- the application calls the API, which calls the JDBC driver, which calls
appropriate middleware, which accesses the database
• WEB TO DATABASE MIDDLEWARE
- The text discusses several methods used to access data across Internet
connections:
❖ Server-Side Extensions
- Services of all kinds across the Internet can be handled through web servers.
The World Wide Web is the most successful platform on the Internet, causing
most users to forget that other services exist.
- A request for a web page is a request for a data object, or a collection of them,
managed by a web server.
- The text mentions that some pages are dynamic, created on the fly based on a
user request.
- To construct such a page, the server must pull data from a database other than
its own.
- To do this, server-side extensions are used to make requests to the appropriate
database, retrieving counts of items on hand, shipping options, vendor
choices, and other data important to a customer.
❖ Web Server Interfaces
- There are two commonly used methods that allow the web server to
communicate through the extensions:
1. Common Gateway Interface (CGI)
- script files that communicate with the middleware; they can be written in any
of several languages, provided the server supports that language;
- scripts run through interpreters, which can slow down requests when traffic is
heavy; each request must make a new connection to the database
2. Application Programming Interface (API)
- APIs use dynamic link libraries (DLLs), which are compiled code, letting
them run faster than scripts; DLLs run in memory, so they run faster, once
loaded;
- an API can share its connection with the database with multiple requests,
opening the bottleneck in a way that the CGI scripts cannot
❖ Client-Side Extensions
- These are extensions to web browsers; they run on the client device, not the
server device; the text mentions five types:
1. plug-in extensions are additional software added to the browser that run
when they are needed
2. JavaScript is an embedded script file (interpreted) that is part of a web
page
3. Java applets are programs that can be run by a client if the Java runtime
environment has been installed;
4. ActiveX controls are like Java applets, but they use code invented by
Microsoft, and only run in Microsoft browsers
5. VBScript is a Microsoft script language, and its scripts can be run in web
pages or separately
❖ Web Application Servers
- These are specialized server programs that can perform the middleware duties
for the primary web server. The text presents a dozen tasks and features a web
application server can perform, showing how useful they can be.
❖ Web Database Development
- The last item in this section is about making web pages that are meant to
interface with a database. The language used to do this can be a blend of SQL
and HTML,
- The text also mentions that PHP can be used for this kind of web page coding.
• EXTENSIBLE MARKUP LANGUAGE
- The text describes document types that are agreed upon by business partners,
written in XML, and used to facilitate orders from one company to another.
- The text refers to this kind of document needing a document type definition
(DTD) that must be shared between these business partners. Using this kind of
agreed upon format prevents misunderstandings, and makes it possible to
process orders and invoices much faster.
- Regular transactions of this type can benefit from definitions of such
documents in an XML schema document (XSD), which is a meta-language
description of a DTD document used in a partner system.
- This is very much like a schema for a database, which is used to describe all
of the objects used in it. Look at the example on page 707.
- It should remind you of a set of declarations of fields in a database, even
though it is written in XML.
• CLOUD COMPUTING SERVICES
✓ a computing models
✓ for enabling ubiquitous, convenient, on-demand network access
✓ to a shared pool of configurable computer resources (e.g., networks, servers,
storage applications, and services)
✓ that can be rapidly provisioned and released with minimal management effort or
server provider interaction
- The bottom line is that a vendor is selling/leasing services that are available as
needed from devices that are owned or operated by that vendor.
- It should be obvious that such a model depends heavily on reliable, fast, error
free data access, typically over an Internet connection.
- The text mentions several features of this kind of service that make cloud
services attractive to a business:
✓ no hardware cost for the client, except for client-side machines
✓ no problems when you scale up such a service, and it may scale down
automatically; shared infrastructure means the client does not have to
build it
✓ web interfaces may mean little software cost for the clients
• SEVERAL TYPES OF CLOUDS
❖ Public
- Services on this cloud are for sale/lease to the general public. Amazon and
Google provide this type of service in their clouds.
- This cloud is managed only by the provider.
❖ Private
- Services on this cloud are for a specific organization. It was built by them or
for them, but it is only used by them.
- It may be most useful to international organizations. This cloud may be
managed by the organization or by a third party.
❖ Community
- This cloud is for a collection of organizations that share a common interest.
- Governments and schools may have this kind of cloud.
- This cloud may be managed by the organizations involved or by a third party.
• SEVERAL TYPES OF CLOUD SERVICES
❖ Software as a Service (SaaS)
- Applications are run in the cloud from client devices.
- Devices can be simple web computers, smart devices, or fully functional
computers.
- The application's may be the same for all clients of that service.
- Examples: Office 365 and Google Docs.
❖ Platform as a Service (PaaS)
- Clients can develop their own applications with tools from the vendor, and
can use the vendor's network to deploy those applications to the client's
workstations and devices.
- Examples: Microsoft Azure with .NET, Google Application Engine with
Python or Java.
❖ Infrastructure as a Service (IaaS)
- In this one, the client can choose to add or remove storage, servers,
processors, and personal computers that work across the WAN link.
- An example would be the servers that are available from Amazon Web
Services.
| WEEK 16 |
➢ DATABASE MANAGEMENT AND DESIGN
- Most transactions take more than one SQL command, which means that we
have to be careful about where each transaction starts and ends.
- The number of steps in a transaction is measured by the number of database
requests (commands) that are actually used.
- The text reminds us that there may be several operating system calls that
happen as a result of each database request, but these are not counted as parts
of the transaction.
• CONCURRENCY CONTROL
- concurrency control as the of the actions of multiple users of a database
system.
❖ Lost update problem
- When two transactions are updating the same data element (such as a
customer's balance) coordination is vital. Assume two transactions read a
customer's record at the same moment. One takes an order and increases the
current balance. The other takes a payment and decreases the customer's
balance. Now imagine that they each read the customer's record before the
other transaction updates the balance.
❖ Uncommitted data problem
- A related problem occurs when one transaction reads data, saves a new value,
then rolls back, while another transaction reads the new saved value,
calculates a change to it, then saves its new change, oblivious to the fact that
the number it is working from is a false value.
- No transaction should be allowed to start regarding data that is in flux with
another transaction.
❖ Inconsistent retrievals problem
- This one comes from reading a table that is being updated while you are
reading it. You get some old data, some new data, then perform an operation
on your faulty dataset, resulting in a faulty interpretation of the data.
- This one is tricky. If you think about it, you could defend the idea that the data
looked the way you read it at a specific moment in time.
- Taking any aggregate measure on a database should not be considered if that
database is being updated. A database can only be consistent before updates
and updates, never during updates.
• LOCKING METHODS
- A lock is like a tag that says only a particular process is granted access to a
data object until that access is released.
- A pessimistic lock is a lock that is placed on a data object with the
assumption that another process is likely to cause a problem.
- A lock should be placed on data before reading it, held on that data while
changing it, and released only after the data changes are committed.
- A lock manager is the part of a DBMS that that automatically applies locks
and constrains other processes from ignoring them.
- A lock can be placed on any of several kinds of data objects. The of precision
of a lock is called its granularity.
• LOCK LEVEL
❖ Database
- The entire database is locked for one process.
- Good for a batch update, but not good for sharing among many users.
❖ Table
- Locks an entire single table.
- Only one process can access it, but another process can access a different
table.
- Not very useful when users need to update multiple tables in a single
transaction.
❖ Page
- Locks a section of the file that fills a portion of memory, commonly a 4KB
block.
- This is more useful than the table lock when you have large tables, but less
useful than a row lock.
❖ Row
- Locks a single row at a time in a table.
- Multiple transactions can access different rows at the same time.
- Can lead to a deadly embrace/deadlock if the database is not clever enough to
avoid it.
❖ Field
- Locks only a single field in a single row of a table.
- Very flexible, but takes a lot of processing to manage it for more than a few
users.
• LOCK TYPE/ MODE
❖ Binary
- If a lock is binary, it can only be locked or unlocked.
- If an object is locked, only the transaction having the lock can use the object.
- If an object is unlocked, any transaction can lock it.
- Once a transaction is done with an object, the lock must be unlocked.
- Generally considered too limiting for concurrency.
❖ Shared/Exclusive
- This type allows more choices.
- A lock is exclusive to the process that applies the lock.
- This is typically used by a process that has to change the object.
- A shared lock is a read only permission, allowing multiple processes to have
shared locks one the same object at the same time.
- In this system, an object can be unlocked, shared, or.
- An exclusive lock can only be granted if the object in question is unlocked.
- A shared lock can be granted if the object is shared or unlocked.
• METHODS TO AVOID LOCKS PROBLEMS
❖ Two-phase locking
- This is a method that addresses
❖ Serializability
- It does not address deadlocks.
❖ Growing Phase
- The first phase is the growing phase.
- The transaction acquires all locks needed to do its job, one by one. When it
has them all, it is at the stage called the locked point.
- No actual work is done until the locked point is reached.
❖ Shrinking Phase
- The second phase is the shrinking phase. As you might guess from the name,
all locks held by the transaction are released in this phase.
- During this phase, no new locks are allowed.
- This method does not prevent deadlocks. If two transactions require the same
locks, and each has obtained one or more of them, both transactions are
blocked from attaining their locked points. As the text notes, this can only
happen with exclusive locks.
- Shared locks have no limit on the number of transactions that may have them
in common.
❖ Deadlock prevention
- If the system notices that a deadlock is possible, the transaction being
processed is, its actions (if any) are rolled back, and its currently held locks
are released.
- The transaction is rescheduled. If we were practicing two phase locking, there
should have been no actions from this transaction yet.
❖ Deadlock detection
- If a deadlock is detected (already happening), one of the transactions is chosen
as the. This one is aborted and rolled back.
- The other transaction is allowed to continue. The victim transaction is
restarted.
❖ Deadlock avoidance
- The method in this case is to grant access only to those resources that cannot
cause a deadlock.
- This makes more sense than the text, which is a better description of two-
phase locking.
• TIMESTAMPING METHODS
- It is a method often used with network compatible applications. In this case,
the DBMS reads the current system time when new requests are received,
converts it to a global time, and each request is given a unique timestamp.
- Timestamps must be unique to ensure that each request has a unique spot in a
queue.
- The text refers to timestamps as being global. It means that the same time
system is in effect regardless of the physical location of the requester.
- Often, timestamping is done with reference to Greenwich Mean Time (GMT)
or Universal Time Coordinated (UTC), which seems to have been renamed as
Coordinated Universal Time.
- The text gives us a new word that goes along with timekeeping:
monotonicity. It is a noun. It means the quality of either never increasing or
never decreasing.
- With reference to timestamps, it means never decreasing.
• METHODS OF MANAGING TIMESTAMP SYSTEMS
❖ The Wait/Die Scheme
- It is like a first come, first served system regarding requests for locks
- The transaction with the earlier timestamp is requesting a lock that would
cause a problem:
 The later transaction is favored.
 The transaction with the earlier timestamp waits for the other
transaction to finish and release its locks.
 Then execution of the transaction with the earlier timestamp continues.
- The transaction with the later timestamp is requesting a lock
that would cause a problem:
 The earlier transaction is favored.
 The transaction having the later timestamp is chosen to.
 It is stopped, rolled back, and rescheduled with its same timestamp.
 The transaction with the earlier timestamp continues.
❖ The Wound/Wait Scheme
- The transaction with the earlier timestamp is requesting a lock
that would cause a problem:
 The earlier transaction is favored.
 The earlier transaction wounds/preempts the later one.
 The later transaction is stopped, rolled back, and rescheduled with its
same timestamp.
- The transaction with the later timestamp is requesting a lock
that would cause a problem:
 The earlier transaction is favored.
 The transaction with the later timestamp waits for the other transaction
to finish and release its locks.
 Then execution of the transaction with the later timestamp continues.
• OPTIMISTIC METHODS
- What if we assume that most locks do not actually conflict with each other?
This is an optimistic approach. With this approach we do not use locks or
timestamps. Transactions have three phases instead:
❖ Read - Read the database, make a dataset, make changes to the dataset.
❖ Validation - Check whether changes made to the dataset would create integrity or
consistency errors in the database. If no errors (positive test), proceed to the next
phase. If errors are detected (negative test), discard the dataset and restart the
transaction.
❖ Write - This phase is only entered if no errors are found in the validation phase.
That being so, changes are written from the dataset to the database.
• ANSI LEVELS OF TRANSACTION ISOLATION
- The ANSI standard uses different terms to describe amount of isolation,
protection of transaction data from other transactions.
- It begins by defining three kinds of reads that a transaction might allow
transactions to make:
❖ dirty read - another transaction can read data that is not yet committed (saved)
❖ nonrepeatable read - reads of the same row at different times give different
results, because the row may have been changed or deleted
❖ phantom read - a query run at two different times produces more matching rows
at the later time
• RECOVERY MANAGEMENT
- We have considered several times in this chapter how to recover from a failed
transaction.
- The question in this discussion is how to recover from a failed database.
- The text begins by discussing events that could cause this kind of failure:
✓ hardware failure
✓ software failure
✓ human error or intentional human intervention
✓ physical disaster, such as fire, flood, power failure (The text calls this
one natural disaster, but there is nothing natural about a power failure.)
- When a recovery is needed, it may rely on one or more of these features:
❖ Write-Ahead-Log Protocol
- This tells the DBMS to write down a transaction's details in a log
before the transaction is processed.
- The idea is that a system failure that takes place during the
transaction will not change our being able to roll back or complete the
necessary transaction.
❖ Redundant Transaction Logs
- This means to store backup copies of the logs in different places,
each of which is unlikely to suffer from a disaster that wipes out the
others.
❖ Database Buffers
- These are copies of parts of the database that are used as draft copies
of the data a transaction is working on. It is much faster to change a
copy in RAM, then write several changes at once, than it is to write
every change to the database as it happens.
- It is the same idea as using a dataset to hold changes that are written
to the database when they are all ready.
❖ Database Checkpoints
- When all open buffers are written at once to the disk version of the
database, that is a database checkpoint.
- Checkpoints need to be run regularly enough to provide accurate live
data to the database.
- When a checkpoint has run, it is recorded in the transaction log.
• KINDS OF RECOVERY
- The first type is a deferred-write, also called deferred update) recovery.
- This is done on systems that update the transaction log immediately, but they
do not update the database immediately.
- The text gives us four steps to conduct this kind of recovery:
1. Find the last checkpoint in the transaction log. This is the last time the
database was updated.
2. Ignore transactions that were completed before the last checkpoint. They
were written during that checkpoint.
3. For any transaction that was committed after the last checkpoint, use the
DBMS to reconstruct that transaction from the transaction log.
4. Ignore transactions that were rolled back or still in process
after the last checkpoint.
- The second type is a write-through (also called an immediate update)
recovery. This is done on systems that update the database as transactions are
processed, before they are committed.
- On such systems, if there is an aborted transaction, that transaction needs to be
rolled back. That makes this one different.
1. Find the last checkpoint in the transaction log, as above.
2. Ignore transactions that were completed before the last checkpoint. They
were written during that checkpoint.
3. For any transaction that was committed after the last checkpoint, use the
DBMS to reconstruct that transaction from the transaction log.
4. For any transaction that was rolled back after the last checkpoint, or had not
reached its COMMIT point, use the DBMS to roll back or undo each of them,
up to the checkpoint.
• PARTS OF DBMS
- As you know, tables are stored in files.
- A file can contain more than one table.
- Storage space for them is allocated on a storage device when a database is
created. In an environment, files grow as needed. In an environment, such as
Oracle, storage space may need to be expanded as files grow large.
- This expansion can be in predetermined increments called extents.
- An extent has an advantage: it is an allocation of contiguous (sequential)
memory addresses on a storage device.
- This makes access to files stored in this address range faster than access to
files stored in discontinuous blocks.
- This disk space is allocated for a particular kind of files, such as a database or
part of one, and it can be called a or a file group.
- There may be one for the entire database, or one for each kind of file, such as
tables, indexes, user files, and temporary storage which works like a swap file.
- Buffer cache, this is space that is allocated in RAM to allow very fast access
to the most recently used data.
- The idea is that data you just worked with may be the data you will need again
in a moment.
- This can be data you have just pulled from a disk, or changed data you have
not yet stored on a disk.
• THREE PERSPECTIVES OF OPTIMIZATION
❖ or - An automatic optimization is chosen by the DBMS. A manual one
is chosen by the programmer/user.
❖ () or compile time (static) - An optimization method may be chosen
at the time a query is run. If so, it is a dynamic method. If a query is
embedded in a program that is compiled (perhaps written in one of the
C variants), the optimization method may be chosen at that time,
instead of when the query actually runs. That makes the method a
static method.
❖ or rule based - A statistically based method relies on measurements of
the system, such as number of files or records, number of tables,
number of users with rights, and so on. These statistics change over
time. They can be generated dynamically or manually. A rule-based
method, on the other hand, is based on general rules set by the user or
a DBA.

You might also like