You are on page 1of 3

PARALLEL DATABASE

-----------------------------------------------------------------------------------
---------------------------------------------------------------

INTRODUCTION
A parallel database system seeks to improve performance through parallelization of
various operations, such as loading data, building indexes, and evaluating queries.

Parallel systems improve processing and I/O speeds by using multiple processors and
disks in parallel.

NEED
we know that as there is large of volume data ,many manipulations,storing retriving
etc, so we have to increase the speed of data processing.
This is where parallelism helps.
Parallel machines are becoming quite common nowadays and affordable even,the prices
of microprocessors, memory and disks have dropped sharply.
BENEFITS
It improves response time. When a number of processes work in parallel they are
quite faster.
It improves thoughput.It obviously improves the output rate.

**HOW TO MEASURE BENEFITS**


SPEED UP
Adding more resources results in proportionally less running time for a fixed
amount data.
Taking an example here, if we take a task and this is given to a single processor
system it'll take time say TS and then it'll give an output.
On the other side if the same task is given to a parallel machine with multiple
processors and disks. Now if we take the time taken as TL.
TL will be comparatively smaller because suppose if 1 processor takes 60 sec to
complete a job so TS =60seconds here. but for the same job TL=20 why because
parallel machine has 3 processors that can complete the job.
SPEEDUP=TS/TL
Next is SCALE UP
It is the factor that expresses how much more work can be done in the same period
of time by a sysytem "n" times larger
So in speed up we talked about time , here we talk about volume of work.
If we go by the example, a single processor system can process 100 tasks in a given
amount of time, and the parallel system can process 300 transactions in this amount
of time, the the value of SCALEUP would be equal to 3, that is 300/100.

ARCHITECTURE
In Parallel database systems there are 3 major architectures
shared-memory
shared-disk
shared-nothing.
First we'll take a look at --
SHARED MEMORY
Also known as Symmetric multiprocessing (SMP).
There are multiple CPUs that are attached to an interconnection network. you have
all the processors sharing a common memory area using a interconnected network
Now what are the ADVANTAGES of shared memory system
It has high-speed data access for a limited number of processors.
The communication is efficient.
It also have certain disadvantages
It cannot use beyond 80 or 100 CPUs in parallel.
The bus or the interconnection network gets block due to the increment of the large
number of CPUs.
SHARED DISK ARCHITECTURE
Various CPUs are attached to an interconnection network.
Each CPU has its own memory and all of them have access to the same disk. Any
processor can access any disk.
The memory is not shared among CPUs therefore each node has its own copy of the
operating system and DBMS.
Advantages of shared disk -->
..
**Problems with Shared Memory and Shared Disk Architecture**
As more Processors are added the existir processors slow down because of increased
memory contention and network bandwith
A system with 1000 Processors is only 4% percent as effective as a
single processor system
This is why they went for-
SHARED NOTHING ARCHITECTURE
Each processor has a local main memory ar disk space.
here ,No two processors can access the same storage area.
All communication between processors happen over a communication network.
There is a homogeneity of nodes in a shared nothing architecture
Advantages of Shred nothing-->
..
TYPES OF PARALLELISM'
first we'll talk about I/O PARALLELISM
It is like splitting , say you some data so to achieve parallelism I'll split this
data into smaller chunks, for each processor I give this chunk on seprate disks.
Suppose this is data from two or three different branches and We've split them into
seprate chunks so the three branch managers can access the three disks seprately.
Then next comes INTERQUERY PARALLELISM
say I have a dbms software that actually runs on different applications and so it
creates different queries from all sources
So what happens is Query is handlled independently parallely in different
processors
Now comes INTRAQUERY
so in this case there is only one query that I want to speed up this query by doing
something in parallel

**Data Partitioning**
So as we talked I/O Parallelism is reducing the time requireed to retrive
relations from disk by partitioning the relations on multiple disks so that we can
work on this data parallely.

So the types of partioning ,First


HORIZONTAL partioning
what you do is simply if there is a table so you take somee rows and move to disk1
some other rows to disk2 and some other to disk3 and so on
if you do it row wise its horizontal partioning.
Vertical partioning is the exact same colounm wise.

Round Robin Partitioning- This strategy scans the relation in any order and sends
the ith tuple to disk number mod n
What it ensures is an even distribution of tuples across disks;
that is, each disk has approximately the same number of tuples as the others.
Disadvantages of round robin partioning
It is Only suitable for full table scans
Not suitable for point queries or range queries.
Select from employee where name='sam';
Select from employee where id between 3 and 5;
HASH PARTITIONING
So what we do in hashing is we give the input, pass it through hashing function and
get a value.
Hash partitioning maps data to partition based on a hashing algorithm that applies
to the partitioning key that you identify.
• The hashing algorithm evenly distributes rows among partitions, giving partitions
approximately the same size.

RANGE PARTIONING
Range partitioning strategy partitions the data based on the partitioning
attributes values.
We need to find set of range vectors on which we are about to partition.
range partitioning can lead to data skew;
that is,Some partition gets more tuples and some partition gets lesser tuples
Two Types of Skew
Attribute-Skew.
Partition Skew
PARALLELIZING SEQUENTIAL OPERATOR EVALUTION CODE
An elegant software architecture for parallel DBMSs that enables us to readily
parallelize existing code for sequentially evaluating a relational operator.
The basic idea here is to use parallel data streams.
Streams tht are merged as needed to provide the inputs for a relational operator.
The output of an operator is split as needed to parallelize subsequent processing.
And a parallel evaluation plan that consists of a dataflow network of relational,
merge, and split operators.

PARALLELIZING INDIVIDUAL OPERATIONS


so there are 3 techniques here-
BULK LOADING &SCANNING
**scanning a relation**: Pages can be read in parallel while scanning a relation,
and the retrieved tuples can then be merged, if the relation is partitioned across
several disks.
**bulk loading**: if a relation has associated indexes, any sorting of data entries
required for building the indexes during bulk loading can also be done in parallel.

PARALLEL SORTING
First redistribute all tuples in the relation using range partitioning.
Each processor then sorts the tuples assigned to it
The entire sorted relation can be retrieved by visiting the processors in an order
corresponding to the ranges assigned to them.
The problem of data skew we have already seen where
Some partition gets more tuples and some partition gets lesser tuples which can be
solved by sampling
the data at the outset to determine good range partiong points.

PARALLEL JOIN
The basic idea for joining A and B in parallel is to decompose the join into a
collection of k smaller joins by using partition.

By using the same partitioning function for both A and B, we ensure that the union
of the k smaller joins computes the join of A and B.

and that's it for now THANKYOU

You might also like