Professional Documents
Culture Documents
-----------------------------------------------------------------------------------
---------------------------------------------------------------
INTRODUCTION
A parallel database system seeks to improve performance through parallelization of
various operations, such as loading data, building indexes, and evaluating queries.
Parallel systems improve processing and I/O speeds by using multiple processors and
disks in parallel.
NEED
we know that as there is large of volume data ,many manipulations,storing retriving
etc, so we have to increase the speed of data processing.
This is where parallelism helps.
Parallel machines are becoming quite common nowadays and affordable even,the prices
of microprocessors, memory and disks have dropped sharply.
BENEFITS
It improves response time. When a number of processes work in parallel they are
quite faster.
It improves thoughput.It obviously improves the output rate.
ARCHITECTURE
In Parallel database systems there are 3 major architectures
shared-memory
shared-disk
shared-nothing.
First we'll take a look at --
SHARED MEMORY
Also known as Symmetric multiprocessing (SMP).
There are multiple CPUs that are attached to an interconnection network. you have
all the processors sharing a common memory area using a interconnected network
Now what are the ADVANTAGES of shared memory system
It has high-speed data access for a limited number of processors.
The communication is efficient.
It also have certain disadvantages
It cannot use beyond 80 or 100 CPUs in parallel.
The bus or the interconnection network gets block due to the increment of the large
number of CPUs.
SHARED DISK ARCHITECTURE
Various CPUs are attached to an interconnection network.
Each CPU has its own memory and all of them have access to the same disk. Any
processor can access any disk.
The memory is not shared among CPUs therefore each node has its own copy of the
operating system and DBMS.
Advantages of shared disk -->
..
**Problems with Shared Memory and Shared Disk Architecture**
As more Processors are added the existir processors slow down because of increased
memory contention and network bandwith
A system with 1000 Processors is only 4% percent as effective as a
single processor system
This is why they went for-
SHARED NOTHING ARCHITECTURE
Each processor has a local main memory ar disk space.
here ,No two processors can access the same storage area.
All communication between processors happen over a communication network.
There is a homogeneity of nodes in a shared nothing architecture
Advantages of Shred nothing-->
..
TYPES OF PARALLELISM'
first we'll talk about I/O PARALLELISM
It is like splitting , say you some data so to achieve parallelism I'll split this
data into smaller chunks, for each processor I give this chunk on seprate disks.
Suppose this is data from two or three different branches and We've split them into
seprate chunks so the three branch managers can access the three disks seprately.
Then next comes INTERQUERY PARALLELISM
say I have a dbms software that actually runs on different applications and so it
creates different queries from all sources
So what happens is Query is handlled independently parallely in different
processors
Now comes INTRAQUERY
so in this case there is only one query that I want to speed up this query by doing
something in parallel
**Data Partitioning**
So as we talked I/O Parallelism is reducing the time requireed to retrive
relations from disk by partitioning the relations on multiple disks so that we can
work on this data parallely.
Round Robin Partitioning- This strategy scans the relation in any order and sends
the ith tuple to disk number mod n
What it ensures is an even distribution of tuples across disks;
that is, each disk has approximately the same number of tuples as the others.
Disadvantages of round robin partioning
It is Only suitable for full table scans
Not suitable for point queries or range queries.
Select from employee where name='sam';
Select from employee where id between 3 and 5;
HASH PARTITIONING
So what we do in hashing is we give the input, pass it through hashing function and
get a value.
Hash partitioning maps data to partition based on a hashing algorithm that applies
to the partitioning key that you identify.
• The hashing algorithm evenly distributes rows among partitions, giving partitions
approximately the same size.
RANGE PARTIONING
Range partitioning strategy partitions the data based on the partitioning
attributes values.
We need to find set of range vectors on which we are about to partition.
range partitioning can lead to data skew;
that is,Some partition gets more tuples and some partition gets lesser tuples
Two Types of Skew
Attribute-Skew.
Partition Skew
PARALLELIZING SEQUENTIAL OPERATOR EVALUTION CODE
An elegant software architecture for parallel DBMSs that enables us to readily
parallelize existing code for sequentially evaluating a relational operator.
The basic idea here is to use parallel data streams.
Streams tht are merged as needed to provide the inputs for a relational operator.
The output of an operator is split as needed to parallelize subsequent processing.
And a parallel evaluation plan that consists of a dataflow network of relational,
merge, and split operators.
PARALLEL SORTING
First redistribute all tuples in the relation using range partitioning.
Each processor then sorts the tuples assigned to it
The entire sorted relation can be retrieved by visiting the processors in an order
corresponding to the ranges assigned to them.
The problem of data skew we have already seen where
Some partition gets more tuples and some partition gets lesser tuples which can be
solved by sampling
the data at the outset to determine good range partiong points.
PARALLEL JOIN
The basic idea for joining A and B in parallel is to decompose the join into a
collection of k smaller joins by using partition.
By using the same partitioning function for both A and B, we ensure that the union
of the k smaller joins computes the join of A and B.