You are on page 1of 11

DBT20CS353-ASSIGNMENT 3 -DATA DISTRIBUTION AND DISTRIBUTED QUERY PROCESSING

NAME : CHAITHRA
SRN: PES1UG20CS543
SECTION: I
(a) Load a million rows of data to a (transaction) table that should have distributed storage in
multiple drives of your PC. May use a program to create data - if used, submit the
program code.

PYTHON CODE TO LOAD MILLION ROWS OF DATA

BK =>Book table
import random
import mysql.connector
import random
import string
from decimal import Decimal

# Establish a connection to the MySQL database


mydb = mysql.connector.connect(
host="localhost",
user="root",
password="",
database="pes1ug20cs543_library"
)

# Create a cursor object to execute SQL queries


mycursor = mydb.cursor()

# Define the number of rows to generate


num_rows = 1000000

# Generate and insert data into the table


for i in range(num_rows):
isbn =i+1
title = ''.join(random.choices(string.ascii_letters, k=random.randint(1,
100)))
cost = Decimal(random.uniform(0, 1000)).quantize(Decimal('0.01'))
is_reserved = random.randint(0, 1)
edition = random.randint(1, 100)
publi_place = ''.join(random.choices(string.ascii_letters,
k=random.randint(1, 30)))
publisher = ''.join(random.choices(string.ascii_letters,
k=random.randint(1, 30)))
copy_year = random.randint(1900, 2023)
shelf_id = random.randint(1, 1000)
sub_name = ''.join(random.choices(string.ascii_letters,
k=random.randint(1, 30)))
# Insert the data into the table
sql = "INSERT INTO bk (`ISBN`, `Title`, `Cost`, `IsReserved`,
`Edition`, `PubliPlace`, `Publisher`, `CopyYr`, `ShelfID`, `SubName`) VALUES
(%s, %s, %s,%s, %s, %s,%s, %s, %s,%s)"
val = (isbn, title, cost, is_reserved, edition, publi_place, publisher,
copy_year, shelf_id, sub_name)
mycursor.execute(sql, val)

# Commit the changes to the database


mydb.commit()

# Close the database connection


mydb.close()

SQL QUERY TO PARTION AND INSERT DATA TO THE TABLE


CREATE TABLE partition_table(
ISBN int(11) NOT NULL,
Title varchar(100) NOT NULL,
Cost decimal(5,2) NOT NULL,
IsReserved tinyint(1) NOT NULL,
Edition int(11) NOT NULL,
PubliPlace varchar(30) DEFAULT NULL,
Publisher varchar(30) NOT NULL,
CopyYr decimal(4,0) NOT NULL,
ShelfID int(11) DEFAULT NULL,
SubName varchar(30) DEFAULT NULL,
PRIMARY KEY (ISBN)
);

Partition based on range (cost column)


ALTER TABLE partition_table
DROP PRIMARY KEY,
ADD PRIMARY KEY (ISBN, cost)
PARTITION BY RANGE(cost)(
PARTITION p0 VALUES LESS THAN (100),
PARTITION p1 VALUES LESS THAN (500),
PARTITION p2 VALUES LESS THAN (900),
PARTITION p3 VALUES LESS THAN MAXVALUE
);

INSERT INTO partition_table


(ISBN,Title,Cost,IsReserved,Edition,PubliPlace,Publisher,CopyYr,ShelfId,SubName) Select
ISBN,Title,Cost,IsReserved,Edition,PubliPlace,Publisher,CopyYr,ShelfId,SubName from bk;
(b) Measure performance of this humongous table using multiple (min 6) varieties of
queries using different tables combinations and record the same and compare them.

(c) The Explain/Analyze plan outputs must be part of the submission in addition to
query results.

1) Cost of the book between 101 and 499

select * from bk where cost between 101 and 499;

Original table – rows accessed 899770

Partitioned table – rows accessed 398138 from partition p1 which has the range between 100 and
500

Analyze the above query


Original table scan-------→ 1027.1ms
Partition table scan------→ 430.53ms
2) select the books with cost >400

select isbn,cost,title from bk where cost>400;

The number of rows in bk table (book table) is more when compared to the partitioned table
Original table ----→518.6ms
Partitioned table-→505.68ms

Since only 3 of the 4 partitions(p1,p2,p3 were used and p0 was not used) were used the scan on
the partition table took less time when compared to the original table

3) Join query

SELECT b.ISBN, b.Title, bc.CopyID FROM bk b JOIN issue bc ON b.ISBN = bc.ISBN WHERE b.Cost <
500;
Since for this query the number of rows joined is 9 we compare the time required to join the 9
rows with cost <500 we see that the partition table takes (0.1581)lesser time to perform join
operation when compared to original table(0.2345) .

Only 2 partitions were used (p0,p1)

4 ) Aggregate

Orignal table rows scanned ---→899770


Partitioned table rows scanned→ 493166
Partition table scan ------→484.03 ms
Original table scan -------→ 742.43 ms

Only p2 and p3 partition were scanned so it took lesser time when compared to the original table
5 ) ORDER BY AND GROUP BY
Original table scan -----→5330.9 ms
Partition table scan ---→4728.4 ms

6 ) Index scan

Original table without index scan

With index scan


Partition table without index scan

Partition table with index scan


Original table without index scan -------→953.36ms
Original table with index scan------→ 11.33 ms
Partition table without index scan ---→476.43ms
Partition table with index scan-------→0.139ms

Partitioned table with index scan took lesser time when compared to all other tables .
Hence we can come to a conclusion that partitioning the table reduces the time required to scan
table since the we scan only the required partition.

You might also like