You are on page 1of 10

6CS030 Big Data

Coursework – Part 1
Worksheet One – 5%
Hand-out: Week 2. Demo: Week 3 Workshop

Student Id: [1928446]

Student Name: [Preeti Lama]

Group: [C3G1]

Remainder value- 1

Degree Level Qualification


a) Identify any cleaning issues in the worksheet and take steps to rectify the issues.
Document what changes have been made.

b) Convert the worksheet so that it is in a suitable format for a relational table and
import into Oracle.
Choose appropriate names for the table and column names.
Document what changes you have made.

Documentation on cleaning excel file

Oracle expects the first row to contain the column heading so since this file contains
some unnecessary line, these line are removed.

The first line now looks like this:

And the bottom line now looks like this after removing the unnecessary data and column
total:

Now since the data is in appropriate format now pivot table is created using the pivot
Table Wizard.
After this the new worksheet will be presented and then the row and column buttons are
unselected which helps to reduce the data just one value:

The data will reappear once it is clicked in its value:

The column and the table name is changed before importing it into Oracle:

Row: Local_Authority

Column: Q_Year

Value.

Since this data has been put into the new worksheet known as sheet it is renamed into
Qualification Sheet.

Missing Values

The data in the spreadsheet has some issues in term of missing values. Before editing
the information the copy of the worksheet is created. The worksheet contained some
missing values.
The data containing the some missing value was solved using AVERAGE and MEDIAN
where as the data sheet that contained no value was completely removed.

After the process was completed it was researched if the sheet contains some null
value or not:
To visualise the data the graph was created:

The chart above shows the data fluctuates and there is data missing in the year 2010
and 2011 so the solution is to replace the hash values with the average.
Since the data was all cleaned it is imported to oracle.

The data after it was successfully imported.


c) Write two SQL commands that demonstrate an OLAP query. Show the results

a) SELECT LOCAL_AUTHORITY,Q_YEAR,round(avg(VALUE),2) as AVG


FROM QUALIFICATION
group by CUBE(LOCAL_AUTHORITY,Q_YEAR) ORDER BY
LOCAL_AUTHORITY ASC;
b) SELECT LOCAL_AUTHORITY,Q_YEAR,round(avg(VALUE),2) as AVG
FROM QUALIFICATION
group by ROLLUP(LOCAL_AUTHORITY,Q_YEAR) ORDER BY Q_YEAR ASC;
d) Name one advantage to using this approach. For future reference include a brief
explanation of why you think this is an advantage.

OLAP system has helped a lot in organizing the huge amount of information in
the convenient form for the end user. Its suits all the user form the small and the
medium to even large corporate groups.

High Speed of data processing:


The major advantage of OLAP is the speed of query execution. A cube that is
correctly designed processes the query within 5 seconds. The data will be
available at any time it is required which helps to take decision immediately. It is
not essential to waste the time on calculation and composing the complex or
heavy weights report. Every single data is stored in one place which is ready to
be operated with. Every complex report can just be made in a few minute from
the warehouse.

Advantage of cleaning data:


Since the data is the major importance in every enterprise, every small decision
relies on data, data cleaning helps to avoid costly errors. Data cleaning is one of
the best solutions to avoid the costly error as well as increasing the efficiency of
customer acquisition. The major advantage of cleaning the data is it makes
decision making much easier and efficient.

e) Name one disadvantage to using this approach. For future reference include a
brief explanation of why you think this is a disadvantage

Despite of advantage of using this tool, online analytical processing like every
other technology has disadvantage. The major disadvantage of this system is
limitation.
Potential Risk
Due to lacking of the computation and low interactive analysis ability the OLAP
tool have a huge potential risk. The implementation also hardly relies on the IT.
The poor computation ability of this system results in the failure to submit the
data of huge amount and sometime may bring difficulty in term of decision
making. Not being able to give the reference or solve complex problem may
result to the great loss. The potential risk may lead the OLAP project failure. This
possibility of risk leads to the point that there is difficulty in providing valuable link
to the decision maker even though it depends on the system type and the OLAP
software.
The process of data cleaning helps to avoid the dangerous problem but data
cleaning itself sometime can be more dangerous. There is no any obvious
pattern to the missing data so it is not simple matter to change these data.
Making the statistical analysis on this data is difficult since the numerical data will
be loaded as varchar while importing it on Oracle.

You might also like