Data Analysis (Reviewer)

PRELIM
Data Analysis
-Is the process of inspecting, cleansing, transforming and modelling data with the
goal of discovering useful information.
“METHODS”
1. Data Mining - Discovers patterns in large data sets using methods of Statistics,
AI(Artificial Intelligence), Machine Learning, Databases.
2. Text Analytics - Is the process of deriving useful information.
3. Business Intelligence(BI) - Transforms data into actionable intelligence for business
purposes.
4. Data Visualization - Refers simply to the graphical representation of data by using
charts,graphs,Maps,etc…
AI - An “Intelligent” computer uses AI to think like a human and perform tasks on it’s own.
Machine Learning - how a computer system develops its intelligence.
Data Mining Tools

-Data mining tools allow organizations to uncover patterns and relationships in data that can be used
to make predictions and data-driven business decisions.
1. RapidMiner - Written in JAVA programming language.

2. Orange - An Open Source machine learning and data visualization tool.
3. Weka - Named after an inquisitive flightless bird on the isles of New Zealand.
4. Knime - It is a powerful tool with GUI that shows network of data nodes.
-Popular to financial data analysts.
5. R-Programming - Written in C and in Fortran.
Data Science
-is the study and extraction of useful information from raw data.
Data science uses:
*Scientific algorithms
*Processes
*Systems
*Modern tools
*and Techniques
DataScientist
-They collect data, analyze it, and share their insights with technology leaders and
businesses to help organizations solve issues.
ADT(Abstract Data Types)
NOTABLE ADTs:
-List
-Stack-Queue
ADT features
-Abstraction
-Better conceptualization
-Robust
MIDTERM
Sets
- German mathematician Georg Cantor introduced the concept of sets.
- A set is an unordered collection of different elements.
Cardinality
-the number of elemets in set
Example:
|{1,4,3,5}|=4
The cardinality is “4”
Types of Sets
1.A set which contains a definite number of elements is called a finite set.
2.A set which contains infinite number of elements is called an infinite set.
3. Subset - A subset is a part of a given set (another set or the same set).A = {1, 2, 3} is a
subset of B = {1, 2, 3, 4, 10}.
4. Proper Subset - is any subset of the set except itself. For example, if A = {1, 2, 3}, then its
proper subsets are {}, {1}, {2}, {3}, {1, 2}, {2, 3}, and {3, 1}, but the set itself {1, 2, 3} is NOT a
proper subset of A.
5.Universal Set - Consider two sets, A = {x,y,z} and B = {1,2,3,x,y}, then the universal set
associated with these two sets is U = {1,2,3,x,y,z}.
6. If two sets contain the same elements they are said to be equal.
7.If the cardinalities of two sets are same, they are called equivalent sets.
8.Two sets that have at least one common element are called overlapping sets.
9.Two sets A and B are called disjoint sets if they do not have even one element in common.
10.Venn diagram, invented in 1880 by John Venn, is a schematic diagram that shows all possible
logical relations between different mathematical sets.
Set Operations
1..Set Union - If A={10,11,12,13} and B = {13,14,15} then A∪B={10,11,12,13,14,15}
2.Set Intersection - A = 1,2,3, B = 2,5,7 C = 3,5,7 A∩B∩C = {2,3,5,7}
3.Set Difference/Relative Complement - If A={10,11,12,13} and B={13,14,15} then

(A−B)={10,11,12} and (B−A)={14,15}.
4.Complement of a set - If the universal set is all prime numbers up to 25 and set A = {2, 3, 5} then
the complement of set A is other than the elements of A.
Step 1: Check for the universal set and the set for which you need to find the complement. U = {2, 3, 5,
7, 11, 13, 17, 19, 23}, A = {2, 3, 5}.
Step 2: Subtract, that is (U - A). Here,
U - A = A'
= {2, 3, 5, 7, 11, 13, 17, 19, 23} - {2, 3, 5}
= {7, 11, 13, 17, 19, 23}
5.Cartesian Product/Cross Product - Consider two non-empty sets C = {x, y, z} and D = {1, 2,
3} as shown in the image ->
6. Power Set - For a set S={a,b,c,d} let us calculate the subsets −
Subsets with 0 elements − {∅} (the empty set)
Subsets with 1 element − {a},{b},{c},{d}

Subsets with 2 elements − {a,b},{a,c},{a,d},{b,c},{b,d},{c,d}
Subsets with 3 elements − {a,b,c},{a,b,d},{a,c,d},{b,c,d}
Subsets with 4 elements − {a,b,c,d}
Hence, P(S)= {{∅},{a},{b},{c},{d},{a,b},{a,c},{a,d},{b,c},{b,d},{c,d},{a,b,c},{a,b,d},{a,c,d},{b,c,d},
{a,b,c,d}}
7.Partitioning of a Set - one possible partition of {1, 2, 3, 4, 5, 6} is, {1, 3}, {2}, {4, 5, 6}.
7. Bell Numbers - Let S={1,2,3} , n=|S|=3
The alternate partitions are −

1. ∅,{1,2,3}
2. {1},{2,3}
3. {1,2},{3}
4. {1,3},{2}
5. {1},{2},{3}
Hence B3=5
8. Relations - Suppose there are two sets… X = {4, 36, 49, 50} and Y = {1, -2, -6, -7, 7, 6, 2}.
A relation that states that "(x, y) is in the relation R if x is a square of y" can be represented using
ordered pairs as… R = {(4, -2), (4, 2), (36, -6), (36, 6), (49, -7), (49, 7)}.
Types Of Relations
Empty Transitive
Universal Equivalence
Identity One to One
Inverse One to Many
Reflexive Many to One
Symmetric Many to Many
PRE FINALS
Algorithm
-Derived from the name of the Persian mathematician Muhammad ibn Mūsā al-Khwārizmī.
-We first demonstrate the algorithm using pseudocode, which explains the algorithm in an English-like
syntax.
-The same algorithm is shown in a programming language.
The first Algorithm

-Because a cooking recipe could be considered an algorithm, the first algorithm could go back as far as
written language.
-his algorithm was first described in 300 B.C.
- Ada Lovelace is credited as being the first computer programmer and the first person to develop an
algorithm for a machine (Analytic Engine).
TYPES
Best case: Define the input for which algorithm takes less time or minimum time.
Worst Case: Define the input for which algorithm takes a long time or maximum time.
Average case: In the average case take all random inputs and calculate the computation time for all
inputs.
Cost Models
1. Uniform cost model - Assigns a constant cost to every machine operation, regardless of the size of
the numbers involved.
2. Logarithmic cost model - Assigns a cost to every machine operation proportional to the number of
bits involved.
Run-Time Analysis
-is a theoretical classification that estimates and anticipates the increase in running time (or run-time or
execution time) of an algorithm as its input size increases.
Data Science
- is the study of data to extract meaningful insights for business.
What is data science used for?

1.Descriptive analysis examines data to gain insights into what happened or what is
happening in the data environment.
2.Diagnostic analysis is a deep-dive or detailed data examination to understand why
something happened.
3.Predictive analysis uses historical data to make accurate forecasts about data
patterns that may occur in the future.
4.Prescriptive analytics takes predictive data to the next level. It not only predicts
what is likely to happen but also suggests an optimum response to that outcome.
DS Process
OSEMN
-OBTAIN DATA
-SCRUB DATA
-EXPLORE DATA
-MODEL DATA
-INTERPRET DATA
DS Techniques
1.Classification is the sorting of data into specific groups or categories.
2.Regression is the method of finding a relationship between two seemingly unrelated data points.
3.Clustering is the method of grouping closely related data together to look for patterns and
anomalies.
DS Technologies
Artificial intelligence: Machine learning models and related software are used for
predictive and prescriptive analysis.
Cloud computing: Cloud technologies have given data scientists the flexibility and
processing power required for advanced data analytics.
IoT(Internet of things): refers to various devices that can automatically connect to
the internet. These devices collect data for data science initiatives. They generate
massive data which can be used for data mining and data extraction.
Quantum computing: Quantum computers can perform complex calculations at high
speed. Skilled data scientists use them for building complex quantitative algorithms.
Statistical Methods
- mainly useful to ensure that your data are interpreted correctly.
Steps in the DA Process
1.Pose a Question
2.What to Measure and How to Measure
3.Data Collection
4.Data Cleaning
5.Summarizing and Visualizing Data
6.Data Modeling
7.Optimize and Repeat
Mean Median and Mode

1.Mode - The given is…5, 6, 5, 7, 5, 8, 9, 5
Mode will be 5. Because it is the most frequent item.
2.Median - 5, 5, 5, 5, 6, 7, 8, 9
The Median is 5.5
We order the dataset first.
Then we take the middle one in the order.
But since its an even number dataset, we have two numbers in the middle.
We add them (5+6) and divide them by 2.
3. Mean - The given is the same…5, 6, 5, 7, 5, 8, 9, 5

The mean would be 6.25.
The formula would be…
Mean = (Sum of Elements) / (Cardinality)
Mean = 50 / 8
Mean = 6.25
Variance & Standard Deviation

The given is…3, 5, 8, 1
The mean = 4.25
The variance = 6.68
Standard Deviation = 2.58
First the mean is (3 + 5 + 8 + 1) / 4 = 4.25
After getting that, we get variance.
[(3 - 4.25)^2 + (5 - 4.25)^2 + (8 - 4.25)^2 + (1 - 4.25)^2] / 4
Standard deviation is taken by getting the square root of variance (√6.68).

Data Analysis (Reviewer)

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Analysis (Reviewer)

Uploaded by

Copyright:

Available Formats

PRELIM

Data Mining Tools

1. RapidMiner - Written in JAVA programming language.

2.Set Intersection - A = 1,2,3, B = 2,5,7 C = 3,5,7 A∩B∩C = {2,3,5,7}

3.Set Difference/Relative Complement - If A={10,11,12,13} and B={13,14,15} then

Step 2: Subtract, that is (U - A). Here,

= {2, 3, 5, 7, 11, 13, 17, 19, 23} - {2, 3, 5}

= {7, 11, 13, 17, 19, 23}

6. Power Set - For a set S={a,b,c,d} let us calculate the subsets −

Subsets with 0 elements − {∅} (the empty set)

Subsets with 1 element − {a},{b},{c},{d}

7. Bell Numbers - Let S={1,2,3} , n=|S|=3

The alternate partitions are −

Identity One to One

Inverse One to Many

Reflexive Many to One

Symmetric Many to Many

The first Algorithm

What is data science used for?

Mean Median and Mode

3. Mean - The given is the same…5, 6, 5, 7, 5, 8, 9, 5

Variance & Standard Deviation

You might also like