You are on page 1of 9

KJSCE/IT/TY /SEMVI/EDA/2020-21

Experiment No.1

Title: Data exploration (Attribute identification and


categorization)

(A constituent college under Somaiya Vidyavihar University)


KJSCE/IT/TY /SEMVI/EDA/2020-21

Batch: B-1 Roll No.: 1814091 Experiment No.:1

Aim: Data exploration by applying attribute identification and categorization


___________________________________________________________________________

Resources needed: Any Data Repositories


___________________________________________________________________________
Theory:

In order to make data ready for data analytic process, data exploration is essential step to
develop a high-level understanding of the data. Data exploration includes in detail analysis of
attributes and their data values and visualization. It aimed at identifying possible relationship
between two or more variables/objects. Broadly classifying, there are two types of attributes,
numeric and categorical.

Categorical Attribute:
In categorical, each value represents some kind of category, code, or state. Categorical
variables are either nominal or ordinal, depending on the extent of information the numerical
coding provides. The values of a nominal attribute are symbols or names of things. Nominal
means “relating to names.”
E.g. hair color and occupation are two attributes describing person objects. Possible values
for hair color are black, brown, blond, red, auburn, gray, and white. For occupation, possible
values are teacher, dentist, programmer, farmer etc.

An ordinal attribute is an attribute with possible values that have a meaningful order or
ranking among them, but the magnitude between successive values is not known. For
example, grade attribute with values A+, A,A-, B, C; Student_progress attribute with values
Good, average , poor. The central tendency of an ordinal attribute can be represented by its
mode and its median (the middle value in an ordered sequence), but the mean cannot be
defined. Nominal, binary, and ordinal attributes are qualitative. That is, they describe a
feature of an object without giving an actual size or quantity. The values of such qualitative
attributes are typically words representing categories.

Numeric Attributes:

A numeric attribute is quantitative; that is, it is a measurable quantity, represented in integer


or real values. Numeric attributes can be interval-scaled or ratio-scaled.

Interval-Scaled Attributes:
Interval-scaled attributes are measured on a scale of equal-size units. The values of interval-
scaled attributes have order and can be positive, 0, or negative. Thus, in addition to providing
a ranking of values, such attributes allow us to compare and quantify the difference between
values.
For example, temperature, humidity attributes

Ratio-Scaled Attributes:
A ratio-scaled attribute is a numeric attribute with an inherent zero-point. That is, if a
measurement is ratio-scaled, we can speak of a value as being a multiple (or ratio) of another
value. In addition, the values are ordered, and we can also compute the difference between
values, as well as the mean, median, and mode.

(A constituent college under Somaiya Vidyavihar University)


KJSCE/IT/TY /SEMVI/EDA/2020-21

For example, years of experience

___________________________________________________________________________

Procedure / Approach /Algorithm / Activity Diagram:

1. Download the large dataset for the purpose of exploration and ensure that dataset has
variety of attributes; number of attributes must be at least 25.
2. Identify the category of each attribute from the dataset which you have created.
3. Identify the attributes which can provide any kind of useful information either
collectively or as an individual. Also, discuss the about the information provided by
the attribute and how it will be computed?
___________________________________________________________________________

Results: (Program printout with output / Document printout as per the format)

Attribute Name Attribute Type Reason


Date Interval-scaled Dates can be measured on a scale of equal size-units
that is 24hrs.
Islamic Date Interval-scaled Dates can be measured on a scale of equal size-units
that is 24hrs.
Blast Day type Discrete, Nominal It has majorly 2 types (Holiday, Working day)

Holiday Type Nominal, Discrete We can categorize the attribute province into finite
values but it does not have a meaningful order.
Time Interval-scaled Time can be measured on a scale of equal size-units.

City Nominal, We can categorize the attribute City into finite


Discrete values but it does not have a meaningful order.
Latitude Continuous Represented as floating-point variables

Longitude Continuous Represented as floating-point variables

Province Nominal, Discrete We can categorize the attribute province into finite
values but it does not have a meaningful order.
Location Nominal, Discrete We can categorize the attribute location into finite
values but it does not have a meaningful order.
Location Category Nominal, Discrete It Does not have a meaningful order but values
provide enough information to distinguish from one
another and cannot be represented as integers.
Location Sensitivity Ordinal It has majorly 3 types (High, Medium, Low ) which
can be re-arranged in a meaningful order
Open/Closed space Discrete, Nominal It has majorly 2 types (Open, Closed)

Influencing Nominal, Discrete It Does not have a meaningful order but values
Event/Event provide enough information to distinguish from one
another and cannot be represented as integers.
Target Type Nominal, Discrete It Does not have a meaningful order but values
provide enough information to distinguish from one

(A constituent college under Somaiya Vidyavihar University)


KJSCE/IT/TY /SEMVI/EDA/2020-21

another and cannot be represented as integers.


Target sect if any Nominal, Discrete It Does not have a meaningful order but values
provide enough information to distinguish from one
another and cannot be represented as integers.
Killed min Continuous Since it is not discrete and it represents the
minimum number of killed people
Killed max Continuous Since it is not discrete and it represents the
maximum number of killed people.
Injured min Continuous Since it is not discrete and it represents the
minimum number of injured people.
Injured max Continuous Since it is not discrete and it represents the
maximum number of injured people.
No. of suicide blasts Continuous Since it is not discrete and it represents the number
of suicide blasts.
Explosive Continuous It is the weight per item and it can be any positive
weight(max) number.
Hospital names Discrete It has finite values which may not be represented as
integers
Temperature(C) Continuous Represented as floating-point variables

Temperature(F) Continuous Represented as floating-point variables

Attribute Name Information How/What to compute ?


provided
Blast Day type It tells whether blast Can be used to devise a pattern and compute as
took place on working to maximum blasts occur on which day type.
day or no.
Time The time at which the With this information a prediction can be made
blast occurred as to what time a blast is most likely to happen.
Location Sensitivity Tells about how Sensitivity of the location can be computed
sensitive the location is which can thereby help the armed forces in
i.e. high or low estimating the amount of officials needed
depending upon the sensitivity.
Open/Closed space Provides with the info This can be collectively used with ‘killed max’
space i.e. open/closed and ‘injured max’ columns in order to develop
space relationship between the number of people
getting killed/injured depending upon the space
type.
Killed max The maximum number Can be used collectively with other columns to
of people killed . compute the injury to kill rate.
Injured max The maximum number The number of max people injured can be
of people injured. found which can further be used for
predictions.
No. of suicide blasts Tells about the number This can collectively be used with other
of suicide bombers. columns which help in computing the

(A constituent college under Somaiya Vidyavihar University)


KJSCE/IT/TY /SEMVI/EDA/2020-21

relationship between number of suicide


bombers and other factors.
Explosive The maximum weight Explosive weight collectively be used with other
weight(max) of the explosive used . columns which help in computing the
relationship between weight of bomb and the
people killed/injured and other factors.

Dataset:

(A constituent college under Somaiya Vidyavihar University)


KJSCE/IT/TY /SEMVI/EDA/2020-21

(A constituent college under Somaiya Vidyavihar University)


KJSCE/IT/TY /SEMVI/EDA/2020-21

(A constituent college under Somaiya Vidyavihar University)


KJSCE/IT/TY /SEMVI/EDA/2020-21

___________________________________________________________________________
Questions:
1. Compare Discrete and Continuous Attributes. Give at least 5 examples of each.

BASIS FOR
DISCRETE ATTRIBUTES CONTINUOUS ATTRIBUTES
COMPARISON

Meaning Discrete attributes refers to the Continuous attribute alludes to


attributes that assume a finite the attributes which assumes
number of isolated values. infinite number of different
values.

Range of Complete Incomplete


specified number

Values Values are obtained by Values are obtained by


counting. measuring.

Classification Non-overlapping Overlapping

Assumes Distinct or separate values. Any value between the two


values.

Represented by Isolated points Connected points

Examples:

Discrete Attribute

 Number of printing mistakes in a book.


 Number of road accidents in New Delhi.
 Number of siblings of an individual.
 Gender
 The number of students in a class (you can't have half a student).

(A constituent college under Somaiya Vidyavihar University)


KJSCE/IT/TY /SEMVI/EDA/2020-21

Continuous Attribute

 Height of a person
 Age of a person
 Profit earned by the company.
 Temperature of the day
 Weight of a person

___________________________________________________________________________
Outcomes: CO1 : Summarize the data

___________________________________________________________________________

Conclusion: (Conclusion to be based on the objectives and outcomes achieved)

We were able to categorize the data into various types which would help us better understand
them thereby deciding the best computational technique to be applied on them.

Grade: AA / AB / BB / BC / CC / CD /DD

Signature of faculty in-charge with date


___________________________________________________________________________
References:

Books/ Journals/ Websites:

1. Han, Kamber, "Data Mining Concepts and Techniques", Morgan Kaufmann 3rd
Edition
2. P. N. Tan, M. Steinbach, Vipin Kumar, “Introduction to Data Mining ”, Pearson
Education, 2nd edition.

(A constituent college under Somaiya Vidyavihar University)

You might also like