Professional Documents
Culture Documents
DATA MINING
DATA PROCESSING
DATA CLEANING
Data Cleaning in Data Mining. Quality of your data is critical in getting to
final analysis. Any data which tend to be incomplete, noisy and inconsistent
can effect your result. Data cleaning in data mining is the process of
detecting and removing corrupt or inaccurate records from a record set, table
or database.
• Aggregation:
Data collection or aggregation is the method of storing and presenting data in a summary
format. The data may be obtained from multiple data sources to integrate these data
sources into a data analysis description. This is a crucial step since the accuracy of data
analysis insights is highly dependent on the quantity and quality of the data used.
Gathering accurate data of high quality and a large enough quantity is necessary to
produce relevant results.
• The collection of data is useful for everything from decisions concerning financing or
business strategy of the product, pricing, operations, and marketing strategies.
• For example, Sales, data may be aggregated to compute monthly& annual total amounts.
Discretization:
It is a process of transforming continuous data into set of small
intervals. Most Data Mining activities in the real world require
continuous attributes. Yet many of the existing data mining
frameworks are unable to handle these attributes.
Also, even if a data mining task can manage a continuous
attribute, it can significantly improve its efficiency by replacing
a constant quality attribute with its discrete values.
For example, (1-10, 11-20) (age:- young, middle age, senior).
Normalization: Data normalization involves converting all data variable
into a given range.
Concept of Hierarchy?
DATA INTEGRATION
Data Integration is a data pre-processing technique that involves combining data from multiple heterogeneous data sources into a
coherent data store and provide a unified view of the data. These sources may include multiple data cubes, databases or flat files.
• A lot of connectors. There are many systems and applications in the world; the more pre-built connectors your Data Integration
tool has, the more time your team will save.
• Open source. Open source architectures typically provide more flexibility while helping to avoid vendor lock-in.
• Portability. It's important, as companies increasingly move to hybrid cloud models, to be able to build your data integrations once
and run them anywhere.
• Ease of use. Data integration tools should be easy to learn and easy to use with a GUI interface to make visualizing your
data pipelines simpler.
• A transparent price model. Your data integration tool provider should not ding you for increasing the number of connectors or
data volumes.
• Cloud compatibility. Your data integration tool should work natively in a single cloud, multi-cloud, or hybrid cloud environment.
• Gathering: You may need to collect data that is relevant to the analysis from
multiple sources.
• Cleansing: Data may have some issues, that you want to resolve before analysis.
• Formating: Data is transformed and consolidated into forms appropriate for
mining by performing summary or aggregation operations.
• Blending: Data is combined or blended from various sources to develop our
analytical dataset.
• Sampling -
Data Mining: Concepts and Techniques April 26, 2024 16
Validation
• Important steps:-
• Observe the key results of the model.
• Ensure the results make sense within the context of business problems.
• Determine whether to proceed or take a step back.
• Iterate.
9 – 11 7 10 7/60 = .1167 55 70
11 -13 5 12 5/60 = .0833 60 60
Mean = = 416/60 = 6.93
Median of Grouped Data
Median =
L = the lower limit of the median class interval = 7
cf = a cumulative total of the frequencies up to but not including the frequency of the median
class =29
fmed = the frequency of the median class = 19
W = The width of the median class = 2
N = Total number of frequency 60
DATA VISUALIZATION
TOOLS USED FOR DATA VISUALIZATION
Tableau
Looker
Zoho Analytics
Sisense
IBM Cognos Analytics
Qlik Sense
Domo
Microsoft Power BI
Klipfolio
SAP Analytics Cloud
Top 8 Python Libraries for Data Visualization
1. Matplotlib
2. Plotly
3. Seaborn
4. GGplot
5. Altair
6. Bokeh
7. Pygal
8. Geoplotlib
Data Mining Knowledge Representation
Background knowledge:
Concept hierarchies: The concept hierarchies
are induced by a partial order1 over the values
of a given attribute. Depending on the type of
the ordering relation we distinguish several
types of concept hierarchies
Representing Input Data And Output Knowledge