You are on page 1of 2

DATA COLLECTION

Data collection involves gathering the relevant data required to build the model. It involves
the following steps:

1. DATA SOURCES: Identifying sources of data that are relevant to the problem. The
sources include structured databases. One might collect data from a relational
database like MySQL or PostgreSQL that sores customer information, sales
transactions or user interactions ( Rautmare et al 2016). Another is unstructured text
data that involves gathering text data from customer reviews, support tickets or social
media posts. Sensor data is another source that involves collecting data from sensors
such as humidity, temperature
2. DATA ACQUISITION: once data sources are identified one needs to acquire data. It
involves querying databases that is using sequential query language to extract relevant
data from databases (Buneman et al 1982). Web scraping involves collecting data
from websites that do not provide APIs, web scarping libraries like Selenium can be
used to extract required information from web pages. API integration is used if data is
available through APIs where programming languages like Python are used to make
APII requests and retrieve data. For example one might use the Twitter API to collect
tweets related to a specific topic
3. DATA QUALITY ASSESSMENT: After acquiring data it is essential to assess
quality and suitability of data. It involves checking potential issues such as missing
values, duplicates records inconsistent formatting. It includes techniques such outlier
detection. This is where one might employ statistical techniques such as z-score or
box plots to identify and handle outliers either by removing them or applying
appropriate transformations (Asikoglu 2017). For missing values one might use
methods like imputation or deleting rows with missing values depending on the
impact of missing data on the problem.
4. DATA INTERGRATION AND MERGING: In some cases one might need to
combine data from multiple sources to create a comprehensive dataset (Curran and
Hussong 2009). This can involve merging datasets based on common identifiers or
performing joins across different tables in a database. For example joining tables
involves one having customer data in one table and sales data in another table , one
might join them based on customer ID to create a unified dataset
5. DATA PRIVACY AND ETHICS: Ensure that one complies with data privacy
regulations and ethical considerations while collecting and handling the data. This
includes acquiring necessary permissions, anonymizing sensitive information, and
protecting security of data.

REFERENCE LIST

Buneman, P., Frankel, R.E. and Nikhil, R., 1982. An implementation technique for database
query languages. ACM Transactions on Database Systems (TODS), 7(2), pp.164-186.

Asikoglu, O., 2017. Outlier detection in extreme value series. neural networks, 4(5).

Rautmare, S. and Bhalerao, D.M., 2016, October. MySQL and NoSQL database comparison
for IoT application. In 2016 IEEE international conference on advances in computer
applications (ICACA) (pp. 235-238). IEEE.

Curran, P.J. and Hussong, A.M., 2009. Integrative data analysis: the simultaneous analysis of
multiple data sets. Psychological methods, 14(2), p.81.

You might also like