You are on page 1of 4

Data Science is an emerging interdisciplinary field, where it could

predict what will happen in any business line in future through data analysis by
data scientist. Data scientists (as a business analyst) need to have critical thinking
and result-oriented, with exceptional industry-specific knowledge, communication
and other soft skills that allow them to explain highly technical results to their
non-technical counterparts. An ideal data scientist is someone who has the both
the engineering skills to acquire and manage large data sets and has the
statiscian’s skills to extract value from the large data sets and present that data to a
large audience (Woods, 2011). In addition, they can use business intelligence
reports to analyze what has happened in the past. Therefore, a business value can
be increasing by analyze the past and future prediction (utilizing business
intelligence and data science respectively) by solving the analytical problems in
the right way.

It’s challenging to be data scientist nowadays: 1) Conflict with IT


department where they might not cooperate to provide all the data needed, while
data scientist uses the programming software like Python to explore the data
which is strongly object by IT department. 2) Unrealistic expectation from every
industry and firm towards data scientist where they expect data scientist should
know everything and could solve every business problem. 3) Big data programs
keep growing eventually through diversity technology, and data scientist must
choose and apply the right program/ AI on the specific business problem,
respectively.

Big data is the large volume of data that fast generating and inundates
a business on day-to-day basis, and basically it is hard to manage by using IT
infrastructure. There is no threshold to define the volume of data, and the volume
definition is different from time to time. Refer to Appendix 4, 80% of big data is
unstructured in the form of text documents, images, and videos. Anyhow, data
scientist could use up either structured or unstructured big data to accomplish the
data science task and solving a business problem. During 21 st century, a lot of
MNCs are using big data analytics in their daily task, example in daily basis:

1|Page
Facebook processes 0.5 petabytes; Walmart generates 40 petabytes of
transactional data; Google processes 24 petabytes (1 petabyte (PT) = 1,000
terabyte (TB) = 106 gigabytes (GB)). Beside data scientist, an industry/ firm might
face challenging too in their business innovation. Example: a company plan to
generate 40PT in their system, it needs 4,000 drives ($200/pcs), cost $0.8million +
tax, and other accessories budget together. In addition, the availability of those
hardware in the market must put in lead time consideration too. In addition,
transferring data with 100MB/sec speed will take 12.6 years to complete storing
40PT data into the hard drive. Therefore, a firm need to put in consideration about
cost, space, lead time etc before investment on their technology innovation.

More in-depth discussion, how could we reframing a business


problem as a data science question and approach the analytics problem? At first,
we must understand the business issue and look into big data, further analyze the
validated useful data, impute missing data, find out the root cause and prescribing
it. Data science project is not a one man show, not only data scientist is taking
part, but also involve other personnel like database administrator (DBA) and data
engineer, who responsible to provide and prepare data; business user, business
intelligence analyst, project sponsor and project manager also playing their roles
to complete a successful data science project. Once everything setting up ready,
we can start to select the framework that we want to work on it from either “Data
Analytic Life Cycle” or “Microsoft Team Data Science Process (TDSP)”.

Data analytic lifecycle (refer to Appendix 6) contain of 6 steps: 1)


Discovery, 2) Data Preparation, 3) Model Planning, 4) Model Building, 5)
Communicate Results, 6) Operationalize. We use an online mortgage approval
scenario to further understand on this lifecycle.

1) Discovery

Important step to define and understand the business problem by conducting


interviews with the stakeholders.

2) Data Preparation

2|Page
Establish the analytic sandbox, and ETLT (extract, transform, load, and
transform) the data, follow by data exploration and conditioning (remove outliers/
missing data), lastly is summarize and visualize the data. At first, access the data
by understanding each data code, then proceed for visualization by using
Anscombe’s Quartet (common software using by data scientist). Visualization is
important before analyzing where data is easier to read in visual form and could
applies to any domain which easier in analysis and communicate with people
(refer to Appendix 7A & 7B). In analyzing stage, we could analyze the
relationship between two variables and summarize findings.

3) Model Planning

Select the suitable model after data analysis to solve the unique business
problems, respectively. There are vary category of techniques in model selection
(refer to Appendix 8).

4) Model Building

Build training and test datasets, where 80% with data labeled is for training, while
20% with unlabeled data is for testing only which not show in the model. After
setting up model, then will train the selected model by evaluate the fitted model
and adjust accordingly in term to get the accurate results.

5) Communicate Results

Data scientist will prepare different types of presentation for different usage.
Target audience will be management, analyst, and responses. It’s to show their
findings, predictions and recommendations to solve the business problem.

6) Operationalize

Operationalize the model by providing the code and technical documentation to


the respective responsible department for further deployment and action after
communication approval. Lifecycle is considered complete but is not finish where
we have to further monitoring for refinery that new attributes can be considered,
and delivery mechanism can be simplified with self-service reporting.

3|Page
Data science combines the scientific method, math and statistics,
specialized programming, advanced analytics, AI, and even storytelling to
uncover and explain the business insights buried in data (What Is Data Science?,
2020).

4|Page

You might also like