You are on page 1of 47

Pemanfaatan Big Data dalam

Riset
Lala Septem Riza

Sekolah Pascasarjana
2023
Outlines
1. Pengenalan Data Science
2. Phenomena and Definition of Big Data
3. Platforms, Technology, Tool, dan Method in Big Data Analysis
4. Implementations and Research in Big Data
Introduction to Data Science
• Data science is the study that focuses on knowledge extraction from
data: data collection, preparation, analysis, visualization,
management, recommendation, etc.
• Data science is an interdisciplinary field that requires hacking skills
(i.e., programming), math and statistics knowledge, and substantive
expertise in a field of science.
Processes in Data Science
1. Objectives: asking the right questions
to find what the problem is.
2. Data Collection: Get Relevant Data for
Analysis of the Problem.
3. Data Preprocessing: Explore the Data
to Make Error Corrections (cleaning
and organizing).
4. Computational and Data model:
Descriptive, predictive, etc.
5. Reporting/Dissemination/Publication.

Data Science: Software and


Implementations|4
Final Goals in Data Analysis

1. Decision analytics: supports decision-making with visual analytics


that reflect reasoning.
2. Descriptive analytics: provides insight from historical data with
reporting, score cards, clustering, etc.
3. Predictive analytics: employs predictive modeling using statistical
and machine learning techniques.
4. Prescriptive analytics: recommends decisions using optimization,
simulation, etc.

Data Science: Software and


Implementations|5
Phenomena of Big Data
Volume of data digital 2010 to 2025 (in zettabytes 1021 bytes).
Internet Activities
Big
Data

Petabytes: 1015 byte


The Shift of Marketplace
What is Big Data?
1.Volume: The huge amounts of data being
stored.
2.Velocity: The lightning speed at which data
streams must be processed and analyzed.
3.Variety: The different sources and
forms from which data is collected, such as
numbers, text, video, images, audio and text.
9Vs of Big Data Definitions
Historical/Traditional technologies don’t work
because …
Challenges in Big Data
Technology and Method in Big
Data Analysis
The Issues on Big Data Technologies:
1. Computational Models: How the data are processed
and analyzed  Data Analysis/Data Science
2. Database/storage Frameworks: focuses on
technologies and mechanisms to write, read, and
manage Big Data efficiently. Furthermore, handling
fault tolerance, availability, consistency, scalability,
and heterogeneity of Big Data should be considered
as well
Big Data Platform
Big Data
Platform
Big Data Platforms
• Redundant and Reliable: Platforms can replicates data automatically,
so when machine goes down there is no data loss.
• Runs on commodity hardware: Don’t have to buy special hardware,
expensive RAIDs, or redundant hardware; reliability is built into
software.
• Scale-Out rather than Scale-UP.
• Bring code to data rather than data to code.
• Fault tolerant/Deal with failures.
• Break disk read barrier.
Introduction to Apache Hadoop
Hadoop History Timeline
• In April 2008, Hadoop broke a world record to become the
fastest system to sort an entire terabyte of data. Running on
a 910-node cluster, Hadoop sorted 1 terabyte in 209
seconds (just under 3.5 minutes), beating the previous year’s
winner of 297 seconds.
• In November of the same year, Google reported that its
MapReduce implementation sorted 1 terabyte in 68
seconds.
• Then, in April 2009, it was announced that a team at Yahoo!
had used Hadoop to sort 1 terabyte in 62 seconds.
• In the 2014 competition, a team from Databricks were joint
winners of the Gray Sort benchmark. They used a 207-node
Spark cluster to sort 100 terabytes of data in 1,406 seconds,
a rate of 4.27 terabytes per minute.
Hadoop Version
Hadoop Distributed File Systems (HDFS)
• HDFS is a filesystem designed for storing very large files
with streaming data access patterns, running on clusters
of commodity hardware.
• Very large files: hundreds of megabytes, gigabytes, or terabytes
in size.
• Streaming data access: a write once, read-many-times pattern.
• Commodity hardware: run on clusters of commodity hardware.
• HDFS is not a good fit:
• Low-latency data access
• Lots of small files
Research in Big Data
Implementations of Big Data Analysis

• Google: using Big Data for searching,


recommendation, etc.
• Amazon: Big Data resulted from collecting
customers’ behaviors for recommendation
system.
• Facebook: using Big Data Analysis for image
recognition when tagging, deepfakes, People
You May Know, dll.
Related paper to Big Data
1. Riza, L. S., Pratama, F. D., Piantari, E., & Fashi, M. (2020). Genomic
repeats detection using Boyer-Moore algorithm on Apache Spark
Streaming. Telkomnika, 18(2), 783-791.
2. Baig, M. I., Shuib, L., & Yadegaridehkordi, E. (2020). Big data in
education: a state of the art, limitations, and future research directions.
International Journal of Educational Technology in Higher Education,
17(1), 1-23.
3. Mayabee, T. T., Khan, S., Alam, A., Amin, S., Chowdhury, J. K., Hassan, M.
T., ... & Hasan, M. (2022). Student Performance Monitor: A Big Data
Analytical Application. In Proceedings of International Conference on
Data Science and Applications (pp. 759-771). Springer, Singapore.
Big Data in Bioinformatics
Riza, L. S., Pratama, F. D., Piantari, E., & Fashi, M. (2020). Genomic repeats
detection using Boyer-Moore algorithm on Apache Spark Streaming. Telkomnika,
18(2), 783-791.
Genomic repeats detection using Boyer-
Moore algorithm on Apache Spark Streaming
• Repetition identification and
classification are important
fundamental annotation tasks
because of the evolution of
genomes and diseases and
distinguish from other gene types.
• A task of genomic repeats, which
basically is an analysis of string
matching or pattern matching, is
carried out to look for a pattern in
a large text.
Research Objective
• This research is aimed at building a big-data computational model
and implementing the Boyer Moore algorithm in finding string
patterns in human chromosome genome data contained in ensemble
pages.
• Apache Spark is an open-source cluster computing framework for
large data processing.
Research Method in
Genomic Repeats

• 4 working environments:
• In personal computers
• On virtual machines in google cloud
project
• On HDFS
• With apache spark streaming
• Data collection (round 3.9GB): Human
DNA sequences which can be
downloaded freely on page
ftp://ftp.Ensembl.Org/pub/release-95/fa
sta/homo_sapiens/dna/
.
Results: Speed
Comparisons
Big Data in Education
Baig, M. I., Shuib, L., & Yadegaridehkordi, E. (2020). Big data in education: a state of
the art, limitations, and future research directions. International Journal of
Educational Technology in Higher Education, 17(1), 1-23.
Big data in education
• In the educational realm, a large volume of data is produced through
online courses, teaching and learning activities.
• Academic data can help teachers to analyze their teaching pedagogy
and affect changes according to students’ needs and requirement.
• The large-scale administrative data can play a tremendous role in
managing various educational problems.
• Therefore, it is essential for professionals to understand the
effectiveness of big data in education in order to minimize educational
issues
What research
themes have
been addressed
in educational
studies of big
data?
Roadmap
Big Data in
Education
Student Performance Monitor:
A Big Data Analytical
Application
Mayabee, T. T., Khan, S., Alam, A., Amin, S., Chowdhury, J. K., Hassan, M. T., ... &
Hasan, M. (2022). Student Performance Monitor: A Big Data Analytical Application.
In Proceedings of International Conference on Data Science and Applications (pp.
759-771). Springer, Singapore.
Objectives
• To analyze Program Learning Outcome (PLO) in Outcome Based
Education (OBE) by using Big Data Analytics.

The outcome-based education (OBE) system


is an educational theory where every part of
the curriculum is centered around outcomes
or goals that a student must accomplish to
successfully complete their program.
Proposed Analysis
Results
Other Example: Data Analysis in Education

Predicted Values Predictive Multiple


of Real World Variables
Model Predictors
(Features)

Change the World Teacher …


Student
Real World Sensor 1
… Non-Text Joint Mining
of Non-Text
Sensor k Data
and Text
… Text
Data
Big Data for Education

Quality Scalable Intelligent MOOC


Small Classrooms
Towards
Intelligent
MOOC “Big Data Technology”

Automate grading with machine learning MOOC

Automate question answering on forums Scalability


Traditional Manual Grading
Submitted Assignments Graded Assignments
Grade:
93
85
….

Proposed Automated Grading Graded


Submitted Assignments
Assignments Multi-dimensional Grade Predictor Grade
Verification

Detailed
Clustering Batch Grading Results
grading

Improvement Performance &


Behavior Analysis
References
• Baig, M. I., Shuib, L., & Yadegaridehkordi, E. (2020). Big data in education: a state of the art, limitations,
and future research directions. International Journal of Educational Technology in Higher Education,
17(1), 1-23.
• Big Data Education System Leaderboard, Universy of Illinios at Urbana-Champaign, The Data and
Information Systems Laboratories, https://www.google.com/url?
sa=t&rct=j&q=&esrc=s&source=web&cd=&cad=rja&uact=8&ved=2ahUKEwiNrurljs7yAhUL8HMBHR89A
g8QFnoECAIQAQ&url=http%3A%2F%2Ftimes.cs.uiuc.edu%2Fczhai%2Fpub%2Fbigdata-education-
zhai.pptx&usg=AOvVaw30IHA6b1UxmFFK0SXCA5hr
• Favaretto, M., De Clercq, E., Schneble, C. O., & Elger, B. S. (2020). What is your definition of Big Data?
Researchers’ understanding of the phenomenon of the decade. PloS one, 15(2), e0228987.
• Mayabee, T. T., Khan, S., Alam, A., Amin, S., Chowdhury, J. K., Hassan, M. T., ... & Hasan, M. (2022).
Student Performance Monitor: A Big Data Analytical Application. In Proceedings of International
Conference on Data Science and Applications (pp. 759-771). Springer, Singapore.
• Riza, L. S., Pratama, F. D., Piantari, E., & Fashi, M. (2020). Genomic repeats detection using Boyer-Moore
algorithm on Apache Spark Streaming. Telkomnika, 18(2), 783-791.

You might also like