You are on page 1of 31

Big data and Data Science

Case study
Dr. Purnima Gandhi
Computer Science and Engineering
Institute of Technology
Big Data characteristics
Traditional data vs. Big Data
Traditional data Big data
Volume GB Constantly updated (PB to TB
currently..)
Data generated rate Per hour, day, … More rapid
Structure Structured Semi-structured , unstructured
Data source Centralized Fully distributed
Data integration Easy Difficult
Data store RDBMS HDFS, NoSQL
Access Interactive Batch or real time/ near real time
Analysis vs Analytics
• Mathematics and statistics
• Aggregation and Statistics
• Data warehousing and OLAP
• Indexing, Searching, and Querying
• Keyword based search
• Pattern matching (XML/RDF)
• Knowledge discovery
• Data Mining
• Statistical Modeling
• Machine learning
• Artificial intelligence
Many more…………………….
Big Data
No single standard definition…
“Big Data” is data
• Whose scale, diversity, and complexity require new architecture,
techniques, algorithms, & analytics to manage it and extract value &
hidden knowledge from it…
• “Big data refers to data sets whose size is beyond the ability of typical
database software tools to capture, store, manage and analyze.” -The
McKinsey Global Institute, 2012
Big Data pipeline
• Big Data Analytics is interdisciplinary and emerging technology
• BDA is not strait forward
• The term "data pipeline" describes a set of processes that move data
from one place to another place. ... Big data pipelines can also use the
same transformations and load data into a variety of depositories,
including relational databases, data lakes, and data warehouses.
Big Data pipeline
Big Data and Data Science
• “… the sexy job in the next 10 years will be statisticians,” Hal Varian, Google Chief
Economist

• The U.S. will need 140,000-190,000 predictive analysts and 1.5 million
managers/analysts by 2018. McKinsey Global Institute’s June 2011
• New Data Science institutes being created or repurposed – NYU, Columbia,
Washington, UCB,...
• New degree programs, courses, boot-camps:
• e.g., at Berkeley: Stats, I-School, CS, Astronomy…
• One proposal (elsewhere) for an MS in “Big Data Science”
Data Science - Case study
I - Text Emotions Detection
• A human can express his emotions in any form, such as face, gestures,
speech and text. The detection of text emotions is a content-based
classification problem. which is the task of natural language
processing.
• Detecting a person’s emotions is a difficult task, but detecting the
emotions using text written by a person is even more difficult as a
human can express his emotions in any form.
Methodology
• Person emotions are important in
• applications such as chatbots, customer support forum, customer reviews etc.
• Text → token → emotional words → machine learning algorithms
(content based classification)
Emotion detection

https://thecleverprogrammer.com/2021/02/19/text-emotions-detection-with-machine-learning/
II Hotel Recommendation System
• A hotel recommendation system aims to predict which hotel a user is
most likely to choose from among all hotels including customer
reviews and rating.
• Algorithm/methodology – Natural language processing
Hotel recommendation
Hotel recommendation

https://thecleverprogrammer.com/2021/02/13/hotel-recommendation-system-with-machine-learning/
III Customer Personality Analysis
• Customer personality analysis helps a business to modify its product
based on its target customers from different types of customer
segments. For example, instead of spending money to market a new
product to every customer in the company’s database, a company can
analyse which customer segment is most likely to buy the product
and then market the product only on that particular segment.
Ask questions
The most important part of a customer personality analysis is getting
the answers to questions such as:
• What people say about your product: what gives customers’ attitude
towards the product.
• What people do: which reveals what people are doing rather than
what they are saying about your product.
• Methodology - clustering to summarize customer segments + Apriori
algorithm
Know your data
Clustering customers
Based on age, income and seniority
• Stars: Old customers with high income and high spending nature.
• Need Attention: New customers with below-average income and low
spending nature.
• High Potential: New customers with high income and high spending
nature.
• Leaky Bucket: Old customers with below-average income and a low
spending nature.
Clustering customers
Customer personality analysis
Biggest customers of wines are
• Customers with an average income of around $69,500.
• Customers with an average total spend of approximately $1,252.
• Customers registered with the company for approximately 21 months.
• Customers with a graduate degree.
• Customers who are also heavy consumers of meat products.

https://thecleverprogrammer.com/2021/02/08/customer-personality-analysis-with-python/
Data repositories
• UCI machine learning repository
• Kaggle
• Data.world

*List is not exhaustive, students are advised to explore more data


repositories

You might also like