Assignment On " Exploring Public Datasets" Subject: Big Data Technologies

ASSIGNMENT ON SUBMITTED TO: PROF.
MOHAN KRISHNA
“ EXPLORING PUBLIC
DATASETS” SUBMITTED BY:
GAUNEKAR
MANGUIRISH EKNATH
SUBJECT: 19138
BIG DATA DOMIN CHOSEN

TECHNOLOGIES MARKETING
• This is a list of over 34,000 consumer reviews for amazon
products like the kindle, fire TV stick, and more product
database. The dataset includes basic product information,
rating, review text, and more for each product.
• Rationale behind using the dataset to understand how
amazon uses its customer insights and understand the ratings
Data set 1: of products sold by amazon
Consumer reviews • Problem statement: to use this data to analyze amazon’s

most successful consumer electronics product launches;
of Amazon discover insights into consumer reviews and assists with
products machine learning models. E.G:
• What are the most reviewed amazon products?
• What are the initial and current number of customer reviews
for each product?
• How do the reviews in the first 90 days after a product
launch compare to the price of the product?
continued
3Vs:
• Volume: the data consists of around 3000 datapoints which can be analysed.
• Variety: the data consists of both categorical and quantitative variables, it is a well structured dataset
with proper segment for different reviews, ratings, type of item and also text based reviews.
• Velocity: the velocity in this type of data is very high as people by regularly and the reviews changes at a
very fast pace in the online sites such as amazon
Challenges:
• The data set should be compatible to work on bigdata tools.
• Challenge to generate appropriate, correct and meaningful insights from the dataset.
Potential insights:
• The main insights one will get while analysing the dataset will be to understand how Amazon utilises its
customer ratings and feedback to set their goals.
• The data comprised of 816start-up companies across India.
• The rationale behind selecting this dataset was to understand
how the start-up market in India has changed over the years
Data set 2: and how will it be shaping in the future.
Start-up in • Problem statement: to know if India is a suitable market for
India startups in India and if it is then which industry sector and
city is more suitable to enter a start-up venture
• The data consists of both quantitative as well as categorical
variables.
Continued…..
3Vs:
• Volume: the size of the data is 8.55mb, the dataset consists of 816 data points of start-up companies across India
• Variety: the data is well structured having both quantitative and categorical variables with operating status and
city of the start-ups in India
• Velocity: as the market is changing day by day the data for the start-up is also bound to change, therefore it has a
high velocity as data regarding new start-ups will be added frequently.
Challenges:
• It will be a challenge use the data in the future as the data about the start-ups changes rapidly
Potential insights:
• By analysing the data we will come to know the present state of the start-up in India.
• We will come to know which cities and industries hold an upper hand in the start-up market.
• The dataset addresses the housing values and their dependencies
in the Boston suburbs in tracking the air
• The per capita income of the town can be taken as dependent
variable
• It consists of many variables such as average rooms, nitric oxide
Data set 3 : concentration of the area, dist between employment centres etc.
Concerns • From this data set we can come to know, on which basis the
people in Boston select their homes
housing values • The purpose behind choosing this data set was to understand the
in suburbs of nature of house buying and the factors dependent on it which
can be applied elsewhere
Boston. • Problem statement: whether Boston is a better place for
residential setup? And if it is then what are the key factors which
determine the selection procedure for the buying.
• By analysing the data we can get to know and guide new buyers
in Boston
continued…..
3Vs
• Volume: the data set consists of 510 responses which can be analysed.
• Variety: the dataset consists of quantitative variables for many variables. The data set is well structured
with proper notation and values for the variables as some variables are made using dummies.
• Velocity: the velocity of this data is slightly low as data in this field is not generated on high velocity
and data is generated from time to time not frequently.
Challenges:
• To use the data on big data tools and generate insights.
• To remove duplication of data which may occur during analysis
Potential insights
• We will get to know the buying behaviour while purchasing a house in Boston.
• With this analysis we can guide new and potential buyers in Boston area.
Thank you

Assignment On " Exploring Public Datasets" Subject: Big Data Technologies

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Assignment On " Exploring Public Datasets" Subject: Big Data Technologies

Uploaded by

Copyright:

Available Formats

ASSIGNMENT ON SUBMITTED TO: PROF.

BIG DATA DOMIN CHOSEN

Consumer reviews • Problem statement: to use this data to analyze amazon’s

You might also like