You are on page 1of 12

1

Big Data

Student’s Name

Date
2

Big Data

MongoDB exercise 3
Question 1. How many total games in the dataset did the Los Angeles Lakers play?
```javascript
db.nba_games.find({
$or: [{ home_team: "Los Angeles Lakers" }, { away_team: "Los Angeles Lakers" }]
}).count()
```
Question 2. How many games between January 1, 2000, and December 31, 2009, did the
Chicago Bulls play against the Detroit Pistons?
```javascript
db.nba_games.find({
date: {
$gte: ISODate("2000-01-01"),
$lte: ISODate("2009-12-31")
},
$or: [
{ home_team: "Chicago Bulls", away_team: "Detroit Pistons" },
{ home_team: "Detroit Pistons", away_team: "Chicago Bulls" }
]
}).count()
```

Question 3. Who played in more NBA games in this dataset, Karl Malone or John
Stockton? (Using separate queries)
This requires us to do two searches and then compare the outcomes to get the solution.
```javascript
const karlMaloneGames = db.nba_games.find({
$or: [
{ home_team: "Utah Jazz" },
{ away_team: "Utah Jazz" },
{ home_team: "Los Angeles Lakers" },
3

{ away_team: "Los Angeles Lakers" }


]
}).count();

const johnStocktonGames = db.nba_games.find({


$or: [
{ home_team: "Utah Jazz" },
{ away_team: "Utah Jazz" }
]
}).count();

if (karlMaloneGames > johnStocktonGames) {


print("Karl Malone played in more games.");
} else if (johnStocktonGames > karlMaloneGames) {
print("John Stockton played in more games.");
} else {
print("Karl Malone and John Stockton played in the same number of games.");
}
```

Question 4. How many NBA games did the Minnesota Timberwolves play at home
between March 1, 2008, and February 5, 2010?
```javascript
db.nba_games.find({
home_team: "Minnesota Timberwolves",
date: {
$gte: ISODate("2008-03-01"),
$lte: ISODate("2010-02-05")
}
}).count()
```
4

Question 5. List all games played in 2003 in which the winning team scored more than 125
points.
```javascript
db.nba_games.find({
date: {
$gte: ISODate("2003-01-01"),
$lte: ISODate("2003-12-31")
},
"winning_score": { $gt: 125 }
})
```

Question 6. For each game that the New York Knicks played in 2011, list the following:
i. the two teams that played in the game (e.g. New York Knicks and Boston Celtics)
ii. The final score for the game
iii. The date of the game
```javascript
db.nba_games.find({
date: {
$gte: ISODate("2011-01-01"),
$lte: ISODate("2011-12-31")
},
$or: [{ home_team: "New York Knicks" }, { away_team: "New York Knicks" }]
}, {
_id: 0,
home_team: 1,
away_team: 1,
home_score: 1,
away_score: 1,
date: 1
})
```
5

Question 7. List all of the teams that the Boston Celtics played between February 1,
2003, and March 27, 2009, when the Celtics were the home team.
```javascript
db.nba_games.distinct("away_team", {
home_team: "Boston Celtics",
date: {
$gte: ISODate("2003-02-01"),
$lte: ISODate("2009-03-27")
}
})
```
Exercise 4
If there is a folder called "games" that contains documents about NBA matches, and each of
those documents has a "scores" column that has the results of each game, and a
"season_start" field that contains the date when the NBA season officially begins, then you
may accomplish the following:
```javascript
db.games.aggregate([
// Compare the 2002-2003 records against the current records
{
$match: {
season_start: {
$gte: new Date("2002-07-01"),
$lte: new Date("2003-06-30")
}
}
},
// Reverse the scores matrix to get data for each game
{
$unwind: "$scores"
},
// Sort the results by team and tally the points
6

{
$group: {
_id: "$scores.team",
totalPoints: { $sum: "$scores.points" }
}
},
// Order from highest point total to lowest (from highest performing team to lowest)
{
$sort: { totalPoints: -1 }
},
// Plan to rearrange the sequence of things
{
$project: {
_id: 1,
totalPoints: 1
}
},
// Rearrange by highest to lowest scoring team.
{
$sort: { totalPoints: 1 }
},
// Just use the first outcome (the team with the fewest points)
{
$limit: 1
}
]);
```

Breaking down the tasks into their component parts:


1. The code given automatically arranges the list from best to worst in terms of total points.
7

2. We switch the order of sorting to go from lowest-scoring team to highest by adding the
'$project' phase after the first '$sort'.
3. The '$match' phase uses a date range to filter the documents and locate the 2002-2003
campaign.
4. To determine which team had the greatest point average during the 2002-2003 season,
replace the '$sum' aggregation operator with '$avg'. Adjust the following in the '$group'
phase:

```javascript
{
$group: {
_id: "$scores.team",
avgPointsPerGame: { $avg: "$scores.points" }
}
}
```

5. To sum the point totals for each player, you'll need to `$unwind` the arrays within the
`box.players` subdocument. Here's how you can modify the query:

```javascript
db.games.aggregate([
{
$match: {
season_start: {
$gte: new Date("2002-07-01"),
$lte: new Date("2003-06-30")
}
}
},
{
$unwind: "$scores"
},
{
8

$unwind: "$box.players"
},
{
$unwind: "$box.players.stats"
},
{
$group: {
_id: "$box.players.player_name",
totalPoints: { $sum: "$box.players.stats.points" }
}
},
{
$sort: { totalPoints: -1 }
}
]);
```
In order to tally up the scoring totals for every player during the 2002-2003 period, this query
first unravels the'scores' array followed by recursively unravels the nested arrays inside
'box.players'.

Exercise - Machine Learning problem framing


Problem Statement: Predicting Housing Prices
Description: Let's say you're trying to estimate the selling price of homes for a real estate
company. The number of rooms, baths, square footage, location, and selling price of
previously sold homes are all at your disposal. Your goal is to create a machine learning
model that can utilize these characteristics to estimate a home's market value.
Solving this issue using a spreadsheet:
Collect Information Gathering home sales records is usually the first step. Let's pretend you
have a data collection with the following variables for the sake of this example:
Identification number of the home;
 bedrooms;
 bathrooms;
 square footage;
 location;
 asking price
9

Preparing the Data: It's possible that the data has to be prepped and cleaned. Examples
include normalizing numerical characteristics, dealing with missing values, and encoding
categorical variables like "neighbourhood."
Third, you can always add more functionality by using feature engineering. You might, for
instance, develop a function that displays the price per square foot or a flag that indicates
whether or not the home includes a garage or a swimming pool.
To train and test hypotheses, you need divide your data set in two. Your machine learning
model may be trained on one subset and tested on another.
Create a Model: A basic regression model may be created in Excel by using the built-in
functions and formulae. If you want to utilize characteristics to anticipate prices, you might,
for instance, use the LINEST function to build a linear regression model.
Use the validation dataset to evaluate your model's accuracy. 6. The accuracy of your model's
home price forecasts may be evaluated using several metrics such as Mean Absolute Error
(MAE), Mean Squared Error (MSE), and Root Mean Squared Error (RMSE).
Now that you have a trained model, you can use it to generate predictions about upcoming
listings of homes for sale.
This issue may be represented in a simple spreadsheet as follows:
House Bedrooms Bathroom Square Neighbourhood Selling Predicted
ID s Footage Price Price
1 3 2 1800 Suburb A 250,000 [Prediction]
2 4 3 2200 Suburb B 320,000 [Prediction]
3 2 1 1200 Suburb A 175,000 [Prediction]

Working with High Velocity Data exercise


Question 1
Putting Google's Data to Use in Driverless Vehicles:
a. Google's extensive database of maps, street-view pictures, and geotagged data may be
used to facilitate and enhance the company's autonomous vehicles in the manners
listed below:
b. A high-resolution map is essential for autonomous navigation, and Google Maps can
give just that. The car can make educated judgments thanks to these maps that include
road designs, markings for lanes, and traffic signals.
c. Baseline data for actual time visual identification may be found in Street-View
photos. They aid in object recognition, allowing the car to better understand its
surroundings and react accordingly.
d. Google can improve route planning and respond to changing traffic circumstances
with the use of immediate traffic information from Google Maps.
e. Geospatial Data: Geospatial data, such as elevation, weather, and road quality, may be
part of location-tagged data. With this information, we can better plan our routes and
adjust to changing road conditions.
10

The car's ability to find locations and deliver services to passengers may be improved by
including data about nearby businesses and sites of interest.

Question 2
Comparison of Web and Car Data Use:
Many of the same data sources utilized for autonomous cars may be used for web-based apps,
but there are some important distinctions to keep in mind.
a. Maps as well as Street-View: Maps serve the same purpose for navigation in both
settings. In contrast to autonomous cars, where maps are actively employed for
decision-making and localisation, consumers mostly see maps in web-based apps.
b. Street-View Pictures: These pictures are used for visual exploration in online apps.
These photos are utilized for real-time object detection and context comprehension in
autonomous cars.
c. Traffic Data: Web and autonomous vehicles utilize traffic data to improve their
routes. To guarantee safe driving, however, autonomous cars would need to respond
to changing traffic circumstances in real time.
d. Geospatial Data: Whereas geospatial data may be utilized for services that use
location in online applications, in autonomous cars, it impacts choices like making
speed changes depending on elevation and dealing with poor weather.
e. Local Organization Data: This information is utilized by online apps to aid users in
locating local businesses. It might be used to provide recommendations to passengers
or to speed up deliveries made by autonomous cars.
Question 3.
Sorting by Time Period:
Different data streams and jobs may be categorized according to the following timescales:
a. Tasks that need to be completed immediately, like collision avoidance, and data
sources that update in real time, such as traffic data and obstacle recognition, belong
here.
b. Timescales close to real-time are used for data sources like real-time maps and traffic
forecasts and for activities like optimizing routes, which need frequent but not
instantaneous changes.
c. Timescales on the order of days or weeks may be used to refine the vehicle's
behaviour based on prior driving experiences and information on the state of the
roads.
d. Maintaining long-term dependability requires incorporating data on road repairs and
modifications to infrastructure over a period of months.
e. Consideration of urban planning, transportation management, or significant
infrastructure modifications using data collected over many years at once is an
example of the batch timeframe.
Google's autonomous driving software must accommodate data from several epochs without
a hitch if it is to keep passengers and drivers safe and productive on the road.

Exercise - Graph Data Science Library analytics (optional advanced exercise)


11

Question 1
Some neo4j analytic graph algorithms require the extraction of a projection because:
a. Simplifying the original network, which may have been too complicated to apply
specific algorithms due to the presence of different node types and interactions. The
network is made more manageable for examination by extracting a projection, which
isolates the nodes and interactions that are most important to the topic at hand.
b. Execution Time: Computationally-intensive analytic graph algorithms are a common
problem. The efficiency of these methods may be greatly enhanced by extracting a
projection, which generates a subgraph of much lower size and complexity.
Question 2
Collection of Community Detection Algorithms for Graph Data Science in Neo4j:
a. Louvain Using the idea of modularity, this technique finds groups or subgraphs inside
a graph. Social network analysis is only one area where this method may be used to
help find clusters of people that have common traits.
b. Label Propagation is a community discovery approach that uses the connections
between nodes to distribute labels to those nodes. It may be used to do things like find
clusters of related articles or isolate certain functions in biological networks.
Question 3
Algorithm for Centrality Betweenness:
a. To identify hubs that link otherwise isolated elements of a network, the Betweenness
Centrality measure was developed. Number of shortest pathways that go via each
node is determined.
b. Transportation networks are one real-world example of where Betweenness Centrality
may be useful. A city's travel efficiency may be greatly impacted if major
thoroughfares or public transit nodes are closed.
Question 4
To find the Jaccard Score:
A pair of sets' Jaccard Similarities Score is determined by dividing their intersection size by
their union size. The Node Similarity technique in Neo4j uses this to determine the degree to
which two nodes have similar sets of neighbours.
It can be shown mathematically that the Jaccard Similarities Score (J) between two nodes A
and B is:
J(A, B) = (Union of A's neighbours and B's neighbours) / (A's neighbours’ and B's)
intersection
Here, "neighbours(A)" stands for the collection of nodes that are linked to the node A,
whereas "neighbours(B)" stands for the collection of nodes that link to node B. The Jaccard
Similarities Score calculates the degree to which these sets overlap, therefore indicating the
degree of similarity within the individual nodes.
Reference
12

Chaudhary, K., & Gupta, M. (2019, January). Analyzing IPL dataset with MongoDB. In 2019 9th
International Conference on Cloud Computing, Data Science & Engineering (Confluence) (pp. 212-
216). IEEE.

You might also like