6CS030 Big Data
2022/23
Internal Assignment 2
This worksheet is based on MongoDB.
1. There are three JSON exports from Twitter.
You need to analyze just one of the JSON datasets.
First take your student number and divide it by 3. Use the remainder value (modulus) to
pick one of the following worksheets:
Remainder Value JSON Dataset to use Dataset Generated From
0 kathmandu [Link] Kathmandu Post
1 Nepal_Cricket.json Nepal Cricket
2 [Link] Nepal Republic Media
For example, if your student number is 1712345, 1712345/3= 2 so you would use the
[Link] dataset. See the Remainder spreadsheet if you are not sure how to do
this.
As per my university id,I got remainder value 0 which means that I used the
[Link] dataset.
2. Examine your dataset and carry out the following tasks:
Task no Task
a Import the data into your own MongoDB database:
- Show the command to do
Command: use assignment_2
Command: mongoimport --db mydb --collection KathmanduPost --file
"//Users/zeus/Downloads/Datasets_89718/[Link] ";
- Write a command to show how many documents are in your
collection
There were total of 550 document in that collections which is shown below
Command: [Link]()
b Analyze the data
Write command to
- Show one document
Command: [Link]()
- Show the unique values in one field
In this each value was represented on the language code corresponding to the
language which was used in the documents. Here what's the value means:
- da - Danish
- en - English
- hi - Hindi
- in - Indonesian
- ne - Nepali
- nl - Dutch
- und - Undefined or unknown language
Command: [Link]("lang")
- Shows a set of documents based on some criteria. Output just
two fields from the document
On the basis of the specified criteria, I had looked at the documents and I
have outputted just two fields from the documents i.e., ‘text’ and ‘created_at’.
Command : [Link]({ favorite_count: { $gt: 0 } }, { text: 1, id: 1 })
- Use a regular expression to search for some criteria. The
search should be case insensitive
I have searched for the word called ‘news’ in the documents within the field
‘text’ and I have made sure that the search ignored whether the letters are
uppercase of lowercase
Command:[Link](%7B text: %7B $regex: /news/i %7D %7D)
c Reshape the collection
Write a command to:
- Update a field within the collection
Firstly, I have found a field called “time_zone” which has a value called null in
the document. After that I have set the value of “time_zone” having null to
unknown.
Command: [Link]({ time_zone: null })
Command: [Link]({ time_zone: null }, { $set: {
time_zone: "unknown" } })
- Create a new collection based on a subset of the dataset.
Include a query to show a document from the new collection
For this an aggregation operation is performed, it first matches the documents
where the “time_zone” equals “unknown” . Then it outputs the results of this
aggregation to a new collection called “newCollection”.
Commands: [Link]([{ $match: { time_zone:
"unknown" } }, {
$out: "newCollection" }])
Now viewing the details stored in the “newCollection” collection.
d Name one advantage to using this approach for handling Big Data
and include a brief explanation of why you think this is an
advantage.
This approach enables swift processing and manipulation of data, making it a
distinct advantage in data management. Even with vast amounts of data,
complex tasks can be executed efficiently. It excels in tasks such as sorting,
grouping, and altering data of any size, facilitating improved decision-making
and maintaining a competitive edge.
e Name one disadvantage to using this approach for handling Big Data
and include a brief explanation of why you think this is a
disadvantage.
Managing and enhancing the aggregation process for handling extensive data
volumes can prove challenging. As data scales up, ensuring smooth aggregation
becomes increasingly complex. This necessitates a deep understanding of data
organization and tailored queries to optimize efficiency. Failure to manage these
aspects effectively can lead to process slowdowns across the board.
For this exercise you can either use the Mongo Shell or Python Notebook to carry out the
commands.
Python
If using a Python Notebook you will not be able to use the command to import data within the
notebook, however, you can document what command you ran in your notebook.
You can use the Print Option to create a PDF version of the file and upload it. Do check that it has
printed all the pages and not just the first page (if so, submit the notebook).
Upload
Upload one of the following: A Word Document, ipynb file or PDF version of the Python Notebook
which shows evidence of the above tasks.