Professional Documents
Culture Documents
Coursera Worksheet
1. Profile the data by finding the total number of records for each of
the tables below:
2. Find the total distinct records by either the foreign key or primary
key for each table. If two foreign keys are listed in the table, please
specify which foreign key.
Note: Primary Keys are denoted in the ER-Diagram with a yellow key icon.
3. Are there any columns with null values in the Users table? Indicate
"yes," or "no."
Answer: No
select count(*)
from user
where id is null or name is null or review_count is null or
yelping_since is null or
useful is null or funny is null or cool is null or fans is null or
average_stars is null
or compliment_hot is null or compliment_more is null or
compliment_profile is null or
compliment_cute is null or compliment_list is null or
compliment_note is null or compliment_plain is null
or compliment_cool is null or compliment_funny is null or
compliment_writer is null or compliment_photos is null;
4. For each table and column listed below, display the smallest
(minimum), largest (maximum), and average (mean) value for the following
fields:
i. Avon
Copy and Paste the Resulting Table Below (2 columns – star rating
and count):
+-------+-------+
| stars | count |
+-------+-------+
| 3.5 | 3 |
| 2.5 | 2 |
| 4.0 | 2 |
| 1.5 | 1 |
| 4.5 | 1 |
| 5.0 | 1 |
+-------+-------+
ii. Beachwood
Copy and Paste the Resulting Table Below (2 columns – star rating
and count):
+-------+-------+
| stars | count |
+-------+-------+
| 5.0 | 5 |
| 3.0 | 2 |
| 3.5 | 2 |
| 4.5 | 2 |
| 2.0 | 1 |
| 2.5 | 1 |
| 4.0 | 1 |
+-------+-------+
select Name,review_count
from user
order by review_count desc
limit 3;
+--------+--------------+
| name | review_count |
+--------+--------------+
| Gerald | 2000 |
| Sara | 1629 |
| Yuri | 1339 |
+--------+--------------+
A simple data pull with review_counts and fans show that there is
no strong correlation between the two as there are quite a few cases
where in the fan following is high
despite their reviews being lower and vice-versa. For Example, Amy
has the most fans but her review_count is not as high as Lisa, bernice
has a high review_count but
fan follwing is lower.
+-----------+--------------+------+
| name | review_count | fans |
+-----------+--------------+------+
| Amy | 609 | 503 |
| Mimi | 968 | 497 |
| Harald | 1153 | 311 |
| Gerald | 2000 | 253 |
| Christine | 930 | 173 |
| Lisa | 813 | 159 |
| Cat | 377 | 133 |
| William | 1215 | 126 |
| Fran | 862 | 124 |
| Lissa | 834 | 120 |
| Mark | 861 | 115 |
| Tiffany | 408 | 111 |
| bernice | 255 | 105 |
| Roanna | 1039 | 104 |
| Angela | 694 | 101 |
| .Hon | 1246 | 101 |
| Ben | 307 | 96 |
| Linda | 584 | 89 |
| Christina | 842 | 85 |
| Jessica | 220 | 84 |
| Greg | 408 | 81 |
| Nieves | 178 | 80 |
| Sui | 754 | 78 |
| Yuri | 1339 | 76 |
| Nicole | 161 | 73 |
+-----------+--------------+------+
(Output limit exceeded, 25 of 10000 total rows shown)
9. Are there more reviews with the word "love" or with the word "hate" in
them?
Answer: There are more reviews with the word love than hate,
+------------+------------+
| love_count | hate_count |
+------------+------------+
| 1780 | 232 |
+------------+------------+
1. Pick one city and category of your choice and group the businesses in
that city or category by their overall star rating.
Compare the businesses with 2-3 stars to the businesses with 4-5
stars and answer the following questions. Include your code.
ii. Do the two groups you chose to analyze have a different number of
reviews?
Business with stars 4-5 have more reviews on an average than the
other group.
iii. Are you able to infer anything from the location data provided
between these two groups? Explain.
The Postal codes indicate that businesses with Higher rating are
closer to the city center than other.
2. Group business based on the ones that are open and the ones that are
closed. What differences can you find between the ones that are still
open and the ones that are closed? List at least two differences and the
SQL code you used to arrive at your answer.
i. Difference 1:
Closed business had very less reviews compared to the ones that
are open
ii. Difference 2:
Closed business had much less checkins compared to open
businesses
select is_open,
count(distinct b.id) as total_reviews,
sum(c.count) as checkin_counts
from business a
left join review b
on (a.id=b.business_id)
left join checkin c
on (a.id=c.business_id)
group by 1;
+---------+---------------+----------------+
| is_open | total_reviews | checkin_counts |
+---------+---------------+----------------+
| 0 | 71 | 15 |
| 1 | 565 | 823 |
+---------+---------------+----------------+
3. For this last part of your analysis, you are going to choose the type
of analysis you want to conduct on the Yelp dataset and are going to
prepare the data for analysis.
Ideas for analysis include: Parsing out keywords and business attributes
for sentiment analysis, clustering businesses to find commonalities or
anomalies between them, predicting the overall star rating for a
business, predicting the number of fans a user will have, and so on.
These are just a few examples to get you started, so feel free to be
creative and come up with your own problem you want to solve. Provide
answers, in-line, to all of the following:
I'd try to think on the lines of how well we can predict a user's fan
following based on his/her attributes such as reviews, friends, elite
years etc.
ii. Write 1-2 brief paragraphs on the type of data you will need for your
analysis and why you chose that data:
+------------------------+--------------+---------------+--------+-------
-+--------+---------+-------------+------+------+
| id | review_count | years_yelping | useful | funny
| cool | friends | elite_years | tips | fans |
+------------------------+--------------+---------------+--------+-------
-+--------+---------+-------------+------+------+
| ---1lKK3aKOuomHnwAkAow | 245 | 13 | 67 | 22
| 9 | None | None | None | 15 |
| ---udAKDsn0yQXmzbWQNSw | 43 | 6 | 1 | 0
| 0 | None | None | None | 1 |
| --0kuuLmuYBe3Rmu0Iycww | 26 | 10 | 10 | 2
| 0 | None | None | None | 2 |
| --2vR0DIsmQ6WfcSzKWigw | 1153 | 8 | 122921 | 122419
| 122890 | None | None | None | 311 |
| --3l8wysfp49Z2TLnyT0vg | 111 | 7 | 97 | 57
| 32 | None | None | None | 2 |
| --3WaS23LcIXtxyFULJHTA | 213 | 10 | 63 | 6
| 2 | None | None | None | 10 |
| --41c9Tl0C9OGewIR7Qyzg | 239 | 9 | 64 | 15
| 3 | None | None | None | 23 |
| --4q8EyqThydQm-eKZpS-A | 400 | 12 | 405 | 313
| 72 | None | None | None | 23 |
| --56mD0sm1eOogphi2FFLw | 211 | 10 | 3111 | 3164
| 3168 | None | None | None | 39 |
| --5Dnq-kuMq5bqvlgDKORQ | 7 | 9 | 17 | 4
| 4 | None | None | None | 4 |
| --66hzx80CeVZcrm4AKJtQ | 19 | 11 | 1 | 0
| 0 | None | None | None | 1 |
| --6CV8BPNofy7jt1JavD-g | 32 | 9 | 0 | 1
| 0 | None | None | None | 2 |
| --6Ke7_lBBM6XAramtPoWw | 15 | 11 | 19 | 1
| 0 | None | None | None | 6 |
| --8g9UaBe0xQ4FD0q34h_A | 55 | 11 | 2 | 2
| 1 | None | None | None | 1 |
| --BumyUHiO_7YsHurb9Hkw | 38 | 3 | 0 | 0
| 0 | None | 3 | None | 1 |
| --CIuK7sUpaNzalLAlHJKA | 188 | 9 | 2 | 1
| 1 | None | None | None | 2 |
| --cPqjzKHqHKmGala65zwg | 138 | 8 | 16 | 9
| 4 | None | None | None | 11 |
| --eQVss9nAx54FWsZHZgpA | 15 | 10 | 26 | 0
| 2 | None | None | None | 3 |
| --EVSb3jbKVL3WJ5NUCuCA | 26 | 8 | 0 | 0
| 0 | None | None | None | 1 |
| --E_9aMNd1pgJFxe-oJ1OQ | 10 | 9 | 2 | 0
| 0 | None | None | None | 1 |
| --F69eR3LsY-6Wl_YlOGWw | 3 | 10 | 2 | 0
| 0 | None | None | None | 1 |
| --fF_pQlaU9sME-HLCoHlQ | 18 | 8 | 11 | 6
| 7 | None | None | None | 4 |
| --FR2eTGLEzdLuGb57Tsfw | 5 | 5 | 1 | 0
| 0 | None | None | None | 1 |
| --iYUTSkH-LjfQt9EN8Nnw | 10 | 9 | 25 | 0
| 0 | None | None | None | 1 |
| --J3HPoNe-IJ0xE10Z_sDg | 134 | 8 | 9 | 0
| 1 | None | None | None | 4 |
+------------------------+--------------+---------------+--------+-------
-+--------+---------+-------------+------+------+
(Output limit exceeded, 25 of 2357 total rows shown)
iv. Provide the SQL code you used to create your final dataset: