Professional Documents
Culture Documents
You have 1 free member-only story left this month. Sign up for Medium and get an extra one
Save
Background
This project is built on top of the data challenge that Panoply has released in Apr 2019.
Panoply is a cloud data warehouse that you could gather data from different data
sources (i.e. AWS S3, Google analytics and etc.) easily into one place and then connect
to different Business Intelligence tools (i.e. Chartio, Mode and etc.) for analytics and
insights.
https://towardsdatascience.com/instagram-data-analysis-ce03aa4a472a 1/25
2/5/23, 9:08 PM Instagram Data Analysis Using Panoply and Mode | by Ka Hou Sio | Towards Data Science
Panoply has recently integrated their data warehouse with Instagram API to collect
data. This challenge is about using Panoply as an ETL tool to explore Instagram data
for marketing use (i.e. promotion, segmentation and etc.).
In this challenge, challengers are asked to set up a Panoply account, and connect to
self Instagram or provided Instagram data to perform analyses, draw insights and
build visualization for storytelling. If you have an Instagram account, you can use the
data from your own Instagram account. Or if you prefer to use the data provided by
Panoply, you can choose from 2 accounts:
Then you can use any of your choice of BI tools for data visualization. The final
deliverable of this challenge will be in English, data visualization(s) for communicating
your findings and the SQL queries that you have used.
Project Design
Steps of my project would be:
1. Explore Panoply, using resources it provided for the challenge and documentations
on its own website.
2. Create Panoply free trial account, connect to provided data source, connect to BI
tool.
Tools
The tools and technologies that I am going to use in this project would be Panoply,
Instagram data, SQL and Mode.
https://towardsdatascience.com/instagram-data-analysis-ce03aa4a472a 2/25
2/5/23, 9:08 PM Instagram Data Analysis Using Panoply and Mode | by Ka Hou Sio | Towards Data Science
Process
I started by reading the resources that Panoply provided and explore their website to
better understand what Panoply is and how it plays a role in data analytics process.
Then I created the free trial of Panoply account and follow their documentations to
connect to Amazon S3 to collect the provided Instagram data.
1. Go to https://panoply.io.
https://towardsdatascience.com/instagram-data-analysis-ce03aa4a472a 3/25
2/5/23, 9:08 PM Instagram Data Analysis Using Panoply and Mode | by Ka Hou Sio | Towards Data Science
https://towardsdatascience.com/instagram-data-analysis-ce03aa4a472a 4/25
2/5/23, 9:08 PM Instagram Data Analysis Using Panoply and Mode | by Ka Hou Sio | Towards Data Science
2. Select the data source you want to connect to, for this project I used Amazon S3.
Input your credentials and start to collect data from your source to Panoply data
warehouse.
Search Medium
meaningful insights. Below are the tables and the columns that I have used.
312 1
Metrics
After my research and I identified some metrics and questions that an Instagram
account owner would like to know in order to improve their account awareness given
the data the API provided.
Performance difference for posts that has location tag versus there is no tag.
Cohort Analysis.
https://towardsdatascience.com/instagram-data-analysis-ce03aa4a472a 7/25
2/5/23, 9:08 PM Instagram Data Analysis Using Panoply and Mode | by Ka Hou Sio | Towards Data Science
2. After you are in your account dashboard, hit the down-arrow on the left that is next
to your name.
3. Then a new drop down menu will come up and select the Connect a Database tab.
https://towardsdatascience.com/instagram-data-analysis-ce03aa4a472a 8/25
2/5/23, 9:08 PM Instagram Data Analysis Using Panoply and Mode | by Ka Hou Sio | Towards Data Science
For this project, we used Amazon Redshift because that is the database that Panoply
used to store our tables. After you entered and collected to your Panoply database,
Mode will start to collect all the tables to your account database in Mode. Once it is
done, you can start using SQL to analyze your data in Mode.
https://towardsdatascience.com/instagram-data-analysis-ce03aa4a472a 9/25
2/5/23, 9:08 PM Instagram Data Analysis Using Panoply and Mode | by Ka Hou Sio | Towards Data Science
Analysis
Below are the queries that I have used for answering the questions in the Metric
section.
with t AS
(SELECT
value as hashtag,
likes_count as likes,
comments_count as comments
FROM public.shinestyinstagram_instagram_media m
left JOIN public.shinestyinstagram_instagram_media_tags mt
ON m.id = mt.instagram_media_id
)
select
hashtag,
AVG(likes) as avg_likes,
AVG(comments) as avg_comments
from
t
where hashtag is not null
group by 1
https://towardsdatascience.com/instagram-data-analysis-ce03aa4a472a 10/25
2/5/23, 9:08 PM Instagram Data Analysis Using Panoply and Mode | by Ka Hou Sio | Towards Data Science
Below visual tells you the average likes for each hashtag, #housetonstrong and
#theperfectcrime have the most average likes performance.
Below visual tells you the average comments with respects to each hashtag used,
#buttstuff and #macrobrews have the best performance.
https://towardsdatascience.com/instagram-data-analysis-ce03aa4a472a 11/25
2/5/23, 9:08 PM Instagram Data Analysis Using Panoply and Mode | by Ka Hou Sio | Towards Data Science
Performance By Hashtag
SELECT
DATE_TRUNC(‘week’, created_time)::DATE as week,
SUM(comments_count) AS total_comments,
AVG(comments_count) AS avg_comments,
SUM(likes_count) AS total_likes,
AVG(likes_count) AS avg_likes,
count(distinct id) as nums_of_post
FROM
public.shinestyinstagram_instagram_media
GROUP BY
1
ORDER BY
1
https://towardsdatascience.com/instagram-data-analysis-ce03aa4a472a 12/25
2/5/23, 9:08 PM Instagram Data Analysis Using Panoply and Mode | by Ka Hou Sio | Towards Data Science
select
TO_CHAR(created_time, ‘DY’) as day,
COUNT(distinct media_id) AS nums_of_post_got_commented,
COUNT(distinct from_username) AS nums_of_commenter,
ROUND((nums_of_commenter/ cast(nums_of_post_got_commented as FLOAT)),
0) as average_commenter_per_post
from
public.shinestyinstagram_instagram_comments
group by
1
order by
1
Below visual shows that Thursday and Friday are the time when the users like to
comment the most.
https://towardsdatascience.com/instagram-data-analysis-ce03aa4a472a 13/25
2/5/23, 9:08 PM Instagram Data Analysis Using Panoply and Mode | by Ka Hou Sio | Towards Data Science
select
TO_CHAR(created_time, ‘HH24’) as hour,
COUNT(distinct media_id) AS nums_of_post_got_commented,
COUNT(distinct from_username) AS nums_of_commenter,
ROUND((nums_of_commenter/ cast(nums_of_post_got_commented as FLOAT)),
0) as average_commenter_per_post
from
public.shinestyinstagram_instagram_comments
group by
1
order by
1
https://towardsdatascience.com/instagram-data-analysis-ce03aa4a472a 14/25
2/5/23, 9:08 PM Instagram Data Analysis Using Panoply and Mode | by Ka Hou Sio | Towards Data Science
SELECT
TO_CHAR(created_time, ‘HH24’) as hour,
SUM(comments_count) AS total_comments,
AVG(comments_count) AS avg_comments,
SUM(likes_count) AS total_likes,
AVG(likes_count) AS avg_likes,
count(distinct id) as nums_of_post
FROM
public.shinestyinstagram_instagram_media
GROUP BY
1
ORDER BY 1
The purpose of me showing this query is because I think it is not appropriate to use
this insight to make a recommendation to post a IG post during the 7am to 4pm of a
day. Because this query is using when a post is created to calculate the numbers of
likes and comments. In contrast, the last visual used when a user has commented to
calculate the performance, which I think would be more accurate. So I would like to
suggest @shinestythreads to post during 11pm to 2am of a day to get more comment
engagement. Unfortunately the API didn’t provide the same info for likes, which I
would like to use to calculate for like engagement.
https://towardsdatascience.com/instagram-data-analysis-ce03aa4a472a 15/25
2/5/23, 9:08 PM Instagram Data Analysis Using Panoply and Mode | by Ka Hou Sio | Towards Data Science
Performance by Hour
SELECT
type,
SUM(likes_count) as total_likes,
AVG(likes_count) as avg_likes,
SUM(comments_count) as total_comments,
AVG(comments_count) as avg_comments,
COUNT(distinct id) as nums_of_post
FROM
public.shinestyinstagram_instagram_media
GROUP BY
1
There are three types of medias that IG offers right now: image, video and carousel.
Below visuals show the video media has the most average comments and the carousel
media has the most average likes. And the video media shows itself is the best media to
get engagement.
https://towardsdatascience.com/instagram-data-analysis-ce03aa4a472a 16/25
2/5/23, 9:08 PM Instagram Data Analysis Using Panoply and Mode | by Ka Hou Sio | Towards Data Science
SELECT
filter,
type,
SUM(likes_count) as total_likes,
AVG(likes_count) as avg_likes,
SUM(comments_count) as total_comments,
AVG(comments_count) as avg_comments,
COUNT(distinct id) as nums_of_post
https://towardsdatascience.com/instagram-data-analysis-ce03aa4a472a 17/25
2/5/23, 9:08 PM Instagram Data Analysis Using Panoply and Mode | by Ka Hou Sio | Towards Data Science
FROM
public.shinestyinstagram_instagram_media
GROUP BY
1, 2
In below visual, we can see the filter named Crema has the best performance beside
the Normal filter for image media.
In below visual, we can see the filter named Ashby has the best performance beside
the Normal filter for video media.
https://towardsdatascience.com/instagram-data-analysis-ce03aa4a472a 18/25
2/5/23, 9:08 PM Instagram Data Analysis Using Panoply and Mode | by Ka Hou Sio | Towards Data Science
For carousel media, we can do the same to get insight but I didn’t do it in here because
@shinestythreads only has normal filter for carousel media.
SELECT
location,
SUM(likes_count) as total_likes,
AVG(likes_count) as avg_likes,
SUM(comments_count) as total_comments,
AVG(comments_count) as avg_comments
FROM
(SELECT
name as location,
m.likes_count,
m.comments_count
FROM
public.shinestyinstagram_instagram_media_location l
LEFT JOIN public.shinestyinstagram_instagram_media m
ON l.instagram_media_id = m.id
) as t
GROUP BY
1
This visual shows us that among all the posts that has a location tag, Augusts National
Golf Club has the most average likes and comments engagement.
https://towardsdatascience.com/instagram-data-analysis-ce03aa4a472a 19/25
2/5/23, 9:08 PM Instagram Data Analysis Using Panoply and Mode | by Ka Hou Sio | Towards Data Science
Find out performance difference for posts that has location tag versus there is no
tag:
WITH t AS
(SELECT
m.id,
m.likes_count,
m.comments_count,
l.name as location
FROM
public.shinestyinstagram_instagram_media m
LEFT JOIN
public.shinestyinstagram_instagram_media_location l
ON
m.id = l.instagram_media_id
),
w as
(SELECT
*,
(CASE WHEN location IS NULL THEN 0 ELSE 1 END) AS have_location
FROM t
)
SELECT
have_location,
SUM(likes_count) as total_likes,
AVG(likes_count) as avg_likes,
SUM(comments_count) as total_comments,
AVG(comments_count) as avg_comments
FROM
w
GROUP BY
1
Posts with location tag has more average likes but slightly less average comments
compare to posts that has no location tag.
https://towardsdatascience.com/instagram-data-analysis-ce03aa4a472a 20/25
2/5/23, 9:08 PM Instagram Data Analysis Using Panoply and Mode | by Ka Hou Sio | Towards Data Science
SELECT
*
FROM
(SELECT
from_username as username,
COUNT(media_id) as nums_of_comments,
RANK() OVER(ORDER BY nums_of_comments DESC)
FROM
public.shinestyinstagram_instagram_comments
GROUP BY
1
ORDER BY
2 DESC
) as t
WHERE
rank >1 and rank <=15
This visual shows us who are the top active commenters (users), not including the user
@shinestythreads.
https://towardsdatascience.com/instagram-data-analysis-ce03aa4a472a 21/25
2/5/23, 9:08 PM Instagram Data Analysis Using Panoply and Mode | by Ka Hou Sio | Towards Data Science
Cohort Analysis:
with t AS
(select
media_id,
from_username as username,
(select
username,
min(week) as first_time_commenting
from
t
GROUP by
1
)
https://towardsdatascience.com/instagram-data-analysis-ce03aa4a472a 22/25
2/5/23, 9:08 PM Instagram Data Analysis Using Panoply and Mode | by Ka Hou Sio | Towards Data Science
SELECT
x.cohort::DATE AS week,
MAX(x.week_number) OVER (PARTITION BY x.cohort) AS total_nums_of_week,
x.week_number,
MAX(x.nums_of_commenter) OVER (PARTITION BY x.cohort) AS
nums_of_new_commenter,
x.nums_of_commenter,
x.nums_of_commenter/MAX(x.nums_of_commenter) OVER (PARTITION BY
x.cohort)::FLOAT AS retention_rate
FROM
(SELECT
w.first_time_commenting as cohort,
FLOOR(EXTRACT('day' FROM t.week - w.first_time_commenting)/7) AS
week_number,
COUNT(DISTINCT t.username) AS nums_of_commenter
FROM
t t
LEFT JOIN
w w
ON
t.username = w.username
GROUP BY
1,2) as x
ORDER BY 1,2,3
https://towardsdatascience.com/instagram-data-analysis-ce03aa4a472a 23/25
2/5/23, 9:08 PM Instagram Data Analysis Using Panoply and Mode | by Ka Hou Sio | Towards Data Science
Above are all the analyses I have done for the project. For the data I have used, I didn’t
upload them to my Github because of confidential purposes. If you are interested in
learning what resources I have used for this project, please visit my Github repo.
If you have any questions, feel free to comment below. Thank you so much for reading!
Every Thursday, the Variable delivers the very best of Towards Data Science: from hands-on tutorials and cutting-edge
research to original features you don't want to miss. Take a look.
By signing up, you will create a Medium account if you don’t already have one. Review
our Privacy Policy for more information about our privacy practices.
https://towardsdatascience.com/instagram-data-analysis-ce03aa4a472a 25/25