Professional Documents
Culture Documents
Replica of Youtube Analysis
Replica of Youtube Analysis
Scrapping data is not easy, esspecially on the social media sites which already issue the
regulations of protecting users' data. We did try with Facebook, Twitter. If you can scrape the
data through some sort of API, it must be a third-party API which is not legally made by the site
their own. But for exact, yes we CAN scrape data, but not really transparently, which means we
can take home the data that the social media sites send to us, but it is very limited in the size of
data scraped, and one of our members was about to be banned on Facebook due to the act they
had done with Facebook data.
So we chose Youtube, not really a social media but a video site that, open for developers to
access their data, which contain a large amount of video, and by collect how users engage in
video, we can understand the behavior, pattern of Youtube users.
##Our purpose
By collect data, we will use the collected dataset to train machine learning model (scope of work
in the final project) to predict the view of a certain title, or calculate how much successful will a
certain title (we will type in) will contribute to the viral level of a video sis Huong pls help me
write this or we can do sentiment analysis, which is...
bởi vì bây giờ mình chỉ cần collect data về và clean nó để training model, nên mới chỉ có một
đống dataset input, và đống đó sẽ được tụi em dùng để training - như thầy nói "các em làm gì
làm miễn sao có được 1 cục data đã clean rồi để chạy model là được" tuy nhiên chạy model thì có
2 phần: train model và test model, thì mình sẽ scrape data để test model sau
chị viết sao cho nó hứa hẹn lên chị, kiểu giải quyết được vấn đề này nọ kia, nhưng mà đây là
scope của giữa kì nên tụi em làm chừng này thôi, cuối kì tụi em làm ghê hơn nữa
• Create your own Youtube Data API v3 in the Google Console for Developer, because the
API listed below is belonged to mine, which is about to exceed its allowed quotas. For the
size of our scrapped data, everytime you run the code again, make a new API
• It will takes a lot of time in the getting video data step, because as you can see later in our
submitted .csv file of all video data, it has more than 100 thousand lines of data, and the
file's size is 43.4mb (which is really large in comparision with most of the non-video data
set we found on Kaggle).
!pip install google-api-python-client
#In Google Colab, packages are not saved and builted-in the system
like the Pycharm, so we need to install them seperately before the
importing steps.
Requirement already satisfied: google-api-python-client in
/usr/local/lib/python3.10/dist-packages (2.84.0)
Requirement already satisfied: httplib2<1dev,>=0.15.0 in
/usr/local/lib/python3.10/dist-packages (from google-api-python-
client) (0.22.0)
Requirement already satisfied: google-auth<3.0.0dev,>=1.19.0 in
/usr/local/lib/python3.10/dist-packages (from google-api-python-
client) (2.17.3)
Requirement already satisfied: google-auth-httplib2>=0.1.0 in
/usr/local/lib/python3.10/dist-packages (from google-api-python-
client) (0.1.0)
Requirement already satisfied: google-api-core!=2.0.*,!=2.1.*,!
=2.2.*,!=2.3.0,<3.0.0dev,>=1.31.5 in /usr/local/lib/python3.10/dist-
packages (from google-api-python-client) (2.11.1)
Requirement already satisfied: uritemplate<5,>=3.0.1 in
/usr/local/lib/python3.10/dist-packages (from google-api-python-
client) (4.1.1)
Requirement already satisfied: googleapis-common-
protos<2.0.dev0,>=1.56.2 in /usr/local/lib/python3.10/dist-packages
(from google-api-core!=2.0.*,!=2.1.*,!=2.2.*,!
=2.3.0,<3.0.0dev,>=1.31.5->google-api-python-client) (1.60.0)
Requirement already satisfied: protobuf!=3.20.0,!=3.20.1,!=4.21.0,!
=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0.dev0,>=3.19.5 in
/usr/local/lib/python3.10/dist-packages (from google-api-core!=2.0.*,!
=2.1.*,!=2.2.*,!=2.3.0,<3.0.0dev,>=1.31.5->google-api-python-client)
(3.20.3)
Requirement already satisfied: requests<3.0.0.dev0,>=2.18.0 in
/usr/local/lib/python3.10/dist-packages (from google-api-core!=2.0.*,!
=2.1.*,!=2.2.*,!=2.3.0,<3.0.0dev,>=1.31.5->google-api-python-client)
(2.31.0)
Requirement already satisfied: cachetools<6.0,>=2.0.0 in
/usr/local/lib/python3.10/dist-packages (from google-
auth<3.0.0dev,>=1.19.0->google-api-python-client) (5.3.1)
Requirement already satisfied: pyasn1-modules>=0.2.1 in
/usr/local/lib/python3.10/dist-packages (from google-
auth<3.0.0dev,>=1.19.0->google-api-python-client) (0.3.0)
Requirement already satisfied: six>=1.9.0 in
/usr/local/lib/python3.10/dist-packages (from google-
auth<3.0.0dev,>=1.19.0->google-api-python-client) (1.16.0)
Requirement already satisfied: rsa<5,>=3.1.4 in
/usr/local/lib/python3.10/dist-packages (from google-
auth<3.0.0dev,>=1.19.0->google-api-python-client) (4.9)
Requirement already satisfied: pyparsing!=3.0.0,!=3.0.1,!=3.0.2,!
=3.0.3,<4,>=2.4.2 in /usr/local/lib/python3.10/dist-packages (from
httplib2<1dev,>=0.15.0->google-api-python-client) (3.1.1)
Requirement already satisfied: pyasn1<0.6.0,>=0.4.6 in
/usr/local/lib/python3.10/dist-packages (from pyasn1-modules>=0.2.1-
>google-auth<3.0.0dev,>=1.19.0->google-api-python-client) (0.5.0)
Requirement already satisfied: charset-normalizer<4,>=2 in
/usr/local/lib/python3.10/dist-packages (from
requests<3.0.0.dev0,>=2.18.0->google-api-core!=2.0.*,!=2.1.*,!=2.2.*,!
=2.3.0,<3.0.0dev,>=1.31.5->google-api-python-client) (3.2.0)
Requirement already satisfied: idna<4,>=2.5 in
/usr/local/lib/python3.10/dist-packages (from
requests<3.0.0.dev0,>=2.18.0->google-api-core!=2.0.*,!=2.1.*,!=2.2.*,!
=2.3.0,<3.0.0dev,>=1.31.5->google-api-python-client) (3.4)
Requirement already satisfied: urllib3<3,>=1.21.1 in
/usr/local/lib/python3.10/dist-packages (from
requests<3.0.0.dev0,>=2.18.0->google-api-core!=2.0.*,!=2.1.*,!=2.2.*,!
=2.3.0,<3.0.0dev,>=1.31.5->google-api-python-client) (2.0.4)
Requirement already satisfied: certifi>=2017.4.17 in
/usr/local/lib/python3.10/dist-packages (from
requests<3.0.0.dev0,>=2.18.0->google-api-core!=2.0.*,!=2.1.*,!=2.2.*,!
=2.3.0,<3.0.0dev,>=1.31.5->google-api-python-client) (2023.7.22)
# API setup
api_service_name = "youtube"
api_version = "v3"
api_key = "AIzaSyC3ovk22_2y24J_AF6pt3YdI9JAnCU4Bpg" #lol we change
this api for tons of time, we're about to run out of Google account
for this
youtube = build(api_service_name, api_version, developerKey=api_key)
channel_ids = set()
# Fetch channels based on search terms
for term in search_terms:
request = youtube.search().list(q=term, type='channel', part='id',
maxResults=50)
response = request.execute() #error usually arises in this line of
code because the Youtube Data v3 API only allow us 10,000 quotas for
normal usage
for item in response['items']:
channel_ids.add(item['id']['channelId'])
# Now, you can use the 'channel_ids' list in your main analysis
for channel in top_channels:
print(f"{channel['name']}: {channel['id']}")
youtube = build('youtube','v3',developerKey=api_key)
def get_channel_stats(youtube, channel_ids):
all_data = []
for i in range(len(response['items'])):
data = dict(
Channel_name=response['items'][i]['snippet']['title'],
Subscribers=int(response['items'][i]['statistics']
['subscriberCount']),
Views=int(response['items'][i]['statistics']
['viewCount']),
Total_videos=int(response['items'][i]['statistics']
['videoCount']),
playlist_id=response['items'][i]['contentDetails']
['relatedPlaylists']['uploads']
)
all_data.append(data)
Channel_name object
Subscribers int64
Views int64
Total_videos int64
playlist_id object
dtype: object
channel_data
Channel_name Subscribers
Views \
0 YouTube Movies 174000000
0
1 Cocomelon - Nursery Rhymes 165000000
168520383123
2 PewDiePie 111000000
29122295331
3 Gaming 93700000
0
4 BANGTANTV 76400000
21358776894
5 Sports 75000000
0
6 Pinkfong Baby Shark - Kids' Songs & Stories 69600000
39865360048
7 Movieclips 59800000
59781790192
8 ABS-CBN Entertainment 45000000
50837677357
9 MrBeast Gaming 38500000
6797805872
10 Markiplier 35500000
20556728913
11 Aditya Movies 29400000
10609662887
12 JYP Entertainment 27800000
20525268539
13 Rans Entertainment 25600000
6533705814
14 Ninja 23700000
2565491881
15 Marvel Entertainment 20200000
5719490335
16 Like Nastya Vlog 18600000
8255279489
17 Apple 18300000
1052250254
18 Linus Tech Tips 15400000
7299972254
19 ABC News 15000000
13306852770
Total_videos playlist_id
0 0 UUlgRkhTL3_hImCAmdLfDE4g
1 1007 UUbCmjCuTUZos6Inko4u57UQ
2 4719 UU-lHJZR3Gqxm24_Vd_AJ5Yw
3 0 UUOpNcN46UbXVtpKMrmU4Abg
4 2331 UULkAepWjdylmXSltofFvsYQ
5 0 UUEgdi0XIXXZ-qJOFPf4JSKw
6 2963 UUcdwLMPsaU2ezNSJU1nFoBQ
7 39413 UU3gNmTGu-TTbFPpfSs5kNkg
8 197733 UUstEtN0pgOmCf02EdXsGChw
9 141 UUIPPMRA040LQr5QPyJEbmXA
10 5512 UU7_YxT-KID8kRbqZo7MyscQ
11 5649 UUX_uPA_dGf7wXjuMEaSKLJA
12 1833 UUaO6TYtlC8U5ttz62hTrZgg
13 3834 UUvA9_f5Lwk-poMynabtrZPg
14 1792 UUAW-NpUFkMyCNrvRSSGIvDQ
15 8665 UUvC4D8onUfXzvjTOM-dBfEA
16 510 UUCI5Xsd_gCbZb9eTeOf9FdQ
17 178 UUE_M8A5yxnLfW0KghEeajjw
18 6585 UUXuqSBlHAE6Xw-yeJA0Tunw
19 82205 UUBi2mrWuNuyYy4gbM6fU18Q
Above we have the overall data of 20 channels. I will print them out into a .csv file for later use of
cleaning and basic analysis.
You can also see above there are 3 channels that have 0 video, which is quite interesting. This is
the function of "Topic" on Youtube, which users can subscribed for and whenever a video is
categorized in that topic, you can see the video without subscribing for tons of channels related
to that topic.
However, this can hinder us in further exploration and analysis, so stay tune, you will see below
we will have the code to skip scraping video data for these channels.
next_page_token = response.get('nextPageToken')
more_pages = True
while more_pages:
if next_page_token is None:
more_pages = False
else:
request = youtube.playlistItems().list(
part='contentDetails',
playlistId=playlist_id,
maxResults=50,
pageToken=next_page_token
)
response = request.execute()
next_page_token = response.get('nextPageToken')
except Exception as e:
print(f"Error fetching videos for playlist {playlist_id}:
{e}")
return video_ids
video_stats = dict(
Title=video['snippet'].get('title', 'N/A'),
Tags=video['snippet'].get('tags', []),
Category_ID=video['snippet'].get('categoryId', 'N/A'),
Published_date=video['snippet'].get('publishedAt',
'N/A'),
Views=video['statistics'].get('viewCount', 0),
Likes=video['statistics'].get('likeCount', 0),
Comments=video['statistics'].get('commentCount', 0)
)
all_video_stats.append(video_stats)
return all_video_stats
In the code for #Skip videos with comments turned off, we were trying to remove the videos
whose channels turned off their comments. However we are not succeed at that part, which you
can see in our dataset it still have large amount of videos that have 0 comment.
This is hard because some video its content is not interesting, controversy enough for people to
interact by comment, because when I check the data again, it has 10050 lines of video data
which videos have 0 comments. In in that 10050 lines, it is tricky for us to remove the turned-
off-comment-fuction video.
We think it will need deeper intervention or somethings else in the Google Developer, if time
allow after this midterm, we will go back and review for that.
def get_video_categories(youtube):
request = youtube.videoCategories().list(part="snippet",
regionCode="US")
response = request.execute()
category_mapping = {item['id']: item['snippet']['title'] for item
in response['items']}
return category_mapping
category_mapping = get_video_categories(youtube)
The code right above is something like asking Youtube: Can you give me a list of all video
categories available in the US? (as you can see the regionCode we chose above is US). Because of
text in English is easier for analysis than text in Vietnamese, which is a real good deal for new-
bie in tech like us, and we input in the search bar keywords in English, so we choose US.
vids_data = []
if not video_ids:
print(f"Skipping channel {channel_name} due to lack of video
IDs.")
continue
As I mentioned somewhere above, there are some channels that do not have any video and we
want to skip scrapping process for those channels, and the name of the channels are listed
above.
Have a look at our data after more than 10 minutes of scraping. You can check out the submitted
csv file for more detail.
vids_data
[ Title \
0 dad life
1 Reacting to my Wife's baby memes
2 I'm the best dad (proof)
3 I'm a dad now
4 I Made A Street Lamp... And No One Noticed
... ...
4454 Dual Minecraft Lets Play! Episode [003] - Expl...
4455 Dual Minecraft Lets Play! Episode [002] - New ...
4456 Call of Duty: Black Ops: Wager Match: Gun Game
4457 Blacklight Tango Down: Team Deathmatch 38-4 (P...
4458 Minecraft Multiplayer Fun
Tags
Category_ID \
0 [pewdiepie, pewds, pewdie] 20
Tags
Category_ID \
0 [방탄소년단, BTS, BANGTAN, 알엠, RM, 슈가, SUGA, 제이홉, j... 10
1 [방탄소년단, BTS, BANGTAN, 알엠, RM, 슈가, SUGA, 제이홉, j... 10
2 [방탄소년단, BTS, BANGTAN, 알엠, RM, 슈가, SUGA, 제이홉, j... 10
3 [방탄소년단, BTS, BANGTAN, 알엠, RM, 슈가, SUGA, 제이홉, j... 10
4 [방탄소년단, BTS, BANGTAN, 알엠, RM, 슈가, SUGA, 제이홉, j... 10
... ... ...
As you can see, we used to want to scrape the data of Favourite, which indicates how much
people saved that video as their Favourite video. However, due to Youtube that does not public
that figures, so all the data of Favourite showed above is 0. We need to drop that column before
proceeding to any step ahead.
vids_data
[ Title \
0 Are You Sleeping Brother John? | CoComelon Nur...
1 🔴 CoComelon LIVE Halloween Mix 🎃! Wheels on th...
2 Humpty Dumpty Grocery Store + Wheels on the Bu...
3 Which Halloween Costume Do You Like? Halloween...
4 Play Outside at the Farm with Baby Animals | C...
... ...
1004 Learn the ABCs: "P" is for Pig and Penguin
1005 Learn the ABCs: "L" is for Lion and Ladybug
1006 Learn the ABCs: "K" is for Kangaroo
1007 ABC Song with Cute Ending
1008 ABC Song
Tags
Category_ID \
0 [cocomelon, abckidtv, nursery rhymes, children... 27
1 [cocomelon, abckidtv, nursery rhymes, children... 27
Tags
Category_ID \
0 [pewdiepie, pewds, pewdie] 20
Tags
Category_ID \
0 [방탄소년단, BTS, BANGTAN, 알엠, RM, 슈가, SUGA, 제이홉, j... 10
1 [방탄소년단, BTS, BANGTAN, 알엠, RM, 슈가, SUGA, 제이홉, j... 10
2 [방탄소년단, BTS, BANGTAN, 알엠, RM, 슈가, SUGA, 제이홉, j... 10
3 [방탄소년단, BTS, BANGTAN, 알엠, RM, 슈가, SUGA, 제이홉, j... 10
4 [방탄소년단, BTS, BANGTAN, 알엠, RM, 슈가, SUGA, 제이홉, j... 10
... ... ...
Tags
Category_ID \
0 [babies, baby, baby shark, baby shark challeng... 27
1275 [songs for kids, kids song, free kids songs, k... 27
Tags Category_ID
\
0 [Nanny McPhee Returns 2010, Nanny McPhee Retur... 1
Tags Category_ID
\
0 [2023 Kapamilya Channel, ABS-CBN Entertainment... 24
Tags Category_ID \
0 [] 20
1 [] 20
2 [] 20
3 [] 20
4 [] 20
.. ... ...
136 [Minecraft, challenge, Minecraft challenge, ga... 20
137 [Minecraft, challenge, Minecraft challenge, ga... 20
138 [Minecraft, challenge, Minecraft challenge, ga... 20
139 [Minecraft, challenge, Minecraft challenge, ga... 20
140 [Minecraft, challenge, Minecraft challenge, ga... 22
Tags
Category_ID \
0 [markiplier, 3 scary games, horror games, the ... 20
Tags
Category_ID \
0 [A Aa Movie, A Aa Movie shorts, A Aa Movie sce... 1
Tags
Category_ID \
0 [rans, raffi ahmad, nagita, nagita slavina, ra... 24
Tags
Category_ID \
0 [ninja, ninga, ninjashyper, ninja clips, ninja... 20
1786 [justin.tv] 20
Tags
Category_ID \
0 [marvel, comics] 24
1 [marvel, comics] 24
2 [marvel, comics] 24
3 [marvel, comics] 24
4 [marvel, comics] 24
Tags Category_ID \
0 [Настя, Настя и папа, Лайк Настя, like nastya ... 22
1 [Настя, Настя и папа, Лайк Настя, like nastya ... 22
2 [Настя, Настя и папа, Лайк Настя, like nastya ... 22
3 [Настя, Настя и папа, Лайк Настя, like nastya ... 22
4 [Настя, Настя и папа, Лайк Настя, like nastya ... 22
.. ... ...
176 [в парке, влог, парк аттракционов, настя и пап... 22
177 [настя, лайк настя, рома и диана, катя и макси... 22
178 [настя, рома и диана, влог, для детей, лайк на... 22
179 [сюрпризы, egg hunt, toys, влог, настя, лайк н... 22
180 [игрушки, настя, лайк настя, nastya, like nast... 22
Tags
Category_ID \
0 [USB, usb extender, usb extension, remote comp... 28
Tags Category_ID
\
0 [abc, abcnl, ai, anonymous, artificial, book, ... 25
As you can see above, the favorite is dropped. Now we need to print out the data into csv,
because we dont want to run all above code again due to the limit of Youtube Data API, which
only allow 10,000 quotas a day.
##Printing data
/content/drive/MyDrive/AI in EDT
project/all_videos_data_rawdata_final.csv
The code is already printed in my Drive folder. Because we are not good at fixing bugs and
errors, so we will seperate our code section by section for the ease of fixing. As you can see, this
is a bit chunky for us to do but it is the only safe way for us.
import os
# 1. Print the shape of the DataFrame
print(f"The dataset has {all_vids_data.shape[0]} rows and
{all_vids_data.shape[1]} columns.")
Just a small printing stuff to show you that how large our dataset is in its shape and its size. We
are really proud of us for the time and hardwork we have dedicated here.
By investing a huge amount of time to figure out hơw to effectively scrape data, we do not want
to run that part of the code again, so we choose to do section by section seperately.
The scope of scraping data in section 1 ends right above, as we exported out 2 .csv file: one for
channel data and one for all video data.
Now come to the cleaning data. Unlike the traditional dataset that only has numerical data, we
have here title, categories, tags of the videos (which are text). So our approach in data cleaning
and data analysis is a bit more complicated and different.
We will have below 2 sub-section of cleaning data, in one we conventionally do numerial
cleaning stuff, and one we try to clean the string data that we have.
Text Analysis is a major application field for machine learning algorithms. However the raw data,
a sequence of symbols cannot be fed directly to the algorithms themselves as most of them
expect numerical feature vectors with a fixed size rather than the raw text documents with
variable length. source:
Tức là cái mình làm chỉ là clean sơ sơ rồi chạy word cloud thôi chớ nó chưa thực sự có ý nghĩa
lắm trong việc phân tích model, để coi họ làm gì tiếp
hiện tại mình đã chia bài làm của mình là từng phần từng phần, cái text