You are on page 1of 35

#Overview of our project

Scrapping data is not easy, esspecially on the social media sites which already issue the
regulations of protecting users' data. We did try with Facebook, Twitter. If you can scrape the
data through some sort of API, it must be a third-party API which is not legally made by the site
their own. But for exact, yes we CAN scrape data, but not really transparently, which means we
can take home the data that the social media sites send to us, but it is very limited in the size of
data scraped, and one of our members was about to be banned on Facebook due to the act they
had done with Facebook data.

So we chose Youtube, not really a social media but a video site that, open for developers to
access their data, which contain a large amount of video, and by collect how users engage in
video, we can understand the behavior, pattern of Youtube users.

##Our purpose

• collect data as mmuch as possible


• perform cleaning stuffs

By collect data, we will use the collected dataset to train machine learning model (scope of work
in the final project) to predict the view of a certain title, or calculate how much successful will a
certain title (we will type in) will contribute to the viral level of a video sis Huong pls help me
write this or we can do sentiment analysis, which is...

bởi vì bây giờ mình chỉ cần collect data về và clean nó để training model, nên mới chỉ có một
đống dataset input, và đống đó sẽ được tụi em dùng để training - như thầy nói "các em làm gì
làm miễn sao có được 1 cục data đã clean rồi để chạy model là được" tuy nhiên chạy model thì có
2 phần: train model và test model, thì mình sẽ scrape data để test model sau

chị viết sao cho nó hứa hẹn lên chị, kiểu giải quyết được vấn đề này nọ kia, nhưng mà đây là
scope của giữa kì nên tụi em làm chừng này thôi, cuối kì tụi em làm ghê hơn nữa

##Structure of the project

#1. Scrapping data

A note if you running yourself this section again:

• Create your own Youtube Data API v3 in the Google Console for Developer, because the
API listed below is belonged to mine, which is about to exceed its allowed quotas. For the
size of our scrapped data, everytime you run the code again, make a new API
• It will takes a lot of time in the getting video data step, because as you can see later in our
submitted .csv file of all video data, it has more than 100 thousand lines of data, and the
file's size is 43.4mb (which is really large in comparision with most of the non-video data
set we found on Kaggle).
!pip install google-api-python-client
#In Google Colab, packages are not saved and builted-in the system
like the Pycharm, so we need to install them seperately before the
importing steps.
Requirement already satisfied: google-api-python-client in
/usr/local/lib/python3.10/dist-packages (2.84.0)
Requirement already satisfied: httplib2<1dev,>=0.15.0 in
/usr/local/lib/python3.10/dist-packages (from google-api-python-
client) (0.22.0)
Requirement already satisfied: google-auth<3.0.0dev,>=1.19.0 in
/usr/local/lib/python3.10/dist-packages (from google-api-python-
client) (2.17.3)
Requirement already satisfied: google-auth-httplib2>=0.1.0 in
/usr/local/lib/python3.10/dist-packages (from google-api-python-
client) (0.1.0)
Requirement already satisfied: google-api-core!=2.0.*,!=2.1.*,!
=2.2.*,!=2.3.0,<3.0.0dev,>=1.31.5 in /usr/local/lib/python3.10/dist-
packages (from google-api-python-client) (2.11.1)
Requirement already satisfied: uritemplate<5,>=3.0.1 in
/usr/local/lib/python3.10/dist-packages (from google-api-python-
client) (4.1.1)
Requirement already satisfied: googleapis-common-
protos<2.0.dev0,>=1.56.2 in /usr/local/lib/python3.10/dist-packages
(from google-api-core!=2.0.*,!=2.1.*,!=2.2.*,!
=2.3.0,<3.0.0dev,>=1.31.5->google-api-python-client) (1.60.0)
Requirement already satisfied: protobuf!=3.20.0,!=3.20.1,!=4.21.0,!
=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0.dev0,>=3.19.5 in
/usr/local/lib/python3.10/dist-packages (from google-api-core!=2.0.*,!
=2.1.*,!=2.2.*,!=2.3.0,<3.0.0dev,>=1.31.5->google-api-python-client)
(3.20.3)
Requirement already satisfied: requests<3.0.0.dev0,>=2.18.0 in
/usr/local/lib/python3.10/dist-packages (from google-api-core!=2.0.*,!
=2.1.*,!=2.2.*,!=2.3.0,<3.0.0dev,>=1.31.5->google-api-python-client)
(2.31.0)
Requirement already satisfied: cachetools<6.0,>=2.0.0 in
/usr/local/lib/python3.10/dist-packages (from google-
auth<3.0.0dev,>=1.19.0->google-api-python-client) (5.3.1)
Requirement already satisfied: pyasn1-modules>=0.2.1 in
/usr/local/lib/python3.10/dist-packages (from google-
auth<3.0.0dev,>=1.19.0->google-api-python-client) (0.3.0)
Requirement already satisfied: six>=1.9.0 in
/usr/local/lib/python3.10/dist-packages (from google-
auth<3.0.0dev,>=1.19.0->google-api-python-client) (1.16.0)
Requirement already satisfied: rsa<5,>=3.1.4 in
/usr/local/lib/python3.10/dist-packages (from google-
auth<3.0.0dev,>=1.19.0->google-api-python-client) (4.9)
Requirement already satisfied: pyparsing!=3.0.0,!=3.0.1,!=3.0.2,!
=3.0.3,<4,>=2.4.2 in /usr/local/lib/python3.10/dist-packages (from
httplib2<1dev,>=0.15.0->google-api-python-client) (3.1.1)
Requirement already satisfied: pyasn1<0.6.0,>=0.4.6 in
/usr/local/lib/python3.10/dist-packages (from pyasn1-modules>=0.2.1-
>google-auth<3.0.0dev,>=1.19.0->google-api-python-client) (0.5.0)
Requirement already satisfied: charset-normalizer<4,>=2 in
/usr/local/lib/python3.10/dist-packages (from
requests<3.0.0.dev0,>=2.18.0->google-api-core!=2.0.*,!=2.1.*,!=2.2.*,!
=2.3.0,<3.0.0dev,>=1.31.5->google-api-python-client) (3.2.0)
Requirement already satisfied: idna<4,>=2.5 in
/usr/local/lib/python3.10/dist-packages (from
requests<3.0.0.dev0,>=2.18.0->google-api-core!=2.0.*,!=2.1.*,!=2.2.*,!
=2.3.0,<3.0.0dev,>=1.31.5->google-api-python-client) (3.4)
Requirement already satisfied: urllib3<3,>=1.21.1 in
/usr/local/lib/python3.10/dist-packages (from
requests<3.0.0.dev0,>=2.18.0->google-api-core!=2.0.*,!=2.1.*,!=2.2.*,!
=2.3.0,<3.0.0dev,>=1.31.5->google-api-python-client) (2.0.4)
Requirement already satisfied: certifi>=2017.4.17 in
/usr/local/lib/python3.10/dist-packages (from
requests<3.0.0.dev0,>=2.18.0->google-api-core!=2.0.*,!=2.1.*,!=2.2.*,!
=2.3.0,<3.0.0dev,>=1.31.5->google-api-python-client) (2023.7.22)

pip install pandas

Requirement already satisfied: pandas in


/usr/local/lib/python3.10/dist-packages (1.5.3)
Requirement already satisfied: python-dateutil>=2.8.1 in
/usr/local/lib/python3.10/dist-packages (from pandas) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in
/usr/local/lib/python3.10/dist-packages (from pandas) (2023.3.post1)
Requirement already satisfied: numpy>=1.21.0 in
/usr/local/lib/python3.10/dist-packages (from pandas) (1.23.5)
Requirement already satisfied: six>=1.5 in
/usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.1-
>pandas) (1.16.0)

pip install seaborn

Requirement already satisfied: seaborn in


/usr/local/lib/python3.10/dist-packages (0.12.2)
Requirement already satisfied: numpy!=1.24.0,>=1.17 in
/usr/local/lib/python3.10/dist-packages (from seaborn) (1.23.5)
Requirement already satisfied: pandas>=0.25 in
/usr/local/lib/python3.10/dist-packages (from seaborn) (1.5.3)
Requirement already satisfied: matplotlib!=3.6.1,>=3.1 in
/usr/local/lib/python3.10/dist-packages (from seaborn) (3.7.1)
Requirement already satisfied: contourpy>=1.0.1 in
/usr/local/lib/python3.10/dist-packages (from matplotlib!=3.6.1,>=3.1-
>seaborn) (1.1.0)
Requirement already satisfied: cycler>=0.10 in
/usr/local/lib/python3.10/dist-packages (from matplotlib!=3.6.1,>=3.1-
>seaborn) (0.11.0)
Requirement already satisfied: fonttools>=4.22.0 in
/usr/local/lib/python3.10/dist-packages (from matplotlib!=3.6.1,>=3.1-
>seaborn) (4.42.1)
Requirement already satisfied: kiwisolver>=1.0.1 in
/usr/local/lib/python3.10/dist-packages (from matplotlib!=3.6.1,>=3.1-
>seaborn) (1.4.5)
Requirement already satisfied: packaging>=20.0 in
/usr/local/lib/python3.10/dist-packages (from matplotlib!=3.6.1,>=3.1-
>seaborn) (23.1)
Requirement already satisfied: pillow>=6.2.0 in
/usr/local/lib/python3.10/dist-packages (from matplotlib!=3.6.1,>=3.1-
>seaborn) (9.4.0)
Requirement already satisfied: pyparsing>=2.3.1 in
/usr/local/lib/python3.10/dist-packages (from matplotlib!=3.6.1,>=3.1-
>seaborn) (3.1.1)
Requirement already satisfied: python-dateutil>=2.7 in
/usr/local/lib/python3.10/dist-packages (from matplotlib!=3.6.1,>=3.1-
>seaborn) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in
/usr/local/lib/python3.10/dist-packages (from pandas>=0.25->seaborn)
(2023.3.post1)
Requirement already satisfied: six>=1.5 in
/usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.7-
>matplotlib!=3.6.1,>=3.1->seaborn) (1.16.0)

##Connecting to Youtube Data API

from googleapiclient.discovery import build


import pandas as pd #có nhiều phần ở bên dưới mình import lại mấy
cái này, ở bước cuối khi refine bài làm thì mình remove mấy cái bị
trùng ra
import seaborn as sns

# API setup
api_service_name = "youtube"
api_version = "v3"
api_key = "AIzaSyC3ovk22_2y24J_AF6pt3YdI9JAnCU4Bpg" #lol we change
this api for tons of time, we're about to run out of Google account
for this
youtube = build(api_service_name, api_version, developerKey=api_key)

# Popular search terms to fetch a variety of channels. Here we can


alter the outcome which the specific interest we have; however,
currently we are aimless about which machine learning model and
application to apply in the next steps, so we choose random hot search
keywords from the internet
search_terms = ["music", "gaming", "vlog", "tutorial",
"entertainment", "news", "sports", "apple", "technology",
"movies","PewDiePie", "ASMR", "Markiplier", "Pewdiepie Vs T Series",
"Fortnite", "Baby Shark", "BTS","Minecraft","Vietnam", "Viet Nam",
"university", ]

channel_ids = set()
# Fetch channels based on search terms
for term in search_terms:
request = youtube.search().list(q=term, type='channel', part='id',
maxResults=50)
response = request.execute() #error usually arises in this line of
code because the Youtube Data v3 API only allow us 10,000 quotas for
normal usage
for item in response['items']:
channel_ids.add(item['id']['channelId'])

# Fetch detailed information about each channel


channels_data = []
for channel_id in channel_ids:
request = youtube.channels().list(id=channel_id,
part='snippet,statistics')
response = request.execute()
channel_info = response['items'][0]
channels_data.append({
'name': channel_info['snippet']['title'],
'id': channel_id,
'subscribers': int(channel_info['statistics']
['subscriberCount'])
})

# Sort channels by subscriber count and get the top 20


top_channels = sorted(channels_data, key=lambda x: x['subscribers'],
reverse=True)[:20] #here is the number of channel we want to scrape
data which are sorted based on their subscriber number
#We found that 20 channels is a sustainable figure for us to run later
part of the code without crash down while scraping data.

# Convert the set to a list


channel_ids = list(channel_ids)

# Now, you can use the 'channel_ids' list in your main analysis
for channel in top_channels:
print(f"{channel['name']}: {channel['id']}")

YouTube Movies: UClgRkhTL3_hImCAmdLfDE4g


Cocomelon - Nursery Rhymes: UCbCmjCuTUZos6Inko4u57UQ
PewDiePie: UC-lHJZR3Gqxm24_Vd_AJ5Yw
Gaming: UCOpNcN46UbXVtpKMrmU4Abg
BANGTANTV: UCLkAepWjdylmXSltofFvsYQ
Sports: UCEgdi0XIXXZ-qJOFPf4JSKw
Pinkfong Baby Shark - Kids' Songs & Stories: UCcdwLMPsaU2ezNSJU1nFoBQ
Movieclips: UC3gNmTGu-TTbFPpfSs5kNkg
ABS-CBN Entertainment: UCstEtN0pgOmCf02EdXsGChw
MrBeast Gaming: UCIPPMRA040LQr5QPyJEbmXA
Markiplier: UC7_YxT-KID8kRbqZo7MyscQ
Aditya Movies: UCX_uPA_dGf7wXjuMEaSKLJA
JYP Entertainment: UCaO6TYtlC8U5ttz62hTrZgg
Rans Entertainment: UCvA9_f5Lwk-poMynabtrZPg
Ninja: UCAW-NpUFkMyCNrvRSSGIvDQ
Marvel Entertainment: UCvC4D8onUfXzvjTOM-dBfEA
Like Nastya Vlog: UCCI5Xsd_gCbZb9eTeOf9FdQ
Apple: UCE_M8A5yxnLfW0KghEeajjw
Linus Tech Tips: UCXuqSBlHAE6Xw-yeJA0Tunw
ABC News: UCBi2mrWuNuyYy4gbM6fU18Q

##Scrape channel statistics

youtube = build('youtube','v3',developerKey=api_key)
def get_channel_stats(youtube, channel_ids):
all_data = []

# Split the channel_ids list into chunks of 50


chunks = [channel_ids[i:i + 50] for i in range(0,
len(channel_ids), 50)]

for chunk in chunks:


request = youtube.channels().list(
part='snippet,contentDetails,statistics',
id=','.join(chunk)
)
response = request.execute()

for i in range(len(response['items'])):
data = dict(
Channel_name=response['items'][i]['snippet']['title'],
Subscribers=int(response['items'][i]['statistics']
['subscriberCount']),
Views=int(response['items'][i]['statistics']
['viewCount']),
Total_videos=int(response['items'][i]['statistics']
['videoCount']),
playlist_id=response['items'][i]['contentDetails']
['relatedPlaylists']['uploads']
)
all_data.append(data)

# Sort by subscribers in descending order and get the top 20


channels
sorted_data = sorted(all_data, key=lambda x: x['Subscribers'],
reverse=True)[:20]
return sorted_data

channel_statistics = get_channel_stats(youtube, channel_ids)


channel_data = pd.DataFrame(channel_statistics)
channel_data['Subscribers']=pd.to_numeric(channel_data['Subscribers'])
channel_data['Views']=pd.to_numeric(channel_data['Views'])
channel_data['Total_videos']=pd.to_numeric(channel_data['Total_videos'
])
channel_data.dtypes

Channel_name object
Subscribers int64
Views int64
Total_videos int64
playlist_id object
dtype: object

channel_data

Channel_name Subscribers
Views \
0 YouTube Movies 174000000
0
1 Cocomelon - Nursery Rhymes 165000000
168520383123
2 PewDiePie 111000000
29122295331
3 Gaming 93700000
0
4 BANGTANTV 76400000
21358776894
5 Sports 75000000
0
6 Pinkfong Baby Shark - Kids' Songs & Stories 69600000
39865360048
7 Movieclips 59800000
59781790192
8 ABS-CBN Entertainment 45000000
50837677357
9 MrBeast Gaming 38500000
6797805872
10 Markiplier 35500000
20556728913
11 Aditya Movies 29400000
10609662887
12 JYP Entertainment 27800000
20525268539
13 Rans Entertainment 25600000
6533705814
14 Ninja 23700000
2565491881
15 Marvel Entertainment 20200000
5719490335
16 Like Nastya Vlog 18600000
8255279489
17 Apple 18300000
1052250254
18 Linus Tech Tips 15400000
7299972254
19 ABC News 15000000
13306852770

Total_videos playlist_id
0 0 UUlgRkhTL3_hImCAmdLfDE4g
1 1007 UUbCmjCuTUZos6Inko4u57UQ
2 4719 UU-lHJZR3Gqxm24_Vd_AJ5Yw
3 0 UUOpNcN46UbXVtpKMrmU4Abg
4 2331 UULkAepWjdylmXSltofFvsYQ
5 0 UUEgdi0XIXXZ-qJOFPf4JSKw
6 2963 UUcdwLMPsaU2ezNSJU1nFoBQ
7 39413 UU3gNmTGu-TTbFPpfSs5kNkg
8 197733 UUstEtN0pgOmCf02EdXsGChw
9 141 UUIPPMRA040LQr5QPyJEbmXA
10 5512 UU7_YxT-KID8kRbqZo7MyscQ
11 5649 UUX_uPA_dGf7wXjuMEaSKLJA
12 1833 UUaO6TYtlC8U5ttz62hTrZgg
13 3834 UUvA9_f5Lwk-poMynabtrZPg
14 1792 UUAW-NpUFkMyCNrvRSSGIvDQ
15 8665 UUvC4D8onUfXzvjTOM-dBfEA
16 510 UUCI5Xsd_gCbZb9eTeOf9FdQ
17 178 UUE_M8A5yxnLfW0KghEeajjw
18 6585 UUXuqSBlHAE6Xw-yeJA0Tunw
19 82205 UUBi2mrWuNuyYy4gbM6fU18Q

Above we have the overall data of 20 channels. I will print them out into a .csv file for later use of
cleaning and basic analysis.

from google.colab import drive


drive.mount("/content/drive")
FOLDER_PATH = "/content/drive/MyDrive/AI in EDT project/"

# Save the channel_data DataFrame to a CSV file


channel_file_path = FOLDER_PATH + "channel_data_raw.csv"
channel_data.to_csv(channel_file_path, index=False)

# Print the path of the saved file


print(channel_file_path)

Drive already mounted at /content/drive; to attempt to forcibly


remount, call drive.mount("/content/drive", force_remount=True).
/content/drive/MyDrive/AI in EDT project/channel_data_raw.csv

You can also see above there are 3 channels that have 0 video, which is quite interesting. This is
the function of "Topic" on Youtube, which users can subscribed for and whenever a video is
categorized in that topic, you can see the video without subscribing for tons of channels related
to that topic.

However, this can hinder us in further exploration and analysis, so stay tune, you will see below
we will have the code to skip scraping video data for these channels.

##Get video ids

def get_video_ids(youtube, playlist_id):


video_ids = []
try:
request = youtube.playlistItems().list(
part='contentDetails',
playlistId=playlist_id,
maxResults=50
)
response = request.execute()

for item in response['items']:


video_ids.append(item['contentDetails']['videoId'])

next_page_token = response.get('nextPageToken')
more_pages = True

while more_pages:
if next_page_token is None:
more_pages = False
else:
request = youtube.playlistItems().list(
part='contentDetails',
playlistId=playlist_id,
maxResults=50,
pageToken=next_page_token
)
response = request.execute()

for item in response['items']:


video_ids.append(item['contentDetails']
['videoId'])

next_page_token = response.get('nextPageToken')

except Exception as e:
print(f"Error fetching videos for playlist {playlist_id}:
{e}")

return video_ids

##Scrape video details


def get_video_details(youtube, video_ids):
all_video_stats = []

for i in range(0, len(video_ids), 50):


request = youtube.videos().list(
part='snippet,statistics,contentDetails',
id=','.join(video_ids[i:i+50]))
response = request.execute()

for video in response['items']:


# Skip videos with comments turned off
if 'commentCount' not in video['statistics']:
continue

video_stats = dict(
Title=video['snippet'].get('title', 'N/A'),
Tags=video['snippet'].get('tags', []),
Category_ID=video['snippet'].get('categoryId', 'N/A'),
Published_date=video['snippet'].get('publishedAt',
'N/A'),
Views=video['statistics'].get('viewCount', 0),
Likes=video['statistics'].get('likeCount', 0),
Comments=video['statistics'].get('commentCount', 0)
)
all_video_stats.append(video_stats)

return all_video_stats

In the code for #Skip videos with comments turned off, we were trying to remove the videos
whose channels turned off their comments. However we are not succeed at that part, which you
can see in our dataset it still have large amount of videos that have 0 comment.

This is hard because some video its content is not interesting, controversy enough for people to
interact by comment, because when I check the data again, it has 10050 lines of video data
which videos have 0 comments. In in that 10050 lines, it is tricky for us to remove the turned-
off-comment-fuction video.

We think it will need deeper intervention or somethings else in the Google Developer, if time
allow after this midterm, we will go back and review for that.

def get_video_categories(youtube):
request = youtube.videoCategories().list(part="snippet",
regionCode="US")
response = request.execute()
category_mapping = {item['id']: item['snippet']['title'] for item
in response['items']}
return category_mapping

category_mapping = get_video_categories(youtube)
The code right above is something like asking Youtube: Can you give me a list of all video
categories available in the US? (as you can see the regionCode we chose above is US). Because of
text in English is easier for analysis than text in Vietnamese, which is a real good deal for new-
bie in tech like us, and we input in the search bar keywords in English, so we choose US.

Looping the video details scraping process for YouTube


channels
Below is where we will scrape a real great dataset. However, due to the limited of data that our
normal Google account can get access to, some of them are also quite confusing for us, so we
decided just scrape video titles, its tags, its publish date, month, how much likes and comments
that video has, and in which that video is categorized into?

vids_data = []

for channel_name in channel_data['Channel_name']:


playlist_id = channel_data.loc[channel_data['Channel_name'] ==
channel_name, 'playlist_id'].iloc[0]
video_ids = get_video_ids(youtube, playlist_id)

if not video_ids:
print(f"Skipping channel {channel_name} due to lack of video
IDs.")
continue

video_details = get_video_details(youtube, video_ids)


video_data = pd.DataFrame(video_details)

# Convert data types and extract additional features safely


if 'Published_date' in video_data.columns:
video_data['Published_date'] =
pd.to_datetime(video_data['Published_date']).dt.date
if 'Views' in video_data.columns:
video_data['Views'] = pd.to_numeric(video_data['Views'])
if 'Likes' in video_data.columns:
video_data['Likes'] = pd.to_numeric(video_data['Likes'])
if 'Comments' in video_data.columns:
video_data['Comments'] = pd.to_numeric(video_data['Comments'])
if 'Published_date' in video_data.columns:
video_data['Month'] =
pd.to_datetime(video_data['Published_date']).dt.strftime('%b')
#We need to make them complicated like above because we keep
encouter the "KeyError" with 'Published_date', 'Views' and other
columns of the video data

# Convert category IDs to category names


if 'Category_ID' in video_data.columns:
video_data['Category'] =
video_data['Category_ID'].map(category_mapping)
# Append the video data for the current channel to the main list
vids_data.append(video_data)

Error fetching videos for playlist UUlgRkhTL3_hImCAmdLfDE4g:


<HttpError 404 when requesting
https://youtube.googleapis.com/youtube/v3/playlistItems?
part=contentDetails&playlistId=UUlgRkhTL3_hImCAmdLfDE4g&maxResults=50&
key=AIzaSyC3ovk22_2y24J_AF6pt3YdI9JAnCU4Bpg&alt=json returned "The
playlist identified with the request's <code>playlistId</code>
parameter cannot be found.". Details: "[{'message': "The playlist
identified with the request's <code>playlistId</code> parameter cannot
be found.", 'domain': 'youtube.playlistItem', 'reason':
'playlistNotFound', 'location': 'playlistId', 'locationType':
'parameter'}]">
Skipping channel YouTube Movies due to lack of video IDs.
Error fetching videos for playlist UUOpNcN46UbXVtpKMrmU4Abg:
<HttpError 404 when requesting
https://youtube.googleapis.com/youtube/v3/playlistItems?
part=contentDetails&playlistId=UUOpNcN46UbXVtpKMrmU4Abg&maxResults=50&
key=AIzaSyC3ovk22_2y24J_AF6pt3YdI9JAnCU4Bpg&alt=json returned "The
playlist identified with the request's <code>playlistId</code>
parameter cannot be found.". Details: "[{'message': "The playlist
identified with the request's <code>playlistId</code> parameter cannot
be found.", 'domain': 'youtube.playlistItem', 'reason':
'playlistNotFound', 'location': 'playlistId', 'locationType':
'parameter'}]">
Skipping channel Gaming due to lack of video IDs.
Error fetching videos for playlist UUEgdi0XIXXZ-qJOFPf4JSKw:
<HttpError 404 when requesting
https://youtube.googleapis.com/youtube/v3/playlistItems?
part=contentDetails&playlistId=UUEgdi0XIXXZ-
qJOFPf4JSKw&maxResults=50&key=AIzaSyC3ovk22_2y24J_AF6pt3YdI9JAnCU4Bpg&
alt=json returned "The playlist identified with the request's
<code>playlistId</code> parameter cannot be found.". Details:
"[{'message': "The playlist identified with the request's
<code>playlistId</code> parameter cannot be found.", 'domain':
'youtube.playlistItem', 'reason': 'playlistNotFound', 'location':
'playlistId', 'locationType': 'parameter'}]">
Skipping channel Sports due to lack of video IDs.

As I mentioned somewhere above, there are some channels that do not have any video and we
want to skip scrapping process for those channels, and the name of the channels are listed
above.

Have a look at our data after more than 10 minutes of scraping. You can check out the submitted
csv file for more detail.

vids_data
[ Title \
0 dad life
1 Reacting to my Wife's baby memes
2 I'm the best dad (proof)
3 I'm a dad now
4 I Made A Street Lamp... And No One Noticed
... ...
4454 Dual Minecraft Lets Play! Episode [003] - Expl...
4455 Dual Minecraft Lets Play! Episode [002] - New ...
4456 Call of Duty: Black Ops: Wager Match: Gun Game
4457 Blacklight Tango Down: Team Deathmatch 38-4 (P...
4458 Minecraft Multiplayer Fun

Tags
Category_ID \
0 [pewdiepie, pewds, pewdie] 20

1 [pewdiepie, pewds, pewdie] 24

2 [pewdiepie, pewds, pewdie, pewdiepie son, pewd... 20

3 [pewdiepie, pewds, pewdie] 20

4 [pewdiepie, pewds, pewdie] 24

... ... ...

4454 [minecraft, mindcraft, mind, dual, commentary,... 20

4455 [Military, Penis, exploded, in, Xebaz, face, m... 20

4456 [yt:quality=high, Call, of, Duty, Black, Ops, ... 20

4457 [yt:quality=high, blacklight, tango, down, com... 20

4458 [yt:quality=high, minecraft, alpha, Minecraft,... 20

Published_date Views Likes Favorites Comments Month \


0 2023-09-15 3006394 218561 0 8262 Sep
1 2023-09-10 2433876 187681 0 3309 Sep
2 2023-08-30 2617240 216452 0 6378 Aug
3 2023-08-11 12739703 1470970 0 100941 Aug
4 2023-06-30 3194681 201679 0 9325 Jun
... ... ... ... ... ... ...
4454 2010-12-19 827995 11666 0 1653 Dec
4455 2010-12-19 1946711 33344 0 4661 Dec
4456 2010-12-16 4217990 222335 0 46913 Dec
4457 2010-12-10 1909649 43400 0 5255 Dec
4458 2010-10-03 21781040 920979 0 160736 Oct
Category
0 Gaming
1 Entertainment
2 Gaming
3 Gaming
4 Entertainment
... ...
4454 Gaming
4455 Gaming
4456 Gaming
4457 Gaming
4458 Gaming

[4459 rows x 10 columns],


Title \
0 V 'Layover' MV Making Film
1 [n 월의 석진] Message from Jin : Sep 2023 💌
2 V 'Blue' @ NAVER NPOP
3 [#슈취타] 우리 태형이 하고 싶은거 다해 😉💜 - EP.18 #SUGA with ...
4 [#슈취타] 더 뚜렷해진 색깔로 빛날 완전체 ⃣
️ 방탄 ️⃣ - EP.18 #S...
... ...
2324 130108 RAP MONSTER
2325 130107 RAP MONSTER
2326 흔한 연습생의 크리스마스 Video Edit by 방탄소년단
2327 Let's Introduce BANGTAN ROOM by 방탄소년단
2328 닥투 - RAP MONSTER of 방탄소년단

Tags
Category_ID \
0 [방탄소년단, BTS, BANGTAN, 알엠, RM, 슈가, SUGA, 제이홉, j... 10
1 [방탄소년단, BTS, BANGTAN, 알엠, RM, 슈가, SUGA, 제이홉, j... 10
2 [방탄소년단, BTS, BANGTAN, 알엠, RM, 슈가, SUGA, 제이홉, j... 10
3 [방탄소년단, BTS, BANGTAN, 알엠, RM, 슈가, SUGA, 제이홉, j... 10
4 [방탄소년단, BTS, BANGTAN, 알엠, RM, 슈가, SUGA, 제이홉, j... 10
... ... ...

2324 [HIPHOP, 방탄소년단, bangtan, BTS, 랩몬스터, RAP MONSTE... 10


2325 [HIPHOP, 방탄소년단, bangtan, BTS, 랩몬스터, rap monster] 10
2326 [HIPHOP, 방탄소년단, bangtan, BTS, christmas] 10
2327 [HIPHOP, 방탄소년단, bangtan, BTS] 10
2328 [1219 대선, 방탄소년단, 투표, 선거, 닥치고, 닥치고투표] 10

Published_date Views Likes Favorites Comments Month


Category
0 2023-09-19 546974 150709 0 5785 Sep
Music
1 2023-09-16 1938195 468656 0 23088 Sep
Music
2 2023-09-15 569781 149628 0 4534 Sep
Music
3 2023-09-11 477747 95642 0 1170 Sep
Music
4 2023-09-11 354655 81263 0 826 Sep
Music
... ... ... ... ... ... ...
...
2324 2013-01-09 363504 69321 0 2962 Jan
Music
2325 2013-01-07 646254 89882 0 4542 Jan
Music
2326 2012-12-23 2418578 334756 0 10929 Dec
Music
2327 2012-12-22 2318941 367378 0 13950 Dec
Music
2328 2012-12-17 3232368 452408 0 32883 Dec
Music

[2329 rows x 10 columns]]

As you can see, we used to want to scrape the data of Favourite, which indicates how much
people saved that video as their Favourite video. However, due to Youtube that does not public
that figures, so all the data of Favourite showed above is 0. We need to drop that column before
proceeding to any step ahead.

# Drop the 'Favorites' column if it exists and only contains zeros


if 'Favorites' in video_data.columns and video_data['Favorites'].sum()
== 0:
video_data.drop('Favorites', axis=1, inplace=True)

vids_data

[ Title \
0 Are You Sleeping Brother John? | CoComelon Nur...
1 🔴 CoComelon LIVE Halloween Mix 🎃! Wheels on th...
2 Humpty Dumpty Grocery Store + Wheels on the Bu...
3 Which Halloween Costume Do You Like? Halloween...
4 Play Outside at the Farm with Baby Animals | C...
... ...
1004 Learn the ABCs: "P" is for Pig and Penguin
1005 Learn the ABCs: "L" is for Lion and Ladybug
1006 Learn the ABCs: "K" is for Kangaroo
1007 ABC Song with Cute Ending
1008 ABC Song

Tags
Category_ID \
0 [cocomelon, abckidtv, nursery rhymes, children... 27
1 [cocomelon, abckidtv, nursery rhymes, children... 27

2 [cocomelon, abckidtv, nursery rhymes, children... 27

3 [cocomelon, abckidtv, nursery rhymes, children... 27

4 [cocomelon, abckidtv, nursery rhymes, children... 27

... ... ...

1004 [baby songs, sing-along, abckidtv, preschool, ... 1

1005 [baby songs, kids education, kindergarten, chi... 1

1006 [cocomelon, kids entertainment, kindergarten, ... 1

1007 [toddler, children songs, abckidtv, kids video... 27

1008 [kids animation, baby songs, children songs, k... 27

Published_date Views Likes Comments Month


Category
0 2023-09-19 388502 1796 0 Sep
Education
1 2023-09-18 1921881 13620 0 Sep
Education
2 2023-09-16 4599679 20252 0 Sep
Education
3 2023-09-15 1272792 36801 0 Sep
Education
4 2023-09-12 5118501 13386 0 Sep
Education
... ... ... ... ... ...
...
1004 2007-06-20 9659508 5041 0 Jun Film &
Animation
1005 2007-06-20 24653431 21469 0 Jun Film &
Animation
1006 2007-06-20 8799218 4525 0 Jun Film &
Animation
1007 2006-09-02 296942569 332389 0 Sep
Education
1008 2006-09-01 29042840 39775 0 Sep
Education

[1009 rows x 9 columns],


Title \
0 dad life
1 Reacting to my Wife's baby memes
2 I'm the best dad (proof)
3 I'm a dad now
4 I Made A Street Lamp... And No One Noticed
... ...
4454 Dual Minecraft Lets Play! Episode [003] - Expl...
4455 Dual Minecraft Lets Play! Episode [002] - New ...
4456 Call of Duty: Black Ops: Wager Match: Gun Game
4457 Blacklight Tango Down: Team Deathmatch 38-4 (P...
4458 Minecraft Multiplayer Fun

Tags
Category_ID \
0 [pewdiepie, pewds, pewdie] 20

1 [pewdiepie, pewds, pewdie] 24

2 [pewdiepie, pewds, pewdie, pewdiepie son, pewd... 20

3 [pewdiepie, pewds, pewdie] 20

4 [pewdiepie, pewds, pewdie] 24

... ... ...

4454 [minecraft, mindcraft, mind, dual, commentary,... 20

4455 [Military, Penis, exploded, in, Xebaz, face, m... 20

4456 [yt:quality=high, Call, of, Duty, Black, Ops, ... 20

4457 [yt:quality=high, blacklight, tango, down, com... 20

4458 [yt:quality=high, minecraft, alpha, Minecraft,... 20

Published_date Views Likes Comments Month Category

0 2023-09-15 3065109 221079 8211 Sep Gaming

1 2023-09-10 2452099 188551 3205 Sep Entertainment

2 2023-08-30 2620072 216570 6378 Aug Gaming

3 2023-08-11 12744664 1471182 100944 Aug Gaming

4 2023-06-30 3195241 201695 9325 Jun Entertainment

... ... ... ... ... ... ...

4454 2010-12-19 828006 11665 1653 Dec Gaming


4455 2010-12-19 1946750 33345 4661 Dec Gaming

4456 2010-12-16 4218037 222334 46913 Dec Gaming

4457 2010-12-10 1909714 43400 5255 Dec Gaming

4458 2010-10-03 21781569 920988 160736 Oct Gaming

[4459 rows x 9 columns],


Title \
0 V 'Layover' MV Making Film
1 [n 월의 석진] Message from Jin : Sep 2023 💌
2 V 'Blue' @ NAVER NPOP
3 [#슈취타] 우리 태형이 하고 싶은거 다해 😉💜 - EP.18 #SUGA with ...
4 [#슈취타] 더 뚜렷해진 색깔로 빛날 완전체 ️⃣ 방탄 ️⃣ - EP.18 #S...
... ...
2324 130108 RAP MONSTER
2325 130107 RAP MONSTER
2326 흔한 연습생의 크리스마스 Video Edit by 방탄소년단
2327 Let's Introduce BANGTAN ROOM by 방탄소년단
2328 닥투 - RAP MONSTER of 방탄소년단

Tags
Category_ID \
0 [방탄소년단, BTS, BANGTAN, 알엠, RM, 슈가, SUGA, 제이홉, j... 10
1 [방탄소년단, BTS, BANGTAN, 알엠, RM, 슈가, SUGA, 제이홉, j... 10
2 [방탄소년단, BTS, BANGTAN, 알엠, RM, 슈가, SUGA, 제이홉, j... 10
3 [방탄소년단, BTS, BANGTAN, 알엠, RM, 슈가, SUGA, 제이홉, j... 10
4 [방탄소년단, BTS, BANGTAN, 알엠, RM, 슈가, SUGA, 제이홉, j... 10
... ... ...

2324 [HIPHOP, 방탄소년단, bangtan, BTS, 랩몬스터, RAP MONSTE... 10


2325 [HIPHOP, 방탄소년단, bangtan, BTS, 랩몬스터, rap monster] 10
2326 [HIPHOP, 방탄소년단, bangtan, BTS, christmas] 10
2327 [HIPHOP, 방탄소년단, bangtan, BTS] 10
2328 [1219 대선, 방탄소년단, 투표, 선거, 닥치고, 닥치고투표] 10

Published_date Views Likes Comments Month Category


0 2023-09-19 680801 172494 6468 Sep Music
1 2023-09-16 1956170 471508 23164 Sep Music
2 2023-09-15 580254 150507 4552 Sep Music
3 2023-09-11 480787 95951 1176 Sep Music
4 2023-09-11 355981 81493 827 Sep Music
... ... ... ... ... ... ...
2324 2013-01-09 363531 69324 2962 Jan Music
2325 2013-01-07 646296 89886 4542 Jan Music
2326 2012-12-23 2418742 334759 10930 Dec Music
2327 2012-12-22 2319029 367386 13951 Dec Music
2328 2012-12-17 3232644 452428 32883 Dec Music

[2329 rows x 9 columns],


Title \
0 If You're Happy and You Know It | Nursery Rhym...
1 The Scared Tiny Little Boat | Outdoor Songs | ...
2 Dancing William! #william #shorts
3 Winter Vehicle Friends | Car Songs for Kids | ...
4 Baby Shark Lost His Pet! #bigshow #babyshark
... ...
1271 Hickory Dickory Dock | Best Kids Songs | PINKF...
1272 A Sailor Went to Sea | Best Kids Songs | PINKF...
1273 The Princess and the Frog | Fairy Tales | Musi...
1274 The Alphabet Song | Best Kids Songs | PINKFONG...
1275 Farm Animal Songs Collection Vol. 2 | Best Kid...

Tags
Category_ID \
0 [babies, baby, baby shark, baby shark challeng... 27

1 [babies, baby, baby shark, baby shark challeng... 27

2 [babies, baby, baby shark, baby shark challeng... 27

3 [babies, baby, baby shark, baby shark challeng... 27

4 [pinkfong, family, kids, children, toddlers, b... 27

... ... ...

1271 [pinkfong, songs, family, songs for kids, kids... 27

1272 [pinkfong, songs, family, songs for kids, kids... 27

1273 [The Princess And The Frog (Film), cuentos, cu... 27

1274 [abc, phonics, abc phonics, abc songs, phonics... 27

1275 [songs for kids, kids song, free kids songs, k... 27

Published_date Views Likes Comments Month Category


0 2023-09-19 11186 72 0 Sep Education
1 2023-09-18 34209 163 0 Sep Education
2 2023-09-18 117755 3019 0 Sep Education
3 2023-09-17 33652 156 0 Sep Education
4 2023-09-17 153115 0 0 Sep Education
... ... ... ... ... ... ...
1271 2014-10-22 464588 222 0 Oct Education
1272 2014-10-15 533818 404 0 Oct Education
1273 2014-08-06 1156357 1449 0 Aug Education
1274 2014-08-01 684619 482 0 Aug Education
1275 2014-06-26 543858 446 0 Jun Education

[1276 rows x 9 columns],


Title \
0 Nanny McPhee Returns (2010) - Nanny McPhee Lea...
1 Nanny McPhee Returns (2010) - The Magic Burp &...
2 Nanny McPhee Returns (2010) - Defusing The Bom...
3 Nanny McPhee Returns (2010) - The Bomb in the ...
4 Nanny McPhee Returns (2010) - Dropping the Bom...
... ...
19862 Cujo (4/8) Movie CLIP - You're Rabid! (1983) HD
19863 Cujo (3/8) Movie CLIP - What's the Matter? (19...
19864 Cujo (2/8) Movie CLIP - Cujo Won't Hurt Him (1...
19865 Cujo (1/8) Movie CLIP - A Bat Bites Cujo (1983...
19866 Valley Girl (11/12) Movie CLIP - Randy Stalks ...

Tags Category_ID
\
0 [Nanny McPhee Returns 2010, Nanny McPhee Retur... 1

1 [Nanny McPhee Returns 2010, Nanny McPhee Retur... 1

2 [Nanny McPhee Returns 2010, Nanny McPhee Retur... 1

3 [Nanny McPhee Returns 2010, Nanny McPhee Retur... 1

4 [Nanny McPhee Returns 2010, Nanny McPhee Retur... 1

... ... ...

19862 [cujo, cujo trailer, cujo part 1, cujo the mov... 1

19863 [cujo, cujo trailer, cujo part 1, cujo the mov... 1

19864 [cujo, cujo trailer, cujo part 1, cujo the mov... 1

19865 [cujo, cujo trailer, cujo part 1, cujo the mov... 1

19866 [valley girl, valley girl trailer, valley girl... 1

Published_date Views Likes Comments Month


Category
0 2023-09-19 2713 32 0 Sep Film &
Animation
1 2023-09-19 1641 27 0 Sep Film &
Animation
2 2023-09-19 5443 214 0 Sep Film &
Animation
3 2023-09-19 1314 20 0 Sep Film &
Animation
4 2023-09-19 1451 24 0 Sep Film &
Animation
... ... ... ... ... ... ..
.
19862 2016-08-12 492130 3678 537 Aug Film &
Animation
19863 2016-08-12 946906 6699 549 Aug Film &
Animation
19864 2016-08-12 951408 6236 305 Aug Film &
Animation
19865 2016-08-12 2314592 14261 1522 Aug Film &
Animation
19866 2016-08-12 61639 419 64 Aug Film &
Animation

[19867 rows x 9 columns],


Title \
0 Magandang Buhay (5/5) | September
2023 20,
1 Magandang Buhay (4/5) | September
2023 20,
2 Magandang Buhay (3/5) | September
2023 20,
3 Magandang Buhay (2/5) | September
2023 20,
4 Magandang Buhay (1/5) | September
2023 20,
... ...
19702 Lolo Hugo asks Ali who is more attractive,betw...
19703 Jalle and Eich gets their first win as daily c...
19704 Miss Q&A Tyra proudly introduces her live-in p...
19705 Bon Joey Trinidad bags her 9th crown | Miss Q ...
19706 Kulitan at tapatan nina INAH EVANS & RUTH VS. ...

Tags Category_ID
\
0 [2023 Kapamilya Channel, ABS-CBN Entertainment... 24

1 [2023 Kapamilya Channel, ABS-CBN Entertainment... 24

2 [2023 Kapamilya Channel, ABS-CBN Entertainment... 24

3 [2023 Kapamilya Channel, ABS-CBN Entertainment... 24

4 [2023 Kapamilya Channel, ABS-CBN Entertainment... 24

... ... ...

19702 [ABS-CBN Online, ABS-CBN Philippines, Philippi... 24

19703 [ABS-CBN Entertianment, It's Showtime, Kapamil... 24


19704 [ABS-CBN Entertianment, It's Showtime, Kapamil... 24

19705 [ABS-CBN Entertianment, It's Showtime, Kapamil... 24

19706 [ABS-CBN, ABS-CBN Online, ABS-CBN Philippines,... 24

Published_date Views Likes Comments Month Category


0 2023-09-20 275 0 0 Sep Entertainment
1 2023-09-20 65 0 0 Sep Entertainment
2 2023-09-20 77 2 0 Sep Entertainment
3 2023-09-20 75 2 0 Sep Entertainment
4 2023-09-20 161 3 0 Sep Entertainment
... ... ... ... ... ... ...
19702 2022-09-12 9910 217 4 Sep Entertainment
19703 2022-09-12 14621 105 12 Sep Entertainment
19704 2022-09-12 20639 299 18 Sep Entertainment
19705 2022-09-12 14149 172 16 Sep Entertainment
19706 2022-09-12 2843 16 1 Sep Entertainment

[19707 rows x 9 columns],


Title \
0 Press This Button = Win $100,000!
1 Would You Rather Have $10,000 or This Mystery ...
2 100 Assassins vs 10 Real Cops!
3 If You Build It, I'll Pay For It!
4 World's Hardest Challenge!
.. ...
136 $10,000 Obstacle Course - Challenge
137 Last to Survive Random Blocks wins $10,000 - C...
138 Last to Survive Arena wins $10,000 - Challenge
139 $10,000 Bank robbery - Challenge
140 Last To Survive Wins $10,000 - Challenge

Tags Category_ID \
0 [] 20
1 [] 20
2 [] 20
3 [] 20
4 [] 20
.. ... ...
136 [Minecraft, challenge, Minecraft challenge, ga... 20
137 [Minecraft, challenge, Minecraft challenge, ga... 20
138 [Minecraft, challenge, Minecraft challenge, ga... 20
139 [Minecraft, challenge, Minecraft challenge, ga... 20
140 [Minecraft, challenge, Minecraft challenge, ga... 22

Published_date Views Likes Comments Month Category

0 2023-07-07 47879812 997604 19942 Jul Gaming


1 2023-06-17 29812082 611631 11866 Jun Gaming

2 2023-04-05 61008668 947171 9940 Apr Gaming

3 2022-12-31 57010860 1299543 27455 Dec Gaming

4 2022-12-16 27702996 631110 22943 Dec Gaming

.. ... ... ... ... ... ...

136 2020-05-22 12822301 320409 8469 May Gaming

137 2020-05-20 28472518 687620 17025 May Gaming

138 2020-05-16 29884047 469072 12118 May Gaming

139 2020-05-14 18182140 350588 7703 May Gaming

140 2020-05-12 57330607 865352 24571 May People & Blogs

[141 rows x 9 columns],


Title \
0 3 SCARY GAMES #102
1 Skibidi.EXE
2 Smash or Pass: All ∞ Pokémon
3 Ethan is DEAD to Me
4 Ethan is DEAD to Me
... ...
5402 Cry of Fear Reaction Compilation #1
5403 Amnesia Custom Story Reaction Compilation #1
5404 Nightmare House Reaction Compilation
5405 Penumbra Reaction Compilation
5406 Amnesia Reaction Compilation

Tags
Category_ID \
0 [markiplier, 3 scary games, horror games, the ... 20

1 [markiplier, scary games, skibidi, skibidi toi... 20

2 [Pokémon, Pokemon, Markiplier, Smash or pass, ... 20

3 [scary games, secret, gaming, garten of banban... 20

4 [scary games, secret, gaming, garten of banban... 20

... ... ...

5402 [Cry of Fear, Cry of Fear Reactions, Lets Play... 20


5403 [Amnesia Reaction Compilation, Amnesia: The Da... 20

5404 [Nightmare House 2, Penumbra Reactions, Lets P... 20

5405 [Penumbra Reaction Compilation, Penumbra: Over... 20

5406 [Amnesia Reaction Compilation, Amnesia Gamepla... 20

Published_date Views Likes Comments Month Category


0 2023-09-16 1801950 114166 4259 Sep Gaming
1 2023-09-13 1847105 113453 7846 Sep Gaming
2 2023-09-11 1645200 147295 7077 Sep Gaming
3 2023-09-09 1635921 89640 2908 Sep Gaming
4 2023-09-09 0 21 0 Sep Gaming
... ... ... ... ... ... ...
5402 2012-05-26 1135220 14554 882 May Gaming
5403 2012-05-26 976098 12066 669 May Gaming
5404 2012-05-26 1253218 15782 987 May Gaming
5405 2012-05-26 957374 11976 865 May Gaming
5406 2012-05-26 3532328 84029 14435 May Gaming

[5407 rows x 9 columns],


Title \
0 #Samantha Latest #Shorts | Tag That Frustrated...
1 #LoveStory #Shorts | #HindiDubbedMovie | #Naga...
2 "Vennela Kishore" Birthday Special Mashup | MC...
3 Nithin Latest Hindi Insta Shorts | #AAa2 #newy...
4 #Nethraa Movie #RoboShankar #MottaRajendran #S...
... ...
5636 Ravi Teja About Him Self in Bhadra Movie - Rav...
5637 Ravi Teja Hilarious Comedy in Marriage Functio...
5638 Ravi Teja Comedy Scenes In Bhadra Movie - Ravi...
5639 Padmanabham Druken Comedy in Bhadra Movie - Pa...
5640 Ravi Teja Comedy Scenes In Bhadra Movie | Ravi...

Tags
Category_ID \
0 [A Aa Movie, A Aa Movie shorts, A Aa Movie sce... 1

1 [#LoveStory Hindi Dubbed Movies, AdityaMovies,... 1

2 [vennela kishore comedy, #vennelakishore, #mck... 1

3 [A Aa 2 Insta reels, A Aa 2 yt shorts, chalmoh... 1

4 [nethraa movie yt scenes, nethraa movie scenes... 1

... ... ...


5636 [Bhadra, Bhadra movie, Bhadra full movie, Bhad... 1

5637 [Bhadra, Bhadra Full Movie, Bhadra Telugu Full... 1

5638 [Badra comedy scenes, Bhadra movie comedy scen... 23

5639 [Bhadra, Badra comedy scenes, Bhadra movie com... 23

5640 [Bhadra, Badra comedy scenes, Bhadra movie com... 1

Published_date Views Likes Comments Month Category

0 2023-09-20 2908 177 1 Sep Film & Animation

1 2023-09-19 16969 984 3 Sep Film & Animation

2 2023-09-19 9804 301 12 Sep Film & Animation

3 2023-09-19 25517 1498 4 Sep Film & Animation

4 2023-09-19 10317 407 2 Sep Film & Animation

... ... ... ... ... ... ...

5636 2012-02-24 318619 1835 35 Feb Film & Animation

5637 2012-02-24 1843965 6807 74 Feb Film & Animation

5638 2012-02-24 161869 849 11 Feb Comedy

5639 2012-02-24 280843 1242 12 Feb Comedy

5640 2012-02-24 271996 1291 20 Feb Film & Animation

[5641 rows x 9 columns],


Title \
0 Xdinary Heroes 〈Livelock〉 The Beginning
1 A2K ep.20 "Team Evaluation Rankings"
2 [Nizi Project Season 2] Part 1 #10 Highlights
3 A2K ep.21 "The Grand Finale Begins"
4 [Nizi Project Season 2] Part 1 - Team Mission ...
... ...
1828 Wonder Girls _ Yubin
1829 Wonder Girls "Take it!" M/V
1830 J.Y.Park with Lil Jon!
1831 Wonder Girls "Stupid" M/V
1832 Wonder Girls "Tell me" M/V
Tags
Category_ID \
0 [JYP Entertainment, JYP] 10

1 [JYP Entertainment, JYP, A2K, America2Korea, R... 10

2 [JYP Entertainment, JYP, Globalaudition, Audit... 10

3 [JYP Entertainment, JYP, A2K, America2Korea, R... 10

4 [JYP Entertainment, JYP] 10

... ... ...

1828 [Wonder, Girls, Yubin, Rap] 24

1829 [Wonder, Girls, Take, it, 박진영, 원더걸스, 2AM, 2PM,... 24


1830 [jyp, lil, jon, j.y.park, 박진영, korea, 원더걸스, 2A... 24
1831 [Wonder, Girls, Stupid, 박진영, 원더걸스, 2AM, 2PM, J... 24
1832 [Wonder, Girls, Tell, me, irony, 박진영, 원더걸스, 2A... 24

Published_date Views Likes Comments Month Category

0 2023-09-20 128 408 48 Sep Music

1 2023-09-19 1343353 65132 9618 Sep Music

2 2023-09-18 178543 4015 327 Sep Music

3 2023-09-16 1 4287 942 Sep Music

4 2023-09-15 661387 12412 1732 Sep Music

... ... ... ... ... ... ...

1828 2008-01-29 133781 1153 147 Jan Entertainment

1829 2008-01-28 3051093 5908 941 Jan Entertainment

1830 2008-01-28 154523 1271 229 Jan Entertainment

1831 2008-01-28 2526443 10134 1344 Jan Entertainment

1832 2008-01-28 33446611 811525 30126 Jan Entertainment

[1833 rows x 9 columns],


Title \
0 DITINGGAL RAFFI GIGI KE SPANYOL CIPUNG GELAR P...
1 NAGITA BUKA LOKER PNS LAGI???!!!CIPUNG MAU IKU...
2 NAGITA JAJAN FURNITURE TERUUUS!!! CIPUNG MULAI...
3 KONSER BLACKPINK YANG PALING BERKESAN BUAT NAG...
4 Salah Potong Rambut
... ...
3769 GREEN & BLUE - TIPS MAKE UP MEMSYE
3770 Diary Mamank Rans - "Hari Yang Riweh"
3771 GRAND LAUNCHING RA JEANS
3772 RA JEANS 60"
3773 RA JEANS 15"

Tags
Category_ID \
0 [rans, raffi ahmad, nagita, nagita slavina, ra... 24

1 [rans, raffi ahmad, nagita, nagita slavina, ra... 24

2 [rans, raffi ahmad, nagita, nagita slavina, ra... 24

3 [rans, raffi ahmad, nagita, nagita slavina, ra... 24

4 [rans, raffi ahmad, nagita, nagita slavina, ra... 24

... ... ...

3769 [RAFFI AHMAD, NAGITA SLAVINA, TIPS AND TRICK, ... 22

3770 [Diary Mamank, RAFFI AHMAD, NAGITA SLAVINA, RA... 22

3771 [raffi ahmad, nagita slavina, ra jeans] 22

3772 [raffiahmad, nagitaslavina, raffigigi, rajeans] 22

3773 [raffiahmad, nagitaslavina, rajeans, raffigigi] 22

Published_date Views Likes Comments Month Category


0 2023-09-19 274940 10523 532 Sep Entertainment
1 2023-09-18 110188 4469 287 Sep Entertainment
2 2023-09-18 208327 7092 189 Sep Entertainment
3 2023-09-17 373480 13172 560 Sep Entertainment
4 2023-09-17 12548 486 50 Sep Entertainment
... ... ... ... ... ... ...
3769 2016-03-07 784251 7721 413 Mar People & Blogs
3770 2016-03-07 4526771 60516 3134 Mar People & Blogs
3771 2016-01-27 96537 2256 355 Jan People & Blogs
3772 2015-12-28 105062 2183 391 Dec People & Blogs
3773 2015-12-28 165702 3613 884 Dec People & Blogs

[3774 rows x 9 columns],


Title \
0 Fortnite All NEW Birthday Update
1 Stop Trolling Bro
2 This New Loadout is Breaking Season 4
3 Blocked From Playing Builds
4 SEASON 4 Problem..
... ...
1785 Suddoth 2 + Aussie + cookie Monster
1786 Formal vs Ninja Game 4 Sanctuary
1787 Insane overkill reaction!!
1788 Incredible Pit Start
1789 First video of many

Tags
Category_ID \
0 [ninja, ninga, ninjashyper, ninja clips, ninja... 20

1 [ninja, ninga, ninjashyper, ninja clips, ninja... 20

2 [ninja, ninga, ninjashyper, ninja clips, ninja... 20

3 [ninja, ninga, ninjashyper, ninja clips, ninja... 20

4 [ninja, ninga, ninjashyper, ninja clips, ninja... 20

... ... ...

1785 [justin.tv, ninja] 20

1786 [justin.tv] 20

1787 [justin.tv, ninja] 20

1788 [Ninja, Halo, Reach, MLG, Awesome, Maggie, Cow... 20

1789 [Ninja, Mlg, Hyper] 20

Published_date Views Likes Comments Month Category


0 2023-09-19 59563 4104 216 Sep Gaming
1 2023-09-19 35945 2251 32 Sep Gaming
2 2023-09-17 183470 8351 533 Sep Gaming
3 2023-09-16 92977 7937 86 Sep Gaming
4 2023-09-15 199818 9710 404 Sep Gaming
... ... ... ... ... ... ...
1785 2011-12-15 68174 1170 150 Dec Gaming
1786 2011-12-15 192336 2977 400 Dec Gaming
1787 2011-12-15 444856 8534 674 Dec Gaming
1788 2011-12-15 624961 14416 1024 Dec Gaming
1789 2011-12-15 7366868 469835 46647 Dec Gaming

[1790 rows x 9 columns],


Title \
0 Loki Cosplay Transformation!
1 Marvel Studios' Loki Season 2 and Marvel's Spi...
2 Marvel Studios’ Loki Season 2 | Amazing Loki
3 King T'Challa | Marvel Studios' Legends
4 The Dora Milaje | Marvel Studios' Legends
... ...
8591 The Weekly Watcher August 15, 2008
8592 The Weekly Watcher August 22, 2008
8593 Weekly Watcher August 29, 2008
8594 The Weekly Watcher September 5, 2008
8595 The Weekly Watcher September 12, 2008

Tags
Category_ID \
0 [marvel, comics] 24

1 [marvel, comics] 24

2 [marvel, comics] 24

3 [marvel, comics] 24

4 [marvel, comics] 24

... ... ...

8591 [Marvel, Comics, Marvel.com, Weekly, Watcher, ... 24

8592 [Marvel, Comics, Marvel.com, Weekly, Watcher, ... 24

8593 [Marvel, Comics, Marvel.com, Weekly, Watcher, ... 24

8594 [Alexa, Mendez, Marvel, Comics, Marvel.com, We... 24

8595 [Marvel Comics, Alexa, Mendez, Weekly, Watcher... 24

Published_date Views Likes Comments Month Category


0 2023-09-19 25004 2214 22 Sep Entertainment
1 2023-09-18 43829 2510 94 Sep Entertainment
2 2023-09-18 755004 38175 1120 Sep Entertainment
3 2023-09-18 45148 3661 313 Sep Entertainment
4 2023-09-17 33772 2038 83 Sep Entertainment
... ... ... ... ... ... ...
8591 2008-09-22 2723 54 18 Sep Entertainment
8592 2008-09-22 3632 63 27 Sep Entertainment
8593 2008-09-22 6478 109 36 Sep Entertainment
8594 2008-09-22 6902 157 44 Sep Entertainment
8595 2008-09-19 109639 3195 1037 Sep Entertainment
[8596 rows x 9 columns],
Title \
0 Настя и папа на активном отдыхе в Турции
1 Настя много занимается, чтобы стать умнее
2 Настя в 3 классе! Школьные истории
3 Настя и папа учатся готовить шоколад на фабрик...
4 Настя и папа на семейном отдыхе в парке Land o...
.. ...
176 Настя и папа в парке аттракционов - Влог
177 Настя и друзья отправились в путешествие по Ам...
178 Настя и выпускной в детском саду с Ромой и Дианой
179 Влог: Настя и друзья собирают сюрпризы.
180 Настя и её необычные игрушки

Tags Category_ID \
0 [Настя, Настя и папа, Лайк Настя, like nastya ... 22
1 [Настя, Настя и папа, Лайк Настя, like nastya ... 22
2 [Настя, Настя и папа, Лайк Настя, like nastya ... 22
3 [Настя, Настя и папа, Лайк Настя, like nastya ... 22
4 [Настя, Настя и папа, Лайк Настя, like nastya ... 22
.. ... ...
176 [в парке, влог, парк аттракционов, настя и пап... 22
177 [настя, лайк настя, рома и диана, катя и макси... 22
178 [настя, рома и диана, влог, для детей, лайк на... 22
179 [сюрпризы, egg hunt, toys, влог, настя, лайк н... 22
180 [игрушки, настя, лайк настя, nastya, like nast... 22

Published_date Views Likes Comments Month Category

0 2023-09-18 157985 531 0 Sep People & Blogs

1 2023-09-14 227048 911 0 Sep People & Blogs

2 2023-09-11 258291 1054 0 Sep People & Blogs

3 2023-09-07 609668 2150 0 Sep People & Blogs

4 2023-09-04 974036 2776 0 Sep People & Blogs

.. ... ... ... ... ... ...

176 2019-09-06 23639256 89162 0 Sep People & Blogs

177 2019-08-24 36136243 125800 0 Aug People & Blogs

178 2019-06-19 141711735 421901 0 Jun People & Blogs

179 2019-04-27 36906853 100016 0 Apr People & Blogs


180 2019-03-20 52312138 136712 0 Mar People & Blogs

[181 rows x 9 columns],


Empty DataFrame
Columns: []
Index: [],
Title \
0 Have I been doing this the expensive way for n...
1 Young People Try a Mac from 1996!
2 Unity? More Like Divorce - WAN Show September ...
3 LTT TV - 24/7 Tech Tips
4 10 Reasons I Daily Drive a Foldable
... ...
6286 Personal Grooming with a USB Shaver (Linus Tec...
6287 Lenovo Ideapad S10 Netbook Unboxing and Overview
6288 eVGA X58 Motherboard Overview (Linus Tech Tips...
6289 Asus Rampage 2 Extreme Motherboard (Linus Tech...
6290 Undisclosed AMD Processor at 3.6GHz (Linus Tec...

Tags
Category_ID \
0 [USB, usb extender, usb extension, remote comp... 28

1 [apple, macintosh, old mac, apple ii, first ma... 28

2 [pcmr, building, competition, gamers, how to, ... 28

3 [pcmr, building, competition, gamers, how to, ... 28

4 [Samsung, Apple, Google, Phone, Foldable, Fold... 28

... ... ...

6286 [USB, shaver, Syba, rechargable] 28

6287 [lenovo, ideapad, s10, unboxing, netbook, subn... 28

6288 [evga, x58, motherboard, sli, linus, tech, tip... 28

6289 [asus, motherboard, rampage, extreme, computer... 28

6290 [amd, cpu, processor, overclocking, linus, tec... 28

Published_date Views Likes Comments Month


Category
0 2023-09-20 157666 10021 583 Sep Science &
Technology
1 2023-09-17 848871 37169 2185 Sep Science &
Technology
2 2023-09-16 553000 12123 1500 Sep Science &
Technology
3 2023-09-15 0 6 0 Sep Science &
Technology
4 2023-09-14 826618 40261 2260 Sep Science &
Technology
... ... ... ... ... ...
...
6286 2008-12-13 168683 3387 514 Dec Science &
Technology
6287 2008-12-10 289847 5314 1204 Dec Science &
Technology
6288 2008-11-29 118660 2655 614 Nov Science &
Technology
6289 2008-11-27 266312 4148 765 Nov Science &
Technology
6290 2008-11-25 1010799 45658 9087 Nov Science &
Technology

[6291 rows x 9 columns],


Title \
0 Kashmir Hill on how facial recognition tech ch...
1 Family claims hidden camera recorded teen in p...
2 Photojournalist Brian Frank on spirit of worke...
3 Californians facing a mental health toll from ...
4 Biden extends support for Ukraine in UN speech
... ...
19540 900,000 children getting COVID vaccines: White...
19541 Consumer prices continue to rise
19542 New Jersey man gets 41 months in Capitol Hill ...
19543 The latest on Day 4 of Ahmaud Arbery murder trial
19544 Day 7 of the Kyle Rittenhouse trial concludes ...

Tags Category_ID
\
0 [abc, abcnl, ai, anonymous, artificial, book, ... 25

1 [news, breaking news, live news, daily news, w... 25

2 [abc, abcnl, brian, california, central, frank... 25

3 [abc, abcnl, california, climate, disaster, ea... 25

4 [ABC, America, Biden, Joe, Nations, News, Poli... 25

... ... ...

19540 [abc, children, covid-19, eligible, fda, for, ... 24


19541 [abc, chain, consumer, costs, gas, groceries, ... 24

19542 [Fairlamb, Scott, abc, assaulting, capitol, fe... 24

19543 [Ahmaud, Arbery, BLM, Black, Brunswick, Georgi... 24

19544 [BLM, Black, Kenosha, Koribanics, Kyle, Lives,... 24

Published_date Views Likes Comments Month Category

0 2023-09-20 214 2 1 Sep News & Politics

1 2023-09-20 5831 252 21 Sep News & Politics

2 2023-09-20 307 1 0 Sep News & Politics

3 2023-09-20 594 2 5 Sep News & Politics

4 2023-09-20 8476 356 455 Sep News & Politics

... ... ... ... ... ... ...

19540 2021-11-11 45850 352 1672 Nov Entertainment

19541 2021-11-11 12842 263 178 Nov Entertainment

19542 2021-11-11 120938 1729 1725 Nov Entertainment

19543 2021-11-11 75927 913 1213 Nov Entertainment

19544 2021-11-11 70126 431 1250 Nov Entertainment

[19545 rows x 9 columns]]

As you can see above, the favorite is dropped. Now we need to print out the data into csv,
because we dont want to run all above code again due to the limit of Youtube Data API, which
only allow 10,000 quotas a day.

##Printing data

from google.colab import drive


drive.mount("/content/drive")
#Because we want to output the .csv file of our scraped data, so
first, we need to connect it with our Google Drive

Drive already mounted at /content/drive; to attempt to forcibly


remount, call drive.mount("/content/drive", force_remount=True).
FOLDER_PATH = "/content/drive/MyDrive/AI in EDT project/"

# Instead of concatenating in a loop, append data to a list and


concatenate once at the end
all_data_list = []
for vid_data in vids_data:
all_data_list.append(vid_data)

all_vids_data = pd.concat(all_data_list, ignore_index=True)

# Save the combined dataframe to a single CSV file


file_path = FOLDER_PATH + "all_videos_data_rawdata_final.csv"
all_vids_data.to_csv(file_path, index=False)
print(file_path)

/content/drive/MyDrive/AI in EDT
project/all_videos_data_rawdata_final.csv

The code is already printed in my Drive folder. Because we are not good at fixing bugs and
errors, so we will seperate our code section by section for the ease of fixing. As you can see, this
is a bit chunky for us to do but it is the only safe way for us.

import os
# 1. Print the shape of the DataFrame
print(f"The dataset has {all_vids_data.shape[0]} rows and
{all_vids_data.shape[1]} columns.")

# 2. Print the size of the file in MB


file_path = FOLDER_PATH + "all_videos_data_rawdata_final.csv"
file_size_MB = os.path.getsize(file_path) / (1024 * 1024) # Convert
bytes to MB
print(f"The file size is {file_size_MB:.2f} MB.")

The dataset has 101846 rows and 9 columns.


The file size is 43.37 MB.

Just a small printing stuff to show you that how large our dataset is in its shape and its size. We
are really proud of us for the time and hardwork we have dedicated here.

By investing a huge amount of time to figure out hơw to effectively scrape data, we do not want
to run that part of the code again, so we choose to do section by section seperately.

The scope of scraping data in section 1 ends right above, as we exported out 2 .csv file: one for
channel data and one for all video data.

#2. Cleaning data

Now come to the cleaning data. Unlike the traditional dataset that only has numerical data, we
have here title, categories, tags of the videos (which are text). So our approach in data cleaning
and data analysis is a bit more complicated and different.
We will have below 2 sub-section of cleaning data, in one we conventionally do numerial
cleaning stuff, and one we try to clean the string data that we have.

##2.1. Cleaning numerical data

##2.2. Cleaning text data (string data)

Text Analysis is a major application field for machine learning algorithms. However the raw data,
a sequence of symbols cannot be fed directly to the algorithms themselves as most of them
expect numerical feature vectors with a fixed size rather than the raw text documents with
variable length. source:

Tức là cái mình làm chỉ là clean sơ sơ rồi chạy word cloud thôi chớ nó chưa thực sự có ý nghĩa
lắm trong việc phân tích model, để coi họ làm gì tiếp

hiện tại mình đã chia bài làm của mình là từng phần từng phần, cái text

Source để tham khảo cách họ xử lý ngôn ngữ tự nhiên

You might also like