You are on page 1of 6

Freedium

< Go to the original

Let's Use Python to Scrap Some Online


Movies/Videos
Peng Cao
Follow

~3 min read · January 7, 2024 (Updated: January 7, 2024) · Free: No

Let's explore the Python steps to scrape and reconstruct videos from web
efficiently. Discover the intricacies of how websites store videos. By obtaining
and arranging these segments through M3U8 files, we can reconstruct the
complete video.

How are the video stored


Typically, to display a video resource on a webpage, there must be a <video> tag:
Freedium Copy

<video src="xxx.mp4"></video>

The src attribute inside this <video> tag is not the actual download address of
the video. Almost no video website directly provides a download address within
the <video> tag.

This approach leads to a poor user experience as it negatively impacts both


network speed and memory usage.

A better solution is to slice the video into segments (ts). Each segment is assigned
a unique URL. Once all the segments are obtained, they can be properly arranged
and merged to create a complete video.

Since the video needs to be divided into numerous small fragments, a file is
required to record the paths of these fragments. This file is generally an M3U file.
After encoding the content of the M3U file in UTF-8, it becomes an M3U8 file.
Nowadays, most major video platforms use M3U8 files.

Nowadays, almost all video websites adopt a similar approach. The correct
loading sequence is as follows:

1. Request the M3U8 file.

2. Load the segment (ts) files.

3. Play the video normally.

This method offers several advantages, such as saving network resources. When
a user fast-forwards, the server can directly locate and load the corresponding ts
file, greatly enhancing the user experience and reducing server pressure.

Steps to obtain and construct video


Freedium
1. Obtain the first-level M3U8 file address by inspecting the webpage source
code.

Copy

import requests
from lxml import etree
import json

def get_first_m3u8_url():
# Fetch the page source code
url = "https://www.yunbtv.org/vodplay/sandadui-2-1.html"
resp = requests.get(url)
resp.encoding = "utf-8"

tree = etree.HTML(resp.text)

# Parse the URL from the script content


script_content = tree.xpath('//script[contains(text(), "player_aaaa")]/text

# Extract the JSON part from the script


json_str = script_content[script_content.find('{'):script_content.rfind('}'

# Parse the JSON string


data = json.loads(json_str)

# Extract the URL value


url_value = data.get("url", "")

print(url_value)

2. Download the first-level M3U8 file and extract the second-level M3U8 file
address.

Copy

import requests

def download_m3u8_file(first_m3u8_url):
resp = requests.get(first_m3u8_url)
resp.encoding = "utf-8"
url2 = resp.text.split()[-1]
Freedium
# Remove the last segment of the first URL (remove '/index.m3u8')
base_url = first_m3u8_url.rsplit('/', 1)[0]
# Second-level M3U8 address
second_m3u8_url = f"{base_url}/{url2}"

# Download M3U8 file


m3u8_resp = requests.get(second_m3u8_url)
m3u8_resp.encoding = "utf-8"

with open("m3u8.txt", mode="w", encoding="utf-8") as f:


f.write(m3u8_resp.text)

3. Parse the second-level M3U8 file and crawl the video segments.

Copy

import aiohttp
import aiofiles
import asyncio

# Download a single ts file


async def download_one(url):
print("Downloading: " + url)
# Retry 10 times to prevent download failures
for i in range(10):
try:
file_name = url.split("/")[-1]
async with aiohttp.ClientSession() as session:
async with session.get(url) as resp:
content = await resp.content.read()
async with aiofiles.open(f"./TsFiles/{file_name}", mode="wb
await f.write(content)
break
except:
print("Download failed: " + url)
await asyncio.sleep((i + 1) * 5)

# Download all ts files


async def download_all_ts():
# Prepare the task list
tasks = []
# Read the m3u8 file
with open("m3u8.txt", mode="r", encoding="utf-8") as f:
for line in f:
# Exclude all lines starting with #
Freedium
if line.startswith("#"):
continue
line = line.strip()
task = asyncio.create_task(download_one(line))
tasks.append(task)

# Wait for all tasks to finish


await asyncio.wait(tasks)

4. Merge the TS files to reconstruct the MP4 file. This relies on ffmpeg
executable.

Copy

import os

def merge_ts_files():
print("Merging files")
name_list = []
with open("m3u8.txt", mode="r", encoding="utf-8") as f:
for line in f:
# Exclude all lines starting with #
if line.startswith("#"):
continue
line = line.strip()
file_name = line.split("/")[-1]
name_list.append(file_name)

with open(".\TsFiles\m3u8.txt", mode="w", encoding="utf-8") as f:


for data in name_list:
f.write("file " + "'" + data + "'" + "\n")

# Record the current working directory


now_dir = os.getcwd()
# Change the working directory
os.chdir("./TsFiles")
os.system("D:\\ffmpeg\\ffmpeg.exe -f concat -safe 0 -i m3u8.txt -c copy out
# Switch back to the original working directory after all operations
os.chdir(now_dir)
print("File merging completed")

Thanks for reading! Happy hacking!


Freedium
#coding #programming #web-scraping #python #software-engineering

You might also like