[爬虫脚本] 基于Python的网站爬虫实战：全自动下载优品PPT模板

项目背景

在许多工作与学习场景中，高质量的PPT模板是提升演示文稿专业度的关键资源。“优品PPT”（ypppt.com）是一个提供大量免费、优质PPT模板的公开网站。然而，当需要批量获取或按分类归档这些资源时，手动下载的方式效率低下且过程繁琐。

为了解决这一痛点，本文将通过一个完整的实战项目，演示如何利用Python构建一个自动化爬虫，实现对该网站PPT模板的批量下载和分类归档。

需求分析与目标设定

在编码之前，首先需要明确项目的核心需求和预期目标。一个健壮的爬虫脚本应具备以下功能：

分类自动发现：动态抓取网站的所有模板分类，避免因网站结构更新导致脚本失效。
分页智能遍历：准确探测每个分类下的总页数，确保数据抓取的完整性。
深度链接解析：模拟用户操作，从列表页 -> 详情页 -> 下载页，最终定位到文件的真实下载地址。
自动化下载与归档：将文件下载至本地，并依据其分类和原始标题进行结构化存储。
良好的交互性：提供命令行接口（CLI），允许用户选择目标分类和下载范围。
基础反爬应对：通过设置User-Agent和随机延时，降低被目标网站屏蔽的风险，并能妥善处理百度网盘等特殊链接。

技术栈选型

针对上述需求，我们选择一套成熟且高效的Python库组合：

requests: 业界标准的HTTP库，用于与服务器进行网络通信，获取网页HTML。
BeautifulSoup4: 强大的HTML/XML解析库，能够轻松地从复杂的文档树中提取所需数据。
os: Python内置库，用于处理文件系统操作，如创建目录。
re: Python内置的正则表达式库，用于从非结构化文本中匹配和提取特定模式的数据（如网盘提取码）。

核心实现逻辑解析

爬虫的执行流程遵循一个清晰的逻辑链条。下面对各关键步骤的函数实现进行分析。

1. 发现并提取全部分类 (`get_ppt_categories`)

爬虫的入口点是获取所有可爬取的目标范围。通过分析模板主页 (/moban/) 的DOM结构，可以定位到包含分类信息的导航菜单，并从中提取所有分类的名称和URL。

# 实现思路：
# 1. 向目标URL发送GET请求。
# 2. 使用BeautifulSoup解析响应的HTML。
# 3. 通过CSS选择器 `div.menu a` 精准定位到所有分类的<a>标签。
# 4. 遍历标签列表，提取文本内容（分类名）和`href`属性（URL），并进行有效性过滤，最终存入字典。
def get_ppt_categories():
    # ... 源码略 ...

2. 动态探测分类的总页数 (`get_total_pages`)

为了实现完整抓取，必须预先知道每个分类下列表页的总数。一种稳健的策略是模拟用户翻页行为：持续请求下一页，直到出现“下一页”按钮消失或服务器返回404状态码为止。

关键细节：通过开发者工具分析发现，该站点的分页URL模式为 list-{page}.html。在代码中正确构造此URL是探测成功的关键。

# 实现思路：
# 1. 初始化页码为1，进入一个无限循环。
# 2. 根据当前页码构造URL并发起请求。
# 3. 检查HTTP状态码，若为404则表明已超出最大页数，终止循环。
# 4. 解析页面，查找文本为“下一页”的<a>标签。若不存在，同样终止循环。
# 5. 若存在，则页码加一，并加入一个短暂的随机延时，继续下一次探测。
def get_total_pages(category_url):
    # ... 源码略 ...

3. 抓取列表页中的详情页URL (`get_ppt_list_from_category`)

在确定了页数范围后，即可遍历所有列表页，提取其中每个PPT条目指向详情页的URL。

# 实现思路：
# 1. 循环遍历指定的页数。
# 2. 访问每个列表页的URL。
# 3. 使用CSS选择器 `ul.posts.clear > li > a.p-title` 提取所有详情页的链接。
# 4. 将提取到的URL添加到一个全局列表中。
def get_ppt_list_from_category(category_url, max_pages):
    # ... 源码略 ...

4. 解析并下载最终文件 (`download_ppt_file`)

这是整个流程的终点。此函数负责完成从详情页到最终文件下载的完整链路。

该过程涉及两次页面跳转：首先从详情页找到下载页的链接，再从下载页解析出文件的直接下载地址（Direct Link）。

# 实现思路：
# 1. 访问下载页URL，解析HTML。
# 2. 提取页面H1标签作为文件名，并找到最终文件所在的<a>标签。
# 3. **特殊情况处理**：检查链接中是否包含 "pan.baidu.com"。如果是，则使用正则表达式搜索页面文本中的“提取码”，并输出提示信息，不执行下载。
# 4. **常规下载**：对于直链，使用 `requests.get(stream=True)` 发起流式下载请求。
# 5. 清理文件名中的非法字符，并以二进制块（chunk）的形式将文件内容写入本地磁盘。
def download_ppt_file(download_page_url, save_folder):
    # ... 源码略 ...

使用说明

环境配置：确保已安装Python 3，并通过pip安装必要的依赖库：
```
pip install requests beautifulsoup4
```
代码保存：将附录中的完整源码保存为 .py 文件，例如 ypppt_spider.py。
脚本执行：在终端中运行该脚本：
```
python ypppt_spider.py
```
交互式操作：根据终端输出的提示，依次输入目标分类的编号和希望下载的页数（或输入all下载全部）。脚本将自动执行后续任务。

完整源码

import os
import time
import requests
from bs4 import BeautifulSoup
import random
import re

# --- 全局配置 ---
BASE_URL = "https://www.ypppt.com"
SAVE_DIR = "PPT_Downloads_Perfect"
HEADERS = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64 ) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36'
}

# ==============================================================================
# 函数区: 修复分页URL格式问题
# ==============================================================================

def get_ppt_categories():
    """从模板主页 /moban/ 获取所有分类"""
    print("正在从模板主页探测全部分类目录...")
    try:
        response = requests.get(f"{BASE_URL}/moban/", headers=HEADERS)
        response.raise_for_status()
        response.encoding = 'utf-8'
        soup = BeautifulSoup(response.text, 'html.parser')
        
        category_links = soup.select('div.menu a')
        if not category_links:
            print("错误：在模板主页未能探测到任何分类链接。")
            return None
            
        categories = {}
        for link in category_links:
            name = link.text.strip()
            href = link.get('href', '')
            # 过滤有效分类链接
            if name and href and '/moban/' in href and not link.has_attr('style'):
                categories[name] = href
        
        print(f"✓ 全部分类目录探测成功！共找到 {len(categories)} 个分类。")
        return categories
    except Exception as e:
        print(f"错误：访问模板主页失败 - {e}")
        return None

def get_total_pages(category_url):
    """修复分页URL格式，使用连字符而非下划线"""
    print(f"  - 正在探测分类 “{category_url}” 的总页数...")
    total_pages = 1
    base_url = f"{BASE_URL}{category_url}".rstrip('/')
    
    while True:
        # 关键修复：分页URL使用连字符 "list-{page}.html" 而非下划线
        current_url = base_url if total_pages == 1 else f"{base_url}/list-{total_pages}.html"
        print(f"    探测第{total_pages}页: {current_url}")  # 调试输出当前URL
        try:
            response = requests.get(current_url, headers=HEADERS, timeout=10)
            print(f"    响应状态: {response.status_code}")  # 显示响应状态码
            
            if response.status_code == 404:
                print(f"    第{total_pages}页不存在，停止探测")
                break
            
            response.raise_for_status()
            response.encoding = 'utf-8'
            soup = BeautifulSoup(response.text, 'html.parser')
            
            # 根据页面源码优化下一页按钮识别
            next_page = soup.find('a', string='下一页')
            
            if not next_page:
                print(f"    未找到下一页按钮，当前页{total_pages}为最后一页")
                break
            
            # 验证下一页URL是否有效（防止虚假按钮）
            next_page_url = next_page.get('href', '')
            if not next_page_url or 'list-' not in next_page_url:
                print(f"    下一页链接无效，停止探测")
                break
            
            total_pages += 1
            time.sleep(random.uniform(1, 1.5))  # 延迟避免反爬
            
        except Exception as e:
            print(f"    探测第{total_pages}页出错: {e}，停止探测")
            break
    
    print(f"  - ✓ 探测到总页数为: {total_pages}")
    return total_pages

def get_ppt_list_from_category(category_url, max_pages):
    """修复列表页URL格式，匹配网站实际分页"""
    all_detail_urls = []
    for page_num in range(1, max_pages + 1):
        base_cat_url = category_url.rstrip('/')
        # 修复列表页URL格式：使用连字符
        list_url = f"{BASE_URL}{base_cat_url}" if page_num == 1 else f"{BASE_URL}{base_cat_url}/list-{page_num}.html"
        
        print(f"  - 正在分析第 {page_num}/{max_pages} 页: {list_url}")
        try:
            response = requests.get(list_url, headers=HEADERS)
            if response.status_code == 404:
                print("  - 提示: 页面不存在，已到达最后一页。")
                break
            response.raise_for_status()
            response.encoding = 'utf-8'
            soup = BeautifulSoup(response.text, 'html.parser')
            ppt_links = soup.select('ul.posts.clear > li > a.p-title')
            if not ppt_links:
                print("  - 提示: 此页未找到PPT链接，已到达最后一页。")
                break
            all_detail_urls.extend([link['href'] for link in ppt_links])
            time.sleep(random.uniform(0.5, 1.5))
        except Exception as e:
            print(f"  - 错误: 分析列表页 {list_url} 时出错 - {e}")
            break
    return all_detail_urls

def get_download_page_url(detail_page_url):
    try:
        full_url = f"{BASE_URL}{detail_page_url}"
        response = requests.get(full_url, headers=HEADERS)
        response.raise_for_status()
        response.encoding = 'utf-8'
        soup = BeautifulSoup(response.text, 'html.parser')
        down_button = soup.select_one('a.down-button')
        return down_button['href'] if down_button and down_button.has_attr('href') else None
    except Exception as e:
        print(f"  - 错误: 访问详情页 {full_url} 时出错 - {e}")
        return None

def download_ppt_file(download_page_url, save_folder):
    try:
        full_url = f"{BASE_URL}{download_page_url}"
        response = requests.get(full_url, headers=HEADERS)
        response.raise_for_status()
        response.encoding = 'utf-8'
        soup = BeautifulSoup(response.text, 'html.parser')
        title = (soup.select_one('div.de > h1') or soup.new_tag('h1')).text.strip().replace(' - 下载页', '') or f"ppt_{random.randint(1000, 9999)}"
        final_link_tag = soup.select_one('ul.down.clear > li > a')
        if not final_link_tag or not final_link_tag.has_attr('href'):
            print(f"  - 警告：在下载页 {full_url} 未找到最终下载链接。")
            return
        file_url = final_link_tag['href']
        if "pan.baidu.com" in file_url:
            page_text = soup.get_text()
            match = re.search(r'提取码\s*[:：]\s*([a-zA-Z0-9]{4})', page_text)
            pass_code = match.group(1) if match else "未找到"
            print(f"  - 提示：检测到百度网盘资源，请手动下载。")
            print(f"    - 标题: {title}\n    - 地址: {file_url}\n    - 提取码: {pass_code}")
            return
        print(f"  ● 正在下载：{title}")
        file_response = requests.get(file_url, headers=HEADERS, stream=True)
        file_response.raise_for_status()
        file_name = f"{title}.zip"
        file_name = re.sub(r'[\\/*?:"<>|]', "", file_name)
        save_path = os.path.join(save_folder, file_name)
        with open(save_path, 'wb') as f:
            for chunk in file_response.iter_content(chunk_size=8192):
                f.write(chunk)
        print(f"    ✓ 下载完成，已保存至: {save_path}")
    except Exception as e:
        print(f"  - 错误：下载文件时失败 - {e}")

# ==============================================================================
# 主程序区
# ==============================================================================

if __name__ == "__main__":
    categories = get_ppt_categories()
    if not categories: exit()

    print("\n--- 请选择您要下载的PPT分类 ---")
    cat_list = list(categories.items())
    for i, (name, url) in enumerate(cat_list):
        print(f"  [{i+1}] {name}")
    
    while True:
        try:
            choice = int(input("\n请输入分类编号: ").strip())
            if 1 <= choice <= len(cat_list):
                selected_index = choice - 1
                break
            else: print("输入编号超出范围，请重新输入。")
        except ValueError: print("输入无效，请输入一个数字编号。")

    cat_name, cat_url = cat_list[selected_index]
    
    # 自动探测总页数（已修复URL格式问题）
    total_pages = get_total_pages(cat_url)
    
    while True:
        try:
            page_choice_str = input(f"请输入您想下载的页数 (输入 'all' 下载全部 {total_pages} 页，或输入具体数字): ").strip().lower()
            if page_choice_str == 'all':
                pages_to_scrape = total_pages
                break
            pages_to_scrape = int(page_choice_str)
            if 0 < pages_to_scrape <= total_pages:
                break
            else:
                print(f"页数必须在 1 到 {total_pages} 之间。")
        except ValueError: print("输入无效，请输入 'all' 或一个数字。")

    if not os.path.exists(SAVE_DIR): os.makedirs(SAVE_DIR)
    
    category_folder = os.path.join(SAVE_DIR, cat_name)
    if not os.path.exists(category_folder): os.makedirs(category_folder)
    
    print(f"\n{'='*20} 开始下载分类: “{cat_name}” (共 {pages_to_scrape} 页) {'='*20}")

    detail_page_urls = get_ppt_list_from_category(cat_url, pages_to_scrape)
    
    if not detail_page_urls:
        print("未能从此分类获取到任何PPT列表。")
    else:
        total_count = len(detail_page_urls)
        print(f"\n成功获取到 {total_count} 个PPT，开始逐一处理...")
        for i, detail_url in enumerate(detail_page_urls):
            print(f"\n--- 处理第 {i+1}/{total_count} 个PPT: {detail_url} ---")
            download_page_url = get_download_page_url(detail_url)
            
            if download_page_url:
                download_ppt_file(download_page_url, category_folder)
            else:
                print(f"  - 警告: 未能从 {detail_url} 获取到下载页面链接。")
            
            sleep_time = random.uniform(1, 2)
            print(f"      ...等待 {sleep_time:.2f} 秒...")
            time.sleep(sleep_time)
    
    print(f"\n{'='*20} “{cat_name}” 分类下载任务已完成！ {'='*20}")
    print(f"所有文件已保存在 “{os.path.abspath(SAVE_DIR)}” 文件夹中。")
    print("\n--- 解压提示 ---\n如果解压后的文件名出现乱码，请尝试使用命令: unzip -O GBK '文件名.zip'")