用python写一个爬取特定网站所有链接，并生成站点地图的程序，生成的结果包含html和xml两种格式_ChatGPT_威海佰年网络技术有限公司_网站建设_软件开发_私有云_商标注册_公众号_小程序_APP_物联网

用python写一个爬取特定网站所有链接，并生成站点地图的程序，生成的结果包含html和xml两种格式

ChatGPT 3 类型 : program 标签 : python
652

  用python写一个爬取特定网站所有链接，并生成站点地图的程序，生成的结果包含html和xml两种格式

整理自网络ChatGPT产生之内容，文本内容不具备参考意义，程序内容及代码片段有且仅有借鉴意义。

  

使用python爬虫scrapy来实现

1、安装scrapy框架： pip install scrapy

2、创建一个项目： scrapy startproject  project_name

3、进入项目目录，创建一个spider： cd project_name && scrapy genspider spider_name website

4、在spider文件里定义start_urls,allowed_domains，并实现网页下载和url处理器：

import scrapy
from scrapy.selector import Selector
from scrapy.http import Request

class Spider_name(scrapy.Spider):
    name = "Spider_name"
    allowed_domains = [] #  设置允许的网址
    start_urls = [
		 'https://www.example.com/',
    ]

    def parse(self, response):
        """
        网页下载器
        """
        # 下载网页
        html = response.body
        # 对下载的网页进行解析
        selector = Selector(text=html)
        # 获取当前页所有链接
        list_urls = selector.xpath("//a/@href").extract()

        for url in list_urls:
            # 避免重复爬取
            if url not in self.allowed_domains:
                yield Request(url, callback=self.parse)

    def save_result(self, list_urls):
        """
        保存抓取到的结果
        """
        # 生成html格式
        with open('urls.html', 'w+') as fp:
            fp.write("")
            for url in list_urls:
                fp.write("" + url + "
")
            fp.write("")
        # 生成xml格式
        with open('urls.xml', 'w+') as fp:
            fp.write("")
            for url in list_urls:
                fp.write(""+url+"")
            fp.write("")

Public @ 2023-02-18 14:07:52 整理自网络ChatGPT产生之内容，文本内容不具备参考意义，程序内容有且仅有借鉴意义。

Categories

Tags

用python写一个爬取特定网站所有链接，并生成站点地图的程序，生成的结果包含html和xml两种格式

更多您感兴趣的搜索