python学习阶段总结

python编程基础

搭建编程环境

下载python：下载网址：https://www.python.org/downloads/windows/，选择python3的版本，选择python的可执行文件安装包
安装python：python3.6已经可以自动添加环境变量，python2需要手动配置环境变量
pycharm的安装与配置：下载网址：http://www.jetbrains.com/pycharm/，下载社区版

python基础知识

变量和类型、变量命名、运算符
If…elif…else
for-in循环
函数及参数：def 关键字、参数默认值、可变参数
变量作⽤用域：局部作⽤用域、嵌套作⽤用域、全局作⽤用域
字符串及常⽤用⽅方法
常⽤用数据结构：列表 list（可重复，数组）、元组 tuple（值不能被修改）、集合 set（值不重复）、字典 dict（键值对，类似map）
⾯向对象编程：定义类、继承（子类继承父类的属性+方法）
文件读取：文件的打开⽅方式、捕获异常，增强健壮性

爬虫的基本技术

定义：是按照一定的规则自动浏览万维网并获取信息的机器人程序
分类：通用网络爬虫（从一个种子URL扩充到整个—— 深度/广度优先策略）、聚焦网络爬虫、增量式网络爬虫、Deep Web爬虫
工作流程：

设定抓取目标（种子页面/起始页面）并获取网页。
当服务器无法访问时，按照指定的重试次数尝试重新下载页面。
在需要的时候设置用户代理或隐藏真实IP，否则可能无法访问页面。
对获取的页面进行必要的解码操作然后抓取出需要的信息。
在获取的页面中通过某种方式（如正则表达式）抽取出页面中的链接信息。
对链接进行进一步的处理（获取页面并重复上面的动作）。
将有用的信息进行持久化以备后续的处理

工作原理：

Robots.txt文件中说明user-agent以及可以访问或不能访问的内容

爬虫设计：

下载网页：urllib.requests
解析网页：BeautifulSoup
模拟交互，处理JS动态网页Selenium

from urllib.parse import urljoin
import re
import requests
from bs4 import BeautifulSoup

headers = {'user-agent': 'Baiduspider'}
base_url = 'https://www.zhihu.com/'
seed_url = urljoin(base_url, 'explore')
l1 = [0, 0, 0, 0]


def spider(url, c):
    link_set = []
    titles = []
    link_set.append(url)
    print('Begin!')
    dfs(url, 1, c, link_set, titles)
    print('End, 共爬取了%d个页面！' % len(link_set))
    return titles, link_set


def dfs(url, k, c, links, titles):
    resp = requests.get(url, headers=headers)
    soup = BeautifulSoup(resp.text, 'lxml')
    href_regex = re.compile(r'^/question')
    titles.append(soup.title.string)
    if k == c:
        return
    for a_tag in soup.find_all('a', {'href': href_regex}):
        href = a_tag.attrs['href']
        full_url = urljoin(base_url, href)
        if full_url in links:
            continue
        else:
            links.append(full_url)
            l1[k - 1] = l1[k - 1] + 1
            dfs(full_url, k + 1, c, links, titles)


titles, links = spider(seed_url, 4)
f = open('test.txt', 'w')  # 若是'wb'就表示写二进制文件
for i in range(len(titles)):
    # print(links[i], titles[i])
    s = str(links[i]) + str(titles[i]) + '\n'
    f.write(s)
for x in l1:
    f.write(str(x) + '\n')
f.close()

高级爬虫框架SCRAPY

规则

name: 必须
start_url、start_requests至少一个存在
运行：scrapy runspider quotes_spider.py -o quotes.json
parse: 默认回调函数

重要概念：css选择器、 yield关键字

response.follow

支持相对URL，无需调用URLJOIN
支持选择器
支持html标签

主要命令

查看帮助：

scrapy -h

创建项目：

scrapy startproject -h

scrapy startproject tutorial

运行：

scrapy crawl -h

scrapy crawl quotes

import scrapy


# class QuoteSpider(scrapy.Spider):
#     name = 'quotes'
#     start_urls = [
#         'http://quotes.toscrape.com/tag/humor/',
#     ]
#
#     def parse(self, response):
#         for quote in response.css('div.quote'):
#             yield {
#                 'text': quote.css('span.text::text').get(),
#                 'author': quote.xpath('span/small/text()').get(),
#             }
#             next_page = response.css('li.next a::attr("href")').get()
#             if next_page is not None:
#                 yield response.follow(next_page, self.parse)

class QuoteSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        urls = [
            'http://quotes.toscrape.com/page/1/',
            'http://quotes.toscrape.com/page/2/',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = 'quotes-%s.html' % page
        with open(filename, 'wb')as f:
            f.write(response.body)
        self.log('Save file %s' % filename)

python编程基础

搭建编程环境

python基础知识

爬虫的基本技术

高级爬虫框架SCRAPY

规则

主要命令