某宝闲谈

需要的库:

re
time
random
selenium
openpyxl

爬虫思路

通过Selenium进入到https://www.taobao.com/ 在搜索框输入关键词 并且回车 然后自己扫描登陆
首先判断有没有出现验证码(通过获取本页内容看是否包含“亲，请拖动下方滑块完成验证”字样没有就返回内容)
有滑块那么就过滑块 过三次，如果三次不行就自己手动吧
随后获取商品的总页数，并且构造每页的链接
再判断一次验证码 模拟真人操作，拖动滚动条
获取所有数据

核心代码：

验证是否有滑块:

def validation(self):
	content = self.driver.page_source
	if "亲，请拖动下方滑块完成验证" in content:
		con = self.hua_kuai()
		count = 1
		while "亲，请拖动下方滑块完成验证" in con and count <= 3:
			con = self.hua_kuai()
			count += 1
			if count == 3:
				print("已尽力尝试自动滑动验证码，但抱歉没能通过，请手动滑一下吧~\n")
                input("手动滑动后，请等待页面“加载完成”，扣1并按回车键继续采集：")
                con = self.driver.page_source
     return self.driver

如何过滑块:

def hua_kuai(self):
	ele = self.driver.find_element_by_xpath('//*[@id="nc_1_n1z"]')
	#按住滑块不放
	ac(self.driver).click_and_hold(ele).perform()
	#拖动滑块： xxx需要滑动的大小
	ac(self.driver).move_by_offset(300,random.randint(-5,5)).perform()
	# 松开鼠标
	ac(self.driver).release().perform()
	# 加载页面
	time.sleep(2)
   try:
   # 点击重新滑动按钮
   self.driver.find_element_by_xpath('//*[@id="`nc_1_refresh1`"]').click()
   except:
   pass
   return self.driver.page_source

初始阶段:

def get_page(self):
    # 访问淘宝网址
    self.driver.get('https://www.taobao.com/')
    time.sleep(3)  # 停一会防止出意外
    # 向搜索框中添加内容，并按下回车进行搜索
    self.driver.find_element_by_xpath("//input[@aria-label='请输入搜索文字']").send_keys(self.search_content, Keys.ENTER)
    # 扫码登陆
    self.driver.find_element_by_xpath('//*[@id="login"]/div[1]/i').click()
    # 给20秒时间登陆自己的账号，根据自己的速度来
    time.sleep(20)
    # 进入循环获取每页数据信息