技术简介
Selenium是一个用于Web应用程序的自动化测试框架,通过模拟用户在浏览器中的操作,如点击按钮、输入文本等,来执行自动化测试。它支持多种浏览器和操作系统,并且可以通过各种编程语言编写测试脚本。Selenium的主要特点是易于使用、灵活性和稳定性,因此在Web测试领域被广泛使用。
安装使用
安装
demo
demo中的案例通过selenium框架中的 webdriver.Chrome(options=options)方法获取google chrome启动后的驱动,通过操作driver.get(url_detail)方法动态获取当前网页内容,通过 time.sleep(search_delta_time)增加延迟以确保内容加载完成。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
| from selenium import webdriver from selenium.common import NoSuchWindowException from selenium.webdriver.common.by import By
proxy = { "proxyType": "manual", "httpProxy": "http://" + constants.proxy_server_ip + ":" + str(constants.proxy_server_port), "ftpProxy": "http://" + constants.proxy_server_ip + ":" + str(constants.proxy_server_port), "sslProxy": "http://" + constants.proxy_server_ip + ":" + str(constants.proxy_server_port), "noProxy": "", "proxyAutoconfigUrl": "" }
options = webdriver.ChromeOptions() if proxy_flag == 'True': options.set_capability("proxy", proxy) logger.info("current use internal proxy, proxy content: " + str(proxy['httpProxy']))
driver = webdriver.Chrome(options=options) mode = '' url = "https://" + visit_url + "/tags/" + key_word + "/artworks?" + mode while True: url_detail = url_process_page(url, current_page=cur_page) logger.info("current use url : " + str(url_detail)) driver.get(url_detail) time.sleep(search_delta_time) logger.debug("start load href save url to txt.") load_save_flag = load_href_save(driver, key_word) if load_save_flag: try: save_img_url(driver, key_word) cur_page += 1 logger.success("save img all finish,current page: " + str(cur_page))
except NoSuchWindowException as nswe: logger.warning("chrome force exit! detail:" + str(nswe))
else: break self.success_tips() constants.spider_image_flag = False logger.warning("google chrome will exit! ") driver.quit()
|
下方代码通过driver.find_elements(By.CSS_SELECTOR, “a”)获取指定html元素值,并存储值列表,后期可对列表数据进行筛选,以此完成目标数据的动态抓取
1 2 3 4 5 6 7 8 9 10 11 12
| def load_href_save(driver, key_word): image_urls_list = [] try: image_elements = driver.find_elements(By.CSS_SELECTOR, "a") for image_element in image_elements: driver.execute_script("return arguments[0].href;", image_element) image_urls_list.append(image_url) logger.debug("load href and start save img url: " + image_url) return True except Exception as un_e: logger.error("Error, unknown error, detail:" + str(un_e)) return False
|