mirror of
https://github.com/NikolaiT/se-scraper.git
synced 2024-11-28 10:33:54 +01:00
46 lines
1.0 KiB
Plaintext
46 lines
1.0 KiB
Plaintext
24.12.2018
|
|
- fix interface to scrape() [DONE]
|
|
- add to Github
|
|
|
|
|
|
24.1.2018
|
|
|
|
- fix issue #3: add functionality to add keyword file
|
|
|
|
27.1.2019
|
|
|
|
- Add functionality to block images and CSS from loading as described here:
|
|
|
|
https://www.scrapehero.com/how-to-increase-web-scraping-speed-using-puppeteer/
|
|
https://www.scrapehero.com/how-to-build-a-web-scraper-using-puppeteer-and-node-js/
|
|
|
|
29.1.2019
|
|
|
|
- implement proxy support functionality
|
|
- implement proxy check
|
|
|
|
- implement scraping more than 1 page
|
|
- do it for google
|
|
- and bing
|
|
|
|
- implement duckduckgo scraping
|
|
|
|
|
|
30.1.2019
|
|
|
|
- modify all scrapers to use the generic class where it makes sense
|
|
- Bing, Baidu, Google, Duckduckgo
|
|
|
|
7.2.2019
|
|
- add num_requests to test cases [done]
|
|
|
|
|
|
|
|
TODO:
|
|
- add captcha service solving support
|
|
- check if news instances run the same browser and if we can have one proxy per tab wokers
|
|
|
|
- write test case for:
|
|
- pluggable
|
|
- full metadata (log http headers, log ip address)
|