se-scraper/TODO.md at 0db6e068da8f12f8afefe666529e5dd0b5d3206e

mirror of https://github.com/NikolaiT/se-scraper.git synced 2024-11-08 00:33:58 +01:00

Nikolai Tschacher 78fe12390b better user agents now, added option to include screenshots as base64 in results

2019-07-18 20:19:15 +02:00

2.3 KiB

Raw Blame History

24.12.2018

- fix interface to scrape() [DONE]
- add to Github

24.1.2018

- fix issue #3: add functionality to add keyword file

27.1.2019

- Add functionality to block images and CSS from loading as described here:
    https://www.scrapehero.com/how-to-increase-web-scraping-speed-using-puppeteer/
    https://www.scrapehero.com/how-to-build-a-web-scraper-using-puppeteer-and-node-js/

29.1.2019

- implement proxy support functionality
    - implement proxy check

- implement scraping more than 1 page
    - do it for google
    - and bing
- implement duckduckgo scraping

30.1.2019

- modify all scrapers to use the generic class where it makes sense
    - Bing, Baidu, Google, Duckduckgo

7.2.2019

- add num_requests to test cases [done]

25.2.2019

- https://antoinevastel.com/crawler/2018/09/20/parallel-crawler-puppeteer.html
- add support for browsing with multiple browsers, use this neat library:
- https://github.com/thomasdondorf/puppeteer-cluster [done]

28.2.2019

- write test case for multiple browsers/proxies
- write test case and example for multiple tabs with bing
- make README.md nicer. https://github.com/thomasdondorf/puppeteer-cluster/blob/master/README.md as template

11.6.2019

- TODO: fix amazon scraping
- change api of remaining test cases [done]
- TODO: implement custom search engine parameters on scrape()

12.6.2019

- remove unnecessary sleep() calls and replace with waitFor selectors

16.7.2019

resolve issues
- fix this https://github.com/NikolaiT/se-scraper/issues/37 [done]
use puppeteer stealth plugin: https://www.npmjs.com/package/puppeteer-extra-plugin-stealth
- we will need to load at the concurrency impl of puppeteer-cluster [no typescript support :(), I will not support this right now]
user random user agents plugin: https://github.com/intoli/user-agents [done]
add screenshot capability (make the screen after parsing)
- store as b64 [done]

TODO:

fix googlenewsscraper waiting for results and parsing. remove the static sleep [done]
when using multiple browsers and random user agent, pass a random user agent to each perBrowserOptions
dont create a new tab when opening a new scraper

2.3 KiB Raw Blame History

24.12.2018

24.1.2018

27.1.2019

29.1.2019

30.1.2019

7.2.2019

25.2.2019

28.2.2019

11.6.2019

12.6.2019

16.7.2019

TODO:

2.3 KiB

Raw Blame History