se-scraper/TODO.md

56 lines
1.8 KiB
Markdown
Raw Normal View History

### 24.12.2018
2018-12-24 14:25:02 +01:00
- fix interface to scrape() [DONE]
- add to Github
### 24.1.2018
- fix issue #3: add functionality to add keyword file
### 27.1.2019
- Add functionality to block images and CSS from loading as described here:
https://www.scrapehero.com/how-to-increase-web-scraping-speed-using-puppeteer/
2019-01-27 15:54:56 +01:00
https://www.scrapehero.com/how-to-build-a-web-scraper-using-puppeteer-and-node-js/
### 29.1.2019
- implement proxy support functionality
- implement proxy check
- implement scraping more than 1 page
- do it for google
- and bing
- implement duckduckgo scraping
2019-01-30 16:05:08 +01:00
### 30.1.2019
2019-01-30 16:05:08 +01:00
- modify all scrapers to use the generic class where it makes sense
- Bing, Baidu, Google, Duckduckgo
### 7.2.2019
- add num_requests to test cases [done]
### 25.2.2019
- https://antoinevastel.com/crawler/2018/09/20/parallel-crawler-puppeteer.html
- add support for browsing with multiple browsers, use this neat library:
- https://github.com/thomasdondorf/puppeteer-cluster [done]
### 28.2.2019
- write test case for multiple browsers/proxies
- write test case and example for multiple tabs with bing
- make README.md nicer. https://github.com/thomasdondorf/puppeteer-cluster/blob/master/README.md as template
2019-06-11 18:16:59 +02:00
### 11.6.2019
- TODO: fix amazon scraping
2019-06-11 22:01:27 +02:00
- change api of remaining test cases [done]
2019-06-11 18:27:34 +02:00
- TODO: implement custom search engine parameters on scrape()
2019-06-12 18:14:49 +02:00
### 12.6.2019
- remove unnecessary sleep() calls and replace with waitFor selectors
2019-06-11 18:16:59 +02:00
### TODO:
1. fix googlenewsscraper waiting for results and parsing. remove the static sleep [done]
2. when using multiple browsers and random user agent, pass a random user agent to each perBrowserOptions
2019-06-18 22:23:52 +02:00
3. dont create a new tab when opening a new scraper