2018-12-24 14:25:02 +01:00
|
|
|
24.12.2018
|
|
|
|
- fix interface to scrape() [DONE]
|
|
|
|
- add to Github
|
|
|
|
|
2019-01-24 15:50:03 +01:00
|
|
|
|
|
|
|
24.1.2018
|
|
|
|
|
|
|
|
- fix issue #3: add functionality to add keyword file
|
|
|
|
|
2019-01-27 01:27:52 +01:00
|
|
|
27.1.2019
|
|
|
|
|
|
|
|
- Add functionality to block images and CSS from loading as described here:
|
|
|
|
|
|
|
|
https://www.scrapehero.com/how-to-increase-web-scraping-speed-using-puppeteer/
|
2019-01-27 15:54:56 +01:00
|
|
|
https://www.scrapehero.com/how-to-build-a-web-scraper-using-puppeteer-and-node-js/
|
2019-01-27 01:27:52 +01:00
|
|
|
|
2019-01-29 22:48:08 +01:00
|
|
|
29.1.2019
|
|
|
|
|
|
|
|
- implement proxy support functionality
|
|
|
|
- implement proxy check
|
|
|
|
|
|
|
|
- implement scraping more than 1 page
|
|
|
|
- do it for google
|
|
|
|
- and bing
|
|
|
|
|
|
|
|
- implement duckduckgo scraping
|
|
|
|
|
2018-12-24 14:25:02 +01:00
|
|
|
TODO:
|
2019-01-29 13:29:24 +01:00
|
|
|
- think about implementing ticker search for: https://quotes.wsj.com/MSFT?mod=searchresults_companyquotes
|
2018-12-24 14:25:02 +01:00
|
|
|
- add proxy support
|
|
|
|
- add captcha service solving support
|
2019-01-29 13:29:24 +01:00
|
|
|
- check if news instances run the same browser and if we can have one proxy per tab wokers
|
|
|
|
|
|
|
|
TODO:
|
|
|
|
- think whether it makes sense to introduce a generic scraping class?
|
|
|
|
- is scraping abstractable or is every scraper too unique?
|
2019-01-29 22:48:08 +01:00
|
|
|
- dont make the same mistakes as with GoogleScraper
|
|
|
|
|
|
|
|
|
|
|
|
TODO:
|
|
|
|
okay its fucking time to make a generic scraping class like in GoogleScraper
|
|
|
|
i feel like history repeats
|
|
|
|
|
|
|
|
class Scraper
|
|
|
|
|
|
|
|
constructor(options = {}) {
|
|
|
|
|
|
|
|
}
|
|
|
|
|
|
|
|
async load_search_engine() {}
|
|
|
|
|
|
|
|
async search_keyword() {}
|
|
|
|
|
|
|
|
async new_page() {}
|
|
|
|
|
|
|
|
async detected() {}
|
|
|
|
|
|
|
|
|
|
|
|
then each search engine derives from this generic class
|
|
|
|
|
|
|
|
some search engines do not seed such a abstract class, because they are too complex
|