se-scraper/TODO.txt

24.12.2018
    - fix interface to scrape() [DONE]
    - add to Github


24.1.2018

    - fix issue #3: add functionality to add keyword file

27.1.2019

    - Add functionality to block images and CSS from loading as described here:

        https://www.scrapehero.com/how-to-increase-web-scraping-speed-using-puppeteer/
        https://www.scrapehero.com/how-to-build-a-web-scraper-using-puppeteer-and-node-js/

29.1.2019

    - implement proxy support functionality
        - implement proxy check

    - implement scraping more than 1 page
        - do it for google
        - and bing

    - implement duckduckgo scraping


30.1.2019

    - modify all scrapers to use the generic class where it makes sense
        - Bing, Baidu, Google, Duckduckgo

7.2.2019
    - add num_requests to test cases [done]


TODO:
    - add captcha service solving support
    - check if news instances run the same browser and if we can have one proxy per tab wokers

    - write test case for:
        - pluggable
        - full metadata (log http headers, log ip address)
initial 2018-12-24 14:25:02 +01:00			`24.12.2018`
			`- fix interface to scrape() [DONE]`
			`- add to Github`

supporting yahoo ticker search for news 2019-01-24 15:50:03 +01:00
			`24.1.2018`

			`- fix issue #3: add functionality to add keyword file`

faster scraping, added ticker search engines 2019-01-27 01:27:52 +01:00			`27.1.2019`

			`- Add functionality to block images and CSS from loading as described here:`

			`https://www.scrapehero.com/how-to-increase-web-scraping-speed-using-puppeteer/`
added pluggable functionality 2019-01-27 15:54:56 +01:00			`https://www.scrapehero.com/how-to-build-a-web-scraper-using-puppeteer-and-node-js/`
faster scraping, added ticker search engines 2019-01-27 01:27:52 +01:00
resolved some issues. proxy possible now. scraping for more than one page possible now 2019-01-29 22:48:08 +01:00			`29.1.2019`

			`- implement proxy support functionality`
			`- implement proxy check`

			`- implement scraping more than 1 page`
			`- do it for google`
			`- and bing`

			`- implement duckduckgo scraping`

implemented generic scraping class 2019-01-30 16:05:08 +01:00
			`30.1.2019`

			`- modify all scrapers to use the generic class where it makes sense`
			`- Bing, Baidu, Google, Duckduckgo`

added chrome detection evasion techniques 2019-02-07 16:09:38 +01:00			`7.2.2019`
num_keywords are counted now. added to pluggable 2019-02-07 16:21:56 +01:00			`- add num_requests to test cases [done]`
added chrome detection evasion techniques 2019-02-07 16:09:38 +01:00
resolved some issues. proxy possible now. scraping for more than one page possible now 2019-01-29 22:48:08 +01:00

			`TODO:`
num_keywords are counted now. added to pluggable 2019-02-07 16:21:56 +01:00			`- add captcha service solving support`
			`- check if news instances run the same browser and if we can have one proxy per tab wokers`
resolved some issues. proxy possible now. scraping for more than one page possible now 2019-01-29 22:48:08 +01:00
num_keywords are counted now. added to pluggable 2019-02-07 16:21:56 +01:00			`- write test case for:`
			`- pluggable`
			`- full metadata (log http headers, log ip address)`