se-scraper/TODO.md

### 24.12.2018
    - fix interface to scrape() [DONE]
    - add to Github


### 24.1.2018
    - fix issue #3: add functionality to add keyword file

### 27.1.2019
    - Add functionality to block images and CSS from loading as described here:
        https://www.scrapehero.com/how-to-increase-web-scraping-speed-using-puppeteer/
        https://www.scrapehero.com/how-to-build-a-web-scraper-using-puppeteer-and-node-js/

### 29.1.2019
    - implement proxy support functionality
        - implement proxy check

    - implement scraping more than 1 page
        - do it for google
        - and bing
    - implement duckduckgo scraping


### 30.1.2019
    - modify all scrapers to use the generic class where it makes sense
        - Bing, Baidu, Google, Duckduckgo

### 7.2.2019
    - add num_requests to test cases [done]

### 25.2.2019
    - https://antoinevastel.com/crawler/2018/09/20/parallel-crawler-puppeteer.html
    - add support for browsing with multiple browsers, use this neat library:
    - https://github.com/thomasdondorf/puppeteer-cluster [done]
    
    
### 28.2.2019
    - write test case for multiple browsers/proxies
    - write test case and example for multiple tabs with bing
    - make README.md nicer. https://github.com/thomasdondorf/puppeteer-cluster/blob/master/README.md as template


### 11.6.2019
    - TODO: fix amazon scraping
    - change api of remaining test cases [done]
    - TODO: implement custom search engine parameters on scrape()
    
### 12.6.2019
    - remove unnecessary sleep() calls and replace with waitFor selectors


### 16.7.2019

- resolve issues
    - fix this https://github.com/NikolaiT/se-scraper/issues/37 [done]
    
- use puppeteer stealth plugin: https://www.npmjs.com/package/puppeteer-extra-plugin-stealth

    - we will need to load at the concurrency impl of puppeteer-cluster [no typescript support :(), I will not support this right now]

- user random user agents plugin: https://github.com/intoli/user-agents [done]

- add screenshot capability (make the screen after parsing)
    - store as b64 [done]


### 12.8.2019

- add static test case for bing [done]
- add options that minimize `html_output` flag: 
    `clean_html_output` will remove all JS and CSS from the html 
    `clean_data_images` removes all data images from the html
    [done]
    
    
### 13.8.2019
- Write test case for clean html output [done]
- Consider better compression algorithm. [done] There is the brotli algorithm, but this is only supported
  in very recent versions of nodejs
- what else can we remove from the dom [done] Removing comment nodes now! They are large in BING.
- remove all whitespace and \n and \t from html

### TODO:
1. fix googlenewsscraper waiting for results and parsing. remove the static sleep [done]
2. when using multiple browsers and random user agent, pass a random user agent to each perBrowserOptions

3. dont create a new tab when opening a new scraper
fixed some errors and way better README 2019-02-28 15:34:25 +01:00			`### 24.12.2018`
initial 2018-12-24 14:25:02 +01:00			`- fix interface to scrape() [DONE]`
			`- add to Github`

supporting yahoo ticker search for news 2019-01-24 15:50:03 +01:00
fixed some errors and way better README 2019-02-28 15:34:25 +01:00			`### 24.1.2018`
supporting yahoo ticker search for news 2019-01-24 15:50:03 +01:00			`- fix issue #3: add functionality to add keyword file`

fixed some errors and way better README 2019-02-28 15:34:25 +01:00			`### 27.1.2019`
faster scraping, added ticker search engines 2019-01-27 01:27:52 +01:00			`- Add functionality to block images and CSS from loading as described here:`
			`https://www.scrapehero.com/how-to-increase-web-scraping-speed-using-puppeteer/`
added pluggable functionality 2019-01-27 15:54:56 +01:00			`https://www.scrapehero.com/how-to-build-a-web-scraper-using-puppeteer-and-node-js/`
faster scraping, added ticker search engines 2019-01-27 01:27:52 +01:00
fixed some errors and way better README 2019-02-28 15:34:25 +01:00			`### 29.1.2019`
resolved some issues. proxy possible now. scraping for more than one page possible now 2019-01-29 22:48:08 +01:00			`- implement proxy support functionality`
			`- implement proxy check`

			`- implement scraping more than 1 page`
			`- do it for google`
			`- and bing`
			`- implement duckduckgo scraping`

implemented generic scraping class 2019-01-30 16:05:08 +01:00
fixed some errors and way better README 2019-02-28 15:34:25 +01:00			`### 30.1.2019`
implemented generic scraping class 2019-01-30 16:05:08 +01:00			`- modify all scrapers to use the generic class where it makes sense`
			`- Bing, Baidu, Google, Duckduckgo`

fixed some errors and way better README 2019-02-28 15:34:25 +01:00			`### 7.2.2019`
num_keywords are counted now. added to pluggable 2019-02-07 16:21:56 +01:00			`- add num_requests to test cases [done]`
added chrome detection evasion techniques 2019-02-07 16:09:38 +01:00
fixed some errors and way better README 2019-02-28 15:34:25 +01:00			`### 25.2.2019`
support for multible browsers and proxies 2019-02-27 20:58:13 +01:00			`- https://antoinevastel.com/crawler/2018/09/20/parallel-crawler-puppeteer.html`
			`- add support for browsing with multiple browsers, use this neat library:`
			`- https://github.com/thomasdondorf/puppeteer-cluster [done]`
fixed some errors and way better README 2019-02-28 15:34:25 +01:00

			`### 28.2.2019`
			`- write test case for multiple browsers/proxies`
			`- write test case and example for multiple tabs with bing`
			`- make README.md nicer. https://github.com/thomasdondorf/puppeteer-cluster/blob/master/README.md as template`

changed api big time 2019-06-11 18:16:59 +02:00
			`### 11.6.2019`
			`- TODO: fix amazon scraping`
new version 2019-06-11 22:01:27 +02:00			`- change api of remaining test cases [done]`
updated README 2019-06-11 18:27:34 +02:00			`- TODO: implement custom search engine parameters on scrape()`
removed unnecessary sleeping times 2019-06-12 18:14:49 +02:00
			`### 12.6.2019`
			`- remove unnecessary sleep() calls and replace with waitFor selectors`
changed api big time 2019-06-11 18:16:59 +02:00
fixed issue https://github.com/NikolaiT/se-scraper/issues/37 2019-07-18 19:14:33 +02:00
			`### 16.7.2019`

			`- resolve issues`
using random user agents now from https://github.com/intoli/user-agents 2019-07-18 19:34:09 +02:00			`- fix this https://github.com/NikolaiT/se-scraper/issues/37 [done]`
fixed issue https://github.com/NikolaiT/se-scraper/issues/37 2019-07-18 19:14:33 +02:00
better user agents now, added option to include screenshots as base64 in results 2019-07-18 20:19:15 +02:00			`- use puppeteer stealth plugin: https://www.npmjs.com/package/puppeteer-extra-plugin-stealth`

			`- we will need to load at the concurrency impl of puppeteer-cluster [no typescript support :(), I will not support this right now]`
using random user agents now from https://github.com/intoli/user-agents 2019-07-18 19:34:09 +02:00
better user agents now, added option to include screenshots as base64 in results 2019-07-18 20:19:15 +02:00			`- user random user agents plugin: https://github.com/intoli/user-agents [done]`
using random user agents now from https://github.com/intoli/user-agents 2019-07-18 19:34:09 +02:00
fixed issue https://github.com/NikolaiT/se-scraper/issues/37 2019-07-18 19:14:33 +02:00			`- add screenshot capability (make the screen after parsing)`
better user agents now, added option to include screenshots as base64 in results 2019-07-18 20:19:15 +02:00			`- store as b64 [done]`
fixed issue https://github.com/NikolaiT/se-scraper/issues/37 2019-07-18 19:14:33 +02:00
added static bing test, added html cleaning when exporting html 2019-08-12 16:05:17 +02:00

			`### 12.8.2019`

			`- add static test case for bing [done]`
added little bug in cleaning 2019-08-12 17:16:37 +02:00			- add options that minimize `html_output` flag:
			`clean_html_output` will remove all JS and CSS from the html
			`clean_data_images` removes all data images from the html
			`[done]`
better tests 2019-08-13 15:28:30 +02:00

			`### 13.8.2019`
			`- Write test case for clean html output [done]`
			`- Consider better compression algorithm. [done] There is the brotli algorithm, but this is only supported`
			`in very recent versions of nodejs`
			`- what else can we remove from the dom [done] Removing comment nodes now! They are large in BING.`
			`- remove all whitespace and \n and \t from html`
added static bing test, added html cleaning when exporting html 2019-08-12 16:05:17 +02:00
fixed some errors and way better README 2019-02-28 15:34:25 +01:00			`### TODO:`
users may pass their own user agents, different browsers have random user agents and not the same now 2019-06-17 21:25:45 +02:00			`1. fix googlenewsscraper waiting for results and parsing. remove the static sleep [done]`
			`2. when using multiple browsers and random user agent, pass a random user agent to each perBrowserOptions`
worked on issue #31 2019-06-18 22:23:52 +02:00
			`3. dont create a new tab when opening a new scraper`