se-scraper/README.md

# Search Engine Scraper - se-scraper

[![npm](https://img.shields.io/npm/v/se-scraper.svg?style=for-the-badge)](https://www.npmjs.com/package/se-scraper)
[![Donate](https://img.shields.io/badge/donate-paypal-blue.svg?style=for-the-badge)](https://www.paypal.me/incolumitas)
[![Known Vulnerabilities](https://snyk.io/test/github/NikolaiT/se-scraper/badge.svg)](https://snyk.io/test/github/NikolaiT/se-scraper)

This node module allows you to scrape search engines concurrently with different proxies.

If you don't have much technical experience or don't want to purchase proxies, you can use [my scraping service](https://scrapeulous.com/).

##### Table of Contents
- [Installation](#installation)
- [Quickstart](#quickstart)
- [Using Proxies](#proxies)
- [Examples](#examples)
- [Scraping Model](#scraping-model)
- [Technical Notes](#technical-notes)
- [Advanced Usage](#advanced-usage)
- [Special Query String Parameters for Search Engines](#query-string-parameters)


Se-scraper supports the following search engines:
* Google
* Google News
* Google News App version (https://news.google.com)
* Google Image
* Amazon
* Bing
* Bing News
* Baidu
* Youtube
* Infospace
* Duckduckgo
* Webcrawler
* Reuters
* Cnbc
* Marketwatch

This module uses puppeteer and a modified version of [puppeteer-cluster](https://github.com/thomasdondorf/puppeteer-cluster/). It was created by the Developer of [GoogleScraper](https://github.com/NikolaiT/GoogleScraper), a module with 1800 Stars on Github.

## Installation

You need a working installation of **node** and the **npm** package manager.

Install **se-scraper** by entering the following command in your terminal

```bash
npm install se-scraper
```

If you **don't** want puppeteer to download a complete chromium browser, add this variable to your environment. Then this library is not guaranteed to run out of the box.

```bash
export PUPPETEER_SKIP_CHROMIUM_DOWNLOAD=1
```

## Quickstart

Create a file named `run.js` with the following contents

```js
const se_scraper = require('se-scraper');

let config = {
    search_engine: 'google',
    debug: false,
    verbose: false,
    keywords: ['news', 'scraping scrapeulous.com'],
    num_pages: 3,
    output_file: 'data.json',
};

function callback(err, response) {
    if (err) { console.error(err) }
    console.dir(response, {depth: null, colors: true});
}

se_scraper.scrape(config, callback);
```

Start scraping by firing up the command `node run.js`

## Proxies

**se-scraper** will create one browser instance per proxy. So the maximal amount of concurrency is equivalent to the number of proxies plus one (your own IP).

```js
const se_scraper = require('se-scraper');

let config = {
    search_engine: 'google',
    debug: false,
    verbose: false,
    keywords: ['news', 'scrapeulous.com', 'incolumitas.com', 'i work too much'],
    num_pages: 1,
    output_file: 'data.json',
    proxy_file: '/home/nikolai/.proxies', // one proxy per line
    log_ip_address: true,
};

function callback(err, response) {
    if (err) { console.error(err) }
    console.dir(response, {depth: null, colors: true});
}

se_scraper.scrape(config, callback);
```

With a proxy file such as

```text
socks5://53.34.23.55:55523
socks4://51.11.23.22:22222
```

This will scrape with **three** browser instance each having their own IP address. Unfortunately, it is currently not possible to scrape with different proxies per tab. Chromium does not support that.

## Examples

* [Simple example scraping google](examples/quickstart.js) yields [these results](examples/results/data.json)
* [Simple example scraping baidu](examples/baidu.js) yields [these results](examples/results/baidu.json)
* [Scrape with one proxy per browser](examples/proxies.js) yields [these results](examples/results/proxyresults.json)
* [Scrape 100 keywords on Bing with multible tabs in one browser](examples/multiple_tabs.js) produces [this](examples/results/bing.json)
* [Scrape two keywords on Amazon](examples/amazon.js) produces [this](examples/results/amazon.json)
* [Inject your own scraping logic](examples/pluggable.js)


## Scraping Model

**se-scraper** scrapes search engines only. In order to introduce concurrency into this library, it is necessary to define the scraping model. Then we can decide how we divide and conquer.

#### Scraping Resources

What are common scraping resources?

1. **Memory and CPU**. Necessary to launch multiple browser instances.
2. **Network Bandwith**. Is not often the bottleneck.
3. **IP Addresses**. Websites often block IP addresses after a certain amount of requests from the same IP address. Can be circumvented by using proxies.
4. Spoofable identifiers such as browser fingerprint or user agents. Those will be handled by **se-scraper**

#### Concurrency Model

**se-scraper** should be able to run without any concurrency at all. This is the default case. No concurrency means only one browser/tab is searching at the time.

For concurrent use, we will make use of a modified [puppeteer-cluster library](https://github.com/thomasdondorf/puppeteer-cluster).

One scrape job is properly defined by

* 1 search engine such as `google`
* `M` pages
* `N` keywords/queries
* `K` proxies and `K+1` browser instances (because when we have no proxies available, we will scrape with our dedicated IP)

Then **se-scraper** will create `K+1` dedicated browser instances with a unique ip address. Each browser will get `N/(K+1)` keywords and will issue `N/(K+1) * M` total requests to the search engine.

The problem is that [puppeteer-cluster library](https://github.com/thomasdondorf/puppeteer-cluster) does only allow identical options for subsequent new browser instances. Therefore, it is not trivial to launch a cluster of browsers with distinct proxy settings. Right now, every browser has the same options. It's not possible to set options on a per browser basis.

Solution: 

1. Create a [upstream proxy router](https://github.com/GoogleChrome/puppeteer/issues/678).
2. Modify [puppeteer-cluster library](https://github.com/thomasdondorf/puppeteer-cluster) to accept a list of proxy strings and then pop() from this list at every new call to `workerInstance()` in https://github.com/thomasdondorf/puppeteer-cluster/blob/master/src/Cluster.ts I wrote an [issue here](https://github.com/thomasdondorf/puppeteer-cluster/issues/107). **I ended up doing this**.


## Technical Notes

Scraping is done with a headless chromium browser using the automation library puppeteer. Puppeteer is a Node library which provides a high-level API to control headless Chrome or Chromium over the DevTools Protocol.
 
If you need to deploy scraping to the cloud (AWS or Azure), you can contact me at **hire@incolumitas.com**

The chromium browser is started with the following flags to prevent
scraping detection.

```js
var ADDITIONAL_CHROME_FLAGS = [
    '--disable-infobars',
    '--window-position=0,0',
    '--ignore-certifcate-errors',
    '--ignore-certifcate-errors-spki-list',
    '--no-sandbox',
    '--disable-setuid-sandbox',
    '--disable-dev-shm-usage',
    '--disable-accelerated-2d-canvas',
    '--disable-gpu',
    '--window-size=1920x1080',
    '--hide-scrollbars',
    '--disable-notifications',
];
```

Furthermore, to avoid loading unnecessary ressources and to speed up
scraping a great deal, we instruct chrome to not load images and css and media:

```js
await page.setRequestInterception(true);
page.on('request', (req) => {
    let type = req.resourceType();
    const block = ['stylesheet', 'font', 'image', 'media'];
    if (block.includes(type)) {
        req.abort();
    } else {
        req.continue();
    }
});
```

#### Making puppeteer and headless chrome undetectable

Consider the following resources:

* https://intoli.com/blog/making-chrome-headless-undetectable/
* https://intoli.com/blog/not-possible-to-block-chrome-headless/
* https://news.ycombinator.com/item?id=16179602

**se-scraper** implements the countermeasures against headless chrome detection proposed on those sites.

Most recent detection counter measures can be found here:

* https://github.com/paulirish/headless-cat-n-mouse/blob/master/apply-evasions.js

**se-scraper** makes use of those anti detection techniques.

To check whether evasion works, you can test it by passing `test_evasion` flag to the config:

```js
let config = {
    // check if headless chrome escapes common detection techniques
    test_evasion: true
};
```

It will create a screenshot named `headless-test-result.png` in the directory where the scraper was started that shows whether all test have passed.

## Advanced Usage

Use **se-scraper** by calling it with a script such as the one below.

```js
const se_scraper = require('se-scraper');

let config = {
    // the user agent to scrape with
    user_agent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36',
    // if random_user_agent is set to True, a random user agent is chosen
    random_user_agent: true,
    // how long to sleep between requests. a random sleep interval within the range [a,b]
    // is drawn before every request. empty string for no sleeping.
    sleep_range: '[1,2]',
    // which search engine to scrape
    search_engine: 'google',
    // whether debug information should be printed
    // debug info is useful for developers when debugging
    debug: false,
    // whether verbose program output should be printed
    // this output is informational
    verbose: true,
    // an array of keywords to scrape
    keywords: ['scrapeulous.com', 'scraping search engines', 'scraping service scrapeulous', 'learn js'],
    // alternatively you can specify a keyword_file. this overwrites the keywords array
    keyword_file: '',
    // the number of pages to scrape for each keyword
    num_pages: 2,
    // whether to start the browser in headless mode
    headless: true,
    // path to output file, data will be stored in JSON
    output_file: 'examples/results/advanced.json',
    // whether to prevent images, css, fonts from being loaded
    // will speed up scraping a great deal
    block_assets: true,
    // path to js module that extends functionality
    // this module should export the functions:
    // get_browser, handle_metadata, close_browser
    // must be an absolute path to the module
    //custom_func: resolve('examples/pluggable.js'),
    custom_func: '',
    // use a proxy for all connections
    // example: 'socks5://78.94.172.42:1080'
    // example: 'http://118.174.233.10:48400'
    proxy: '',
    // a file with one proxy per line. Example:
    // socks5://78.94.172.42:1080
    // http://118.174.233.10:48400
    proxy_file: '',
    // check if headless chrome escapes common detection techniques
    // this is a quick test and should be used for debugging
    test_evasion: false,
    // log ip address data
    log_ip_address: false,
    // log http headers
    log_http_headers: false,
    puppeteer_cluster_config: {
        timeout: 10 * 60 * 1000, // max timeout set to 10 minutes
        monitor: false,
        concurrency: 1, // one scraper per tab
        maxConcurrency: 2, // scrape with 2 tabs
    }
};

function callback(err, response) {
    if (err) { console.error(err) }

    /* response object has the following properties:

        response.results - json object with the scraping results
        response.metadata - json object with metadata information
        response.statusCode - status code of the scraping process
     */

    console.dir(response.results, {depth: null, colors: true});
}

se_scraper.scrape(config, callback);
```

[Output for the above script on my machine.](examples/results/advanced.json)

### Query String Parameters

You can add your custom query string parameters to the configuration object by specifying a `google_settings` key. In general: `{{search engine}}_settings`.

For example you can customize your google search with the following config:

```js
let config = {
    search_engine: 'google',
    // use specific search engine parameters for various search engines
    google_settings: {
        google_domain: 'google.com',
        gl: 'us', // The gl parameter determines the Google country to use for the query.
        hl: 'us', // The hl parameter determines the Google UI language to return results.
        start: 0, // Determines the results offset to use, defaults to 0.
        num: 100, // Determines the number of results to show, defaults to 10. Maximum is 100.
    },
}
```
fixed some errors and way better README 2019-02-28 15:34:25 +01:00			`# Search Engine Scraper - se-scraper`
initial 2018-12-24 14:25:02 +01:00
fixed pluggable 2019-03-03 16:46:10 +01:00			`[![npm](https://img.shields.io/npm/v/se-scraper.svg?style=for-the-badge)](https://www.npmjs.com/package/se-scraper)`
			`[![Donate](https://img.shields.io/badge/donate-paypal-blue.svg?style=for-the-badge)](https://www.paypal.me/incolumitas)`
fixed some errors and way better README 2019-02-28 15:34:25 +01:00			`[![Known Vulnerabilities](https://snyk.io/test/github/NikolaiT/se-scraper/badge.svg)](https://snyk.io/test/github/NikolaiT/se-scraper)`
initial 2018-12-24 14:25:02 +01:00
fixed some errors and way better README 2019-02-28 15:34:25 +01:00			`This node module allows you to scrape search engines concurrently with different proxies.`
initial 2018-12-24 14:25:02 +01:00
fixed some errors and way better README 2019-02-28 15:34:25 +01:00			`If you don't have much technical experience or don't want to purchase proxies, you can use [my scraping service](https://scrapeulous.com/).`

			`##### Table of Contents`
			`- [Installation](#installation)`
			`- [Quickstart](#quickstart)`
			`- [Using Proxies](#proxies)`
			`- [Examples](#examples)`
			`- [Scraping Model](#scraping-model)`
			`- [Technical Notes](#technical-notes)`
			`- [Advanced Usage](#advanced-usage)`
added suport for custom query string parameters 2019-03-06 00:08:25 +01:00			`- [Special Query String Parameters for Search Engines](#query-string-parameters)`
fixed some errors and way better README 2019-02-28 15:34:25 +01:00

			`Se-scraper supports the following search engines:`
initial 2018-12-24 14:25:02 +01:00			`* Google`
			`* Google News`
added chrome detection evasion techniques 2019-02-07 16:09:38 +01:00			`* Google News App version (https://news.google.com)`
initial 2018-12-24 14:25:02 +01:00			`* Google Image`
added support for amazon 2019-03-10 20:02:42 +01:00			`* Amazon`
initial 2018-12-24 14:25:02 +01:00			`* Bing`
fixed some errors and way better README 2019-02-28 15:34:25 +01:00			`* Bing News`
initial 2018-12-24 14:25:02 +01:00			`* Baidu`
			`* Youtube`
			`* Infospace`
			`* Duckduckgo`
			`* Webcrawler`
added pluggable functionality 2019-01-27 15:54:56 +01:00			`* Reuters`
fixed some errors and way better README 2019-02-28 15:34:25 +01:00			`* Cnbc`
added pluggable functionality 2019-01-27 15:54:56 +01:00			`* Marketwatch`

fixed some errors and way better README 2019-02-28 15:34:25 +01:00			`This module uses puppeteer and a modified version of [puppeteer-cluster](https://github.com/thomasdondorf/puppeteer-cluster/). It was created by the Developer of [GoogleScraper](https://github.com/NikolaiT/GoogleScraper), a module with 1800 Stars on Github.`

			`## Installation`
initial 2018-12-24 14:25:02 +01:00
fixed some errors and way better README 2019-02-28 15:34:25 +01:00			`You need a working installation of node and the npm package manager.`
tested and works 2019-01-30 23:53:09 +01:00
fixed some errors and way better README 2019-02-28 15:34:25 +01:00			`Install se-scraper by entering the following command in your terminal`
ticker search OOP now and added tests 2019-01-31 22:13:22 +01:00
			```bash
fixed some errors and way better README 2019-02-28 15:34:25 +01:00			`npm install se-scraper`
ticker search OOP now and added tests 2019-01-31 22:13:22 +01:00			```

fixed some errors and way better README 2019-02-28 15:34:25 +01:00			`If you don't want puppeteer to download a complete chromium browser, add this variable to your environment. Then this library is not guaranteed to run out of the box.`
tested and works 2019-01-30 23:53:09 +01:00
			```bash
fixed some errors and way better README 2019-02-28 15:34:25 +01:00			`export PUPPETEER_SKIP_CHROMIUM_DOWNLOAD=1`
tested and works 2019-01-30 23:53:09 +01:00			```

fixed some errors and way better README 2019-02-28 15:34:25 +01:00			`## Quickstart`

			Create a file named `run.js` with the following contents
tested and works 2019-01-30 23:53:09 +01:00
			```js
			`const se_scraper = require('se-scraper');`

			`let config = {`
			`search_engine: 'google',`
			`debug: false,`
			`verbose: false,`
			`keywords: ['news', 'scraping scrapeulous.com'],`
			`num_pages: 3,`
			`output_file: 'data.json',`
			`};`

			`function callback(err, response) {`
			`if (err) { console.error(err) }`
			`console.dir(response, {depth: null, colors: true});`
			`}`

			`se_scraper.scrape(config, callback);`
			```

support for multible browsers and proxies 2019-02-27 20:58:13 +01:00			Start scraping by firing up the command `node run.js`

fixed some errors and way better README 2019-02-28 15:34:25 +01:00			`## Proxies`
support for multible browsers and proxies 2019-02-27 20:58:13 +01:00
fixed some errors and way better README 2019-02-28 15:34:25 +01:00			`se-scraper will create one browser instance per proxy. So the maximal amount of concurrency is equivalent to the number of proxies plus one (your own IP).`
support for multible browsers and proxies 2019-02-27 20:58:13 +01:00
			```js
			`const se_scraper = require('se-scraper');`

			`let config = {`
			`search_engine: 'google',`
			`debug: false,`
			`verbose: false,`
			`keywords: ['news', 'scrapeulous.com', 'incolumitas.com', 'i work too much'],`
			`num_pages: 1,`
			`output_file: 'data.json',`
			`proxy_file: '/home/nikolai/.proxies', // one proxy per line`
			`log_ip_address: true,`
			`};`

			`function callback(err, response) {`
			`if (err) { console.error(err) }`
			`console.dir(response, {depth: null, colors: true});`
			`}`

			`se_scraper.scrape(config, callback);`
			```

fixed some errors and way better README 2019-02-28 15:34:25 +01:00			`With a proxy file such as`
support for multible browsers and proxies 2019-02-27 20:58:13 +01:00
			```text
			`socks5://53.34.23.55:55523`
			`socks4://51.11.23.22:22222`
			```

fixed some errors and way better README 2019-02-28 15:34:25 +01:00			`This will scrape with three browser instance each having their own IP address. Unfortunately, it is currently not possible to scrape with different proxies per tab. Chromium does not support that.`

			`## Examples`

			`* [Simple example scraping google](examples/quickstart.js) yields [these results](examples/results/data.json)`
fixed #11 by improving baidu a lot in speed and quality 2019-03-14 23:33:46 +01:00			`* [Simple example scraping baidu](examples/baidu.js) yields [these results](examples/results/baidu.json)`
fixed some errors and way better README 2019-02-28 15:34:25 +01:00			`* [Scrape with one proxy per browser](examples/proxies.js) yields [these results](examples/results/proxyresults.json)`
			`* [Scrape 100 keywords on Bing with multible tabs in one browser](examples/multiple_tabs.js) produces [this](examples/results/bing.json)`
added support for amazon 2019-03-10 20:02:42 +01:00			`* [Scrape two keywords on Amazon](examples/amazon.js) produces [this](examples/results/amazon.json)`
fixed some errors and way better README 2019-02-28 15:34:25 +01:00			`* [Inject your own scraping logic](examples/pluggable.js)`

support for multible browsers and proxies 2019-02-27 20:58:13 +01:00
fixed some errors and way better README 2019-02-28 15:34:25 +01:00			`## Scraping Model`
support for multible browsers and proxies 2019-02-27 20:58:13 +01:00
			`se-scraper scrapes search engines only. In order to introduce concurrency into this library, it is necessary to define the scraping model. Then we can decide how we divide and conquer.`

			`#### Scraping Resources`

			`What are common scraping resources?`

			`1. Memory and CPU. Necessary to launch multiple browser instances.`
			`2. Network Bandwith. Is not often the bottleneck.`
			`3. IP Addresses. Websites often block IP addresses after a certain amount of requests from the same IP address. Can be circumvented by using proxies.`
			`4. Spoofable identifiers such as browser fingerprint or user agents. Those will be handled by se-scraper`

			`#### Concurrency Model`

			`se-scraper should be able to run without any concurrency at all. This is the default case. No concurrency means only one browser/tab is searching at the time.`

			`For concurrent use, we will make use of a modified [puppeteer-cluster library](https://github.com/thomasdondorf/puppeteer-cluster).`

			`One scrape job is properly defined by`

			* 1 search engine such as `google`
			* `M` pages
			* `N` keywords/queries
			* `K` proxies and `K+1` browser instances (because when we have no proxies available, we will scrape with our dedicated IP)

			Then se-scraper will create `K+1` dedicated browser instances with a unique ip address. Each browser will get `N/(K+1)` keywords and will issue `N/(K+1) * M` total requests to the search engine.

			`The problem is that [puppeteer-cluster library](https://github.com/thomasdondorf/puppeteer-cluster) does only allow identical options for subsequent new browser instances. Therefore, it is not trivial to launch a cluster of browsers with distinct proxy settings. Right now, every browser has the same options. It's not possible to set options on a per browser basis.`

			`Solution:`

			`1. Create a [upstream proxy router](https://github.com/GoogleChrome/puppeteer/issues/678).`
			2. Modify [puppeteer-cluster library](https://github.com/thomasdondorf/puppeteer-cluster) to accept a list of proxy strings and then pop() from this list at every new call to `workerInstance()` in https://github.com/thomasdondorf/puppeteer-cluster/blob/master/src/Cluster.ts I wrote an [issue here](https://github.com/thomasdondorf/puppeteer-cluster/issues/107). I ended up doing this.


fixed some errors and way better README 2019-02-28 15:34:25 +01:00			`## Technical Notes`
initial 2018-12-24 14:25:02 +01:00
			`Scraping is done with a headless chromium browser using the automation library puppeteer. Puppeteer is a Node library which provides a high-level API to control headless Chrome or Chromium over the DevTools Protocol.`

fixed some errors and way better README 2019-02-28 15:34:25 +01:00			`If you need to deploy scraping to the cloud (AWS or Azure), you can contact me at hire@incolumitas.com`
initial 2018-12-24 14:25:02 +01:00
resolved some issues. proxy possible now. scraping for more than one page possible now 2019-01-29 22:48:08 +01:00			`The chromium browser is started with the following flags to prevent`
			`scraping detection.`

			```js
			`var ADDITIONAL_CHROME_FLAGS = [`
			`'--disable-infobars',`
			`'--window-position=0,0',`
			`'--ignore-certifcate-errors',`
			`'--ignore-certifcate-errors-spki-list',`
			`'--no-sandbox',`
			`'--disable-setuid-sandbox',`
			`'--disable-dev-shm-usage',`
			`'--disable-accelerated-2d-canvas',`
			`'--disable-gpu',`
			`'--window-size=1920x1080',`
			`'--hide-scrollbars',`
fixed quotes in user agent. this lead to cloudflare detecting the scraper. very bad. 2019-03-01 16:02:30 +01:00			`'--disable-notifications',`
resolved some issues. proxy possible now. scraping for more than one page possible now 2019-01-29 22:48:08 +01:00			`];`
			```

			`Furthermore, to avoid loading unnecessary ressources and to speed up`
fixed some errors and way better README 2019-02-28 15:34:25 +01:00			`scraping a great deal, we instruct chrome to not load images and css and media:`
resolved some issues. proxy possible now. scraping for more than one page possible now 2019-01-29 22:48:08 +01:00
			```js
			`await page.setRequestInterception(true);`
			`page.on('request', (req) => {`
			`let type = req.resourceType();`
			`const block = ['stylesheet', 'font', 'image', 'media'];`
			`if (block.includes(type)) {`
			`req.abort();`
			`} else {`
			`req.continue();`
			`}`
			`});`
			```

fixed some errors and way better README 2019-02-28 15:34:25 +01:00			`#### Making puppeteer and headless chrome undetectable`
resolved some issues. proxy possible now. scraping for more than one page possible now 2019-01-29 22:48:08 +01:00
			`Consider the following resources:`

			`* https://intoli.com/blog/making-chrome-headless-undetectable/`
added chrome detection evasion techniques 2019-02-07 16:09:38 +01:00			`* https://intoli.com/blog/not-possible-to-block-chrome-headless/`
			`* https://news.ycombinator.com/item?id=16179602`

			`se-scraper implements the countermeasures against headless chrome detection proposed on those sites.`

			`Most recent detection counter measures can be found here:`

			`* https://github.com/paulirish/headless-cat-n-mouse/blob/master/apply-evasions.js`

			`se-scraper makes use of those anti detection techniques.`

			To check whether evasion works, you can test it by passing `test_evasion` flag to the config:

			```js
			`let config = {`
			`// check if headless chrome escapes common detection techniques`
			`test_evasion: true`
			`};`
			```

			It will create a screenshot named `headless-test-result.png` in the directory where the scraper was started that shows whether all test have passed.
initial 2018-12-24 14:25:02 +01:00
fixed some errors and way better README 2019-02-28 15:34:25 +01:00			`## Advanced Usage`
initial 2018-12-24 14:25:02 +01:00
fixed some errors and way better README 2019-02-28 15:34:25 +01:00			`Use se-scraper by calling it with a script such as the one below.`
initial 2018-12-24 14:25:02 +01:00
added pluggable functionality 2019-01-27 15:54:56 +01:00			```js
initial 2018-12-24 14:25:02 +01:00			`const se_scraper = require('se-scraper');`

fixed some errors and way better README 2019-02-28 15:34:25 +01:00			`let config = {`
initial 2018-12-24 14:25:02 +01:00			`// the user agent to scrape with`
			`user_agent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36',`
			`// if random_user_agent is set to True, a random user agent is chosen`
resolved some issues. proxy possible now. scraping for more than one page possible now 2019-01-29 22:48:08 +01:00			`random_user_agent: true,`
initial 2018-12-24 14:25:02 +01:00			`// how long to sleep between requests. a random sleep interval within the range [a,b]`
			`// is drawn before every request. empty string for no sleeping.`
fixed some errors and way better README 2019-02-28 15:34:25 +01:00			`sleep_range: '[1,2]',`
initial 2018-12-24 14:25:02 +01:00			`// which search engine to scrape`
added pluggable functionality 2019-01-27 15:54:56 +01:00			`search_engine: 'google',`
fixed some errors and way better README 2019-02-28 15:34:25 +01:00			`// whether debug information should be printed`
			`// debug info is useful for developers when debugging`
added pluggable functionality 2019-01-27 15:54:56 +01:00			`debug: false,`
fixed some errors and way better README 2019-02-28 15:34:25 +01:00			`// whether verbose program output should be printed`
			`// this output is informational`
support for multible browsers and proxies 2019-02-27 20:58:13 +01:00			`verbose: true,`
fixed some errors and way better README 2019-02-28 15:34:25 +01:00			`// an array of keywords to scrape`
			`keywords: ['scrapeulous.com', 'scraping search engines', 'scraping service scrapeulous', 'learn js'],`
			`// alternatively you can specify a keyword_file. this overwrites the keywords array`
			`keyword_file: '',`
			`// the number of pages to scrape for each keyword`
			`num_pages: 2,`
supporting yahoo ticker search for news 2019-01-24 15:50:03 +01:00			`// whether to start the browser in headless mode`
added pluggable functionality 2019-01-27 15:54:56 +01:00			`headless: true,`
. 2019-01-26 20:15:19 +01:00			`// path to output file, data will be stored in JSON`
fixed some errors and way better README 2019-02-28 15:34:25 +01:00			`output_file: 'examples/results/advanced.json',`
			`// whether to prevent images, css, fonts from being loaded`
faster scraping, added ticker search engines 2019-01-27 01:27:52 +01:00			`// will speed up scraping a great deal`
added pluggable functionality 2019-01-27 15:54:56 +01:00			`block_assets: true,`
			`// path to js module that extends functionality`
			`// this module should export the functions:`
			`// get_browser, handle_metadata, close_browser`
fixed some errors and way better README 2019-02-28 15:34:25 +01:00			`// must be an absolute path to the module`
added pluggable functionality 2019-01-27 15:54:56 +01:00			`//custom_func: resolve('examples/pluggable.js'),`
			`custom_func: '',`
fixed some errors and way better README 2019-02-28 15:34:25 +01:00			`// use a proxy for all connections`
			`// example: 'socks5://78.94.172.42:1080'`
			`// example: 'http://118.174.233.10:48400'`
			`proxy: '',`
			`// a file with one proxy per line. Example:`
support for multible browsers and proxies 2019-02-27 20:58:13 +01:00			`// socks5://78.94.172.42:1080`
			`// http://118.174.233.10:48400`
			`proxy_file: '',`
updated readme 2019-02-07 16:26:11 +01:00			`// check if headless chrome escapes common detection techniques`
			`// this is a quick test and should be used for debugging`
			`test_evasion: false,`
fixed some errors and way better README 2019-02-28 15:34:25 +01:00			`// log ip address data`
			`log_ip_address: false,`
			`// log http headers`
			`log_http_headers: false,`
			`puppeteer_cluster_config: {`
			`timeout: 10 * 60 * 1000, // max timeout set to 10 minutes`
			`monitor: false,`
			`concurrency: 1, // one scraper per tab`
			`maxConcurrency: 2, // scrape with 2 tabs`
			`}`
initial 2018-12-24 14:25:02 +01:00			`};`

resolved some issues. proxy possible now. scraping for more than one page possible now 2019-01-29 22:48:08 +01:00			`function callback(err, response) {`
initial 2018-12-24 14:25:02 +01:00			`if (err) { console.error(err) }`

			`/* response object has the following properties:`

			`response.results - json object with the scraping results`
			`response.metadata - json object with metadata information`
			`response.statusCode - status code of the scraping process`
			`*/`

			`console.dir(response.results, {depth: null, colors: true});`
resolved some issues. proxy possible now. scraping for more than one page possible now 2019-01-29 22:48:08 +01:00			`}`

			`se_scraper.scrape(config, callback);`
initial 2018-12-24 14:25:02 +01:00			```

added suport for custom query string parameters 2019-03-06 00:08:25 +01:00			`[Output for the above script on my machine.](examples/results/advanced.json)`

			`### Query String Parameters`

			You can add your custom query string parameters to the configuration object by specifying a `google_settings` key. In general: `{{search engine}}_settings`.

			`For example you can customize your google search with the following config:`

			```js
			`let config = {`
			`search_engine: 'google',`
			`// use specific search engine parameters for various search engines`
			`google_settings: {`
			`google_domain: 'google.com',`
			`gl: 'us', // The gl parameter determines the Google country to use for the query.`
			`hl: 'us', // The hl parameter determines the Google UI language to return results.`
			`start: 0, // Determines the results offset to use, defaults to 0.`
			`num: 100, // Determines the number of results to show, defaults to 10. Maximum is 100.`
			`},`
			`}`
			```