This commit is contained in:
Nikolai Tschacher 2020-05-17 22:06:33 +02:00
parent 1694ee92d0
commit 0278b24f0d

View File

@ -1,4 +1,6 @@
# Search Engine Scraper - se-scraper
# [The maintained successor of se-scraper is the general purpose crawling infrastructure](https://github.com/NikolaiT/Crawling-Infrastructure)
## Search Engine Scraper - se-scraper
[![npm](https://img.shields.io/npm/v/se-scraper.svg?style=for-the-badge)](https://www.npmjs.com/package/se-scraper)
[![Donate](https://img.shields.io/badge/donate-paypal-blue.svg?style=for-the-badge)](https://www.paypal.me/incolumitas)
@ -98,7 +100,7 @@ tschachn/se-scraper latest 897e1aeeba78 21 minutes
You can check the [latest tag here](https://hub.docker.com/r/tschachn/se-scraper/tags). In the example below, the latest tag is **latest**. This will most likely remain **latest** in the future.
Run the docker image and map the internal port 3000 to the external
Run the docker image and map the internal port 3000 to the external
port 3000:
```bash
@ -289,7 +291,7 @@ Then **se-scraper** will create `K+1` dedicated browser instances with a unique
The problem is that [puppeteer-cluster library](https://github.com/thomasdondorf/puppeteer-cluster) does only allow identical options for subsequent new browser instances. Therefore, it is not trivial to launch a cluster of browsers with distinct proxy settings. Right now, every browser has the same options. It's not possible to set options on a per browser basis.
Solution:
Solution:
1. Create a [upstream proxy router](https://github.com/GoogleChrome/puppeteer/issues/678).
2. Modify [puppeteer-cluster library](https://github.com/thomasdondorf/puppeteer-cluster) to accept a list of proxy strings and then pop() from this list at every new call to `workerInstance()` in https://github.com/thomasdondorf/puppeteer-cluster/blob/master/src/Cluster.ts I wrote an [issue here](https://github.com/thomasdondorf/puppeteer-cluster/issues/107). **I ended up doing this**.
@ -298,7 +300,7 @@ Solution:
## Technical Notes
Scraping is done with a headless chromium browser using the automation library puppeteer. Puppeteer is a Node library which provides a high-level API to control headless Chrome or Chromium over the DevTools Protocol.
If you need to deploy scraping to the cloud (AWS or Azure), you can contact me at **hire@incolumitas.com**
The chromium browser is started with the following flags to prevent
@ -506,4 +508,4 @@ let scrape_config = {
num: 100, // Determines the number of results to show, defaults to 10. Maximum is 100.
},
}
```
```