mirror of
https://github.com/NikolaiT/se-scraper.git
synced 2024-11-21 15:13:13 +01:00
ad
This commit is contained in:
parent
1694ee92d0
commit
0278b24f0d
12
README.md
12
README.md
@ -1,4 +1,6 @@
|
||||
# Search Engine Scraper - se-scraper
|
||||
# [The maintained successor of se-scraper is the general purpose crawling infrastructure](https://github.com/NikolaiT/Crawling-Infrastructure)
|
||||
|
||||
## Search Engine Scraper - se-scraper
|
||||
|
||||
[![npm](https://img.shields.io/npm/v/se-scraper.svg?style=for-the-badge)](https://www.npmjs.com/package/se-scraper)
|
||||
[![Donate](https://img.shields.io/badge/donate-paypal-blue.svg?style=for-the-badge)](https://www.paypal.me/incolumitas)
|
||||
@ -98,7 +100,7 @@ tschachn/se-scraper latest 897e1aeeba78 21 minutes
|
||||
|
||||
You can check the [latest tag here](https://hub.docker.com/r/tschachn/se-scraper/tags). In the example below, the latest tag is **latest**. This will most likely remain **latest** in the future.
|
||||
|
||||
Run the docker image and map the internal port 3000 to the external
|
||||
Run the docker image and map the internal port 3000 to the external
|
||||
port 3000:
|
||||
|
||||
```bash
|
||||
@ -289,7 +291,7 @@ Then **se-scraper** will create `K+1` dedicated browser instances with a unique
|
||||
|
||||
The problem is that [puppeteer-cluster library](https://github.com/thomasdondorf/puppeteer-cluster) does only allow identical options for subsequent new browser instances. Therefore, it is not trivial to launch a cluster of browsers with distinct proxy settings. Right now, every browser has the same options. It's not possible to set options on a per browser basis.
|
||||
|
||||
Solution:
|
||||
Solution:
|
||||
|
||||
1. Create a [upstream proxy router](https://github.com/GoogleChrome/puppeteer/issues/678).
|
||||
2. Modify [puppeteer-cluster library](https://github.com/thomasdondorf/puppeteer-cluster) to accept a list of proxy strings and then pop() from this list at every new call to `workerInstance()` in https://github.com/thomasdondorf/puppeteer-cluster/blob/master/src/Cluster.ts I wrote an [issue here](https://github.com/thomasdondorf/puppeteer-cluster/issues/107). **I ended up doing this**.
|
||||
@ -298,7 +300,7 @@ Solution:
|
||||
## Technical Notes
|
||||
|
||||
Scraping is done with a headless chromium browser using the automation library puppeteer. Puppeteer is a Node library which provides a high-level API to control headless Chrome or Chromium over the DevTools Protocol.
|
||||
|
||||
|
||||
If you need to deploy scraping to the cloud (AWS or Azure), you can contact me at **hire@incolumitas.com**
|
||||
|
||||
The chromium browser is started with the following flags to prevent
|
||||
@ -506,4 +508,4 @@ let scrape_config = {
|
||||
num: 100, // Determines the number of results to show, defaults to 10. Maximum is 100.
|
||||
},
|
||||
}
|
||||
```
|
||||
```
|
||||
|
Loading…
Reference in New Issue
Block a user