ad

2020-05-17 22:06:33 +02:00 · 2020-05-17 22:06:33 +02:00 · 0278b24f0d
commit 0278b24f0d
parent 1694ee92d0
1 changed files with 7 additions and 5 deletions
--- a/README.md
+++ b/README.md
@ -1,4 +1,6 @@
-# Search Engine Scraper - se-scraper
+# [The maintained successor of se-scraper is the general purpose crawling infrastructure](https://github.com/NikolaiT/Crawling-Infrastructure)
+
+## Search Engine Scraper - se-scraper

 [![npm](https://img.shields.io/npm/v/se-scraper.svg?style=for-the-badge)](https://www.npmjs.com/package/se-scraper)
 [![Donate](https://img.shields.io/badge/donate-paypal-blue.svg?style=for-the-badge)](https://www.paypal.me/incolumitas)
@ -98,7 +100,7 @@ tschachn/se-scraper             latest           897e1aeeba78        21 minutes

 You can check the [latest tag here](https://hub.docker.com/r/tschachn/se-scraper/tags). In the example below, the latest tag is **latest**. This will most likely remain **latest** in the future.

-Run the docker image and map the internal port 3000 to the external 
+Run the docker image and map the internal port 3000 to the external
 port 3000:

 ```bash
@ -289,7 +291,7 @@ Then **se-scraper** will create `K+1` dedicated browser instances with a unique

 The problem is that [puppeteer-cluster library](https://github.com/thomasdondorf/puppeteer-cluster) does only allow identical options for subsequent new browser instances. Therefore, it is not trivial to launch a cluster of browsers with distinct proxy settings. Right now, every browser has the same options. It's not possible to set options on a per browser basis.

-Solution: 
+Solution:

 1. Create a [upstream proxy router](https://github.com/GoogleChrome/puppeteer/issues/678).
 2. Modify [puppeteer-cluster library](https://github.com/thomasdondorf/puppeteer-cluster) to accept a list of proxy strings and then pop() from this list at every new call to `workerInstance()` in https://github.com/thomasdondorf/puppeteer-cluster/blob/master/src/Cluster.ts I wrote an [issue here](https://github.com/thomasdondorf/puppeteer-cluster/issues/107). **I ended up doing this**.
@ -298,7 +300,7 @@ Solution:
 ## Technical Notes

 Scraping is done with a headless chromium browser using the automation library puppeteer. Puppeteer is a Node library which provides a high-level API to control headless Chrome or Chromium over the DevTools Protocol.
- 
+
 If you need to deploy scraping to the cloud (AWS or Azure), you can contact me at **hire@incolumitas.com**

 The chromium browser is started with the following flags to prevent
@ -506,4 +508,4 @@ let scrape_config = {
        num: 100, // Determines the number of results to show, defaults to 10. Maximum is 100.
    },
 }
-```
+```