forked from extern/se-scraper
fixed some errors and way better README
This commit is contained in:
parent
089e410ec6
commit
79d32a315a
328
README.md
328
README.md
@ -1,40 +1,59 @@
|
||||
# Search Engine Scraper
|
||||
# Search Engine Scraper - se-scraper
|
||||
|
||||
This node module supports scraping several search engines.
|
||||
[![npm](https://badgen.now.sh/npm/v/se-scraper)](https://www.npmjs.com/package/se-scraper)
|
||||
[![Known Vulnerabilities](https://snyk.io/test/github/NikolaiT/se-scraper/badge.svg)](https://snyk.io/test/github/NikolaiT/se-scraper)
|
||||
|
||||
Right now it's possible to scrape the following search engines
|
||||
This node module allows you to scrape search engines concurrently with different proxies.
|
||||
|
||||
If you don't have much technical experience or don't want to purchase proxies, you can use [my scraping service](https://scrapeulous.com/).
|
||||
|
||||
##### Table of Contents
|
||||
- [Installation](#installation)
|
||||
- [Quickstart](#quickstart)
|
||||
- [Using Proxies](#proxies)
|
||||
- [Examples](#examples)
|
||||
- [Scraping Model](#scraping-model)
|
||||
- [Technical Notes](#technical-notes)
|
||||
- [Advanced Usage](#advanced-usage)
|
||||
|
||||
|
||||
Se-scraper supports the following search engines:
|
||||
* Google
|
||||
* Google News
|
||||
* Google News App version (https://news.google.com)
|
||||
* Google Image
|
||||
* Bing
|
||||
* Bing News
|
||||
* Baidu
|
||||
* Youtube
|
||||
* Infospace
|
||||
* Duckduckgo
|
||||
* Webcrawler
|
||||
* Reuters
|
||||
* cnbc
|
||||
* Cnbc
|
||||
* Marketwatch
|
||||
|
||||
This module uses puppeteer and puppeteer-cluster (modified version). It was created by the Developer of https://github.com/NikolaiT/GoogleScraper, a module with 1800 Stars on Github.
|
||||
This module uses puppeteer and a modified version of [puppeteer-cluster](https://github.com/thomasdondorf/puppeteer-cluster/). It was created by the Developer of [GoogleScraper](https://github.com/NikolaiT/GoogleScraper), a module with 1800 Stars on Github.
|
||||
|
||||
### Quickstart
|
||||
## Installation
|
||||
|
||||
**Note**: If you **don't** want puppeteer to download a complete chromium browser, add this variable to your environments. Then this library is not guaranteed to run out of the box.
|
||||
You need a working installation of **node** and the **npm** package manager.
|
||||
|
||||
```bash
|
||||
export PUPPETEER_SKIP_CHROMIUM_DOWNLOAD=1
|
||||
```
|
||||
|
||||
Then install with
|
||||
Install **se-scraper** by entering the following command in your terminal
|
||||
|
||||
```bash
|
||||
npm install se-scraper
|
||||
```
|
||||
|
||||
then create a file `run.js` with the following contents
|
||||
If you **don't** want puppeteer to download a complete chromium browser, add this variable to your environment. Then this library is not guaranteed to run out of the box.
|
||||
|
||||
```bash
|
||||
export PUPPETEER_SKIP_CHROMIUM_DOWNLOAD=1
|
||||
```
|
||||
|
||||
## Quickstart
|
||||
|
||||
Create a file named `run.js` with the following contents
|
||||
|
||||
```js
|
||||
const se_scraper = require('se-scraper');
|
||||
@ -58,9 +77,9 @@ se_scraper.scrape(config, callback);
|
||||
|
||||
Start scraping by firing up the command `node run.js`
|
||||
|
||||
#### Scrape with proxies
|
||||
## Proxies
|
||||
|
||||
**se-scraper** will create one browser instance per proxy. So the maximal ammount of concurency is equivalent to the number of proxies plus one (your own IP).
|
||||
**se-scraper** will create one browser instance per proxy. So the maximal amount of concurrency is equivalent to the number of proxies plus one (your own IP).
|
||||
|
||||
```js
|
||||
const se_scraper = require('se-scraper');
|
||||
@ -84,16 +103,24 @@ function callback(err, response) {
|
||||
se_scraper.scrape(config, callback);
|
||||
```
|
||||
|
||||
With a proxy file such as (invalid proxies of course)
|
||||
With a proxy file such as
|
||||
|
||||
```text
|
||||
socks5://53.34.23.55:55523
|
||||
socks4://51.11.23.22:22222
|
||||
```
|
||||
|
||||
This will scrape with **three** browser instance each having their own IP address. Unfortunately, it is currently not possible to scrape with different proxies per tab (chromium issue).
|
||||
This will scrape with **three** browser instance each having their own IP address. Unfortunately, it is currently not possible to scrape with different proxies per tab. Chromium does not support that.
|
||||
|
||||
### Scraping Model
|
||||
## Examples
|
||||
|
||||
* [Simple example scraping google](examples/quickstart.js) yields [these results](examples/results/data.json)
|
||||
* [Scrape with one proxy per browser](examples/proxies.js) yields [these results](examples/results/proxyresults.json)
|
||||
* [Scrape 100 keywords on Bing with multible tabs in one browser](examples/multiple_tabs.js) produces [this](examples/results/bing.json)
|
||||
* [Inject your own scraping logic](examples/pluggable.js)
|
||||
|
||||
|
||||
## Scraping Model
|
||||
|
||||
**se-scraper** scrapes search engines only. In order to introduce concurrency into this library, it is necessary to define the scraping model. Then we can decide how we divide and conquer.
|
||||
|
||||
@ -129,18 +156,11 @@ Solution:
|
||||
2. Modify [puppeteer-cluster library](https://github.com/thomasdondorf/puppeteer-cluster) to accept a list of proxy strings and then pop() from this list at every new call to `workerInstance()` in https://github.com/thomasdondorf/puppeteer-cluster/blob/master/src/Cluster.ts I wrote an [issue here](https://github.com/thomasdondorf/puppeteer-cluster/issues/107). **I ended up doing this**.
|
||||
|
||||
|
||||
### Technical Notes
|
||||
## Technical Notes
|
||||
|
||||
Scraping is done with a headless chromium browser using the automation library puppeteer. Puppeteer is a Node library which provides a high-level API to control headless Chrome or Chromium over the DevTools Protocol.
|
||||
|
||||
No multithreading is supported for now. Only one scraping worker per `scrape()` call.
|
||||
|
||||
We will soon support parallelization. **se-scraper** will support an architecture similar to:
|
||||
|
||||
1. https://antoinevastel.com/crawler/2018/09/20/parallel-crawler-puppeteer.html
|
||||
2. https://docs.browserless.io/blog/2018/06/04/puppeteer-best-practices.html
|
||||
|
||||
If you need to deploy scraping to the cloud (AWS or Azure), you can contact me at hire@incolumitas.com
|
||||
If you need to deploy scraping to the cloud (AWS or Azure), you can contact me at **hire@incolumitas.com**
|
||||
|
||||
The chromium browser is started with the following flags to prevent
|
||||
scraping detection.
|
||||
@ -162,7 +182,7 @@ var ADDITIONAL_CHROME_FLAGS = [
|
||||
```
|
||||
|
||||
Furthermore, to avoid loading unnecessary ressources and to speed up
|
||||
scraping a great deal, we instruct chrome to not load images and css:
|
||||
scraping a great deal, we instruct chrome to not load images and css and media:
|
||||
|
||||
```js
|
||||
await page.setRequestInterception(true);
|
||||
@ -177,7 +197,7 @@ page.on('request', (req) => {
|
||||
});
|
||||
```
|
||||
|
||||
### Making puppeteer and headless chrome undetectable
|
||||
#### Making puppeteer and headless chrome undetectable
|
||||
|
||||
Consider the following resources:
|
||||
|
||||
@ -204,59 +224,69 @@ let config = {
|
||||
|
||||
It will create a screenshot named `headless-test-result.png` in the directory where the scraper was started that shows whether all test have passed.
|
||||
|
||||
### Advanced Usage
|
||||
## Advanced Usage
|
||||
|
||||
Use se-scraper by calling it with a script such as the one below.
|
||||
Use **se-scraper** by calling it with a script such as the one below.
|
||||
|
||||
```js
|
||||
const se_scraper = require('se-scraper');
|
||||
const resolve = require('path').resolve;
|
||||
|
||||
// options for scraping
|
||||
event = {
|
||||
let config = {
|
||||
// the user agent to scrape with
|
||||
user_agent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36',
|
||||
// if random_user_agent is set to True, a random user agent is chosen
|
||||
random_user_agent: true,
|
||||
// whether to select manual settings in visible mode
|
||||
set_manual_settings: false,
|
||||
// log ip address data
|
||||
log_ip_address: false,
|
||||
// log http headers
|
||||
log_http_headers: false,
|
||||
// how long to sleep between requests. a random sleep interval within the range [a,b]
|
||||
// is drawn before every request. empty string for no sleeping.
|
||||
sleep_range: '[1,1]',
|
||||
sleep_range: '[1,2]',
|
||||
// which search engine to scrape
|
||||
search_engine: 'google',
|
||||
compress: false, // compress
|
||||
// whether debug information should be printed
|
||||
// debug info is useful for developers when debugging
|
||||
debug: false,
|
||||
// whether verbose program output should be printed
|
||||
// this output is informational
|
||||
verbose: true,
|
||||
keywords: ['scrapeulous.com'],
|
||||
// an array of keywords to scrape
|
||||
keywords: ['scrapeulous.com', 'scraping search engines', 'scraping service scrapeulous', 'learn js'],
|
||||
// alternatively you can specify a keyword_file. this overwrites the keywords array
|
||||
keyword_file: '',
|
||||
// the number of pages to scrape for each keyword
|
||||
num_pages: 2,
|
||||
// whether to start the browser in headless mode
|
||||
headless: true,
|
||||
// the number of pages to scrape for each keyword
|
||||
num_pages: 1,
|
||||
// path to output file, data will be stored in JSON
|
||||
output_file: '',
|
||||
// whether to prevent images, css, fonts and media from being loaded
|
||||
output_file: 'examples/results/advanced.json',
|
||||
// whether to prevent images, css, fonts from being loaded
|
||||
// will speed up scraping a great deal
|
||||
block_assets: true,
|
||||
// path to js module that extends functionality
|
||||
// this module should export the functions:
|
||||
// get_browser, handle_metadata, close_browser
|
||||
// must be an absolute path to the module
|
||||
//custom_func: resolve('examples/pluggable.js'),
|
||||
custom_func: '',
|
||||
// path to a proxy file, one proxy per line. Example:
|
||||
// use a proxy for all connections
|
||||
// example: 'socks5://78.94.172.42:1080'
|
||||
// example: 'http://118.174.233.10:48400'
|
||||
proxy: '',
|
||||
// a file with one proxy per line. Example:
|
||||
// socks5://78.94.172.42:1080
|
||||
// http://118.174.233.10:48400
|
||||
proxy_file: '',
|
||||
proxies: [],
|
||||
// check if headless chrome escapes common detection techniques
|
||||
// this is a quick test and should be used for debugging
|
||||
test_evasion: false,
|
||||
// settings for puppeteer-cluster
|
||||
monitor: false,
|
||||
// log ip address data
|
||||
log_ip_address: false,
|
||||
// log http headers
|
||||
log_http_headers: false,
|
||||
puppeteer_cluster_config: {
|
||||
timeout: 10 * 60 * 1000, // max timeout set to 10 minutes
|
||||
monitor: false,
|
||||
concurrency: 1, // one scraper per tab
|
||||
maxConcurrency: 2, // scrape with 2 tabs
|
||||
}
|
||||
};
|
||||
|
||||
function callback(err, response) {
|
||||
@ -275,198 +305,4 @@ function callback(err, response) {
|
||||
se_scraper.scrape(config, callback);
|
||||
```
|
||||
|
||||
Supported options for the `search_engine` config key:
|
||||
|
||||
```javascript
|
||||
'google'
|
||||
'google_news_old'
|
||||
'google_news'
|
||||
'google_image'
|
||||
'bing'
|
||||
'bing_news'
|
||||
'infospace'
|
||||
'webcrawler'
|
||||
'baidu'
|
||||
'youtube'
|
||||
'duckduckgo_news'
|
||||
'reuters'
|
||||
'cnbc'
|
||||
'marketwatch'
|
||||
```
|
||||
|
||||
Output for the above script on my machine:
|
||||
|
||||
```text
|
||||
{ 'scraping scrapeulous.com':
|
||||
{ '1':
|
||||
{ time: 'Tue, 29 Jan 2019 21:39:22 GMT',
|
||||
num_results: 'Ungefähr 145 Ergebnisse (0,18 Sekunden) ',
|
||||
no_results: false,
|
||||
effective_query: '',
|
||||
results:
|
||||
[ { link: 'https://scrapeulous.com/',
|
||||
title:
|
||||
'Scrapeuloushttps://scrapeulous.com/Im CacheDiese Seite übersetzen',
|
||||
snippet:
|
||||
'Scrapeulous.com allows you to scrape various search engines automatically ... or to find hidden links, Scrapeulous.com enables you to scrape a ever increasing ...',
|
||||
visible_link: 'https://scrapeulous.com/',
|
||||
date: '',
|
||||
rank: 1 },
|
||||
{ link: 'https://scrapeulous.com/about/',
|
||||
title:
|
||||
'About - Scrapeuloushttps://scrapeulous.com/about/Im CacheDiese Seite übersetzen',
|
||||
snippet:
|
||||
'Scrapeulous.com allows you to scrape various search engines automatically and in large quantities. The business requirement to scrape information from ...',
|
||||
visible_link: 'https://scrapeulous.com/about/',
|
||||
date: '',
|
||||
rank: 2 },
|
||||
{ link: 'https://scrapeulous.com/howto/',
|
||||
title:
|
||||
'Howto - Scrapeuloushttps://scrapeulous.com/howto/Im CacheDiese Seite übersetzen',
|
||||
snippet:
|
||||
'We offer scraping large amounts of keywords for the Google Search Engine. Large means any number of keywords between 40 and 50000. Additionally, we ...',
|
||||
visible_link: 'https://scrapeulous.com/howto/',
|
||||
date: '',
|
||||
rank: 3 },
|
||||
{ link: 'https://github.com/NikolaiT/se-scraper',
|
||||
title:
|
||||
'GitHub - NikolaiT/se-scraper: Javascript scraping module based on ...https://github.com/NikolaiT/se-scraperIm CacheDiese Seite übersetzen',
|
||||
snippet:
|
||||
'24.12.2018 - Javascript scraping module based on puppeteer for many different search ... for many different search engines... https://scrapeulous.com/.',
|
||||
visible_link: 'https://github.com/NikolaiT/se-scraper',
|
||||
date: '24.12.2018 - ',
|
||||
rank: 4 },
|
||||
{ link:
|
||||
'https://github.com/NikolaiT/GoogleScraper/blob/master/README.md',
|
||||
title:
|
||||
'GoogleScraper/README.md at master · NikolaiT/GoogleScraper ...https://github.com/NikolaiT/GoogleScraper/blob/.../README.mdIm CacheÄhnliche SeitenDiese Seite übersetzen',
|
||||
snippet:
|
||||
'GoogleScraper - Scraping search engines professionally. Scrapeulous.com - Scraping Service. GoogleScraper is a open source tool and will remain a open ...',
|
||||
visible_link:
|
||||
'https://github.com/NikolaiT/GoogleScraper/blob/.../README.md',
|
||||
date: '',
|
||||
rank: 5 },
|
||||
{ link: 'https://googlescraper.readthedocs.io/',
|
||||
title:
|
||||
'Welcome to GoogleScraper\'s documentation! — GoogleScraper ...https://googlescraper.readthedocs.io/Im CacheDiese Seite übersetzen',
|
||||
snippet:
|
||||
'Welcome to GoogleScraper\'s documentation!¶. Contents: GoogleScraper - Scraping search engines professionally · Scrapeulous.com - Scraping Service ...',
|
||||
visible_link: 'https://googlescraper.readthedocs.io/',
|
||||
date: '',
|
||||
rank: 6 },
|
||||
{ link: 'https://incolumitas.com/pages/scrapeulous/',
|
||||
title:
|
||||
'Coding, Learning and Business Ideas – Scrapeulous.com - Incolumitashttps://incolumitas.com/pages/scrapeulous/Im CacheDiese Seite übersetzen',
|
||||
snippet:
|
||||
'A scraping service for scientists, marketing professionals, analysts or SEO folk. In autumn 2018, I created a scraping service called scrapeulous.com. There you ...',
|
||||
visible_link: 'https://incolumitas.com/pages/scrapeulous/',
|
||||
date: '',
|
||||
rank: 7 },
|
||||
{ link: 'https://incolumitas.com/',
|
||||
title:
|
||||
'Coding, Learning and Business Ideashttps://incolumitas.com/Im CacheDiese Seite übersetzen',
|
||||
snippet:
|
||||
'Scraping Amazon Reviews using Headless Chrome Browser and Python3. Posted on Mi ... GoogleScraper Tutorial - How to scrape 1000 keywords with Google.',
|
||||
visible_link: 'https://incolumitas.com/',
|
||||
date: '',
|
||||
rank: 8 },
|
||||
{ link: 'https://en.wikipedia.org/wiki/Search_engine_scraping',
|
||||
title:
|
||||
'Search engine scraping - Wikipediahttps://en.wikipedia.org/wiki/Search_engine_scrapingIm CacheDiese Seite übersetzen',
|
||||
snippet:
|
||||
'Search engine scraping is the process of harvesting URLs, descriptions, or other information from search engines such as Google, Bing or Yahoo. This is a ...',
|
||||
visible_link: 'https://en.wikipedia.org/wiki/Search_engine_scraping',
|
||||
date: '',
|
||||
rank: 9 },
|
||||
{ link:
|
||||
'https://readthedocs.org/projects/googlescraper/downloads/pdf/latest/',
|
||||
title:
|
||||
'GoogleScraper Documentation - Read the Docshttps://readthedocs.org/projects/googlescraper/downloads/.../latest...Im CacheDiese Seite übersetzen',
|
||||
snippet:
|
||||
'23.12.2018 - Contents: 1 GoogleScraper - Scraping search engines professionally. 1. 1.1 ... For this reason, I created the web service scrapeulous.com.',
|
||||
visible_link:
|
||||
'https://readthedocs.org/projects/googlescraper/downloads/.../latest...',
|
||||
date: '23.12.2018 - ',
|
||||
rank: 10 } ] },
|
||||
'2':
|
||||
{ time: 'Tue, 29 Jan 2019 21:39:24 GMT',
|
||||
num_results: 'Seite 2 von ungefähr 145 Ergebnissen (0,20 Sekunden) ',
|
||||
no_results: false,
|
||||
effective_query: '',
|
||||
results:
|
||||
[ { link: 'https://pypi.org/project/CountryGoogleScraper/',
|
||||
title:
|
||||
'CountryGoogleScraper · PyPIhttps://pypi.org/project/CountryGoogleScraper/Im CacheDiese Seite übersetzen',
|
||||
snippet:
|
||||
'A module to scrape and extract links, titles and descriptions from various search ... Look [here to get an idea how to use asynchronous mode](http://scrapeulous.',
|
||||
visible_link: 'https://pypi.org/project/CountryGoogleScraper/',
|
||||
date: '',
|
||||
rank: 1 },
|
||||
{ link: 'https://www.youtube.com/watch?v=a6xn6rc9GbI',
|
||||
title:
|
||||
'scrapeulous intro - YouTubehttps://www.youtube.com/watch?v=a6xn6rc9GbIDiese Seite übersetzen',
|
||||
snippet:
|
||||
'scrapeulous intro. Scrapeulous Scrapeulous. Loading... Unsubscribe from ... on Dec 16, 2018. Introduction ...',
|
||||
visible_link: 'https://www.youtube.com/watch?v=a6xn6rc9GbI',
|
||||
date: '',
|
||||
rank: 3 },
|
||||
{ link:
|
||||
'https://www.reddit.com/r/Python/comments/2tii3r/scraping_260_search_queries_in_bing_in_a_matter/',
|
||||
title:
|
||||
'Scraping 260 search queries in Bing in a matter of seconds using ...https://www.reddit.com/.../scraping_260_search_queries_in_bing...Im CacheDiese Seite übersetzen',
|
||||
snippet:
|
||||
'24.01.2015 - Scraping 260 search queries in Bing in a matter of seconds using asyncio and aiohttp. (scrapeulous.com). submitted 3 years ago by ...',
|
||||
visible_link:
|
||||
'https://www.reddit.com/.../scraping_260_search_queries_in_bing...',
|
||||
date: '24.01.2015 - ',
|
||||
rank: 4 },
|
||||
{ link: 'https://twitter.com/incolumitas_?lang=de',
|
||||
title:
|
||||
'Nikolai Tschacher (@incolumitas_) | Twitterhttps://twitter.com/incolumitas_?lang=deIm CacheÄhnliche SeitenDiese Seite übersetzen',
|
||||
snippet:
|
||||
'Learn how to scrape millions of url from yandex and google or bing with: http://scrapeulous.com/googlescraper-market-analysis.html … 0 replies 0 retweets 0 ...',
|
||||
visible_link: 'https://twitter.com/incolumitas_?lang=de',
|
||||
date: '',
|
||||
rank: 5 },
|
||||
{ link:
|
||||
'http://blog.shodan.io/hostility-in-the-python-package-index/',
|
||||
title:
|
||||
'Hostility in the Cheese Shop - Shodan Blogblog.shodan.io/hostility-in-the-python-package-index/Im CacheDiese Seite übersetzen',
|
||||
snippet:
|
||||
'22.02.2015 - https://zzz.scrapeulous.com/r? According to the author of the website, these hostile packages are used as honeypots. Honeypots are usually ...',
|
||||
visible_link: 'blog.shodan.io/hostility-in-the-python-package-index/',
|
||||
date: '22.02.2015 - ',
|
||||
rank: 6 },
|
||||
{ link: 'https://libraries.io/github/NikolaiT/GoogleScraper',
|
||||
title:
|
||||
'NikolaiT/GoogleScraper - Libraries.iohttps://libraries.io/github/NikolaiT/GoogleScraperIm CacheDiese Seite übersetzen',
|
||||
snippet:
|
||||
'A Python module to scrape several search engines (like Google, Yandex, Bing, ... https://scrapeulous.com/ ... You can install GoogleScraper comfortably with pip:',
|
||||
visible_link: 'https://libraries.io/github/NikolaiT/GoogleScraper',
|
||||
date: '',
|
||||
rank: 7 },
|
||||
{ link: 'https://pydigger.com/pypi/CountryGoogleScraper',
|
||||
title:
|
||||
'CountryGoogleScraper - PyDiggerhttps://pydigger.com/pypi/CountryGoogleScraperDiese Seite übersetzen',
|
||||
snippet:
|
||||
'19.10.2016 - Look [here to get an idea how to use asynchronous mode](http://scrapeulous.com/googlescraper-260-keywords-in-a-second.html). ### Table ...',
|
||||
visible_link: 'https://pydigger.com/pypi/CountryGoogleScraper',
|
||||
date: '19.10.2016 - ',
|
||||
rank: 8 },
|
||||
{ link: 'https://hub.docker.com/r/cimenx/data-mining-penandtest/',
|
||||
title:
|
||||
'cimenx/data-mining-penandtest - Docker Hubhttps://hub.docker.com/r/cimenx/data-mining-penandtest/Im CacheDiese Seite übersetzen',
|
||||
snippet:
|
||||
'Container. OverviewTagsDockerfileBuilds · http://scrapeulous.com/googlescraper-260-keywords-in-a-second.html. Docker Pull Command. Owner. profile ...',
|
||||
visible_link: 'https://hub.docker.com/r/cimenx/data-mining-penandtest/',
|
||||
date: '',
|
||||
rank: 9 },
|
||||
{ link: 'https://www.revolvy.com/page/Search-engine-scraping',
|
||||
title:
|
||||
'Search engine scraping | Revolvyhttps://www.revolvy.com/page/Search-engine-scrapingIm CacheDiese Seite übersetzen',
|
||||
snippet:
|
||||
'Search engine scraping is the process of harvesting URLs, descriptions, or other information from search engines such as Google, Bing or Yahoo. This is a ...',
|
||||
visible_link: 'https://www.revolvy.com/page/Search-engine-scraping',
|
||||
date: '',
|
||||
rank: 10 } ] } } }
|
||||
```
|
||||
[Output for the above script on my machine.](examples/results/advanced.json)
|
@ -1,47 +1,47 @@
|
||||
24.12.2018
|
||||
### 24.12.2018
|
||||
- fix interface to scrape() [DONE]
|
||||
- add to Github
|
||||
|
||||
|
||||
24.1.2018
|
||||
|
||||
### 24.1.2018
|
||||
- fix issue #3: add functionality to add keyword file
|
||||
|
||||
27.1.2019
|
||||
|
||||
### 27.1.2019
|
||||
- Add functionality to block images and CSS from loading as described here:
|
||||
|
||||
https://www.scrapehero.com/how-to-increase-web-scraping-speed-using-puppeteer/
|
||||
https://www.scrapehero.com/how-to-build-a-web-scraper-using-puppeteer-and-node-js/
|
||||
|
||||
29.1.2019
|
||||
|
||||
### 29.1.2019
|
||||
- implement proxy support functionality
|
||||
- implement proxy check
|
||||
|
||||
- implement scraping more than 1 page
|
||||
- do it for google
|
||||
- and bing
|
||||
|
||||
- implement duckduckgo scraping
|
||||
|
||||
|
||||
30.1.2019
|
||||
### 30.1.2019
|
||||
- modify all scrapers to use the generic class where it makes sense
|
||||
- Bing, Baidu, Google, Duckduckgo
|
||||
|
||||
7.2.2019
|
||||
### 7.2.2019
|
||||
- add num_requests to test cases [done]
|
||||
|
||||
25.2.2019
|
||||
### 25.2.2019
|
||||
- https://antoinevastel.com/crawler/2018/09/20/parallel-crawler-puppeteer.html
|
||||
- add support for browsing with multiple browsers, use this neat library:
|
||||
- https://github.com/thomasdondorf/puppeteer-cluster [done]
|
||||
|
||||
|
||||
### 28.2.2019
|
||||
- write test case for multiple browsers/proxies
|
||||
- write test case and example for multiple tabs with bing
|
||||
- make README.md nicer. https://github.com/thomasdondorf/puppeteer-cluster/blob/master/README.md as template
|
||||
|
||||
TODO:
|
||||
### TODO:
|
||||
- write test case for proxy support and cluster support
|
||||
- add captcha service solving support
|
||||
- check if news instances run the same browser and if we can have one proxy per tab wokers
|
||||
|
||||
- write test case for:
|
||||
- pluggable
|
134
examples/multiple_tabs.js
Normal file
134
examples/multiple_tabs.js
Normal file
@ -0,0 +1,134 @@
|
||||
const se_scraper = require('./../index.js');
|
||||
|
||||
const Cluster = {
|
||||
CONCURRENCY_PAGE: 1, // shares cookies, etc.
|
||||
CONCURRENCY_CONTEXT: 2, // no cookie sharing (uses contexts)
|
||||
CONCURRENCY_BROWSER: 3, // no cookie sharing and individual processes (uses contexts)
|
||||
};
|
||||
|
||||
let keywords = ['New York',
|
||||
'Los Angeles',
|
||||
'Chicago',
|
||||
'Houston',
|
||||
'Philadelphia',
|
||||
'Phoenix',
|
||||
'San Antonio',
|
||||
'San Diego',
|
||||
'Dallas',
|
||||
'San Jose',
|
||||
'Austin',
|
||||
'Indianapolis',
|
||||
'Jacksonville',
|
||||
'San Francisco',
|
||||
'Columbus',
|
||||
'Charlotte',
|
||||
'Fort Worth',
|
||||
'Detroit',
|
||||
'El Paso',
|
||||
'Memphis',
|
||||
'Seattle',
|
||||
'Denver',
|
||||
'Washington',
|
||||
'Boston',
|
||||
'Nashville-Davidson',
|
||||
'Baltimore',
|
||||
'Oklahoma City',
|
||||
'Louisville/Jefferson County',
|
||||
'Portland',
|
||||
'Las Vegas',
|
||||
'Milwaukee',
|
||||
'Albuquerque',
|
||||
'Tucson',
|
||||
'Fresno',
|
||||
'Sacramento',
|
||||
'Long Beach',
|
||||
'Kansas City',
|
||||
'Mesa',
|
||||
'Virginia Beach',
|
||||
'Atlanta',
|
||||
'Colorado Springs',
|
||||
'Omaha',
|
||||
'Raleigh',
|
||||
'Miami',
|
||||
'Oakland',
|
||||
'Minneapolis',
|
||||
'Tulsa',
|
||||
'Cleveland',
|
||||
'Wichita',
|
||||
'Arlington',
|
||||
'New Orleans',
|
||||
'Bakersfield',
|
||||
'Tampa',
|
||||
'Honolulu',
|
||||
'Aurora',
|
||||
'Anaheim',
|
||||
'Santa Ana',
|
||||
'St. Louis',
|
||||
'Riverside',
|
||||
'Corpus Christi',
|
||||
'Lexington-Fayette',
|
||||
'Pittsburgh',
|
||||
'Anchorage',
|
||||
'Stockton',
|
||||
'Cincinnati',
|
||||
'St. Paul',
|
||||
'Toledo',
|
||||
'Greensboro',
|
||||
'Newark',
|
||||
'Plano',
|
||||
'Henderson',
|
||||
'Lincoln',
|
||||
'Buffalo',
|
||||
'Jersey City',
|
||||
'Chula Vista',
|
||||
'Fort Wayne',
|
||||
'Orlando',
|
||||
'St. Petersburg',
|
||||
'Chandler',
|
||||
'Laredo',
|
||||
'Norfolk',
|
||||
'Durham',
|
||||
'Madison',
|
||||
'Lubbock',
|
||||
'Irvine',
|
||||
'Winston-Salem',
|
||||
'Glendale',
|
||||
'Garland',
|
||||
'Hialeah',
|
||||
'Reno',
|
||||
'Chesapeake',
|
||||
'Gilbert',
|
||||
'Baton Rouge',
|
||||
'Irving',
|
||||
'Scottsdale',
|
||||
'North Las Vegas',
|
||||
'Fremont',
|
||||
'Boise City',
|
||||
'Richmond',
|
||||
'San Bernardino'];
|
||||
|
||||
let config = {
|
||||
search_engine: 'bing',
|
||||
debug: false,
|
||||
verbose: true,
|
||||
keywords: keywords,
|
||||
num_pages: 1, // how many pages per keyword
|
||||
output_file: 'examples/results/bing.json',
|
||||
log_ip_address: false,
|
||||
headless: true,
|
||||
puppeteer_cluster_config: {
|
||||
timeout: 10 * 60 * 1000, // max timeout set to 10 minutes
|
||||
monitor: false,
|
||||
concurrency: Cluster.CONCURRENCY_PAGE, // one scraper per tab
|
||||
maxConcurrency: 7, // scrape with 7 tabs
|
||||
}
|
||||
};
|
||||
|
||||
function callback(err, response) {
|
||||
if (err) {
|
||||
console.error(err)
|
||||
}
|
||||
console.dir(response, {depth: null, colors: true});
|
||||
}
|
||||
|
||||
se_scraper.scrape(config, callback);
|
@ -3,17 +3,17 @@ const se_scraper = require('./../index.js');
|
||||
let config = {
|
||||
search_engine: 'google',
|
||||
debug: false,
|
||||
verbose: false,
|
||||
keywords: ['news', 'scrapeulous.com', 'incolumitas.com', 'i work too much'],
|
||||
verbose: true,
|
||||
keywords: ['news', 'scrapeulous.com', 'incolumitas.com', 'i work too much', 'what to do?', 'javascript is hard'],
|
||||
num_pages: 1,
|
||||
output_file: 'data.json',
|
||||
output_file: 'examples/results/proxyresults.json',
|
||||
proxy_file: '/home/nikolai/.proxies', // one proxy per line
|
||||
log_ip_address: true,
|
||||
log_ip_address: false,
|
||||
};
|
||||
|
||||
function callback(err, response) {
|
||||
if (err) { console.error(err) }
|
||||
console.dir(response, {depth: null, colors: true});
|
||||
//console.dir(response, {depth: null, colors: true});
|
||||
}
|
||||
|
||||
se_scraper.scrape(config, callback);
|
@ -6,7 +6,7 @@ let config = {
|
||||
verbose: false,
|
||||
keywords: ['news', 'se-scraper'],
|
||||
num_pages: 1,
|
||||
output_file: 'data.json',
|
||||
output_file: 'examples/results/data.json',
|
||||
};
|
||||
|
||||
function callback(err, response) {
|
||||
|
730
examples/results/advanced.json
Normal file
730
examples/results/advanced.json
Normal file
@ -0,0 +1,730 @@
|
||||
{
|
||||
"scrapeulous.com": {
|
||||
"1": {
|
||||
"time": "Thu, 28 Feb 2019 14:22:28 GMT",
|
||||
"num_results": "Ungefähr 200 Ergebnisse (0,25 Sekunden) ",
|
||||
"no_results": false,
|
||||
"effective_query": "",
|
||||
"results": [
|
||||
{
|
||||
"link": "https://scrapeulous.com/",
|
||||
"title": "Scrapeuloushttps://scrapeulous.com/Im CacheDiese Seite übersetzen",
|
||||
"snippet": "Scraping search engines like Google, Bing and Duckduckgo in large quantities from many geographical regions with real browsers.",
|
||||
"visible_link": "https://scrapeulous.com/",
|
||||
"date": "",
|
||||
"rank": 1
|
||||
},
|
||||
{
|
||||
"link": "https://scrapeulous.com/about/",
|
||||
"title": "Scraping search engines with real browsers in ... - Scrapeulous.comhttps://scrapeulous.com/about/Im CacheDiese Seite übersetzen",
|
||||
"snippet": "Scrapeulous.com allows you to scrape various search engines automatically and in large quantities. The business requirement to scrape information from ...",
|
||||
"visible_link": "https://scrapeulous.com/about/",
|
||||
"date": "",
|
||||
"rank": 2
|
||||
},
|
||||
{
|
||||
"link": "https://blog.scrapeulous.com/",
|
||||
"title": "Scrapeulous.com Bloghttps://blog.scrapeulous.com/Im CacheDiese Seite übersetzen",
|
||||
"snippet": "04.02.2019 - This clean blog serves to publish the latest announcements and changes for scrapeulous.com We will publish instrucitons and general tutorials ...",
|
||||
"visible_link": "https://blog.scrapeulous.com/",
|
||||
"date": "04.02.2019 - ",
|
||||
"rank": 3
|
||||
},
|
||||
{
|
||||
"link": "https://scrapeulous.com/news/",
|
||||
"title": "Scraping search engines with real browsers in large ... - Scrapeuloushttps://scrapeulous.com/news/Im CacheDiese Seite übersetzen",
|
||||
"snippet": "Scrapeulous.com News Api allows you to query the most recent world news for an index composed of developed market equities. The performance of those ...",
|
||||
"visible_link": "https://scrapeulous.com/news/",
|
||||
"date": "",
|
||||
"rank": 4
|
||||
},
|
||||
{
|
||||
"link": "https://scrapeulous.com/howto/",
|
||||
"title": "Scraping search engines with real browsers in ... - Scrapeulous.comhttps://scrapeulous.com/howto/Im CacheDiese Seite übersetzen",
|
||||
"snippet": "06.02.2019 - We offer scraping large amounts of keywords for the Google Search Engine. Large means any number of keywords between 30 and 50000.",
|
||||
"visible_link": "https://scrapeulous.com/howto/",
|
||||
"date": "06.02.2019 - ",
|
||||
"rank": 5
|
||||
},
|
||||
{
|
||||
"link": "https://scrapeulous.com/contact/",
|
||||
"title": "Contact - Scrapeuloushttps://scrapeulous.com/contact/Im CacheDiese Seite übersetzen",
|
||||
"snippet": "Contact scrapeulous.com. Your email address. Valid email address where we are going to contact you. We will not send spam mail. Your inquiry.",
|
||||
"visible_link": "https://scrapeulous.com/contact/",
|
||||
"date": "",
|
||||
"rank": 6
|
||||
},
|
||||
{
|
||||
"link": "https://scrapeulous.com/faq/",
|
||||
"title": "Scraping search engines with real browsers in ... - Scrapeulous.comhttps://scrapeulous.com/faq/Im CacheDiese Seite übersetzen",
|
||||
"snippet": "02.02.2019 - Scraping search engines like Google, Bing and Duckduckgo in large quantities from many geographical regions with real browsers.",
|
||||
"visible_link": "https://scrapeulous.com/faq/",
|
||||
"date": "02.02.2019 - ",
|
||||
"rank": 7
|
||||
},
|
||||
{
|
||||
"link": "https://scrapeulous.com/scrape/",
|
||||
"title": "Scraping search engines with real browsers in large ... - Scrapeuloushttps://scrapeulous.com/scrape/Im CacheDiese Seite übersetzen",
|
||||
"snippet": "It is super easy to use scrapeulous.com, because you can just upload a text/CSV file with your keywords and submit your email address. With this information ...",
|
||||
"visible_link": "https://scrapeulous.com/scrape/",
|
||||
"date": "",
|
||||
"rank": 8
|
||||
},
|
||||
{
|
||||
"link": "https://incolumitas.com/",
|
||||
"title": "Coding, Learning and Business Ideashttps://incolumitas.com/Im CacheDiese Seite übersetzen",
|
||||
"snippet": "About · Contact · GoogleScraper · Lichess Autoplay-Bot · Projects · Scrapeulous.com · Site Notice · SVGCaptcha · Home Archives Categories Tags Atom ...",
|
||||
"visible_link": "https://incolumitas.com/",
|
||||
"date": "",
|
||||
"rank": 9
|
||||
},
|
||||
{
|
||||
"link": "https://twitter.com/scrapeulous",
|
||||
"title": "Scrapeulous.com (@scrapeulous) | Twitterhttps://twitter.com/scrapeulousIm CacheDiese Seite übersetzen",
|
||||
"snippet": "The latest Tweets from Scrapeulous.com (@scrapeulous): \"Creating software to realize the best scraping service at https://t.co/R5NUqSSrB5\"",
|
||||
"visible_link": "https://twitter.com/scrapeulous",
|
||||
"date": "",
|
||||
"rank": 10
|
||||
}
|
||||
]
|
||||
},
|
||||
"2": {
|
||||
"time": "Thu, 28 Feb 2019 14:22:30 GMT",
|
||||
"num_results": "Seite 2 von ungefähr 200 Ergebnissen (0,21 Sekunden) ",
|
||||
"no_results": false,
|
||||
"effective_query": "",
|
||||
"results": [
|
||||
{
|
||||
"link": "https://incolumitas.com/pages/scrapeulous/",
|
||||
"title": "Coding, Learning and Business Ideas – Scrapeulous.comhttps://incolumitas.com/pages/scrapeulous/Im CacheDiese Seite übersetzen",
|
||||
"snippet": "In autumn 2018, I created a scraping service called scrapeulous.com. There you can purchase scrape jobs that allow you to upload a keyword file which in turn ...",
|
||||
"visible_link": "https://incolumitas.com/pages/scrapeulous/",
|
||||
"date": "",
|
||||
"rank": 11
|
||||
},
|
||||
{
|
||||
"link": "https://www.youtube.com/watch?v=a6xn6rc9GbI",
|
||||
"title": "scrapeulous intro - YouTubehttps://www.youtube.com/watch?v=a6xn6rc9GbIDiese Seite übersetzen",
|
||||
"snippet": "Introduction for https://scrapeulous.com. ... scrapeulous intro. Scrapeulous Scrapeulous. Loading ...",
|
||||
"visible_link": "https://www.youtube.com/watch?v=a6xn6rc9GbI",
|
||||
"date": "",
|
||||
"rank": 12
|
||||
},
|
||||
{
|
||||
"link": "https://www.youtube.com/channel/UCJs1Xei5LRefg9GwFYdYhOw",
|
||||
"title": "Scrapeulous Scrapeulous - YouTubehttps://www.youtube.com/.../UCJs1Xei5LRefg9GwFYdYhOwIm CacheDiese Seite übersetzen",
|
||||
"snippet": "How to use scrapeulous.com - Duration: 3 minutes, 42 seconds. 32 minutes ago; 4 views. Introduction for https://scrapeulous.com. Show more. This item has ...",
|
||||
"visible_link": "https://www.youtube.com/.../UCJs1Xei5LRefg9GwFYdYhOw",
|
||||
"date": "",
|
||||
"rank": 13
|
||||
},
|
||||
{
|
||||
"link": "https://googlescraper.readthedocs.io/en/latest/README.html",
|
||||
"title": "GoogleScraper - Scraping search engines professionally ...https://googlescraper.readthedocs.io/en/latest/README.htmlIm CacheDiese Seite übersetzen",
|
||||
"snippet": "Scrapeulous.com - Scraping Service¶. GoogleScraper is a open source tool and will remain a open source tool in the future. Some people however would want ...",
|
||||
"visible_link": "https://googlescraper.readthedocs.io/en/latest/README.html",
|
||||
"date": "",
|
||||
"rank": 14
|
||||
},
|
||||
{
|
||||
"link": "https://github.com/NikolaiT/se-scraper",
|
||||
"title": "GitHub - NikolaiT/se-scraper: Javascript scraping module based on ...https://github.com/NikolaiT/se-scraperIm CacheDiese Seite übersetzen",
|
||||
"snippet": "const se_scraper = require('se-scraper'); let config = { search_engine: 'google', debug: false, verbose: false, keywords: ['news', 'scraping scrapeulous.com'], ...",
|
||||
"visible_link": "https://github.com/NikolaiT/se-scraper",
|
||||
"date": "",
|
||||
"rank": 15
|
||||
},
|
||||
{
|
||||
"link": "https://www.npmjs.com/package/se-scraper",
|
||||
"title": "se-scraper - npmhttps://www.npmjs.com/package/se-scraperIm CacheDiese Seite übersetzen",
|
||||
"snippet": "07.02.2019 - homepage. scrapeulous.com. repository. github. last publish. 20 days ago. collaborators. avatar. Test with RunKit · Report a vulnerability. Help.",
|
||||
"visible_link": "https://www.npmjs.com/package/se-scraper",
|
||||
"date": "07.02.2019 - ",
|
||||
"rank": 16
|
||||
},
|
||||
{
|
||||
"link": "https://pypi.org/project/CountryGoogleScraper/",
|
||||
"title": "CountryGoogleScraper · PyPIhttps://pypi.org/project/CountryGoogleScraper/Im CacheDiese Seite übersetzen",
|
||||
"snippet": "Look [here to get an idea how to use asynchronous mode](http://scrapeulous.com/googlescraper-260-keywords-in-a-second.html). ### Table of Contents 1.",
|
||||
"visible_link": "https://pypi.org/project/CountryGoogleScraper/",
|
||||
"date": "",
|
||||
"rank": 17
|
||||
},
|
||||
{
|
||||
"link": "https://medium.com/@scrapeulous/in-case-you-dont-want-to-go-through-the-hassle-of-creating-your-own-well-maintained-scraping-code-32b029985d4a",
|
||||
"title": "In case you don't want to go through the hassle of creating your own ...https://medium.com/@scrapeulous/in-case-you-dont-want-to-go-through-the-hassle-of-c...",
|
||||
"snippet": "05.02.2019 - And if you want to use some services when you plan to scraper large amounts of keywords, https://scrapeulous.com/ offers scraping for various ...",
|
||||
"visible_link": "https://medium.com/@scrapeulous/in-case-you-dont-want-to-go-through-the-hassle-of-c...",
|
||||
"date": "05.02.2019 - ",
|
||||
"rank": 18
|
||||
},
|
||||
{
|
||||
"link": "https://readthedocs.org/projects/googlescraper/downloads/pdf/latest/",
|
||||
"title": "GoogleScraper Documentation - Read the Docshttps://readthedocs.org/projects/googlescraper/downloads/.../latest...Im CacheDiese Seite übersetzen",
|
||||
"snippet": "03.02.2019 - 1.1 Scrapeulous.com - Scraping Service. GoogleScraper is a open source tool and will remain a open source tool in the future. Some people ...",
|
||||
"visible_link": "https://readthedocs.org/projects/googlescraper/downloads/.../latest...",
|
||||
"date": "03.02.2019 - ",
|
||||
"rank": 19
|
||||
},
|
||||
{
|
||||
"link": "http://blog.shodan.io/hostility-in-the-python-package-index/",
|
||||
"title": "Hostility in the Cheese Shop - Shodan Blogblog.shodan.io/hostility-in-the-python-package-index/Im CacheDiese Seite übersetzen",
|
||||
"snippet": "22.02.2015 - https://zzz.scrapeulous.com/r? According to the author of the website, these hostile packages are used as honeypots. Honeypots are usually ...",
|
||||
"visible_link": "blog.shodan.io/hostility-in-the-python-package-index/",
|
||||
"date": "22.02.2015 - ",
|
||||
"rank": 20
|
||||
}
|
||||
]
|
||||
}
|
||||
},
|
||||
"scraping service scrapeulous": {
|
||||
"1": {
|
||||
"time": "Thu, 28 Feb 2019 14:22:34 GMT",
|
||||
"num_results": "Ungefähr 100 Ergebnisse (0,45 Sekunden) ",
|
||||
"no_results": false,
|
||||
"effective_query": "",
|
||||
"results": [
|
||||
{
|
||||
"link": "https://scrapeulous.com/",
|
||||
"title": "Scraping search engines with real browsers in large quantitieshttps://scrapeulous.com/Im CacheDiese Seite übersetzen",
|
||||
"snippet": "Scraping search engines like Google, Bing and Duckduckgo in large ... Scrapeulous.com allows you to scrape various search engines automatically and in large ...",
|
||||
"visible_link": "https://scrapeulous.com/",
|
||||
"date": "",
|
||||
"rank": 1
|
||||
},
|
||||
{
|
||||
"link": "https://scrapeulous.com/faq/",
|
||||
"title": "Scraping search engines with real browsers in ... - Scrapeulous.comhttps://scrapeulous.com/faq/Im CacheDiese Seite übersetzen",
|
||||
"snippet": "02.02.2019 - Scraping search engines like Google, Bing and Duckduckgo in ... After all, Google itself is the biggest web scraper known in history. ..... We offer general and specialized services to extract and scrape data from the internet.",
|
||||
"visible_link": "https://scrapeulous.com/faq/",
|
||||
"date": "02.02.2019 - ",
|
||||
"rank": 2
|
||||
},
|
||||
{
|
||||
"link": "https://googlescraper.readthedocs.io/en/latest/README.html",
|
||||
"title": "GoogleScraper - Scraping search engines professionally ...https://googlescraper.readthedocs.io/en/latest/README.htmlIm CacheDiese Seite übersetzen",
|
||||
"snippet": "Scrapeulous.com - Scraping Service¶. GoogleScraper is a open source tool and will remain a open source tool in the future. Some people however would want ...",
|
||||
"visible_link": "https://googlescraper.readthedocs.io/en/latest/README.html",
|
||||
"date": "",
|
||||
"rank": 3
|
||||
},
|
||||
{
|
||||
"link": "https://github.com/NikolaiT/se-scraper",
|
||||
"title": "GitHub - NikolaiT/se-scraper: Javascript scraping module based on ...https://github.com/NikolaiT/se-scraperIm CacheDiese Seite übersetzen",
|
||||
"snippet": "mdIm CacheÄhnliche SeitenDiese Seite übersetzen', snippet: 'GoogleScraper - Scraping search engines professionally. Scrapeulous.com - Scraping Service.",
|
||||
"visible_link": "https://github.com/NikolaiT/se-scraper",
|
||||
"date": "",
|
||||
"rank": 4
|
||||
},
|
||||
{
|
||||
"link": "https://readthedocs.org/projects/googlescraper/downloads/pdf/latest/",
|
||||
"title": "GoogleScraper Documentation - Read the Docshttps://readthedocs.org/projects/googlescraper/downloads/.../latest...Im CacheDiese Seite übersetzen",
|
||||
"snippet": "03.02.2019 - Contents: 1 GoogleScraper - Scraping search engines professionally. 1. 1.1 ... For this reason, I created the web service scrapeulous.com.",
|
||||
"visible_link": "https://readthedocs.org/projects/googlescraper/downloads/.../latest...",
|
||||
"date": "03.02.2019 - ",
|
||||
"rank": 5
|
||||
},
|
||||
{
|
||||
"link": "https://incolumitas.com/pages/scrapeulous/",
|
||||
"title": "Coding, Learning and Business Ideas – Scrapeulous.comhttps://incolumitas.com/pages/scrapeulous/Im CacheDiese Seite übersetzen",
|
||||
"snippet": "A scraping service for scientists, marketing professionals, analysts or SEO folk. In autumn 2018, I created a scraping service called scrapeulous.com. There you ...",
|
||||
"visible_link": "https://incolumitas.com/pages/scrapeulous/",
|
||||
"date": "",
|
||||
"rank": 6
|
||||
},
|
||||
{
|
||||
"link": "https://medium.com/@scrapeulous/responses",
|
||||
"title": "Responses – Scrapeulous Scrapeulous – Mediumhttps://medium.com/@scrapeulous/responsesIm CacheDiese Seite übersetzen",
|
||||
"snippet": "Responses published by Scrapeulous Scrapeulous on Medium. ... And if you want to use some services when you plan to scraper large amounts of keywords, ...",
|
||||
"visible_link": "https://medium.com/@scrapeulous/responses",
|
||||
"date": "",
|
||||
"rank": 7
|
||||
},
|
||||
{
|
||||
"link": "https://www.grepsr.com/",
|
||||
"title": "Grepsr | Web Scraping Service Platformhttps://www.grepsr.com/Im CacheÄhnliche SeitenDiese Seite übersetzen",
|
||||
"snippet": "Simplify data extraction with easy-to-use web scraping service platform and manage it better with powerful features and 24/7 support. Sign up free!",
|
||||
"visible_link": "https://www.grepsr.com/",
|
||||
"date": "",
|
||||
"rank": 8
|
||||
},
|
||||
{
|
||||
"link": "https://www.quora.com/Which-one-is-the-best-data-scraping-services",
|
||||
"title": "Which one is the best data scraping services? - Quorahttps://www.quora.com/Which-one-is-the-best-data-scraping-serv...Diese Seite übersetzen",
|
||||
"snippet": "We can list N number of data scraping service providers, but the thing is we have ... Scrapeulous.com - A simple solution to scrape various search engines from ...",
|
||||
"visible_link": "https://www.quora.com/Which-one-is-the-best-data-scraping-serv...",
|
||||
"date": "",
|
||||
"rank": 9
|
||||
},
|
||||
{
|
||||
"link": "https://www.promptcloud.com/",
|
||||
"title": "PromptCloud: Fully Managed Web Scraping Servicehttps://www.promptcloud.com/Im CacheÄhnliche SeitenDiese Seite übersetzen",
|
||||
"snippet": "Our web scraping service helps you extract data from websites without any technical hassle — leverage our decade-old expertize and dedicated support.",
|
||||
"visible_link": "https://www.promptcloud.com/",
|
||||
"date": "",
|
||||
"rank": 10
|
||||
}
|
||||
]
|
||||
},
|
||||
"2": {
|
||||
"time": "Thu, 28 Feb 2019 14:22:36 GMT",
|
||||
"num_results": "Seite 2 von ungefähr 19 Ergebnissen (0,33 Sekunden) ",
|
||||
"no_results": false,
|
||||
"effective_query": "",
|
||||
"results": [
|
||||
{
|
||||
"link": "https://en.wikipedia.org/wiki/Search_engine_scraping",
|
||||
"title": "Search engine scraping - Wikipediahttps://en.wikipedia.org/wiki/Search_engine_scrapingIm CacheDiese Seite übersetzen",
|
||||
"snippet": "Search engine scraping is the process of harvesting URLs, descriptions, or other information .... When scraping websites and services the legal part is often a big concern for companies, for web scraping it greatly depends on the country a ...",
|
||||
"visible_link": "https://en.wikipedia.org/wiki/Search_engine_scraping",
|
||||
"date": "",
|
||||
"rank": 11
|
||||
},
|
||||
{
|
||||
"link": "https://www.reddit.com/r/bigseo/comments/ao71gz/feedback_wanted_search_engine_scraping_software/",
|
||||
"title": "[Feedback wanted] Search engine scraping software for large ...https://www.reddit.com/.../feedback_wanted_search_engine_scra...Im CacheDiese Seite übersetzen",
|
||||
"snippet": "I am the creator of [scrapeulous.com](https://scrapeulous.com), a search engine scraping service. Back in 2013 or so, I created GoogleScraper, ...",
|
||||
"visible_link": "https://www.reddit.com/.../feedback_wanted_search_engine_scra...",
|
||||
"date": "",
|
||||
"rank": 12
|
||||
},
|
||||
{
|
||||
"link": "https://twitter.com/scrapeulous",
|
||||
"title": "Scrapeulous.com (@scrapeulous) | Twitterhttps://twitter.com/scrapeulousIm CacheDiese Seite übersetzen",
|
||||
"snippet": "The latest Tweets from Scrapeulous.com (@scrapeulous): \"Creating software to realize the best scraping service at https://t.co/R5NUqSSrB5\"",
|
||||
"visible_link": "https://twitter.com/scrapeulous",
|
||||
"date": "",
|
||||
"rank": 13
|
||||
},
|
||||
{
|
||||
"link": "http://firstpress.com.ng/tag/engine/",
|
||||
"title": "engine Archives - first pressfirstpress.com.ng/tag/engine/Im CacheDiese Seite übersetzen",
|
||||
"snippet": "08.02.2019 - I am the creator of scrapeulous.com, a search engine scraping service. Back in 2013 or so, I created GoogleScraper, a simple python library ...",
|
||||
"visible_link": "firstpress.com.ng/tag/engine/",
|
||||
"date": "08.02.2019 - ",
|
||||
"rank": 14
|
||||
},
|
||||
{
|
||||
"link": "https://libraries.io/github/NikolaiT/GoogleScraper",
|
||||
"title": "NikolaiT/GoogleScraper - Libraries.iohttps://libraries.io/github/NikolaiT/GoogleScraperIm CacheDiese Seite übersetzen",
|
||||
"snippet": "A Python module to scrape several search engines (like Google, Yandex, Bing, Duckduckgo, Baidu and others) by using proxies ... https://scrapeulous.com/.",
|
||||
"visible_link": "https://libraries.io/github/NikolaiT/GoogleScraper",
|
||||
"date": "",
|
||||
"rank": 15
|
||||
},
|
||||
{
|
||||
"link": "https://www.tiki-toki.com/timeline/entry/625522/Google-Hacking-History-by-Bishop-Fox/",
|
||||
"title": "Google Hacking History by Bishop Fox - Tiki-Tokihttps://www.tiki-toki.com/.../Google-Hacking-History-by-Bishop-...Im CacheDiese Seite übersetzen",
|
||||
"snippet": "Indexed and makes searchable service banners for whole Internet for HTTP (Port 80), .... http://scrapeulous.com/googlescraper-260-keywords-in-a-second.html ...",
|
||||
"visible_link": "https://www.tiki-toki.com/.../Google-Hacking-History-by-Bishop-...",
|
||||
"date": "",
|
||||
"rank": 16
|
||||
},
|
||||
{
|
||||
"link": "http://lifehackbuddy.com/ohhqdeh/ptyah5y.php?mdzaiooef=google-scraping",
|
||||
"title": "Google scraping - LifeHack Buddylifehackbuddy.com/ohhqdeh/ptyah5y.php?mdzaiooef...scrapingIm CacheDiese Seite übersetzen",
|
||||
"snippet": "03.04.2018 - Some people however would want to quickly have a paid service that lets them scrape some data from Google or any other search engine.",
|
||||
"visible_link": "lifehackbuddy.com/ohhqdeh/ptyah5y.php?mdzaiooef...scraping",
|
||||
"date": "03.04.2018 - ",
|
||||
"rank": 17
|
||||
},
|
||||
{
|
||||
"link": "https://www.robtex.com/cidr/167.99.240.0-20",
|
||||
"title": "Robtexhttps://www.robtex.com/cidr/167.99.240.0-20Im CacheDiese Seite übersetzen",
|
||||
"snippet": "167.99.240.80, A, unitedprint-se.com · 167.99.240.82, A, protectservice.info ... 167.99.241.130, A, mail.mothlive.info · 167.99.241.135, A, www.scrapeulous.com.",
|
||||
"visible_link": "https://www.robtex.com/cidr/167.99.240.0-20",
|
||||
"date": "",
|
||||
"rank": 18
|
||||
},
|
||||
{
|
||||
"link": "https://domain-status.com/archives/2018-9-16/com/transferred/170",
|
||||
"title": "Page 170 of 227 for .com transferred domains on 2018-09-16https://domain-status.com/archives/2018-9-16/com/.../170Im CacheDiese Seite übersetzen",
|
||||
"snippet": "16.09.2018 - ... scramblednest.com · scrapeulous.com · scrapling.com · scrapmetalservice.com · scrappysyrups.com · scratch4kidz.com · scratchcardsbonus.",
|
||||
"visible_link": "https://domain-status.com/archives/2018-9-16/com/.../170",
|
||||
"date": "16.09.2018 - ",
|
||||
"rank": 19
|
||||
}
|
||||
]
|
||||
}
|
||||
},
|
||||
"scraping search engines": {
|
||||
"1": {
|
||||
"time": "Thu, 28 Feb 2019 14:22:28 GMT",
|
||||
"num_results": "Ungefähr 21.800.000 Ergebnisse (0,47 Sekunden) ",
|
||||
"no_results": false,
|
||||
"effective_query": "",
|
||||
"results": [
|
||||
{
|
||||
"link": "https://en.wikipedia.org/wiki/Search_engine_scraping",
|
||||
"title": "Search engine scraping - Wikipediahttps://en.wikipedia.org/wiki/Search_engine_scraping",
|
||||
"snippet": "",
|
||||
"visible_link": "https://en.wikipedia.org/wiki/Search_engine_scraping",
|
||||
"date": "",
|
||||
"rank": 1
|
||||
},
|
||||
{
|
||||
"link": "https://en.wikipedia.org/wiki/Search_engine_scraping",
|
||||
"title": "Search engine scraping - Wikipediahttps://en.wikipedia.org/wiki/Search_engine_scraping",
|
||||
"snippet": "",
|
||||
"visible_link": "https://en.wikipedia.org/wiki/Search_engine_scraping",
|
||||
"date": "",
|
||||
"rank": 2
|
||||
},
|
||||
{
|
||||
"link": "https://en.wikipedia.org/wiki/Search_engine_scraping",
|
||||
"title": "Search engine scraping - Wikipediahttps://en.wikipedia.org/wiki/Search_engine_scrapingIm CacheDiese Seite übersetzen",
|
||||
"snippet": "Search engine scraping is the process of harvesting URLs, descriptions, or other information from search engines such as Google, Bing or Yahoo. This is a specific form of screen scraping or web scraping dedicated to search engines only.",
|
||||
"visible_link": "https://en.wikipedia.org/wiki/Search_engine_scraping",
|
||||
"date": "",
|
||||
"rank": 3
|
||||
},
|
||||
{
|
||||
"link": "https://rotatingproxies.com/blog/2017/07/search-engine-easiest-scrape/",
|
||||
"title": "Which Search Engine is Easiest to Scrape? | RotatingProxies Bloghttps://rotatingproxies.com/blog/.../search-engine-easiest-scrape/Im CacheDiese Seite übersetzen",
|
||||
"snippet": "23.07.2017 - Without search engines, the internet would be one big pile of mush. Content left, right and center, but nothing tangible to point you in the correct ...",
|
||||
"visible_link": "https://rotatingproxies.com/blog/.../search-engine-easiest-scrape/",
|
||||
"date": "23.07.2017 - ",
|
||||
"rank": 4
|
||||
},
|
||||
{
|
||||
"link": "https://stackoverflow.com/questions/5403270/search-engine-that-allows-results-to-be-scraped",
|
||||
"title": "Search engine that allows results to be scraped? - Stack Overflowhttps://stackoverflow.com/.../search-engine-that-allows-results-to-...Im CacheDiese Seite übersetzen",
|
||||
"snippet": "12.11.2011 - You might be actually looking for Google JSON Search API: it allows you to ... search Google from your program, and is easier to use than screen-scraping.",
|
||||
"visible_link": "https://stackoverflow.com/.../search-engine-that-allows-results-to-...",
|
||||
"date": "12.11.2011 - ",
|
||||
"rank": 5
|
||||
},
|
||||
{
|
||||
"link": "http://www.scrapebox.com/search-engine-scraper",
|
||||
"title": "Search Engine Scraper - ScrapeBoxwww.scrapebox.com/search-engine-scraperIm CacheÄhnliche SeitenDiese Seite übersetzen",
|
||||
"snippet": "With the release of ScrapeBox v2.0 we have created the fastest, most power Search Engine Scraper ever built. It's the first desktop SERP Scraper we have ever ...",
|
||||
"visible_link": "www.scrapebox.com/search-engine-scraper",
|
||||
"date": "",
|
||||
"rank": 6
|
||||
},
|
||||
{
|
||||
"link": "https://www.reddit.com/r/scrapinghub/comments/6i3p41/good_search_engine_for_scraping/",
|
||||
"title": "Good search engine for scraping : scrapinghub - Reddithttps://www.reddit.com/.../scrapinghub/.../good_search_engine_f...Im CacheDiese Seite übersetzen",
|
||||
"snippet": "Google has anti-scraping captchas so I'm looking for something else, are there any other options?",
|
||||
"visible_link": "https://www.reddit.com/.../scrapinghub/.../good_search_engine_f...",
|
||||
"date": "",
|
||||
"rank": 7
|
||||
},
|
||||
{
|
||||
"link": "https://github.com/NikolaiT/GoogleScraper",
|
||||
"title": "GitHub - NikolaiT/GoogleScraper: A Python module to scrape several ...https://github.com/NikolaiT/GoogleScraperIm CacheÄhnliche SeitenDiese Seite übersetzen",
|
||||
"snippet": "A Python module to scrape several search engines (like Google, Yandex, Bing, Duckduckgo, ...). Including asynchronous networking support.",
|
||||
"visible_link": "https://github.com/NikolaiT/GoogleScraper",
|
||||
"date": "",
|
||||
"rank": 8
|
||||
},
|
||||
{
|
||||
"link": "http://scraping.pro/search-queries-in-a-search-engine-for-scraping/",
|
||||
"title": "Search queries in a search engine for scraping - Scraping.proscraping.pro/search-queries-in-a-search-engine-for-scraping/Im CacheÄhnliche SeitenDiese Seite übersetzen",
|
||||
"snippet": "10.07.2015 - So you might compose those urls according to your needs, store them in file(s) (since there could be millions of them) and then feed them into a ...",
|
||||
"visible_link": "scraping.pro/search-queries-in-a-search-engine-for-scraping/",
|
||||
"date": "10.07.2015 - ",
|
||||
"rank": 9
|
||||
},
|
||||
{
|
||||
"link": "https://www.searchenginejournal.com/scrape-google-serp-custom-extractions/267211/",
|
||||
"title": "How to Scrape SERPs to Optimize for Search Intent - Search Engine ...https://www.searchenginejournal.com › SEOIm CacheDiese Seite übersetzen",
|
||||
"snippet": "05.09.2018 - Having trouble gaining visibility for an important set of keywords? Here's how to use custom extractions to analyze SERP intent to diagnose ...",
|
||||
"visible_link": "https://www.searchenginejournal.com › SEO",
|
||||
"date": "05.09.2018 - ",
|
||||
"rank": 10
|
||||
},
|
||||
{
|
||||
"link": "https://scrapeulous.com/howto/",
|
||||
"title": "Scraping search engines with real browsers in large quantitieshttps://scrapeulous.com/howto/Im CacheDiese Seite übersetzen",
|
||||
"snippet": "06.02.2019 - We offer scraping large amounts of keywords for the Google Search Engine. Large means any number of keywords between 30 and 50000.",
|
||||
"visible_link": "https://scrapeulous.com/howto/",
|
||||
"date": "06.02.2019 - ",
|
||||
"rank": 11
|
||||
}
|
||||
]
|
||||
},
|
||||
"2": {
|
||||
"time": "Thu, 28 Feb 2019 14:22:30 GMT",
|
||||
"num_results": "Seite 2 von ungefähr 21.800.000 Ergebnissen (0,42 Sekunden) ",
|
||||
"no_results": false,
|
||||
"effective_query": "",
|
||||
"results": [
|
||||
{
|
||||
"link": "https://www.maxresultsseo.com/search-engine-scraper",
|
||||
"title": "Google Scraper - Scrape unlimited results from Google + Binghttps://www.maxresultsseo.com/search-engine-scraperIm CacheDiese Seite übersetzen",
|
||||
"snippet": "Google Scraper is a desktop software tool that allows you to scrape results from search engines such as Google and Bing. It will also allow you to check Moz DA ...",
|
||||
"visible_link": "https://www.maxresultsseo.com/search-engine-scraper",
|
||||
"date": "",
|
||||
"rank": 12
|
||||
},
|
||||
{
|
||||
"link": "https://searchnewscentral.com/blog/2011/09/28/how-to-scrape-search-engines-without-pissing-them-off/",
|
||||
"title": "How to: Scrape search engines without pissing them off | Search News ...https://searchnewscentral.com/.../how-to-scrape-search-engines-w...Im CacheDiese Seite übersetzen",
|
||||
"snippet": "28.09.2011 - You can learn a lot about a search engine by scraping its results. It's the only easy way you can get an hourly or daily record of exactly what ...",
|
||||
"visible_link": "https://searchnewscentral.com/.../how-to-scrape-search-engines-w...",
|
||||
"date": "28.09.2011 - ",
|
||||
"rank": 13
|
||||
},
|
||||
{
|
||||
"link": "https://www.quora.com/What-SEO-agencies-or-services-use-for-scraping-Search-Engines",
|
||||
"title": "What SEO agencies or services use for scraping Search Engines? - Quorahttps://www.quora.com/What-SEO-agencies-or-services-use-for-s...Diese Seite übersetzen",
|
||||
"snippet": "13.12.2016 - To detect scraping, search engines need to see patterns. So, if you can randomize a few things, you can scrape. Most companies I know: ...",
|
||||
"visible_link": "https://www.quora.com/What-SEO-agencies-or-services-use-for-s...",
|
||||
"date": "13.12.2016 - ",
|
||||
"rank": 14
|
||||
},
|
||||
{
|
||||
"link": "https://scrapemasters.com/services/search-engines-scraping/",
|
||||
"title": "Google web scraping service. Scrape Google search results. Google ...https://scrapemasters.com/services/search-engines-scraping/Im CacheDiese Seite übersetzen",
|
||||
"snippet": "We turn any search engines (Google, Bing, Yahoo) results page (SERP) into structured data and deliver results through API or any other way convenient for you.",
|
||||
"visible_link": "https://scrapemasters.com/services/search-engines-scraping/",
|
||||
"date": "",
|
||||
"rank": 15
|
||||
},
|
||||
{
|
||||
"link": "https://www.seroundtable.com/google-apis-reduce-scraping-25978.html",
|
||||
"title": "Google Believes Providing APIs Won't Reduce Search Results Scrapinghttps://www.seroundtable.com/google-apis-reduce-scraping-2597...Im CacheDiese Seite übersetzen",
|
||||
"snippet": "29.06.2018 - Filed Under Google Search Engine Optimization ... if Google provided an API for this tool, it would reduce scraping of the Google search results.",
|
||||
"visible_link": "https://www.seroundtable.com/google-apis-reduce-scraping-2597...",
|
||||
"date": "29.06.2018 - ",
|
||||
"rank": 16
|
||||
},
|
||||
{
|
||||
"link": "https://www.youtube.com/watch?v=wuHvwFHoZ4U",
|
||||
"title": "How to use Search Engine Scraper - YouTubehttps://www.youtube.com/watch?v=wuHvwFHoZ4UDiese Seite übersetzen",
|
||||
"snippet": "Get the software from: http://www.bottopia.com/software/search-engine-scraper/ ... Web Scraping a Search ...",
|
||||
"visible_link": "https://www.youtube.com/watch?v=wuHvwFHoZ4U",
|
||||
"date": "",
|
||||
"rank": 17
|
||||
},
|
||||
{
|
||||
"link": "https://www.youtube.com/watch?v=BcQRIr3noOI",
|
||||
"title": "How to Scrape Google Search Results Quickly, Easily and for Free ...https://www.youtube.com/watch?v=BcQRIr3noOIÄhnliche SeitenDiese Seite übersetzen",
|
||||
"snippet": "How to Scrape Google Search Results Quickly, Easily and for Free ... *UPDATE 6/30/2015 - Set your search ...",
|
||||
"visible_link": "https://www.youtube.com/watch?v=BcQRIr3noOI",
|
||||
"date": "",
|
||||
"rank": 18
|
||||
},
|
||||
{
|
||||
"link": "https://forum.gsa-online.de/discussion/20784/scrape-box-google-scraping-settings-next-best-engines-after-google-to-scrape-own-list-or-urls",
|
||||
"title": "Scrape Box Google scraping settings + Next Best engines after ...https://forum.gsa-online.de › ... › Need HelpIm CacheÄhnliche SeitenDiese Seite übersetzen",
|
||||
"snippet": "1) I am using Scrapebox with 20 Semi-dedicated proxies. ... Yes, scrape Bing or another search engine - you will be able to rank in Google if ...",
|
||||
"visible_link": "https://forum.gsa-online.de › ... › Need Help",
|
||||
"date": "",
|
||||
"rank": 19
|
||||
},
|
||||
{
|
||||
"link": "https://www.scrapehero.com/a-beginners-guide-to-web-scraping-part-1-the-basics/",
|
||||
"title": "What is web scraping - Part 1 - Beginner's guide - ScrapeHerohttps://www.scrapehero.com/a-beginners-guide-to-web-scraping-...Im CacheDiese Seite übersetzen",
|
||||
"snippet": "29.01.2018 - For example, SERP monitoring services scrape search engine results periodically to show you how your search rankings have changed over ...",
|
||||
"visible_link": "https://www.scrapehero.com/a-beginners-guide-to-web-scraping-...",
|
||||
"date": "29.01.2018 - ",
|
||||
"rank": 20
|
||||
},
|
||||
{
|
||||
"link": "https://code.i-harness.com/de/q/4207e0",
|
||||
"title": "search-engine - Was ist der Unterschied zwischen Web-Crawling und ...https://code.i-harness.com/de/q/4207e0Im Cache",
|
||||
"snippet": "Web Scraping , um eine minimale Definition zu verwenden, ist der Prozess der Verarbeitung eines Webdokuments und das Extrahieren von Informationen ...",
|
||||
"visible_link": "https://code.i-harness.com/de/q/4207e0",
|
||||
"date": "",
|
||||
"rank": 21
|
||||
}
|
||||
]
|
||||
}
|
||||
},
|
||||
"learn js": {
|
||||
"1": {
|
||||
"time": "Thu, 28 Feb 2019 14:22:34 GMT",
|
||||
"num_results": "Ungefähr 646.000.000 Ergebnisse (0,33 Sekunden) ",
|
||||
"no_results": false,
|
||||
"effective_query": "",
|
||||
"results": [
|
||||
{
|
||||
"link": "https://www.learn-js.org/",
|
||||
"title": "Learn JavaScript - Free Interactive JavaScript Tutorialhttps://www.learn-js.org/Im CacheDiese Seite übersetzen",
|
||||
"snippet": "Learn-JS.org is a free interactive JavaScript tutorial for people who want to learn JavaScript, fast.",
|
||||
"visible_link": "https://www.learn-js.org/",
|
||||
"date": "",
|
||||
"rank": 1
|
||||
},
|
||||
{
|
||||
"link": "https://learnjavascript.online/",
|
||||
"title": "Learn JavaScripthttps://learnjavascript.online/Im CacheDiese Seite übersetzen",
|
||||
"snippet": "Learn JavaScript Online: The easiest way to learn & practice modern JavaScript.",
|
||||
"visible_link": "https://learnjavascript.online/",
|
||||
"date": "",
|
||||
"rank": 2
|
||||
},
|
||||
{
|
||||
"link": "https://www.w3schools.com/js/",
|
||||
"title": "JavaScript Tutorial - W3Schoolshttps://www.w3schools.com/js/Im CacheÄhnliche SeitenDiese Seite übersetzen",
|
||||
"snippet": "JavaScript is the programming language of HTML and the Web. JavaScript is easy to learn. This tutorial will teach you JavaScript from basic to advanced.",
|
||||
"visible_link": "https://www.w3schools.com/js/",
|
||||
"date": "",
|
||||
"rank": 3
|
||||
},
|
||||
{
|
||||
"link": "https://www.codecademy.com/learn/introduction-to-javascript",
|
||||
"title": "JavaScript Tutorial: Learn JavaScript For Free | Codecademyhttps://www.codecademy.com/learn/introduction-to-javascriptIm CacheDiese Seite übersetzen",
|
||||
"snippet": "Learn JavaScript and Javascript arrays to build interactive websites and pages that adapt to every device. Add dynamic behavior, store information, and handle ...",
|
||||
"visible_link": "https://www.codecademy.com/learn/introduction-to-javascript",
|
||||
"date": "",
|
||||
"rank": 4
|
||||
},
|
||||
{
|
||||
"link": "https://developer.mozilla.org/en-US/docs/Learn/JavaScript",
|
||||
"title": "JavaScript - Learn web development | MDNhttps://developer.mozilla.org/en-US/docs/Learn/JavaScriptIm CacheÄhnliche SeitenDiese Seite übersetzen",
|
||||
"snippet": "19.02.2019 - JavaScript is arguably more difficult to learn than related technologies such as HTML and CSS. Before attempting to learn JavaScript, you are ...",
|
||||
"visible_link": "https://developer.mozilla.org/en-US/docs/Learn/JavaScript",
|
||||
"date": "19.02.2019 - ",
|
||||
"rank": 5
|
||||
},
|
||||
{
|
||||
"link": "https://www.thebalancecareers.com/learn-javascript-online-2071405",
|
||||
"title": "Places to Learn JavaScript Online - The Balance Careershttps://www.thebalancecareers.com › ... › EducationIm CacheDiese Seite übersetzen",
|
||||
"snippet": "24.01.2019 - Want to dip your toes into the world of JavaScript? To learn JavaScript, check out the free and paid options for training and certification online.",
|
||||
"visible_link": "https://www.thebalancecareers.com › ... › Education",
|
||||
"date": "24.01.2019 - ",
|
||||
"rank": 6
|
||||
},
|
||||
{
|
||||
"link": "https://javascript.info/",
|
||||
"title": "The Modern Javascript Tutorialhttps://javascript.info/Im CacheÄhnliche SeitenDiese Seite übersetzen",
|
||||
"snippet": "Here we learn JavaScript, starting from scratch and go on to advanced ... Learning how to manage the browser page: add elements, manipulate their size and ...",
|
||||
"visible_link": "https://javascript.info/",
|
||||
"date": "",
|
||||
"rank": 7
|
||||
},
|
||||
{
|
||||
"link": "https://java.com/de/download/help/enable_browser.xml",
|
||||
"title": "Wie aktiviere ich Java in meinem Webbrowser?https://java.com/de/download/help/enable_browser.xmlJavaScript im Browser aktivieren, damit Anzeigen ... - Google Supporthttps://support.google.com/adsense/answer/12654?hl=deJavaScript: Aufgaben und Anwendungsbereiche - molilyhttps://molily.de/js/aufgaben.html",
|
||||
"snippet": "",
|
||||
"visible_link": "https://java.com/de/download/help/enable_browser.xmlhttps://support.google.com/adsense/answer/12654?hl=dehttps://molily.de/js/aufgaben.html",
|
||||
"date": "",
|
||||
"rank": 8
|
||||
},
|
||||
{
|
||||
"link": "https://java.com/de/download/help/enable_browser.xml",
|
||||
"title": "Wie aktiviere ich Java in meinem Webbrowser?https://java.com/de/download/help/enable_browser.xml",
|
||||
"snippet": "",
|
||||
"visible_link": "https://java.com/de/download/help/enable_browser.xml",
|
||||
"date": "",
|
||||
"rank": 9
|
||||
},
|
||||
{
|
||||
"link": "https://support.google.com/adsense/answer/12654?hl=de",
|
||||
"title": "JavaScript im Browser aktivieren, damit Anzeigen ... - Google Supporthttps://support.google.com/adsense/answer/12654?hl=de",
|
||||
"snippet": "",
|
||||
"visible_link": "https://support.google.com/adsense/answer/12654?hl=de",
|
||||
"date": "",
|
||||
"rank": 10
|
||||
},
|
||||
{
|
||||
"link": "https://molily.de/js/aufgaben.html",
|
||||
"title": "JavaScript: Aufgaben und Anwendungsbereiche - molilyhttps://molily.de/js/aufgaben.html",
|
||||
"snippet": "",
|
||||
"visible_link": "https://molily.de/js/aufgaben.html",
|
||||
"date": "",
|
||||
"rank": 11
|
||||
},
|
||||
{
|
||||
"link": "https://learnjavascript.today/",
|
||||
"title": "Learn JavaScript from scratchhttps://learnjavascript.today/Im CacheDiese Seite übersetzen",
|
||||
"snippet": "Build anything you want with JavaScript. Have you tried learning JavaScript for a while now, but feel that you're not making progress?",
|
||||
"visible_link": "https://learnjavascript.today/",
|
||||
"date": "",
|
||||
"rank": 12
|
||||
}
|
||||
]
|
||||
},
|
||||
"2": {
|
||||
"time": "Thu, 28 Feb 2019 14:22:36 GMT",
|
||||
"num_results": "Seite 2 von ungefähr 646.000.000 Ergebnissen (0,27 Sekunden) ",
|
||||
"no_results": false,
|
||||
"effective_query": "",
|
||||
"results": [
|
||||
{
|
||||
"link": "https://www.youtube.com/watch?v=PkZNo7MFNFg",
|
||||
"title": "Learn JavaScript - Full Course for Beginners - YouTubehttps://www.youtube.com/watch?v=PkZNo7MFNFgDiese Seite übersetzen",
|
||||
"snippet": "This complete 134-part JavaScript tutorial for beginners will teach you everything you need to know to get ...",
|
||||
"visible_link": "https://www.youtube.com/watch?v=PkZNo7MFNFg",
|
||||
"date": "",
|
||||
"rank": 13
|
||||
},
|
||||
{
|
||||
"link": "https://medium.com/coderbyte/50-resources-to-help-you-start-learning-javascript-in-2017-4c70b222a3b9",
|
||||
"title": "50 resources to help you start learning JavaScript in 2017 - Mediumhttps://medium.com/.../50-resources-to-help-you-start-learning-ja...Im CacheÄhnliche SeitenDiese Seite übersetzen",
|
||||
"snippet": "04.01.2017 - Over the last few years JavaScript has been surging in popularity as the language itself is growing, a majority of coding bootcamps teach the ...",
|
||||
"visible_link": "https://medium.com/.../50-resources-to-help-you-start-learning-ja...",
|
||||
"date": "04.01.2017 - ",
|
||||
"rank": 14
|
||||
},
|
||||
{
|
||||
"link": "https://medium.freecodecamp.org/want-to-learn-javascript-heres-a-free-24-part-course-to-get-you-started-e7777baf86fb",
|
||||
"title": "Want to learn JavaScript? Here's a free 24-part course to get you started.https://medium.freecodecamp.org/want-to-learn-javascript-heres-...Im CacheDiese Seite übersetzen",
|
||||
"snippet": "19.04.2018 - The first concept you'll need to learn is variables, which are for storing values. In modern JavaScript there are two keywords for doing that: let ...",
|
||||
"visible_link": "https://medium.freecodecamp.org/want-to-learn-javascript-heres-...",
|
||||
"date": "19.04.2018 - ",
|
||||
"rank": 15
|
||||
},
|
||||
{
|
||||
"link": "http://javascriptissexy.com/how-to-learn-javascript-properly/",
|
||||
"title": "How to Learn JavaScript Properly | JavaScript Is Sexyjavascriptissexy.com/how-to-learn-javascript-properly/Ähnliche Seiten",
|
||||
"snippet": "You do want to learn JavaScript. I presume you are here for that reason, and you have made a wise decision. For if you want to develop modern websites and ...",
|
||||
"visible_link": "javascriptissexy.com/how-to-learn-javascript-properly/",
|
||||
"date": "",
|
||||
"rank": 16
|
||||
},
|
||||
{
|
||||
"link": "https://www.sololearn.com/Course/JavaScript/",
|
||||
"title": "JavaScript Tutorial | SoloLearn: Learn to code for FREE!https://www.sololearn.com/Course/JavaScript/Im CacheÄhnliche SeitenDiese Seite übersetzen",
|
||||
"snippet": "Our tutorial will teach you the fundamentals of JavaScript programming. You can learn how to make your website more interactive, change your website content, ...",
|
||||
"visible_link": "https://www.sololearn.com/Course/JavaScript/",
|
||||
"date": "",
|
||||
"rank": 17
|
||||
},
|
||||
{
|
||||
"link": "https://www.javascript.com/try",
|
||||
"title": "Start learning JavaScript with our free real time tutorialhttps://www.javascript.com/tryIm CacheÄhnliche SeitenDiese Seite übersetzen",
|
||||
"snippet": "Start learning JavaScript with our interactive simulator for free. Our easy to follow JavaScript tutorials for beginners will have you coding the basics in no time.",
|
||||
"visible_link": "https://www.javascript.com/try",
|
||||
"date": "",
|
||||
"rank": 18
|
||||
},
|
||||
{
|
||||
"link": "https://learnxinyminutes.com/docs/javascript/",
|
||||
"title": "Learn javascript in Y Minuteshttps://learnxinyminutes.com/docs/javascript/Im CacheÄhnliche SeitenDiese Seite übersetzen",
|
||||
"snippet": "JavaScript was created by Netscape's Brendan Eich in 1995. It was originally intended as a simpler scripting language for websites, complementing the use of ...",
|
||||
"visible_link": "https://learnxinyminutes.com/docs/javascript/",
|
||||
"date": "",
|
||||
"rank": 19
|
||||
},
|
||||
{
|
||||
"link": "https://www.pluralsight.com/paths/javascript",
|
||||
"title": "JavaScript | Pluralsighthttps://www.pluralsight.com/paths/javascriptIm CacheÄhnliche SeitenDiese Seite übersetzen",
|
||||
"snippet": "This learning path includes JavaScript tutorials for both the new programmer looking to get started and the advanced web developer wanting to solidify and ...",
|
||||
"visible_link": "https://www.pluralsight.com/paths/javascript",
|
||||
"date": "",
|
||||
"rank": 20
|
||||
},
|
||||
{
|
||||
"link": "https://www.edx.org/learn/javascript",
|
||||
"title": "Learn Javascript from Harvard, Microsoft, and more | edXhttps://www.edx.org/learn/javascriptIm CacheDiese Seite übersetzen",
|
||||
"snippet": "Free Javascript courses online. Learn Javascript programming and advance your career with free computer science courses from top institutions. Join now.",
|
||||
"visible_link": "https://www.edx.org/learn/javascript",
|
||||
"date": "",
|
||||
"rank": 21
|
||||
},
|
||||
{
|
||||
"link": "http://www.bestprogramminglanguagefor.me/why-learn-javascript",
|
||||
"title": "Why Learn JavaScript - Best Programming Languagewww.bestprogramminglanguagefor.me/why-learn-javascriptIm CacheÄhnliche SeitenDiese Seite übersetzen",
|
||||
"snippet": "JavaScript has become an essential web technology along with HTML and CSS, as most browsers implement JavaScript. Thus, You must learn JavaScript if you ...",
|
||||
"visible_link": "www.bestprogramminglanguagefor.me/why-learn-javascript",
|
||||
"date": "",
|
||||
"rank": 22
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
6903
examples/results/bing.json
Normal file
6903
examples/results/bing.json
Normal file
File diff suppressed because it is too large
Load Diff
158
examples/results/data.json
Normal file
158
examples/results/data.json
Normal file
@ -0,0 +1,158 @@
|
||||
{
|
||||
"news": {
|
||||
"1": {
|
||||
"time": "Thu, 28 Feb 2019 14:24:51 GMT",
|
||||
"num_results": "Ungefähr 25.270.000.000 Ergebnisse (0,49 Sekunden) ",
|
||||
"no_results": false,
|
||||
"effective_query": "",
|
||||
"results": [
|
||||
{
|
||||
"link": "https://news.google.de/",
|
||||
"title": "Google Newshttps://news.google.de/Ähnliche Seiten",
|
||||
"snippet": "Ausführliche und aktuelle Beiträge - von Google News aus verschiedenen Nachrichtenquellen aus aller Welt zusammengetragen.",
|
||||
"visible_link": "https://news.google.de/",
|
||||
"date": "",
|
||||
"rank": 1
|
||||
},
|
||||
{
|
||||
"link": "https://www.rtl.de/cms/news.html",
|
||||
"title": "News: Aktuelle Nachrichten, Schlagzeilen und Videos | RTL.dehttps://www.rtl.de/cms/news.html",
|
||||
"snippet": "Aktuelle Nachrichten aus Deutschland und der Welt auf einen Blick: Bei RTL.de finden Sie die News von heute, spannende Hintergründe und Videos.",
|
||||
"visible_link": "https://www.rtl.de/cms/news.html",
|
||||
"date": "",
|
||||
"rank": 2
|
||||
},
|
||||
{
|
||||
"link": "https://www.zeit.de/news/index",
|
||||
"title": "Schlagzeilen, News und Newsticker | ZEIT ONLINE - Die Zeithttps://www.zeit.de/news/index",
|
||||
"snippet": "Aktuelle News und Schlagzeilen im Newsticker von ZEIT ONLINE. Lesen Sie hier die neuesten Nachrichten.",
|
||||
"visible_link": "https://www.zeit.de/news/index",
|
||||
"date": "",
|
||||
"rank": 3
|
||||
},
|
||||
{
|
||||
"link": "https://www.bild.de/news/startseite/news/news-16804530.bild.html",
|
||||
"title": "News aktuell aus Deutschland und der Welt - Bild.dehttps://www.bild.de/news/startseite/news/news-16804530.bild.html",
|
||||
"snippet": "Aktuelle News aus Deutschland, Europa und der Welt. Alle Informationen, Bilder und Videos zu Skandalen, Krisen und Sensationen bei BILD.de.",
|
||||
"visible_link": "https://www.bild.de/news/startseite/news/news-16804530.bild.html",
|
||||
"date": "",
|
||||
"rank": 4
|
||||
},
|
||||
{
|
||||
"link": "http://www.news.de/",
|
||||
"title": "news.de - mehr als Nachrichten und News, die Sie bewegenwww.news.de/Ähnliche Seiten",
|
||||
"snippet": "Promi News und Aktuelles aus Sport, TV & Web. Jetzt Sportnachrichten von Fußball bis Boxen und das Neueste aus Klatsch und Tratsch per Newsticker, Fotos ...",
|
||||
"visible_link": "www.news.de/",
|
||||
"date": "",
|
||||
"rank": 5
|
||||
},
|
||||
{
|
||||
"link": "https://www.mopo.de/news",
|
||||
"title": "News - Aktuelle Nachrichten aus Deutschland und der Welt. | MOPO.dehttps://www.mopo.de/news",
|
||||
"snippet": "News - Aktuelle Nachrichten aus Hamburg, der Welt, zum HSV und der Welt der Promis.",
|
||||
"visible_link": "https://www.mopo.de/news",
|
||||
"date": "",
|
||||
"rank": 6
|
||||
},
|
||||
{
|
||||
"link": "https://www.t-online.de/nachrichten/",
|
||||
"title": "Politik aktuell: Nachrichten aus Deutschland, Europa und der Welthttps://www.t-online.de/nachrichten/",
|
||||
"snippet": "Trump trifft Kim: Der Nordkorea-Gipfel in Vietnam im News-Blog · Krise in Venezuela: Aktuelle Entwicklungen, ... E-Mails und News unterwegs immer dabei.",
|
||||
"visible_link": "https://www.t-online.de/nachrichten/",
|
||||
"date": "",
|
||||
"rank": 7
|
||||
},
|
||||
{
|
||||
"link": "https://news.google.com/topics/CAAqJggKIiBDQkFTRWdvSUwyMHZNRFZxYUdjU0FtUmxHZ0pFUlNnQVAB?hl=de&gl=DE&ceid=DE%3Ade",
|
||||
"title": "Google News - Schlagzeilen - Neuestehttps://news.google.com/.../CAAqJggKIiBDQkFTRWdvSUwyMHZNRFZxYUdjU0FtUm...",
|
||||
"snippet": "Mit Google News kannst du zum Thema Schlagzeilen vollständige Artikel lesen, Videos ansehen und in Tausenden von Titeln stöbern.",
|
||||
"visible_link": "https://news.google.com/.../CAAqJggKIiBDQkFTRWdvSUwyMHZNRFZxYUdjU0FtUm...",
|
||||
"date": "",
|
||||
"rank": 8
|
||||
},
|
||||
{
|
||||
"link": "https://www.n-tv.de/",
|
||||
"title": "Nachrichten, aktuelle Schlagzeilen und Videos - n-tv.dehttps://www.n-tv.de/",
|
||||
"snippet": "Nachrichten seriös, schnell und kompetent. Artikel und Videos aus Politik, Wirtschaft, Börse, Sport und News aus aller Welt.",
|
||||
"visible_link": "https://www.n-tv.de/",
|
||||
"date": "",
|
||||
"rank": 9
|
||||
}
|
||||
]
|
||||
}
|
||||
},
|
||||
"se-scraper": {
|
||||
"1": {
|
||||
"time": "Thu, 28 Feb 2019 14:24:51 GMT",
|
||||
"num_results": "Ungefähr 16.400.000 Ergebnisse (0,27 Sekunden) ",
|
||||
"no_results": false,
|
||||
"effective_query": "",
|
||||
"results": [
|
||||
{
|
||||
"link": "https://www.npmjs.com/package/se-scraper",
|
||||
"title": "se-scraper - npmhttps://www.npmjs.com/package/se-scraperIm CacheDiese Seite übersetzen",
|
||||
"snippet": "07.02.2019 - A simple library using puppeteer to scrape several search engines such as Google, Duckduckgo and Bing.",
|
||||
"visible_link": "https://www.npmjs.com/package/se-scraper",
|
||||
"date": "07.02.2019 - ",
|
||||
"rank": 1
|
||||
},
|
||||
{
|
||||
"link": "https://github.com/NikolaiT/se-scraper",
|
||||
"title": "GitHub - NikolaiT/se-scraper: Javascript scraping module based on ...https://github.com/NikolaiT/se-scraperIm CacheDiese Seite übersetzen",
|
||||
"snippet": "Javascript scraping module based on puppeteer for many different search engines... - NikolaiT/se-scraper.",
|
||||
"visible_link": "https://github.com/NikolaiT/se-scraper",
|
||||
"date": "",
|
||||
"rank": 2
|
||||
},
|
||||
{
|
||||
"link": "https://github.com/nyancat18/Se-Scraper",
|
||||
"title": "GitHub - nyancat18/Se-Scraper: se-scraper your siteshttps://github.com/nyancat18/Se-ScraperIm CacheDiese Seite übersetzen",
|
||||
"snippet": "se-scraper your sites. Contribute to nyancat18/Se-Scraper development by creating an account on GitHub.",
|
||||
"visible_link": "https://github.com/nyancat18/Se-Scraper",
|
||||
"date": "",
|
||||
"rank": 3
|
||||
},
|
||||
{
|
||||
"link": "http://konjugator.reverso.net/konjugation-franzosisch-verb-se%20scraper.html",
|
||||
"title": "Konjugation se scraper | Konjugieren verb se scraper Französisch ...konjugator.reverso.net/konjugation-franzosisch-verb-se%20scraper.html",
|
||||
"snippet": "Reverso-Konjugation: Konjugation des französischen Verbs se scraper, Konjugator für französische Verben, unregelmäßige Verben, Übersetzung,Grammatik.",
|
||||
"visible_link": "konjugator.reverso.net/konjugation-franzosisch-verb-se%20scraper.html",
|
||||
"date": "",
|
||||
"rank": 4
|
||||
},
|
||||
{
|
||||
"link": "https://swedishicescraper.se/",
|
||||
"title": "Swedish Ice Scraper: Onlinehttps://swedishicescraper.se/Im CacheDiese Seite übersetzen",
|
||||
"snippet": "The original Swedish Ice Scraper - best in test. ... solid Acrylic Glass and use diamond polishing to sharpen the scraping edges. ... info@swedishicescraper.se.",
|
||||
"visible_link": "https://swedishicescraper.se/",
|
||||
"date": "",
|
||||
"rank": 5
|
||||
},
|
||||
{
|
||||
"link": "https://www.blackhatworld.com/seo/any-yandex-scrapers-available-or-universal-se-scraper.243421/",
|
||||
"title": "Any yandex scrapers available? Or universal SE scraper ...https://www.blackhatworld.com › ... › Black Hat SEO ToolsIm CacheDiese Seite übersetzen",
|
||||
"snippet": "10.10.2010 - Mostly blogs & stuff like that. Is Hrefer for yandex only or there are other SEs? How much is it? Advertise on BHW ...",
|
||||
"visible_link": "https://www.blackhatworld.com › ... › Black Hat SEO Tools",
|
||||
"date": "10.10.2010 - ",
|
||||
"rank": 6
|
||||
},
|
||||
{
|
||||
"link": "http://network.ubotstudio.com/forum/index.php/topic/8648-sell-free-sescraper-scrape-search-engines-with-long-lists-of-queries/",
|
||||
"title": "[SELL] FREE - SEscraper - scrape search engines with long lists of ...network.ubotstudio.com › ... › Sell › Bots and ScriptsIm CacheDiese Seite übersetzen",
|
||||
"snippet": "03.12.2011 - SEscraper. Scrape results from: Google Yahoo Bing AOL Enter one or more queries as well as an optional list of keywords to append to each ...",
|
||||
"visible_link": "network.ubotstudio.com › ... › Sell › Bots and Scripts",
|
||||
"date": "03.12.2011 - ",
|
||||
"rank": 7
|
||||
},
|
||||
{
|
||||
"link": "https://netpeaksoftware.com/blog/netpeak-checker-3-0-serp-scraping",
|
||||
"title": "Netpeak Checker 3.0: SERP Scraping – Netpeak Software Bloghttps://netpeaksoftware.com/.../netpeak-checker-3-0-serp-scrapin...Im CacheDiese Seite übersetzen",
|
||||
"snippet": "19.09.2018 - With a new tool under an 'SE Scraper' nickname you can get Google, Bing, Yahoo, and Yandex search results in a structured table with a lot of ...",
|
||||
"visible_link": "https://netpeaksoftware.com/.../netpeak-checker-3-0-serp-scrapin...",
|
||||
"date": "19.09.2018 - ",
|
||||
"rank": 8
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
172
examples/results/out.json
Normal file
172
examples/results/out.json
Normal file
@ -0,0 +1,172 @@
|
||||
{ 'scraping scrapeulous.com':
|
||||
{ '1':
|
||||
{ time: 'Tue, 29 Jan 2019 21:39:22 GMT',
|
||||
num_results: 'Ungefähr 145 Ergebnisse (0,18 Sekunden) ',
|
||||
no_results: false,
|
||||
effective_query: '',
|
||||
results:
|
||||
[ { link: 'https://scrapeulous.com/',
|
||||
title:
|
||||
'Scrapeuloushttps://scrapeulous.com/Im CacheDiese Seite übersetzen',
|
||||
snippet:
|
||||
'Scrapeulous.com allows you to scrape various search engines automatically ... or to find hidden links, Scrapeulous.com enables you to scrape a ever increasing ...',
|
||||
visible_link: 'https://scrapeulous.com/',
|
||||
date: '',
|
||||
rank: 1 },
|
||||
{ link: 'https://scrapeulous.com/about/',
|
||||
title:
|
||||
'About - Scrapeuloushttps://scrapeulous.com/about/Im CacheDiese Seite übersetzen',
|
||||
snippet:
|
||||
'Scrapeulous.com allows you to scrape various search engines automatically and in large quantities. The business requirement to scrape information from ...',
|
||||
visible_link: 'https://scrapeulous.com/about/',
|
||||
date: '',
|
||||
rank: 2 },
|
||||
{ link: 'https://scrapeulous.com/howto/',
|
||||
title:
|
||||
'Howto - Scrapeuloushttps://scrapeulous.com/howto/Im CacheDiese Seite übersetzen',
|
||||
snippet:
|
||||
'We offer scraping large amounts of keywords for the Google Search Engine. Large means any number of keywords between 40 and 50000. Additionally, we ...',
|
||||
visible_link: 'https://scrapeulous.com/howto/',
|
||||
date: '',
|
||||
rank: 3 },
|
||||
{ link: 'https://github.com/NikolaiT/se-scraper',
|
||||
title:
|
||||
'GitHub - NikolaiT/se-scraper: Javascript scraping module based on ...https://github.com/NikolaiT/se-scraperIm CacheDiese Seite übersetzen',
|
||||
snippet:
|
||||
'24.12.2018 - Javascript scraping module based on puppeteer for many different search ... for many different search engines... https://scrapeulous.com/.',
|
||||
visible_link: 'https://github.com/NikolaiT/se-scraper',
|
||||
date: '24.12.2018 - ',
|
||||
rank: 4 },
|
||||
{ link:
|
||||
'https://github.com/NikolaiT/GoogleScraper/blob/master/README.md',
|
||||
title:
|
||||
'GoogleScraper/README.md at master · NikolaiT/GoogleScraper ...https://github.com/NikolaiT/GoogleScraper/blob/.../README.mdIm CacheÄhnliche SeitenDiese Seite übersetzen',
|
||||
snippet:
|
||||
'GoogleScraper - Scraping search engines professionally. Scrapeulous.com - Scraping Service. GoogleScraper is a open source tool and will remain a open ...',
|
||||
visible_link:
|
||||
'https://github.com/NikolaiT/GoogleScraper/blob/.../README.md',
|
||||
date: '',
|
||||
rank: 5 },
|
||||
{ link: 'https://googlescraper.readthedocs.io/',
|
||||
title:
|
||||
'Welcome to GoogleScraper\'s documentation! — GoogleScraper ...https://googlescraper.readthedocs.io/Im CacheDiese Seite übersetzen',
|
||||
snippet:
|
||||
'Welcome to GoogleScraper\'s documentation!¶. Contents: GoogleScraper - Scraping search engines professionally · Scrapeulous.com - Scraping Service ...',
|
||||
visible_link: 'https://googlescraper.readthedocs.io/',
|
||||
date: '',
|
||||
rank: 6 },
|
||||
{ link: 'https://incolumitas.com/pages/scrapeulous/',
|
||||
title:
|
||||
'Coding, Learning and Business Ideas – Scrapeulous.com - Incolumitashttps://incolumitas.com/pages/scrapeulous/Im CacheDiese Seite übersetzen',
|
||||
snippet:
|
||||
'A scraping service for scientists, marketing professionals, analysts or SEO folk. In autumn 2018, I created a scraping service called scrapeulous.com. There you ...',
|
||||
visible_link: 'https://incolumitas.com/pages/scrapeulous/',
|
||||
date: '',
|
||||
rank: 7 },
|
||||
{ link: 'https://incolumitas.com/',
|
||||
title:
|
||||
'Coding, Learning and Business Ideashttps://incolumitas.com/Im CacheDiese Seite übersetzen',
|
||||
snippet:
|
||||
'Scraping Amazon Reviews using Headless Chrome Browser and Python3. Posted on Mi ... GoogleScraper Tutorial - How to scrape 1000 keywords with Google.',
|
||||
visible_link: 'https://incolumitas.com/',
|
||||
date: '',
|
||||
rank: 8 },
|
||||
{ link: 'https://en.wikipedia.org/wiki/Search_engine_scraping',
|
||||
title:
|
||||
'Search engine scraping - Wikipediahttps://en.wikipedia.org/wiki/Search_engine_scrapingIm CacheDiese Seite übersetzen',
|
||||
snippet:
|
||||
'Search engine scraping is the process of harvesting URLs, descriptions, or other information from search engines such as Google, Bing or Yahoo. This is a ...',
|
||||
visible_link: 'https://en.wikipedia.org/wiki/Search_engine_scraping',
|
||||
date: '',
|
||||
rank: 9 },
|
||||
{ link:
|
||||
'https://readthedocs.org/projects/googlescraper/downloads/pdf/latest/',
|
||||
title:
|
||||
'GoogleScraper Documentation - Read the Docshttps://readthedocs.org/projects/googlescraper/downloads/.../latest...Im CacheDiese Seite übersetzen',
|
||||
snippet:
|
||||
'23.12.2018 - Contents: 1 GoogleScraper - Scraping search engines professionally. 1. 1.1 ... For this reason, I created the web service scrapeulous.com.',
|
||||
visible_link:
|
||||
'https://readthedocs.org/projects/googlescraper/downloads/.../latest...',
|
||||
date: '23.12.2018 - ',
|
||||
rank: 10 } ] },
|
||||
'2':
|
||||
{ time: 'Tue, 29 Jan 2019 21:39:24 GMT',
|
||||
num_results: 'Seite 2 von ungefähr 145 Ergebnissen (0,20 Sekunden) ',
|
||||
no_results: false,
|
||||
effective_query: '',
|
||||
results:
|
||||
[ { link: 'https://pypi.org/project/CountryGoogleScraper/',
|
||||
title:
|
||||
'CountryGoogleScraper · PyPIhttps://pypi.org/project/CountryGoogleScraper/Im CacheDiese Seite übersetzen',
|
||||
snippet:
|
||||
'A module to scrape and extract links, titles and descriptions from various search ... Look [here to get an idea how to use asynchronous mode](http://scrapeulous.',
|
||||
visible_link: 'https://pypi.org/project/CountryGoogleScraper/',
|
||||
date: '',
|
||||
rank: 1 },
|
||||
{ link: 'https://www.youtube.com/watch?v=a6xn6rc9GbI',
|
||||
title:
|
||||
'scrapeulous intro - YouTubehttps://www.youtube.com/watch?v=a6xn6rc9GbIDiese Seite übersetzen',
|
||||
snippet:
|
||||
'scrapeulous intro. Scrapeulous Scrapeulous. Loading... Unsubscribe from ... on Dec 16, 2018. Introduction ...',
|
||||
visible_link: 'https://www.youtube.com/watch?v=a6xn6rc9GbI',
|
||||
date: '',
|
||||
rank: 3 },
|
||||
{ link:
|
||||
'https://www.reddit.com/r/Python/comments/2tii3r/scraping_260_search_queries_in_bing_in_a_matter/',
|
||||
title:
|
||||
'Scraping 260 search queries in Bing in a matter of seconds using ...https://www.reddit.com/.../scraping_260_search_queries_in_bing...Im CacheDiese Seite übersetzen',
|
||||
snippet:
|
||||
'24.01.2015 - Scraping 260 search queries in Bing in a matter of seconds using asyncio and aiohttp. (scrapeulous.com). submitted 3 years ago by ...',
|
||||
visible_link:
|
||||
'https://www.reddit.com/.../scraping_260_search_queries_in_bing...',
|
||||
date: '24.01.2015 - ',
|
||||
rank: 4 },
|
||||
{ link: 'https://twitter.com/incolumitas_?lang=de',
|
||||
title:
|
||||
'Nikolai Tschacher (@incolumitas_) | Twitterhttps://twitter.com/incolumitas_?lang=deIm CacheÄhnliche SeitenDiese Seite übersetzen',
|
||||
snippet:
|
||||
'Learn how to scrape millions of url from yandex and google or bing with: http://scrapeulous.com/googlescraper-market-analysis.html … 0 replies 0 retweets 0 ...',
|
||||
visible_link: 'https://twitter.com/incolumitas_?lang=de',
|
||||
date: '',
|
||||
rank: 5 },
|
||||
{ link:
|
||||
'http://blog.shodan.io/hostility-in-the-python-package-index/',
|
||||
title:
|
||||
'Hostility in the Cheese Shop - Shodan Blogblog.shodan.io/hostility-in-the-python-package-index/Im CacheDiese Seite übersetzen',
|
||||
snippet:
|
||||
'22.02.2015 - https://zzz.scrapeulous.com/r? According to the author of the website, these hostile packages are used as honeypots. Honeypots are usually ...',
|
||||
visible_link: 'blog.shodan.io/hostility-in-the-python-package-index/',
|
||||
date: '22.02.2015 - ',
|
||||
rank: 6 },
|
||||
{ link: 'https://libraries.io/github/NikolaiT/GoogleScraper',
|
||||
title:
|
||||
'NikolaiT/GoogleScraper - Libraries.iohttps://libraries.io/github/NikolaiT/GoogleScraperIm CacheDiese Seite übersetzen',
|
||||
snippet:
|
||||
'A Python module to scrape several search engines (like Google, Yandex, Bing, ... https://scrapeulous.com/ ... You can install GoogleScraper comfortably with pip:',
|
||||
visible_link: 'https://libraries.io/github/NikolaiT/GoogleScraper',
|
||||
date: '',
|
||||
rank: 7 },
|
||||
{ link: 'https://pydigger.com/pypi/CountryGoogleScraper',
|
||||
title:
|
||||
'CountryGoogleScraper - PyDiggerhttps://pydigger.com/pypi/CountryGoogleScraperDiese Seite übersetzen',
|
||||
snippet:
|
||||
'19.10.2016 - Look [here to get an idea how to use asynchronous mode](http://scrapeulous.com/googlescraper-260-keywords-in-a-second.html). ### Table ...',
|
||||
visible_link: 'https://pydigger.com/pypi/CountryGoogleScraper',
|
||||
date: '19.10.2016 - ',
|
||||
rank: 8 },
|
||||
{ link: 'https://hub.docker.com/r/cimenx/data-mining-penandtest/',
|
||||
title:
|
||||
'cimenx/data-mining-penandtest - Docker Hubhttps://hub.docker.com/r/cimenx/data-mining-penandtest/Im CacheDiese Seite übersetzen',
|
||||
snippet:
|
||||
'Container. OverviewTagsDockerfileBuilds · http://scrapeulous.com/googlescraper-260-keywords-in-a-second.html. Docker Pull Command. Owner. profile ...',
|
||||
visible_link: 'https://hub.docker.com/r/cimenx/data-mining-penandtest/',
|
||||
date: '',
|
||||
rank: 9 },
|
||||
{ link: 'https://www.revolvy.com/page/Search-engine-scraping',
|
||||
title:
|
||||
'Search engine scraping | Revolvyhttps://www.revolvy.com/page/Search-engine-scrapingIm CacheDiese Seite übersetzen',
|
||||
snippet:
|
||||
'Search engine scraping is the process of harvesting URLs, descriptions, or other information from search engines such as Google, Bing or Yahoo. This is a ...',
|
||||
visible_link: 'https://www.revolvy.com/page/Search-engine-scraping',
|
||||
date: '',
|
||||
rank: 10 } ] } } }
|
478
examples/results/proxyresults.json
Normal file
478
examples/results/proxyresults.json
Normal file
@ -0,0 +1,478 @@
|
||||
{
|
||||
"news": {
|
||||
"1": {
|
||||
"time": "Thu, 28 Feb 2019 14:26:28 GMT",
|
||||
"num_results": "Ungefähr 25.270.000.000 Ergebnisse (0,40 Sekunden) ",
|
||||
"no_results": false,
|
||||
"effective_query": "",
|
||||
"results": [
|
||||
{
|
||||
"link": "https://news.google.de/",
|
||||
"title": "Google Newshttps://news.google.de/Ähnliche Seiten",
|
||||
"snippet": "Ausführliche und aktuelle Beiträge - von Google News aus verschiedenen Nachrichtenquellen aus aller Welt zusammengetragen.",
|
||||
"visible_link": "https://news.google.de/",
|
||||
"date": "",
|
||||
"rank": 1
|
||||
},
|
||||
{
|
||||
"link": "https://www.bild.de/news/startseite/news/news-16804530.bild.html",
|
||||
"title": "News aktuell aus Deutschland und der Welt - Bild.dehttps://www.bild.de/news/startseite/news/news-16804530.bild.html",
|
||||
"snippet": "Aktuelle News aus Deutschland, Europa und der Welt. Alle Informationen, Bilder und Videos zu Skandalen, Krisen und Sensationen bei BILD.de.",
|
||||
"visible_link": "https://www.bild.de/news/startseite/news/news-16804530.bild.html",
|
||||
"date": "",
|
||||
"rank": 2
|
||||
},
|
||||
{
|
||||
"link": "https://www.rtl.de/cms/news.html",
|
||||
"title": "News: Aktuelle Nachrichten, Schlagzeilen und Videos | RTL.dehttps://www.rtl.de/cms/news.html",
|
||||
"snippet": "Aktuelle Nachrichten aus Deutschland und der Welt auf einen Blick: Bei RTL.de finden Sie die News von heute, spannende Hintergründe und Videos.",
|
||||
"visible_link": "https://www.rtl.de/cms/news.html",
|
||||
"date": "",
|
||||
"rank": 3
|
||||
},
|
||||
{
|
||||
"link": "https://www.zeit.de/news/index",
|
||||
"title": "Schlagzeilen, News und Newsticker | ZEIT ONLINE - Die Zeithttps://www.zeit.de/news/index",
|
||||
"snippet": "Aktuelle News und Schlagzeilen im Newsticker von ZEIT ONLINE. Lesen Sie hier die neuesten Nachrichten.",
|
||||
"visible_link": "https://www.zeit.de/news/index",
|
||||
"date": "",
|
||||
"rank": 4
|
||||
},
|
||||
{
|
||||
"link": "http://www.news.de/",
|
||||
"title": "news.de - mehr als Nachrichten und News, die Sie bewegenwww.news.de/Ähnliche Seiten",
|
||||
"snippet": "Promi News und Aktuelles aus Sport, TV & Web. Jetzt Sportnachrichten von Fußball bis Boxen und das Neueste aus Klatsch und Tratsch per Newsticker, Fotos ...",
|
||||
"visible_link": "www.news.de/",
|
||||
"date": "",
|
||||
"rank": 5
|
||||
},
|
||||
{
|
||||
"link": "https://www.t-online.de/nachrichten/",
|
||||
"title": "Politik aktuell: Nachrichten aus Deutschland, Europa und der Welthttps://www.t-online.de/nachrichten/",
|
||||
"snippet": "Trump trifft Kim: Der Nordkorea-Gipfel in Vietnam im News-Blog · Krise in Venezuela: Aktuelle Entwicklungen, ... E-Mails und News unterwegs immer dabei.",
|
||||
"visible_link": "https://www.t-online.de/nachrichten/",
|
||||
"date": "",
|
||||
"rank": 6
|
||||
},
|
||||
{
|
||||
"link": "https://www.mopo.de/news",
|
||||
"title": "News - Aktuelle Nachrichten aus Deutschland und der Welt. | MOPO.dehttps://www.mopo.de/news",
|
||||
"snippet": "News - Aktuelle Nachrichten aus Hamburg, der Welt, zum HSV und der Welt der Promis.",
|
||||
"visible_link": "https://www.mopo.de/news",
|
||||
"date": "",
|
||||
"rank": 7
|
||||
},
|
||||
{
|
||||
"link": "https://news.google.com/topics/CAAqJggKIiBDQkFTRWdvSUwyMHZNRFZxYUdjU0FtUmxHZ0pFUlNnQVAB?hl=de&gl=DE&ceid=DE%3Ade",
|
||||
"title": "Google News - Schlagzeilen - Neuestehttps://news.google.com/.../CAAqJggKIiBDQkFTRWdvSUwyMHZNRFZxYUdjU0FtUm...",
|
||||
"snippet": "Mit Google News kannst du zum Thema Schlagzeilen vollständige Artikel lesen, Videos ansehen und in Tausenden von Titeln stöbern.",
|
||||
"visible_link": "https://news.google.com/.../CAAqJggKIiBDQkFTRWdvSUwyMHZNRFZxYUdjU0FtUm...",
|
||||
"date": "",
|
||||
"rank": 8
|
||||
},
|
||||
{
|
||||
"link": "https://www.n-tv.de/",
|
||||
"title": "Nachrichten, aktuelle Schlagzeilen und Videos - n-tv.dehttps://www.n-tv.de/",
|
||||
"snippet": "Nachrichten seriös, schnell und kompetent. Artikel und Videos aus Politik, Wirtschaft, Börse, Sport und News aus aller Welt.",
|
||||
"visible_link": "https://www.n-tv.de/",
|
||||
"date": "",
|
||||
"rank": 9
|
||||
}
|
||||
]
|
||||
}
|
||||
},
|
||||
"i work too much": {
|
||||
"1": {
|
||||
"time": "Thu, 28 Feb 2019 14:26:30 GMT",
|
||||
"num_results": "Ungefähr 4.500.000.000 Ergebnisse (0,33 Sekunden) ",
|
||||
"no_results": false,
|
||||
"effective_query": "",
|
||||
"results": [
|
||||
{
|
||||
"link": "https://www.themuse.com/advice/3-reasons-you-work-too-muchand-how-to-overcome-each-one",
|
||||
"title": "3 Reasons You Work Too Much and How to Stop- The Musehttps://www.themuse.com/.../3-reasons-you-work-too-muchand-h...Im CacheÄhnliche SeitenDiese Seite übersetzen",
|
||||
"snippet": "There are three main reasons people work too much. Here's how to fight back against each one and attain better work-life balance.",
|
||||
"visible_link": "https://www.themuse.com/.../3-reasons-you-work-too-muchand-h...",
|
||||
"date": "",
|
||||
"rank": 1
|
||||
},
|
||||
{
|
||||
"link": "https://www.linguee.de/englisch-deutsch/uebersetzung/too+much+work.html",
|
||||
"title": "too much work - Deutsch-Übersetzung – Linguee Wörterbuchhttps://www.linguee.de/englisch-deutsch/uebersetzung/too+much+work.htmlIm Cache",
|
||||
"snippet": "Viele übersetzte Beispielsätze mit \"too much work\" – Deutsch-Englisch Wörterbuch und Suchmaschine für Millionen von Deutsch-Übersetzungen.",
|
||||
"visible_link": "https://www.linguee.de/englisch-deutsch/uebersetzung/too+much+work.html",
|
||||
"date": "",
|
||||
"rank": 2
|
||||
},
|
||||
{
|
||||
"link": "https://www.bustle.com/p/am-i-working-too-much-7-signs-its-time-to-slow-down-76583",
|
||||
"title": "Am I Working Too Much? 7 Signs It's Time To Slow Down - Bustlehttps://www.bustle.com/.../am-i-working-too-much-7-signs-its-ti...Im CacheDiese Seite übersetzen",
|
||||
"snippet": "28.08.2017 - Our society prides hard work so much, it can seem like there's no such thing as working too much. But there absolutely is. An overly demanding ...",
|
||||
"visible_link": "https://www.bustle.com/.../am-i-working-too-much-7-signs-its-ti...",
|
||||
"date": "28.08.2017 - ",
|
||||
"rank": 3
|
||||
},
|
||||
{
|
||||
"link": "https://www.lifehack.org/articles/lifestyle/ask-the-entrepreneurs-15-signs-youre-working-too-much-and-burning-out.html",
|
||||
"title": "15 Signs You're Working Too Much and Burning Out - Lifehackhttps://www.lifehack.org/.../ask-the-entrepreneurs-15-signs-youre...Im CacheDiese Seite übersetzen",
|
||||
"snippet": "Use that warning to evaluate if you're working too many hours or on tasks that can be easily outsourced, so you can fully enjoy every client conversation and ...",
|
||||
"visible_link": "https://www.lifehack.org/.../ask-the-entrepreneurs-15-signs-youre...",
|
||||
"date": "",
|
||||
"rank": 4
|
||||
},
|
||||
{
|
||||
"link": "https://www.huffingtonpost.com/chevonne-harris/24-things-only-people-who-work-entirely-too-much-will-understand_b_5510723.html",
|
||||
"title": "24 Things Only People Who Work Entirely Too Much Will Understand ...https://www.huffingtonpost.com/.../24-things-only-people-who-...Im CacheDiese Seite übersetzen",
|
||||
"snippet": "20.06.2014 - To all the people who are on a first-name basis with the office cleaning crew, are unfazed by empty parking lots on dark nights and can't go ...",
|
||||
"visible_link": "https://www.huffingtonpost.com/.../24-things-only-people-who-...",
|
||||
"date": "20.06.2014 - ",
|
||||
"rank": 5
|
||||
},
|
||||
{
|
||||
"link": "https://www.quora.com/How-much-work-is-too-much-work",
|
||||
"title": "How much work is too much work? - Quorahttps://www.quora.com/How-much-work-is-too-much-workÄhnliche SeitenDiese Seite übersetzen",
|
||||
"snippet": "19.09.2015 - I am a Workaholic and worst part of this is - i know that i am a workaholic. Dd it ever happened to you that after all those years of no great skills you suddenly get ...",
|
||||
"visible_link": "https://www.quora.com/How-much-work-is-too-much-work",
|
||||
"date": "19.09.2015 - ",
|
||||
"rank": 6
|
||||
},
|
||||
{
|
||||
"link": "https://www.theodysseyonline.com/16-signs-you-work-too-much",
|
||||
"title": "16 Signs You Work Too Much - Odysseyhttps://www.theodysseyonline.com/16-signs-you-work-too-much",
|
||||
"snippet": "You try to get coverage but because you're one of the few people at work who works too much, no one really wants to come in any more than their normal 8-15 ...",
|
||||
"visible_link": "https://www.theodysseyonline.com/16-signs-you-work-too-much",
|
||||
"date": "",
|
||||
"rank": 7
|
||||
},
|
||||
{
|
||||
"link": "https://www.thealternativedaily.com/how-too-much-work-ruins-health/",
|
||||
"title": "How Much Work Is Too Much For Your Mental And Physical Health?https://www.thealternativedaily.com/how-too-much-work-ruins-h...Im CacheDiese Seite übersetzen",
|
||||
"snippet": "Full time workers in the U.S. will typically clock up 47 hours per week of work — and that only includes paid work. Meanwhile, Aussies at the Australian National ...",
|
||||
"visible_link": "https://www.thealternativedaily.com/how-too-much-work-ruins-h...",
|
||||
"date": "",
|
||||
"rank": 8
|
||||
},
|
||||
{
|
||||
"link": "https://medium.com/s/story/when-you-enjoy-work-too-much-3d5083a0da5a",
|
||||
"title": "Can You Enjoy Work Too Much? – Member Feature Stories – Mediumhttps://medium.com/.../when-you-enjoy-work-too-much-3d5083...Im CacheDiese Seite übersetzen",
|
||||
"snippet": "10.09.2018 - Experiencing fun and relaxation at work doesn't make you a workaholic. We all find joy and meaning in different aspects of life.",
|
||||
"visible_link": "https://medium.com/.../when-you-enjoy-work-too-much-3d5083...",
|
||||
"date": "10.09.2018 - ",
|
||||
"rank": 9
|
||||
},
|
||||
{
|
||||
"link": "https://www.healthline.com/health/working-too-much-health-effects",
|
||||
"title": "7 Health Effects of Working Too Much - Healthlinehttps://www.healthline.com/.../working-too-much-health-effectsIm CacheDiese Seite übersetzen",
|
||||
"snippet": "03.05.2017 - From increased risk of heart disease to poor sleep, working too much can take a toll on your health. Here are some of the side effects, along ...",
|
||||
"visible_link": "https://www.healthline.com/.../working-too-much-health-effects",
|
||||
"date": "03.05.2017 - ",
|
||||
"rank": 10
|
||||
}
|
||||
]
|
||||
}
|
||||
},
|
||||
"scrapeulous.com": {
|
||||
"1": {
|
||||
"time": "Thu, 28 Feb 2019 14:26:29 GMT",
|
||||
"num_results": "Ungefähr 200 Ergebnisse (0,31 Sekunden) ",
|
||||
"no_results": false,
|
||||
"effective_query": "",
|
||||
"results": [
|
||||
{
|
||||
"link": "https://scrapeulous.com/",
|
||||
"title": "Scrapeuloushttps://scrapeulous.com/Im CacheDiese Seite übersetzen",
|
||||
"snippet": "Scraping search engines like Google, Bing and Duckduckgo in large quantities from many geographical regions with real browsers.",
|
||||
"visible_link": "https://scrapeulous.com/",
|
||||
"date": "",
|
||||
"rank": 1
|
||||
},
|
||||
{
|
||||
"link": "https://scrapeulous.com/about/",
|
||||
"title": "Scraping search engines with real browsers in ... - Scrapeulous.comhttps://scrapeulous.com/about/Im CacheDiese Seite übersetzen",
|
||||
"snippet": "Scrapeulous.com allows you to scrape various search engines automatically and in large quantities. The business requirement to scrape information from ...",
|
||||
"visible_link": "https://scrapeulous.com/about/",
|
||||
"date": "",
|
||||
"rank": 2
|
||||
},
|
||||
{
|
||||
"link": "https://blog.scrapeulous.com/",
|
||||
"title": "Scrapeulous.com Bloghttps://blog.scrapeulous.com/Im CacheDiese Seite übersetzen",
|
||||
"snippet": "04.02.2019 - This clean blog serves to publish the latest announcements and changes for scrapeulous.com We will publish instrucitons and general tutorials ...",
|
||||
"visible_link": "https://blog.scrapeulous.com/",
|
||||
"date": "04.02.2019 - ",
|
||||
"rank": 3
|
||||
},
|
||||
{
|
||||
"link": "https://scrapeulous.com/news/",
|
||||
"title": "Scraping search engines with real browsers in large ... - Scrapeuloushttps://scrapeulous.com/news/Im CacheDiese Seite übersetzen",
|
||||
"snippet": "Scrapeulous.com News Api allows you to query the most recent world news for an index composed of developed market equities. The performance of those ...",
|
||||
"visible_link": "https://scrapeulous.com/news/",
|
||||
"date": "",
|
||||
"rank": 4
|
||||
},
|
||||
{
|
||||
"link": "https://scrapeulous.com/faq/",
|
||||
"title": "Scraping search engines with real browsers in ... - Scrapeulous.comhttps://scrapeulous.com/faq/Im CacheDiese Seite übersetzen",
|
||||
"snippet": "02.02.2019 - Scraping search engines like Google, Bing and Duckduckgo in large quantities from many geographical regions with real browsers.",
|
||||
"visible_link": "https://scrapeulous.com/faq/",
|
||||
"date": "02.02.2019 - ",
|
||||
"rank": 5
|
||||
},
|
||||
{
|
||||
"link": "https://scrapeulous.com/howto/",
|
||||
"title": "Scraping search engines with real browsers in ... - Scrapeulous.comhttps://scrapeulous.com/howto/Im CacheDiese Seite übersetzen",
|
||||
"snippet": "06.02.2019 - We offer scraping large amounts of keywords for the Google Search Engine. Large means any number of keywords between 30 and 50000.",
|
||||
"visible_link": "https://scrapeulous.com/howto/",
|
||||
"date": "06.02.2019 - ",
|
||||
"rank": 6
|
||||
},
|
||||
{
|
||||
"link": "https://scrapeulous.com/contact/",
|
||||
"title": "Contact - Scrapeuloushttps://scrapeulous.com/contact/Im CacheDiese Seite übersetzen",
|
||||
"snippet": "Contact scrapeulous.com. Your email address. Valid email address where we are going to contact you. We will not send spam mail. Your inquiry.",
|
||||
"visible_link": "https://scrapeulous.com/contact/",
|
||||
"date": "",
|
||||
"rank": 7
|
||||
},
|
||||
{
|
||||
"link": "https://scrapeulous.com/scrape/",
|
||||
"title": "Scraping search engines with real browsers in large ... - Scrapeuloushttps://scrapeulous.com/scrape/Im CacheDiese Seite übersetzen",
|
||||
"snippet": "It is super easy to use scrapeulous.com, because you can just upload a text/CSV file with your keywords and submit your email address. With this information ...",
|
||||
"visible_link": "https://scrapeulous.com/scrape/",
|
||||
"date": "",
|
||||
"rank": 8
|
||||
},
|
||||
{
|
||||
"link": "https://incolumitas.com/",
|
||||
"title": "Coding, Learning and Business Ideashttps://incolumitas.com/Im CacheDiese Seite übersetzen",
|
||||
"snippet": "About · Contact · GoogleScraper · Lichess Autoplay-Bot · Projects · Scrapeulous.com · Site Notice · SVGCaptcha · Home Archives Categories Tags Atom ...",
|
||||
"visible_link": "https://incolumitas.com/",
|
||||
"date": "",
|
||||
"rank": 9
|
||||
},
|
||||
{
|
||||
"link": "https://twitter.com/scrapeulous",
|
||||
"title": "Scrapeulous.com (@scrapeulous) | Twitterhttps://twitter.com/scrapeulousIm CacheDiese Seite übersetzen",
|
||||
"snippet": "The latest Tweets from Scrapeulous.com (@scrapeulous): \"Creating software to realize the best scraping service at https://t.co/R5NUqSSrB5\"",
|
||||
"visible_link": "https://twitter.com/scrapeulous",
|
||||
"date": "",
|
||||
"rank": 10
|
||||
}
|
||||
]
|
||||
}
|
||||
},
|
||||
"what to do?": {
|
||||
"1": {
|
||||
"time": "Thu, 28 Feb 2019 14:26:31 GMT",
|
||||
"num_results": "Ungefähr 25.270.000.000 Ergebnisse (0,67 Sekunden) ",
|
||||
"no_results": false,
|
||||
"effective_query": "",
|
||||
"results": [
|
||||
{
|
||||
"link": "https://www.mydomaine.com/things-to-do-when-bored",
|
||||
"title": "96 Things to Do When You're Bored | MyDomainehttps://www.mydomaine.com/things-to-do-when-boredIm CacheDiese Seite übersetzen",
|
||||
"snippet": "29.12.2018 - This book changed my life in many ways, but one of my key takeaways has to do with boredom. I am never bored. In fact, the word bored ...",
|
||||
"visible_link": "https://www.mydomaine.com/things-to-do-when-bored",
|
||||
"date": "29.12.2018 - ",
|
||||
"rank": 1
|
||||
},
|
||||
{
|
||||
"link": "https://www.thecrazytourist.com/25-best-things-frankfurt-germany/",
|
||||
"title": "25 Best Things to Do in Frankfurt (Germany) - The Crazy Touristhttps://www.thecrazytourist.com › Travel Guides › GermanyIm CacheDiese Seite übersetzen",
|
||||
"snippet": "Germany's big financial centre is a city of many sides. The central business district, Bankenviertel, captures your attention right away and has all ten of the tallest ...",
|
||||
"visible_link": "https://www.thecrazytourist.com › Travel Guides › Germany",
|
||||
"date": "",
|
||||
"rank": 2
|
||||
},
|
||||
{
|
||||
"link": "https://www.likealocalguide.com/frankfurt/things-to-do",
|
||||
"title": "Top 28 Things To Do in Frankfurt 2019 - Best Activities in Frankfurthttps://www.likealocalguide.com/frankfurt/things-to-doIm CacheÄhnliche SeitenDiese Seite übersetzen",
|
||||
"snippet": "Frankfurt city guide featuring 28 best local sights, things to do & tours recommended by Frankfurt locals. Skip the tourist traps & explore Frankfurt like a local.",
|
||||
"visible_link": "https://www.likealocalguide.com/frankfurt/things-to-do",
|
||||
"date": "",
|
||||
"rank": 3
|
||||
},
|
||||
{
|
||||
"link": "https://www.tripadvisor.com/Attractions-g187337-Activities-Frankfurt_Hesse.html",
|
||||
"title": "THE 15 BEST Things to Do in Frankfurt - 2019 (with Photos ...https://www.tripadvisor.com/Attractions-g187337-Activities-Fran...Im CacheÄhnliche SeitenDiese Seite übersetzen",
|
||||
"snippet": "Book your tickets online for the top things to do in Frankfurt, Germany on TripAdvisor: See 47213 traveler reviews and photos of Frankfurt tourist attractions.",
|
||||
"visible_link": "https://www.tripadvisor.com/Attractions-g187337-Activities-Fran...",
|
||||
"date": "",
|
||||
"rank": 4
|
||||
},
|
||||
{
|
||||
"link": "https://www.timeout.com/frankfurt/things-to-do/best-things-to-do-in-frankfurt",
|
||||
"title": "10 Best Things to do in Frankfurt for Locals and Tourists - Time Outhttps://www.timeout.com/.../things-to-do/best-things-to-do-in-fra...Im CacheDiese Seite übersetzen",
|
||||
"snippet": "09.07.2018 - Looking for the best things to do in Frankfurt? Check out our guide to local-approved restaurants, tours and more can't-miss activities in the ...",
|
||||
"visible_link": "https://www.timeout.com/.../things-to-do/best-things-to-do-in-fra...",
|
||||
"date": "09.07.2018 - ",
|
||||
"rank": 5
|
||||
},
|
||||
{
|
||||
"link": "https://www.lonelyplanet.com/germany/frankfurt-am-main/top-things-to-do/a/poi/1003203",
|
||||
"title": "Top things to do in Frankfurt am Main, Germany - Lonely Planethttps://www.lonelyplanet.com/germany/...things-to-do/.../100320...Im CacheÄhnliche SeitenDiese Seite übersetzen",
|
||||
"snippet": "Discover the best top things to do in Frankfurt am Main including Städel Museum, Kaiserdom, Senckenberg Museum.",
|
||||
"visible_link": "https://www.lonelyplanet.com/germany/...things-to-do/.../100320...",
|
||||
"date": "",
|
||||
"rank": 6
|
||||
},
|
||||
{
|
||||
"link": "https://www.atlasobscura.com/things-to-do/frankfurt-germany",
|
||||
"title": "9 Cool and Unusual Things to Do in Frankfurt - Atlas Obscurahttps://www.atlasobscura.com/things-to-do/frankfurt-germanyIm CacheDiese Seite übersetzen",
|
||||
"snippet": "Discover 9 hidden attractions, cool sights, and unusual things to do in Frankfurt, Germany from Pinkelbaum (Peeing Tree) to Henninger Turm.",
|
||||
"visible_link": "https://www.atlasobscura.com/things-to-do/frankfurt-germany",
|
||||
"date": "",
|
||||
"rank": 7
|
||||
},
|
||||
{
|
||||
"link": "https://lifehacks.io/what-to-do-when-your-bored/",
|
||||
"title": "23 [REALLY] Fun Things To Do When You Are Bored - Life Hackshttps://lifehacks.io/what-to-do-when-your-bored/Im CacheDiese Seite übersetzen",
|
||||
"snippet": "What to Do When You're Bored? Boredom could be real torture while there are people who yearn to feel that way because of their hectic schedules. It is kind of ...",
|
||||
"visible_link": "https://lifehacks.io/what-to-do-when-your-bored/",
|
||||
"date": "",
|
||||
"rank": 8
|
||||
},
|
||||
{
|
||||
"link": "https://www.planetware.com/tourist-attractions-/frankfurt-d-hs-fra.htm",
|
||||
"title": "12 Top-Rated Tourist Attractions in Frankfurt | PlanetWarehttps://www.planetware.com/tourist...-/frankfurt-d-hs-fra.htmIm CacheDiese Seite übersetzen",
|
||||
"snippet": "Considered a global city - it frequently ranks in the top ten best cities to live and do business - Frankfurt has also long been an important center for cultural and ...",
|
||||
"visible_link": "https://www.planetware.com/tourist...-/frankfurt-d-hs-fra.htm",
|
||||
"date": "",
|
||||
"rank": 9
|
||||
},
|
||||
{
|
||||
"link": "https://theculturetrip.com/europe/germany/articles/7-cool-and-unusual-things-to-do-in-frankfurt/",
|
||||
"title": "7 Cool and Unusual Things to Do in Frankfurt - Culture Triphttps://theculturetrip.com › Europe › GermanyIm CacheDiese Seite übersetzen",
|
||||
"snippet": "27.06.2018 - Frankfurt is the busiest airport in Germany, though unfortunately, not everyone realises it's worth stopping to spend time in the city. They don't ...",
|
||||
"visible_link": "https://theculturetrip.com › Europe › Germany",
|
||||
"date": "27.06.2018 - ",
|
||||
"rank": 10
|
||||
}
|
||||
]
|
||||
}
|
||||
},
|
||||
"incolumitas.com": {
|
||||
"1": {
|
||||
"time": "Thu, 28 Feb 2019 14:26:29 GMT",
|
||||
"num_results": "Ongeveer 86.900 resultaten (0,20 seconden) ",
|
||||
"no_results": false,
|
||||
"effective_query": "",
|
||||
"results": [
|
||||
{
|
||||
"link": "https://incolumitas.com/",
|
||||
"title": "incolumitas.comhttps://incolumitas.com/In cacheVertaal deze paginaContactArchivesIntroductionScrapeulous.comProjectsTagsMachine LearningSite NoticeGoogleScraperScraping",
|
||||
"snippet": "Nikolai Tschacher's ideas and projects around IT security and computer science.",
|
||||
"visible_link": "https://incolumitas.com/",
|
||||
"date": "",
|
||||
"rank": 1
|
||||
},
|
||||
{
|
||||
"link": "https://en.wiktionary.org/wiki/incolumitas",
|
||||
"title": "incolumitas - Wiktionaryhttps://en.wiktionary.org/wiki/incolumitasIn cacheVergelijkbaarVertaal deze pagina",
|
||||
"snippet": "incolumitās f (genitive incolumitātis); third declension ... incolumitas in Charlton T. Lewis and Charles Short (1879) A Latin Dictionary , Oxford: Clarendon Press ...",
|
||||
"visible_link": "https://en.wiktionary.org/wiki/incolumitas",
|
||||
"date": "",
|
||||
"rank": 2
|
||||
},
|
||||
{
|
||||
"link": "https://www.linkedin.com/company/incolumitas",
|
||||
"title": "INCOLUMITAS | LinkedInhttps://www.linkedin.com/company/incolumitasVertaal deze pagina",
|
||||
"snippet": "Learn about working at INCOLUMITAS. Join LinkedIn today for free. See who you know at INCOLUMITAS, leverage your professional network, and get hired.",
|
||||
"visible_link": "https://www.linkedin.com/company/incolumitas",
|
||||
"date": "",
|
||||
"rank": 3
|
||||
},
|
||||
{
|
||||
"link": "https://books.google.nl/books?id=-5eGVvVSnGsC&pg=PA150&lpg=PA150&dq=incolumitas.com&source=bl&ots=u2qhYFQ3YS&sig=ACfU3U37dliQuuF7H_qBz0-9F0rBwW6OeQ&hl=nl&sa=X&ved=2ahUKEwiDlKmb0d7gAhUJ4KYKHfdNAe8Q6AEwEnoECAAQAQ",
|
||||
"title": "The Making of the Monastic Community of Fulda, C.744-c.900https://books.google.nl/books?isbn=1107002818Vertaal deze pagina",
|
||||
"snippet": "... too much ofa tangent to repeat their arguments here. to a large extent the argumentation hinges on one, single word: incolumitas, 'well-being', which can refer ...",
|
||||
"visible_link": "https://books.google.nl/books?isbn=1107002818",
|
||||
"date": "",
|
||||
"rank": 4
|
||||
}
|
||||
]
|
||||
}
|
||||
},
|
||||
"javascript is hard": {
|
||||
"1": {
|
||||
"time": "Thu, 28 Feb 2019 14:26:31 GMT",
|
||||
"num_results": "Ongeveer 1.550.000.000 resultaten (0,21 seconden) ",
|
||||
"no_results": false,
|
||||
"effective_query": "",
|
||||
"results": [
|
||||
{
|
||||
"link": "http://blog.thefirehoseproject.com/posts/why-is-javascript-so-hard-to-learn/",
|
||||
"title": "Why is JavaScript So Hard To Learn? - Firehose Projectblog.thefirehoseproject.com/posts/why-is-javascript-so-hard-to-learn/",
|
||||
"snippet": "",
|
||||
"visible_link": "blog.thefirehoseproject.com/posts/why-is-javascript-so-hard-to-learn/",
|
||||
"date": "",
|
||||
"rank": 1
|
||||
},
|
||||
{
|
||||
"link": "http://blog.thefirehoseproject.com/posts/why-is-javascript-so-hard-to-learn/",
|
||||
"title": "Why is JavaScript So Hard To Learn? - Firehose Projectblog.thefirehoseproject.com/posts/why-is-javascript-so-hard-to-learn/",
|
||||
"snippet": "",
|
||||
"visible_link": "blog.thefirehoseproject.com/posts/why-is-javascript-so-hard-to-learn/",
|
||||
"date": "",
|
||||
"rank": 2
|
||||
},
|
||||
{
|
||||
"link": "https://skillcrush.com/2018/06/27/how-hard-is-it-to-learn-javascript/",
|
||||
"title": "How Hard Is it to Learn JavaScript? The Pros Weigh In - Skillcrushhttps://skillcrush.com/2018/06/.../how-hard-is-it-to-learn-javascript...In cacheVertaal deze pagina",
|
||||
"snippet": "27 jun. 2018 - Are you thinking about learning JavaScript but concerned about how hard of a task that might be? Allow these developers with JavaScript ...",
|
||||
"visible_link": "https://skillcrush.com/2018/06/.../how-hard-is-it-to-learn-javascript...",
|
||||
"date": "27 jun. 2018 - ",
|
||||
"rank": 3
|
||||
},
|
||||
{
|
||||
"link": "http://blog.thefirehoseproject.com/posts/why-is-javascript-so-hard-to-learn/",
|
||||
"title": "Why is JavaScript So Hard To Learn? - Firehose Projectblog.thefirehoseproject.com/.../why-is-javascript-so-hard-to-learn/In cacheVergelijkbaarVertaal deze pagina",
|
||||
"snippet": "29 aug. 2016 - JavaScript is so hard to learn because it's an asynchronous programming language. It's also single-threaded, which means it uses its asynchronous nature in a radically different way than most other programming languages.",
|
||||
"visible_link": "blog.thefirehoseproject.com/.../why-is-javascript-so-hard-to-learn/",
|
||||
"date": "29 aug. 2016 - ",
|
||||
"rank": 4
|
||||
},
|
||||
{
|
||||
"link": "https://www.quora.com/Is-JavaScript-hard-to-learn",
|
||||
"title": "Is JavaScript hard to learn? - Quorahttps://www.quora.com/Is-JavaScript-hard-to-learnVergelijkbaarVertaal deze pagina",
|
||||
"snippet": "16 dec. 2015 - Suffice it to say that JavaScript, good JavaScript, is hard because there are many considerations outside of knowing how to code in it. Making sure that you ...",
|
||||
"visible_link": "https://www.quora.com/Is-JavaScript-hard-to-learn",
|
||||
"date": "16 dec. 2015 - ",
|
||||
"rank": 5
|
||||
},
|
||||
{
|
||||
"link": "https://www.thoughtco.com/how-hard-is-javascript-to-learn-2037676",
|
||||
"title": "How Hard Is JavaScript to Learn? HTML Comparison - ThoughtCohttps://www.thoughtco.com › ... › JavaScript ProgrammingIn cacheVergelijkbaarVertaal deze pagina",
|
||||
"snippet": "28 jan. 2019 - In many ways, JavaScript is one of the easiest programming language to learn as your first language. The way that it functions as an interpreted language within the web browser means that you can easily write even the most complex code by writing it a small piece at a time and testing it in the web browser as you go.",
|
||||
"visible_link": "https://www.thoughtco.com › ... › JavaScript Programming",
|
||||
"date": "28 jan. 2019 - ",
|
||||
"rank": 6
|
||||
},
|
||||
{
|
||||
"link": "https://www.reddit.com/r/webdev/comments/80zcx1/javascript_is_hard/",
|
||||
"title": "Javascript IS hard. : webdev - Reddithttps://www.reddit.com/r/webdev/comments/.../javascript_is_hard/In cacheVertaal deze pagina",
|
||||
"snippet": "28 feb. 2018 - I'm sure some of you may have seen the disaster of the thread stating that Javascript isn't hard. I'm here to tell you the opposite. Javascript is...",
|
||||
"visible_link": "https://www.reddit.com/r/webdev/comments/.../javascript_is_hard/",
|
||||
"date": "28 feb. 2018 - ",
|
||||
"rank": 7
|
||||
},
|
||||
{
|
||||
"link": "https://develoger.com/why-is-javascript-so-hard-bd3648db51a5",
|
||||
"title": "Why is JavaScript so hard? – Develogerhttps://develoger.com/why-is-javascript-so-hard-bd3648db51a5In cacheVergelijkbaarVertaal deze pagina",
|
||||
"snippet": "3 okt. 2016 - If you feel comfortable working with html but find it hard to experience ... Looking at the JavaScript and programming trough the eyes of CSS.",
|
||||
"visible_link": "https://develoger.com/why-is-javascript-so-hard-bd3648db51a5",
|
||||
"date": "3 okt. 2016 - ",
|
||||
"rank": 8
|
||||
},
|
||||
{
|
||||
"link": "https://teamtreehouse.com/community/is-learning-javascript-supposed-to-be-this-difficult-or-am-i-not-cut-out-for-this",
|
||||
"title": "Is learning JavaScript supposed to be this difficult or am I not cut out ...https://teamtreehouse.com/.../is-learning-javascript-supposed-to-be...In cacheVergelijkbaarVertaal deze pagina",
|
||||
"snippet": "3 dec. 2015 - I haven't been able to complete any of Dave McFarland's \"programming challenges\" like building quizzes etc. I have to just watch his solution ...",
|
||||
"visible_link": "https://teamtreehouse.com/.../is-learning-javascript-supposed-to-be...",
|
||||
"date": "3 dec. 2015 - ",
|
||||
"rank": 9
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
@ -54,6 +54,7 @@ function read_items_from_file(fname) {
|
||||
|
||||
const cluster = await Cluster.launch({
|
||||
monitor: true,
|
||||
timeout: 12 * 60 * 60 * 1000, // 12 hours in ms
|
||||
concurrency: Cluster.CONCURRENCY_BROWSER,
|
||||
maxConcurrency: perBrowserOptions.length,
|
||||
puppeteerOptions: {
|
||||
|
Binary file not shown.
Before Width: | Height: | Size: 43 KiB |
34
index.js
34
index.js
@ -1,15 +1,16 @@
|
||||
const { Cluster } = require('./src/puppeteer-cluster/dist/index.js');
|
||||
const handler = require('./src/node_scraper.js');
|
||||
var fs = require('fs');
|
||||
var os = require("os");
|
||||
|
||||
exports.scrape = async function(config, callback) {
|
||||
exports.scrape = async function(user_config, callback) {
|
||||
|
||||
// options for scraping
|
||||
event = {
|
||||
let config = {
|
||||
// the user agent to scrape with
|
||||
user_agent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36',
|
||||
// if random_user_agent is set to True, a random user agent is chosen
|
||||
random_user_agent: true,
|
||||
random_user_agent: false,
|
||||
// whether to select manual settings in visible mode
|
||||
set_manual_settings: false,
|
||||
// log ip address data
|
||||
@ -18,7 +19,7 @@ exports.scrape = async function(config, callback) {
|
||||
log_http_headers: false,
|
||||
// how long to sleep between requests. a random sleep interval within the range [a,b]
|
||||
// is drawn before every request. empty string for no sleeping.
|
||||
sleep_range: '[1,1]',
|
||||
sleep_range: '',
|
||||
// which search engine to scrape
|
||||
search_engine: 'google',
|
||||
compress: false, // compress
|
||||
@ -48,22 +49,27 @@ exports.scrape = async function(config, callback) {
|
||||
// this is a quick test and should be used for debugging
|
||||
test_evasion: false,
|
||||
// settings for puppeteer-cluster
|
||||
monitor: false,
|
||||
puppeteer_cluster_config: {
|
||||
timeout: 30 * 60 * 1000, // max timeout set to 30 minutes
|
||||
monitor: false,
|
||||
concurrency: Cluster.CONCURRENCY_BROWSER,
|
||||
maxConcurrency: 2,
|
||||
}
|
||||
};
|
||||
|
||||
// overwrite default config
|
||||
for (var key in config) {
|
||||
event[key] = config[key];
|
||||
for (var key in user_config) {
|
||||
config[key] = user_config[key];
|
||||
}
|
||||
|
||||
if (fs.existsSync(event.keyword_file)) {
|
||||
event.keywords = read_keywords_from_file(event.keyword_file);
|
||||
if (fs.existsSync(config.keyword_file)) {
|
||||
config.keywords = read_keywords_from_file(config.keyword_file);
|
||||
}
|
||||
|
||||
if (fs.existsSync(event.proxy_file)) {
|
||||
event.proxies = read_keywords_from_file(event.proxy_file);
|
||||
if (event.verbose) {
|
||||
console.log(`${event.proxies.length} proxies loaded.`);
|
||||
if (fs.existsSync(config.proxy_file)) {
|
||||
config.proxies = read_keywords_from_file(config.proxy_file);
|
||||
if (config.verbose) {
|
||||
console.log(`${config.proxies.length} proxies loaded.`);
|
||||
}
|
||||
}
|
||||
|
||||
@ -78,7 +84,7 @@ exports.scrape = async function(config, callback) {
|
||||
}
|
||||
}
|
||||
|
||||
await handler.handler(event, undefined, callback );
|
||||
await handler.handler(config, undefined, callback );
|
||||
};
|
||||
|
||||
function read_keywords_from_file(fname) {
|
||||
|
@ -1,7 +1,7 @@
|
||||
{
|
||||
"name": "se-scraper",
|
||||
"version": "1.2.0",
|
||||
"description": "A simple library using puppeteer to scrape several search engines such as Google, Duckduckgo and Bing.",
|
||||
"version": "1.2.1",
|
||||
"description": "A simple module using puppeteer to scrape several search engines such as Google, Duckduckgo and Bing.",
|
||||
"homepage": "https://scrapeulous.com/",
|
||||
"main": "index.js",
|
||||
"scripts": {
|
||||
|
18
run.js
18
run.js
@ -17,15 +17,15 @@ let config = {
|
||||
// this output is informational
|
||||
verbose: true,
|
||||
// an array of keywords to scrape
|
||||
keywords: ['news', 'abc', 'good', 'bad', 'better', 'one more', 'time', 'we are going'],
|
||||
keywords: ['scrapeulous.com', 'scraping search engines', 'scraping service scrapeulous', 'learn js'],
|
||||
// alternatively you can specify a keyword_file. this overwrites the keywords array
|
||||
keyword_file: '',
|
||||
// the number of pages to scrape for each keyword
|
||||
num_pages: 1,
|
||||
num_pages: 2,
|
||||
// whether to start the browser in headless mode
|
||||
headless: false,
|
||||
headless: true,
|
||||
// path to output file, data will be stored in JSON
|
||||
output_file: 'data.json',
|
||||
output_file: 'examples/results/advanced.json',
|
||||
// whether to prevent images, css, fonts from being loaded
|
||||
// will speed up scraping a great deal
|
||||
block_assets: true,
|
||||
@ -42,14 +42,20 @@ let config = {
|
||||
// a file with one proxy per line. Example:
|
||||
// socks5://78.94.172.42:1080
|
||||
// http://118.174.233.10:48400
|
||||
proxy_file: '/home/nikolai/.proxies',
|
||||
proxy_file: '',
|
||||
// check if headless chrome escapes common detection techniques
|
||||
// this is a quick test and should be used for debugging
|
||||
test_evasion: false,
|
||||
// log ip address data
|
||||
log_ip_address: true,
|
||||
log_ip_address: false,
|
||||
// log http headers
|
||||
log_http_headers: false,
|
||||
puppeteer_cluster_config: {
|
||||
timeout: 10 * 60 * 1000, // max timeout set to 10 minutes
|
||||
monitor: false,
|
||||
concurrency: 1, // one scraper per tab
|
||||
maxConcurrency: 2, // scrape with 2 tabs
|
||||
}
|
||||
};
|
||||
|
||||
function callback(err, response) {
|
||||
|
@ -75,11 +75,11 @@ class BingScraper extends Scraper {
|
||||
|
||||
async wait_for_results() {
|
||||
await this.page.waitForSelector('#b_content', { timeout: 5000 });
|
||||
await this.sleep(500);
|
||||
await this.sleep(750);
|
||||
}
|
||||
|
||||
async detected() {
|
||||
// TODO: I was actually never detected by bing. those are good guys.
|
||||
// TODO: I was actually never detected by bing. those are good boys.
|
||||
}
|
||||
}
|
||||
|
||||
|
@ -21,9 +21,10 @@ module.exports = class Scraper {
|
||||
this.config = config;
|
||||
this.context = context;
|
||||
|
||||
this.proxy = config.proxy;
|
||||
this.keywords = config.keywords;
|
||||
|
||||
this.STANDARD_TIMEOUT = 8000;
|
||||
this.STANDARD_TIMEOUT = 10000;
|
||||
// longer timeout when using proxies
|
||||
this.PROXY_TIMEOUT = 15000;
|
||||
this.SOLVE_CAPTCHA_TIME = 45000;
|
||||
@ -37,7 +38,6 @@ module.exports = class Scraper {
|
||||
}
|
||||
|
||||
async run({page, data}) {
|
||||
|
||||
this.page = page;
|
||||
|
||||
let do_continue = await this.load_search_engine();
|
||||
@ -93,23 +93,23 @@ module.exports = class Scraper {
|
||||
}
|
||||
|
||||
if (this.config.log_ip_address === true) {
|
||||
this.metadata.ipinfo = await meta.get_ip_data(this.page);
|
||||
console.log(this.metadata.ipinfo);
|
||||
let ipinfo = await meta.get_ip_data(this.page);
|
||||
this.metadata.ipinfo = ipinfo;
|
||||
console.log(ipinfo);
|
||||
}
|
||||
|
||||
// check that our proxy is working by confirming
|
||||
// that ipinfo.io sees the proxy IP address
|
||||
if (this.config.proxy && this.config.log_ip_address === true) {
|
||||
console.log(`${this.metadata.ipinfo} vs ${this.config.proxy}`);
|
||||
if (this.proxy && this.config.log_ip_address === true) {
|
||||
console.log(`${this.metadata.ipinfo.ip} vs ${this.proxy}`);
|
||||
|
||||
try {
|
||||
// if the ip returned by ipinfo is not a substring of our proxystring, get the heck outta here
|
||||
if (!this.config.proxy.includes(this.metadata.ipinfo.ip)) {
|
||||
if (!this.proxy.includes(this.metadata.ipinfo.ip)) {
|
||||
console.error('Proxy not working properly.');
|
||||
return false;
|
||||
}
|
||||
} catch (exception) {
|
||||
|
||||
}
|
||||
}
|
||||
|
||||
@ -221,7 +221,7 @@ module.exports = class Scraper {
|
||||
async random_sleep() {
|
||||
const [min, max] = this.config.sleep_range;
|
||||
let rand = Math.floor(Math.random() * (max - min + 1) + min); //Generate Random number
|
||||
if (this.config.debug === true) {
|
||||
if (this.config.verbose === true) {
|
||||
console.log(`Sleeping for ${rand}s`);
|
||||
}
|
||||
await this.sleep(rand * 1000);
|
||||
|
@ -63,6 +63,8 @@ module.exports.handler = async function handler (event, context, callback) {
|
||||
console.log(config);
|
||||
}
|
||||
|
||||
console.log(`[se-scraper] started at [${(new Date()).toUTCString()}] and scrapes ${config.search_engine} with ${config.keywords.length} keywords on ${config.num_pages} pages each.`);
|
||||
|
||||
var ADDITIONAL_CHROME_FLAGS = [
|
||||
'--disable-infobars',
|
||||
'--window-position=0,0',
|
||||
@ -108,100 +110,126 @@ module.exports.handler = async function handler (event, context, callback) {
|
||||
ignoreHTTPSErrors: true,
|
||||
};
|
||||
|
||||
if (config.debug === true) {
|
||||
console.log("Chrome Args: ", launch_args);
|
||||
}
|
||||
var results = {};
|
||||
var num_requests = 0;
|
||||
var metadata = {};
|
||||
|
||||
if (pluggable.start_browser) {
|
||||
launch_args.config = config;
|
||||
browser = await pluggable.start_browser(launch_args);
|
||||
let browser = await pluggable.start_browser(launch_args);
|
||||
const page = await await browser.newPage();
|
||||
let obj = getScraper(config.search_engine, {
|
||||
config: config,
|
||||
context: context,
|
||||
pluggable: pluggable,
|
||||
});
|
||||
results = obj.run(page);
|
||||
num_requests = obj.num_requests;
|
||||
|
||||
if (pluggable.close_browser) {
|
||||
await pluggable.close_browser();
|
||||
} else {
|
||||
await browser.close();
|
||||
}
|
||||
} else {
|
||||
var numClusters = config.proxies.length + 1;
|
||||
// if no custom start_browser functionality was given
|
||||
// use puppeteer-cluster for scraping
|
||||
|
||||
// the first browser config with home IP
|
||||
let perBrowserOptions = [launch_args, ];
|
||||
var numClusters = config.puppeteer_cluster_config.maxConcurrency;
|
||||
var perBrowserOptions = [];
|
||||
|
||||
for (var proxy of config.proxies) {
|
||||
perBrowserOptions.push({
|
||||
headless: config.headless,
|
||||
ignoreHTTPSErrors: true,
|
||||
args: ADDITIONAL_CHROME_FLAGS.concat(`--proxy-server=${proxy}`)
|
||||
})
|
||||
// if we have at least one proxy, always use CONCURRENCY_BROWSER
|
||||
// and set maxConcurrency to config.proxies.length + 1
|
||||
// else use whatever configuration was passed
|
||||
if (config.proxies.length > 0) {
|
||||
config.puppeteer_cluster_config.concurrency = Cluster.CONCURRENCY_BROWSER;
|
||||
config.puppeteer_cluster_config.maxConcurrency = config.proxies.length + 1;
|
||||
numClusters = config.proxies.length + 1;
|
||||
|
||||
// the first browser config with home IP
|
||||
perBrowserOptions = [launch_args, ];
|
||||
|
||||
for (var proxy of config.proxies) {
|
||||
perBrowserOptions.push({
|
||||
headless: config.headless,
|
||||
ignoreHTTPSErrors: true,
|
||||
args: ADDITIONAL_CHROME_FLAGS.concat(`--proxy-server=${proxy}`)
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
var cluster = await Cluster.launch({
|
||||
monitor: config.monitor,
|
||||
timeout: 30 * 60 * 1000, // max timeout set to 30 minutes
|
||||
concurrency: Cluster.CONCURRENCY_BROWSER,
|
||||
maxConcurrency: numClusters,
|
||||
monitor: config.puppeteer_cluster_config.monitor,
|
||||
timeout: config.puppeteer_cluster_config.timeout, // max timeout set to 30 minutes
|
||||
concurrency: config.puppeteer_cluster_config.concurrency,
|
||||
maxConcurrency: config.puppeteer_cluster_config.maxConcurrency,
|
||||
puppeteerOptions: launch_args,
|
||||
perBrowserOptions: perBrowserOptions
|
||||
perBrowserOptions: perBrowserOptions,
|
||||
});
|
||||
|
||||
cluster.on('taskerror', (err, data) => {
|
||||
console.log(`Error while scraping ${data}: ${err.message}`);
|
||||
console.log(err)
|
||||
});
|
||||
}
|
||||
|
||||
let metadata = {};
|
||||
|
||||
// Each browser will get N/(K+1) keywords and will issue N/(K+1) * M total requests to the search engine.
|
||||
// https://github.com/GoogleChrome/puppeteer/issues/678
|
||||
// The question is: Is it possible to set proxies per Page? Per Browser?
|
||||
// as far as I can see, puppeteer cluster uses the same puppeteerOptions
|
||||
// for every browser instance. We will use our custom puppeteer-cluster version.
|
||||
// https://www.npmjs.com/package/proxy-chain
|
||||
// this answer looks nice: https://github.com/GoogleChrome/puppeteer/issues/678#issuecomment-389096077
|
||||
let chunks = [];
|
||||
for (var n = 0; n < numClusters; n++) {
|
||||
chunks.push([]);
|
||||
}
|
||||
for (var k = 0; k < config.keywords.length; k++) {
|
||||
chunks[k%numClusters].push(config.keywords[k]);
|
||||
}
|
||||
//console.log(`Generated ${chunks.length} chunks...`);
|
||||
|
||||
let execPromises = [];
|
||||
let scraperInstances = [];
|
||||
for (var c = 0; c < chunks.length; c++) {
|
||||
config.keywords = chunks[c];
|
||||
if (c>0) {
|
||||
config.proxy = config.proxies[c];
|
||||
// Each browser will get N/(K+1) keywords and will issue N/(K+1) * M total requests to the search engine.
|
||||
// https://github.com/GoogleChrome/puppeteer/issues/678
|
||||
// The question is: Is it possible to set proxies per Page? Per Browser?
|
||||
// as far as I can see, puppeteer cluster uses the same puppeteerOptions
|
||||
// for every browser instance. We will use our custom puppeteer-cluster version.
|
||||
// https://www.npmjs.com/package/proxy-chain
|
||||
// this answer looks nice: https://github.com/GoogleChrome/puppeteer/issues/678#issuecomment-389096077
|
||||
let chunks = [];
|
||||
for (var n = 0; n < numClusters; n++) {
|
||||
chunks.push([]);
|
||||
}
|
||||
for (var k = 0; k < config.keywords.length; k++) {
|
||||
chunks[k%numClusters].push(config.keywords[k]);
|
||||
}
|
||||
obj = getScraper(config.search_engine, {
|
||||
config: config,
|
||||
context: context,
|
||||
pluggable: pluggable,
|
||||
});
|
||||
var boundMethod = obj.run.bind(obj);
|
||||
execPromises.push(cluster.execute({}, boundMethod));
|
||||
scraperInstances.push(obj);
|
||||
}
|
||||
|
||||
let results = await Promise.all(execPromises);
|
||||
results = results[0]; // TODO: this is strange. fix that shit boy
|
||||
let execPromises = [];
|
||||
let scraperInstances = [];
|
||||
for (var c = 0; c < chunks.length; c++) {
|
||||
config.keywords = chunks[c];
|
||||
// the first scraping config uses the home IP
|
||||
if (c > 0) {
|
||||
config.proxy = config.proxies[c-1];
|
||||
}
|
||||
var obj = getScraper(config.search_engine, {
|
||||
config: config,
|
||||
context: context,
|
||||
pluggable: pluggable,
|
||||
});
|
||||
|
||||
var boundMethod = obj.run.bind(obj);
|
||||
execPromises.push(cluster.execute({}, boundMethod));
|
||||
scraperInstances.push(obj);
|
||||
}
|
||||
|
||||
let resolved = await Promise.all(execPromises);
|
||||
|
||||
for (var group of resolved) {
|
||||
for (var key in group) {
|
||||
results[key] = group[key];
|
||||
}
|
||||
}
|
||||
|
||||
if (pluggable.close_browser) {
|
||||
await pluggable.close_browser();
|
||||
} else {
|
||||
await cluster.idle();
|
||||
await cluster.close();
|
||||
}
|
||||
|
||||
// count total requests among all scraper instances
|
||||
let num_requests = 0;
|
||||
for (var o of scraperInstances) {
|
||||
num_requests += o.num_requests;
|
||||
// count total requests among all scraper instances
|
||||
for (var o of scraperInstances) {
|
||||
num_requests += o.num_requests;
|
||||
}
|
||||
}
|
||||
|
||||
let timeDelta = Date.now() - startTime;
|
||||
let ms_per_request = timeDelta/num_requests;
|
||||
|
||||
if (config.verbose === true) {
|
||||
console.log(`${numClusters} Scraper Workers took ${timeDelta}ms to perform ${num_requests} requests.`);
|
||||
console.log(`se-scraper took ${timeDelta}ms to perform ${num_requests} requests.`);
|
||||
console.log(`On average ms/request: ${ms_per_request}ms/request`);
|
||||
console.dir(results, {depth: null, colors: true});
|
||||
//console.dir(results, {depth: null, colors: true});
|
||||
}
|
||||
|
||||
if (config.compress === true) {
|
||||
@ -232,7 +260,7 @@ module.exports.handler = async function handler (event, context, callback) {
|
||||
}
|
||||
|
||||
if (config.output_file) {
|
||||
write_results(config.output_file, JSON.stringify(results));
|
||||
write_results(config.output_file, JSON.stringify(results, null, 4));
|
||||
}
|
||||
|
||||
let response = {
|
||||
|
@ -17,7 +17,7 @@ async function normal_search_test() {
|
||||
keywords: normal_search_keywords,
|
||||
keyword_file: '',
|
||||
num_pages: 2,
|
||||
headless: false,
|
||||
headless: true,
|
||||
output_file: '',
|
||||
block_assets: true,
|
||||
user_agent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36',
|
||||
|
Loading…
Reference in New Issue
Block a user