fixed google SERP title, better docker support

This commit is contained in:
Nikolai Tschacher 2019-09-23 16:46:22 +02:00
parent b25f7a4285
commit 07f3dceba1
3 changed files with 38 additions and 8 deletions

View File

@ -6,10 +6,11 @@
This node module allows you to scrape search engines concurrently with different proxies. This node module allows you to scrape search engines concurrently with different proxies.
If you don't have much technical experience or don't want to purchase proxies, you can use [my scraping service](https://scrapeulous.com/). If you don't have extensive technical experience or don't want to purchase proxies, you can use [my scraping service](https://scrapeulous.com/).
##### Table of Contents #### Table of Contents
- [Installation](#installation) - [Installation](#installation)
- [Docker](#docker-support)
- [Minimal Example](#minimal-example) - [Minimal Example](#minimal-example)
- [Quickstart](#quickstart) - [Quickstart](#quickstart)
- [Contribute](#contribute) - [Contribute](#contribute)
@ -75,7 +76,7 @@ If you **don't** want puppeteer to download a complete chromium browser, add thi
export PUPPETEER_SKIP_CHROMIUM_DOWNLOAD=1 export PUPPETEER_SKIP_CHROMIUM_DOWNLOAD=1
``` ```
### Docker Image ### Docker Support
I will maintain a public docker image of se-scraper. Pull the docker image with the command: I will maintain a public docker image of se-scraper. Pull the docker image with the command:
@ -83,7 +84,30 @@ I will maintain a public docker image of se-scraper. Pull the docker image with
docker pull tschachn/se-scraper docker pull tschachn/se-scraper
``` ```
When the image is running, you may start scrape jobs via an HTTP API: Confirm that the docker image was correctly pulled:
```bash
docker image ls
```
Should show something like that:
```
tschachn/se-scraper secondtry 897e1aeeba78 21 minutes ago 1.29GB
```
You can check the [latest tag here](https://hub.docker.com/r/tschachn/se-scraper/tags). In the example below, the latest tag is **secondtry**. This will most likely change in the future to **latest**.
Run the docker image and map the internal port 3000 to the external
port 3000:
```bash
$ docker run -p 3000:3000 tschachn/se-scraper:secondtry
Running on http://0.0.0.0:3000
```
When the image is running, you may start scrape jobs via HTTP API:
```bash ```bash
curl -XPOST http://0.0.0.0:3000 -H 'Content-Type: application/json' \ curl -XPOST http://0.0.0.0:3000 -H 'Content-Type: application/json' \

View File

@ -1,6 +1,6 @@
{ {
"name": "se-scraper", "name": "se-scraper",
"version": "1.5.1", "version": "1.5.2",
"description": "A module using puppeteer to scrape several search engines such as Google, Bing and Duckduckgo", "description": "A module using puppeteer to scrape several search engines such as Google, Bing and Duckduckgo",
"homepage": "https://scrapeulous.com/", "homepage": "https://scrapeulous.com/",
"main": "index.js", "main": "index.js",

View File

@ -16,13 +16,19 @@ class GoogleScraper extends Scraper {
const results = []; const results = [];
$('#center_col .g').each((i, link) => { $('#center_col .g').each((i, link) => {
results.push({ let obj = {
link: $(link).find('.r a').attr('href'), link: $(link).find('.r a').attr('href'),
title: $(link).find('.r a').text(), title: $(link).find('.r a h3').text(),
snippet: $(link).find('span.st').text(), snippet: $(link).find('span.st').text(),
visible_link: $(link).find('.r cite').text(), visible_link: $(link).find('.r cite').text(),
date: $(link).find('span.f').text() || '', date: $(link).find('span.f').text() || '',
}) };
if (obj.date) {
obj.date = obj.date.replace(' - ', '');
}
results.push(obj);
}); });
// parse ads // parse ads