se-scraper/README.md
2019-01-27 15:54:56 +01:00

321 lines
14 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Search Engine Scraper
This node module supports scraping several search engines.
Right now scraping the search engines
* Google
* Google News
* Google News New (https://news.google.com)
* Google Image
* Bing
* Baidu
* Youtube
* Infospace
* Duckduckgo
* Webcrawler
is supported.
Additionally **se-scraper** supports investment ticker search from the following sites:
* Bloomberg
* Reuters
* cnbc
* Marketwatch
This module uses puppeteer. It was created by the Developer of https://github.com/NikolaiT/GoogleScraper, a module with 1800 Stars on Github.
### Technical Notes
Scraping is done with a headless chromium browser using the automation library puppeteer. Puppeteer is a Node library which provides a high-level API to control headless Chrome or Chromium over the DevTools Protocol.
No multithreading is supported for now. Only one scraping worker per `scrape()` call.
If you need to deploy scraping to the cloud (AWS or Azure), you can contact me on hire@incolumitas.com
### Installation and Usage
Install with
```bash
npm install se-scraper
```
Use se-scraper by calling it with a script such as the one below.
```js
const se_scraper = require('se-scraper');
const resolve = require('path').resolve;
let config = {
// the user agent to scrape with
user_agent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36',
// if random_user_agent is set to True, a random user agent is chosen
random_user_agent: false,
// get meta data of scraping in return object
write_meta_data: false,
// how long to sleep between requests. a random sleep interval within the range [a,b]
// is drawn before every request. empty string for no sleeping.
sleep_range: '',
// which search engine to scrape
search_engine: 'google',
// whether debug information should be printed
// debug info is useful for developers when debugging
debug: false,
// whether verbose program output should be printed
// this output is informational
verbose: false,
// an array of keywords to scrape
keywords: ['scrapeulous.com', ],
// alternatively you can specify a keyword_file. this overwrites the keywords array
keyword_file: '',
// whether to start the browser in headless mode
headless: true,
// path to output file, data will be stored in JSON
output_file: 'data.json',
// whether to prevent images, css, fonts from being loaded
// will speed up scraping a great deal
block_assets: true,
// path to js module that extends functionality
// this module should export the functions:
// get_browser, handle_metadata, close_browser
// must be an absolute path to the module
//custom_func: resolve('examples/pluggable.js'),
custom_func: '',
};
se_scraper.scrape(config, (err, response) => {
if (err) { console.error(err) }
/* response object has the following properties:
response.results - json object with the scraping results
response.metadata - json object with metadata information
response.statusCode - status code of the scraping process
*/
console.dir(response.results, {depth: null, colors: true});
});
```
Supported options for the `search_engine` config key:
```javascript
'google'
'google_news_old'
'google_news'
'google_image'
'bing'
'bing_news'
'infospace'
'webcrawler'
'baidu'
'youtube'
'duckduckgo_news'
'google_dr'
'yahoo_news'
// ticker search
'bloomberg'
'reuters'
'cnbc'
'marketwatch'
```
Output for the above script on my laptop:
```text
Scraper took 4295ms to scrape 2 keywords.
On average ms/keyword: 2147.5ms/keyword
{ 'incolumitas.com scraping':
{ time: 'Mon, 24 Dec 2018 13:07:43 GMT',
num_results: 'Ungefähr 2020 Ergebnisse (0.18 Sekunden) ',
no_results: false,
effective_query: '',
results:
[ { link:
'https://incolumitas.com/2018/10/29/youtube-puppeteer-scraping/',
title:
'Coding, Learning and Business Ideas Tutorial: Youtube scraping ...',
snippet:
'29.10.2018 - In this blog post I am going to show you how to scrape YouTube video data using the handy puppeteer library. Puppeteer is a Node library ...',
visible_link:
'https://incolumitas.com/2018/10/29/youtube-puppeteer-scraping/',
date: '29.10.2018 - ',
rank: 1 },
{ link: 'https://incolumitas.com/2018/09/05/googlescraper-tutorial/',
title:
'GoogleScraper Tutorial - How to scrape 1000 keywords with Google',
snippet:
'05.09.2018 - Tutorial that teaches how to use GoogleScraper to scrape 1000 keywords with 10 selenium browsers.',
visible_link: 'https://incolumitas.com/2018/09/05/googlescraper-tutorial/',
date: '05.09.2018 - ',
rank: 2 },
{ link: 'https://incolumitas.com/tag/scraping.html',
title: 'Coding, Learning and Business Ideas Tag Scraping',
snippet:
'Scraping Amazon Reviews using Headless Chrome Browser and Python3. Posted on Mi ... GoogleScraper Tutorial - How to scrape 1000 keywords with Google.',
visible_link: 'https://incolumitas.com/tag/scraping.html',
date: '',
rank: 3 },
{ link: 'https://incolumitas.com/category/scraping.html',
title: 'Coding, Learning and Business Ideas Category Scraping',
snippet:
'Nikolai Tschacher\'s ideas and projects around IT security and computer science.',
visible_link: 'https://incolumitas.com/category/scraping.html',
date: '',
rank: 4 },
{ link:
'https://github.com/NikolaiT/incolumitas/blob/master/content/Meta/scraping-and-extracting-links-from-any-major-search-engine-like-google-yandex-baidu-bing-and-duckduckgo.md',
title:
'incolumitas/scraping-and-extracting-links-from-any-major-search ...',
snippet:
'Title: Scraping and Extracting Links from any major Search Engine like Google, Yandex, Baidu, Bing and Duckduckgo Date: 2014-11-12 00:47 Author: Nikolai ...',
visible_link:
'https://github.com/.../incolumitas/.../scraping-and-extracting-links...',
date: '',
rank: 5 },
{ link:
'https://stackoverflow.com/questions/16955325/scraping-google-results-with-python',
title: 'Scraping Google Results with Python - Stack Overflow',
snippet:
'I found this. incolumitas.com/2013/01/06/… But the author claims it is not ported to 2.7 yet. user2351394 Jun 6 \'13 at 6:59 ...',
visible_link:
'https://stackoverflow.com/.../scraping-google-results-with-python',
date: '',
rank: 6 },
{ link: 'https://pypi.org/project/GoogleScraper/0.1.18/',
title: 'GoogleScraper · PyPI',
snippet:
'[5]: http://incolumitas.com/2014/11/12/scraping-and-extracting-links-from-any-major-search-engine-like-google-yandex-baidu-bing-and-duckduckgo/ ...',
visible_link: 'https://pypi.org/project/GoogleScraper/0.1.18/',
date: '',
rank: 7 },
{ link:
'https://www.reddit.com/r/Python/comments/2m0vyu/scraping_links_on_google_yandex_bing_duckduckgo/',
title:
'Scraping links on Google, Yandex, Bing, Duckduckgo, Baidu and ...',
snippet:
'12.11.2014 - Scraping links on Google, Yandex, Bing, Duckduckgo, Baidu and other search engines with Python ... submitted 4 years ago by incolumitas.',
visible_link:
'https://www.reddit.com/.../scraping_links_on_google_yandex_bi...',
date: '12.11.2014 - ',
rank: 9 },
{ link: 'https://twitter.com/incolumitas_?lang=de',
title: 'Nikolai Tschacher (@incolumitas_) | Twitter',
snippet:
'Embed Tweet. How to use GoogleScraper to scrape images and download them ... Learn how to scrape millions of url from yandex and google or bing with: ...',
visible_link: 'https://twitter.com/incolumitas_?lang=de',
date: '',
rank: 10 } ] },
'best scraping framework':
{ time: 'Mon, 24 Dec 2018 13:07:44 GMT',
num_results: 'Ungefähr 2820000 Ergebnisse (0.36 Sekunden) ',
no_results: false,
effective_query: '',
results:
[ { link:
'http://www.aioptify.com/top-web-scraping-frameworks-and-librares.php',
title: 'Top Web Scraping Frameworks and Libraries - AI Optify',
snippet: '',
visible_link:
'www.aioptify.com/top-web-scraping-frameworks-and-librares.php',
date: '',
rank: 1 },
{ link:
'http://www.aioptify.com/top-web-scraping-frameworks-and-librares.php',
title: 'Top Web Scraping Frameworks and Libraries - AI Optify',
snippet: '',
visible_link:
'www.aioptify.com/top-web-scraping-frameworks-and-librares.php',
date: '',
rank: 2 },
{ link:
'https://www.scrapehero.com/open-source-web-scraping-frameworks-and-tools/',
title:
'Best Open Source Web Scraping Frameworks and Tools - ScrapeHero',
snippet:
'05.06.2018 - List of Open Source Web Scraping Frameworks. Scrapy. MechanicalSoup. PySpider. Portia. Apify SDK. Nodecrawler. Selenium WebDriver. Puppeteer.',
visible_link:
'https://www.scrapehero.com/open-source-web-scraping-framewo...',
date: '05.06.2018 - ',
rank: 3 },
{ link:
'https://medium.com/datadriveninvestor/best-data-scraping-tools-for-2018-top-10-reviews-558cc5a4992f',
title:
'Best Data Scraping Tools for 2018 (Top 10 Reviews) Data Driven ...',
snippet:
'05.03.2018 - Pros: Octoparse is the best free data scraping tool I\'ve met. ... your Scrapy (a open-source data extraction framework) web spider\'s activities.',
visible_link:
'https://medium.com/.../best-data-scraping-tools-for-2018-top-10-...',
date: '05.03.2018 - ',
rank: 4 },
{ link:
'https://www.quora.com/What-is-the-best-web-scraping-open-source-tool',
title: 'What is the best web scraping open source tool? - Quora',
snippet:
'15.06.2015 - My personal favourite is Python Scrapy and it is an excellent framework for building a web data scraper. Why Scrapy? 1) It is an open source framework and cost ...',
visible_link:
'https://www.quora.com/What-is-the-best-web-scraping-open-sour...',
date: '15.06.2015 - ',
rank: 5 },
{ link:
'http://www.aioptify.com/top-web-scraping-frameworks-and-librares.php',
title: 'Top Web Scraping Frameworks and Libraries - AI Optify',
snippet:
'21.05.2018 - Top Web Scraping Frameworks and Libraries. Requests. Scrapy. Beautiful Soup. Selenium with Python. lxml. Webscraping with Selenium - part 1. Extracting data from websites with Scrapy. Scrapinghub.',
visible_link:
'www.aioptify.com/top-web-scraping-frameworks-and-librares.php',
date: '21.05.2018 - ',
rank: 6 },
{ link: 'https://scrapy.org/',
title:
'Scrapy | A Fast and Powerful Scraping and Web Crawling Framework',
snippet:
'An open source and collaborative framework for extracting the data you need from ... Spider): name = \'blogspider\' start_urls = [\'https://blog.scrapinghub.com\'] def ...',
visible_link: 'https://scrapy.org/',
date: '',
rank: 7 },
{ link:
'https://www.scraperapi.com/blog/the-10-best-web-scraping-tools',
title: 'The 10 Best Web Scraping Tools of 2018 - Scraper API',
snippet:
'19.07.2018 - The 10 Best Web Scraping Tools of 2018. ParseHub. Scrapy. Diffbot. Cheerio. Website: https://cheerio.js.org. Beautiful Soup. Website: https://www.crummy.com/software/BeautifulSoup/ Puppeteer. Website: https://github.com/GoogleChrome/puppeteer. Content Grabber. Website: http://www.contentgrabber.com/ Mozenda. Website: ...',
visible_link:
'https://www.scraperapi.com/blog/the-10-best-web-scraping-tools',
date: '19.07.2018 - ',
rank: 8 },
{ link: 'https://elitedatascience.com/python-web-scraping-libraries',
title: '5 Tasty Python Web Scraping Libraries - EliteDataScience',
snippet:
'03.02.2017 - We\'ve decided to feature the 5 Python libraries for web scraping that ... The good news is that you can swap out its parser with a faster one if ... Scrapy is technically not even a library… it\'s a complete web scraping framework.',
visible_link: 'https://elitedatascience.com/python-web-scraping-libraries',
date: '03.02.2017 - ',
rank: 9 },
{ link:
'https://blog.michaelyin.info/web-scraping-framework-review-scrapy-vs-selenium/',
title:
'Web Scraping Framework Review: Scrapy VS Selenium | MichaelYin ...',
snippet:
'01.10.2018 - In this Scrapy tutorial, I will cover the features of Scrapy and Selenium, and help you decide which one is better for your projects.',
visible_link:
'https://blog.michaelyin.info/web-scraping-framework-review-scr...',
date: '01.10.2018 - ',
rank: 10 },
{ link: 'https://github.com/lorien/awesome-web-scraping',
title:
'GitHub - lorien/awesome-web-scraping: List of libraries, tools and APIs ...',
snippet:
'List of libraries, tools and APIs for web scraping and data processing. ... golang.md · add dataflow kit framework, 2 months ago ... Make this list better!',
visible_link: 'https://github.com/lorien/awesome-web-scraping',
date: '',
rank: 11 },
{ link: 'https://www.import.io/post/best-web-scraping-tools-2018/',
title: 'Best Web Scraping Software Tools 2018 | Import.io',
snippet:
'07.08.2018 - List of Best Web Scraping SoftwareThere are hundreds of Web ... it is a fast high-level screen scraping and web crawling framework, used to ...',
visible_link: 'https://www.import.io/post/best-web-scraping-tools-2018/',
date: '07.08.2018 - ',
rank: 12 } ] } }
```