se-scraper/README.md
2019-01-27 15:54:56 +01:00

14 KiB
Raw Blame History

Search Engine Scraper

This node module supports scraping several search engines.

Right now scraping the search engines

  • Google
  • Google News
  • Google News New (https://news.google.com)
  • Google Image
  • Bing
  • Baidu
  • Youtube
  • Infospace
  • Duckduckgo
  • Webcrawler

is supported.

Additionally se-scraper supports investment ticker search from the following sites:

  • Bloomberg
  • Reuters
  • cnbc
  • Marketwatch

This module uses puppeteer. It was created by the Developer of https://github.com/NikolaiT/GoogleScraper, a module with 1800 Stars on Github.

Technical Notes

Scraping is done with a headless chromium browser using the automation library puppeteer. Puppeteer is a Node library which provides a high-level API to control headless Chrome or Chromium over the DevTools Protocol.

No multithreading is supported for now. Only one scraping worker per scrape() call.

If you need to deploy scraping to the cloud (AWS or Azure), you can contact me on hire@incolumitas.com

Installation and Usage

Install with

npm install se-scraper

Use se-scraper by calling it with a script such as the one below.

const se_scraper = require('se-scraper');
const resolve = require('path').resolve;

let config = {
    // the user agent to scrape with
    user_agent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36',
    // if random_user_agent is set to True, a random user agent is chosen
    random_user_agent: false,
    // get meta data of scraping in return object
    write_meta_data: false,
    // how long to sleep between requests. a random sleep interval within the range [a,b]
    // is drawn before every request. empty string for no sleeping.
    sleep_range: '',
    // which search engine to scrape
    search_engine: 'google',
    // whether debug information should be printed
    // debug info is useful for developers when debugging
    debug: false,
    // whether verbose program output should be printed
    // this output is informational
    verbose: false,
    // an array of keywords to scrape
    keywords: ['scrapeulous.com', ],
    // alternatively you can specify a keyword_file. this overwrites the keywords array
    keyword_file: '',
    // whether to start the browser in headless mode
    headless: true,
    // path to output file, data will be stored in JSON
    output_file: 'data.json',
    // whether to prevent images, css, fonts from being loaded
    // will speed up scraping a great deal
    block_assets: true,
    // path to js module that extends functionality
    // this module should export the functions:
    // get_browser, handle_metadata, close_browser
    // must be an absolute path to the module
    //custom_func: resolve('examples/pluggable.js'),
    custom_func: '',
};

se_scraper.scrape(config, (err, response) => {
    if (err) { console.error(err) }

    /* response object has the following properties:

        response.results - json object with the scraping results
        response.metadata - json object with metadata information
        response.statusCode - status code of the scraping process
     */

    console.dir(response.results, {depth: null, colors: true});
});

Supported options for the search_engine config key:

'google'
'google_news_old'
'google_news'
'google_image'
'bing'
'bing_news'
'infospace'
'webcrawler'
'baidu'
'youtube'
'duckduckgo_news'
'google_dr'
'yahoo_news'
// ticker search
'bloomberg'
'reuters'
'cnbc'
'marketwatch'

Output for the above script on my laptop:

Scraper took 4295ms to scrape 2 keywords.
On average ms/keyword: 2147.5ms/keyword
{ 'incolumitas.com scraping':
   { time: 'Mon, 24 Dec 2018 13:07:43 GMT',
     num_results: 'Ungefähr 2020 Ergebnisse (0.18 Sekunden) ',
     no_results: false,
     effective_query: '',
     results:
      [ { link:
           'https://incolumitas.com/2018/10/29/youtube-puppeteer-scraping/',
          title:
           'Coding, Learning and Business Ideas  Tutorial: Youtube scraping ...',
          snippet:
           '29.10.2018 - In this blog post I am going to show you how to scrape YouTube video data using the handy puppeteer library. Puppeteer is a Node library ...',
          visible_link:
           'https://incolumitas.com/2018/10/29/youtube-puppeteer-scraping/',
          date: '29.10.2018 - ',
          rank: 1 },
        { link: 'https://incolumitas.com/2018/09/05/googlescraper-tutorial/',
          title:
           'GoogleScraper Tutorial - How to scrape 1000 keywords with Google',
          snippet:
           '05.09.2018 - Tutorial that teaches how to use GoogleScraper to scrape 1000 keywords with 10 selenium browsers.',
          visible_link: 'https://incolumitas.com/2018/09/05/googlescraper-tutorial/',
          date: '05.09.2018 - ',
          rank: 2 },
        { link: 'https://incolumitas.com/tag/scraping.html',
          title: 'Coding, Learning and Business Ideas  Tag Scraping',
          snippet:
           'Scraping Amazon Reviews using Headless Chrome Browser and Python3. Posted on Mi ... GoogleScraper Tutorial - How to scrape 1000 keywords with Google.',
          visible_link: 'https://incolumitas.com/tag/scraping.html',
          date: '',
          rank: 3 },
        { link: 'https://incolumitas.com/category/scraping.html',
          title: 'Coding, Learning and Business Ideas  Category Scraping',
          snippet:
           'Nikolai Tschacher\'s ideas and projects around IT security and computer science.',
          visible_link: 'https://incolumitas.com/category/scraping.html',
          date: '',
          rank: 4 },
        { link:
           'https://github.com/NikolaiT/incolumitas/blob/master/content/Meta/scraping-and-extracting-links-from-any-major-search-engine-like-google-yandex-baidu-bing-and-duckduckgo.md',
          title:
           'incolumitas/scraping-and-extracting-links-from-any-major-search ...',
          snippet:
           'Title: Scraping and Extracting Links from any major Search Engine like Google, Yandex, Baidu, Bing and Duckduckgo Date: 2014-11-12 00:47 Author: Nikolai ...',
          visible_link:
           'https://github.com/.../incolumitas/.../scraping-and-extracting-links...',
          date: '',
          rank: 5 },
        { link:
           'https://stackoverflow.com/questions/16955325/scraping-google-results-with-python',
          title: 'Scraping Google Results with Python - Stack Overflow',
          snippet:
           'I found this. incolumitas.com/2013/01/06/… But the author claims it is not ported to 2.7 yet.  user2351394 Jun 6 \'13 at 6:59 ...',
          visible_link:
           'https://stackoverflow.com/.../scraping-google-results-with-python',
          date: '',
          rank: 6 },
        { link: 'https://pypi.org/project/GoogleScraper/0.1.18/',
          title: 'GoogleScraper · PyPI',
          snippet:
           '[5]: http://incolumitas.com/2014/11/12/scraping-and-extracting-links-from-any-major-search-engine-like-google-yandex-baidu-bing-and-duckduckgo/ ...',
          visible_link: 'https://pypi.org/project/GoogleScraper/0.1.18/',
          date: '',
          rank: 7 },
        { link:
           'https://www.reddit.com/r/Python/comments/2m0vyu/scraping_links_on_google_yandex_bing_duckduckgo/',
          title:
           'Scraping links on Google, Yandex, Bing, Duckduckgo, Baidu and ...',
          snippet:
           '12.11.2014 - Scraping links on Google, Yandex, Bing, Duckduckgo, Baidu and other search engines with Python ... submitted 4 years ago by incolumitas.',
          visible_link:
           'https://www.reddit.com/.../scraping_links_on_google_yandex_bi...',
          date: '12.11.2014 - ',
          rank: 9 },
        { link: 'https://twitter.com/incolumitas_?lang=de',
          title: 'Nikolai Tschacher (@incolumitas_) | Twitter',
          snippet:
           'Embed Tweet. How to use GoogleScraper to scrape images and download them ... Learn how to scrape millions of url from yandex and google or bing with: ...',
          visible_link: 'https://twitter.com/incolumitas_?lang=de',
          date: '',
          rank: 10 } ] },
  'best scraping framework':
   { time: 'Mon, 24 Dec 2018 13:07:44 GMT',
     num_results: 'Ungefähr 2820000 Ergebnisse (0.36 Sekunden) ',
     no_results: false,
     effective_query: '',
     results:
      [ { link:
           'http://www.aioptify.com/top-web-scraping-frameworks-and-librares.php',
          title: 'Top Web Scraping Frameworks and Libraries - AI Optify',
          snippet: '',
          visible_link:
           'www.aioptify.com/top-web-scraping-frameworks-and-librares.php',
          date: '',
          rank: 1 },
        { link:
           'http://www.aioptify.com/top-web-scraping-frameworks-and-librares.php',
          title: 'Top Web Scraping Frameworks and Libraries - AI Optify',
          snippet: '',
          visible_link:
           'www.aioptify.com/top-web-scraping-frameworks-and-librares.php',
          date: '',
          rank: 2 },
        { link:
           'https://www.scrapehero.com/open-source-web-scraping-frameworks-and-tools/',
          title:
           'Best Open Source Web Scraping Frameworks and Tools - ScrapeHero',
          snippet:
           '05.06.2018 - List of Open Source Web Scraping Frameworks. Scrapy. MechanicalSoup. PySpider. Portia. Apify SDK. Nodecrawler. Selenium WebDriver. Puppeteer.',
          visible_link:
           'https://www.scrapehero.com/open-source-web-scraping-framewo...',
          date: '05.06.2018 - ',
          rank: 3 },
        { link:
           'https://medium.com/datadriveninvestor/best-data-scraping-tools-for-2018-top-10-reviews-558cc5a4992f',
          title:
           'Best Data Scraping Tools for 2018 (Top 10 Reviews)  Data Driven ...',
          snippet:
           '05.03.2018 - Pros: Octoparse is the best free data scraping tool I\'ve met. ... your Scrapy (a open-source data extraction framework) web spider\'s activities.',
          visible_link:
           'https://medium.com/.../best-data-scraping-tools-for-2018-top-10-...',
          date: '05.03.2018 - ',
          rank: 4 },
        { link:
           'https://www.quora.com/What-is-the-best-web-scraping-open-source-tool',
          title: 'What is the best web scraping open source tool? - Quora',
          snippet:
           '15.06.2015 - My personal favourite is Python Scrapy and it is an excellent framework for building a web data scraper. Why Scrapy? 1) It is an open source framework and cost ...',
          visible_link:
           'https://www.quora.com/What-is-the-best-web-scraping-open-sour...',
          date: '15.06.2015 - ',
          rank: 5 },
        { link:
           'http://www.aioptify.com/top-web-scraping-frameworks-and-librares.php',
          title: 'Top Web Scraping Frameworks and Libraries - AI Optify',
          snippet:
           '21.05.2018 - Top Web Scraping Frameworks and Libraries. Requests. Scrapy. Beautiful Soup. Selenium with Python. lxml. Webscraping with Selenium - part 1. Extracting data from websites with Scrapy. Scrapinghub.',
          visible_link:
           'www.aioptify.com/top-web-scraping-frameworks-and-librares.php',
          date: '21.05.2018 - ',
          rank: 6 },
        { link: 'https://scrapy.org/',
          title:
           'Scrapy | A Fast and Powerful Scraping and Web Crawling Framework',
          snippet:
           'An open source and collaborative framework for extracting the data you need from ... Spider): name = \'blogspider\' start_urls = [\'https://blog.scrapinghub.com\'] def ...',
          visible_link: 'https://scrapy.org/',
          date: '',
          rank: 7 },
        { link:
           'https://www.scraperapi.com/blog/the-10-best-web-scraping-tools',
          title: 'The 10 Best Web Scraping Tools of 2018 - Scraper API',
          snippet:
           '19.07.2018 - The 10 Best Web Scraping Tools of 2018. ParseHub. Scrapy. Diffbot. Cheerio. Website: https://cheerio.js.org. Beautiful Soup. Website: https://www.crummy.com/software/BeautifulSoup/ Puppeteer. Website: https://github.com/GoogleChrome/puppeteer. Content Grabber. Website: http://www.contentgrabber.com/ Mozenda. Website: ...',
          visible_link:
           'https://www.scraperapi.com/blog/the-10-best-web-scraping-tools',
          date: '19.07.2018 - ',
          rank: 8 },
        { link: 'https://elitedatascience.com/python-web-scraping-libraries',
          title: '5 Tasty Python Web Scraping Libraries - EliteDataScience',
          snippet:
           '03.02.2017 - We\'ve decided to feature the 5 Python libraries for web scraping that ... The good news is that you can swap out its parser with a faster one if ... Scrapy is technically not even a library… it\'s a complete web scraping framework.',
          visible_link: 'https://elitedatascience.com/python-web-scraping-libraries',
          date: '03.02.2017 - ',
          rank: 9 },
        { link:
           'https://blog.michaelyin.info/web-scraping-framework-review-scrapy-vs-selenium/',
          title:
           'Web Scraping Framework Review: Scrapy VS Selenium | MichaelYin ...',
          snippet:
           '01.10.2018 - In this Scrapy tutorial, I will cover the features of Scrapy and Selenium, and help you decide which one is better for your projects.',
          visible_link:
           'https://blog.michaelyin.info/web-scraping-framework-review-scr...',
          date: '01.10.2018 - ',
          rank: 10 },
        { link: 'https://github.com/lorien/awesome-web-scraping',
          title:
           'GitHub - lorien/awesome-web-scraping: List of libraries, tools and APIs ...',
          snippet:
           'List of libraries, tools and APIs for web scraping and data processing. ... golang.md · add dataflow kit framework, 2 months ago ... Make this list better!',
          visible_link: 'https://github.com/lorien/awesome-web-scraping',
          date: '',
          rank: 11 },
        { link: 'https://www.import.io/post/best-web-scraping-tools-2018/',
          title: 'Best Web Scraping Software Tools 2018 | Import.io',
          snippet:
           '07.08.2018 - List of Best Web Scraping SoftwareThere are hundreds of Web ... it is a fast high-level screen scraping and web crawling framework, used to ...',
          visible_link: 'https://www.import.io/post/best-web-scraping-tools-2018/',
          date: '07.08.2018 - ',
          rank: 12 } ] } }