mirror of
https://github.com/NikolaiT/se-scraper.git
synced 2025-02-26 13:30:58 +01:00
added chrome detection evasion techniques
This commit is contained in:
parent
d5b147296e
commit
7572ebd314
36
README.md
36
README.md
@ -6,7 +6,7 @@ Right now scraping the search engines
|
|||||||
|
|
||||||
* Google
|
* Google
|
||||||
* Google News
|
* Google News
|
||||||
* Google News New (https://news.google.com)
|
* Google News App version (https://news.google.com)
|
||||||
* Google Image
|
* Google Image
|
||||||
* Bing
|
* Bing
|
||||||
* Baidu
|
* Baidu
|
||||||
@ -65,9 +65,14 @@ se_scraper.scrape(config, callback);
|
|||||||
|
|
||||||
Scraping is done with a headless chromium browser using the automation library puppeteer. Puppeteer is a Node library which provides a high-level API to control headless Chrome or Chromium over the DevTools Protocol.
|
Scraping is done with a headless chromium browser using the automation library puppeteer. Puppeteer is a Node library which provides a high-level API to control headless Chrome or Chromium over the DevTools Protocol.
|
||||||
|
|
||||||
No multithreading is supported for now. Only one scraping worker per `scrape()` call.
|
No multithreading is supported for now. Only one scraping worker per `scrape()` call.
|
||||||
|
|
||||||
|
We will soon support parallelization. **se-scraper** will support an architecture similar to:
|
||||||
|
|
||||||
|
1. https://antoinevastel.com/crawler/2018/09/20/parallel-crawler-puppeteer.html
|
||||||
|
2. https://docs.browserless.io/blog/2018/06/04/puppeteer-best-practices.html
|
||||||
|
|
||||||
If you need to deploy scraping to the cloud (AWS or Azure), you can contact me on hire@incolumitas.com
|
If you need to deploy scraping to the cloud (AWS or Azure), you can contact me at hire@incolumitas.com
|
||||||
|
|
||||||
The chromium browser is started with the following flags to prevent
|
The chromium browser is started with the following flags to prevent
|
||||||
scraping detection.
|
scraping detection.
|
||||||
@ -104,11 +109,32 @@ page.on('request', (req) => {
|
|||||||
});
|
});
|
||||||
```
|
```
|
||||||
|
|
||||||
#### Making puppeteer and headless chrome undetectable
|
### Making puppeteer and headless chrome undetectable
|
||||||
|
|
||||||
Consider the following resources:
|
Consider the following resources:
|
||||||
|
|
||||||
* https://intoli.com/blog/making-chrome-headless-undetectable/
|
* https://intoli.com/blog/making-chrome-headless-undetectable/
|
||||||
|
* https://intoli.com/blog/not-possible-to-block-chrome-headless/
|
||||||
|
* https://news.ycombinator.com/item?id=16179602
|
||||||
|
|
||||||
|
**se-scraper** implements the countermeasures against headless chrome detection proposed on those sites.
|
||||||
|
|
||||||
|
Most recent detection counter measures can be found here:
|
||||||
|
|
||||||
|
* https://github.com/paulirish/headless-cat-n-mouse/blob/master/apply-evasions.js
|
||||||
|
|
||||||
|
**se-scraper** makes use of those anti detection techniques.
|
||||||
|
|
||||||
|
To check whether evasion works, you can test it by passing `test_evasion` flag to the config:
|
||||||
|
|
||||||
|
```js
|
||||||
|
let config = {
|
||||||
|
// check if headless chrome escapes common detection techniques
|
||||||
|
test_evasion: true
|
||||||
|
};
|
||||||
|
```
|
||||||
|
|
||||||
|
It will create a screenshot named `headless-test-result.png` in the directory where the scraper was started that shows whether all test have passed.
|
||||||
|
|
||||||
### Advanced Usage
|
### Advanced Usage
|
||||||
|
|
||||||
@ -123,8 +149,6 @@ let config = {
|
|||||||
user_agent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36',
|
user_agent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36',
|
||||||
// if random_user_agent is set to True, a random user agent is chosen
|
// if random_user_agent is set to True, a random user agent is chosen
|
||||||
random_user_agent: true,
|
random_user_agent: true,
|
||||||
// get meta data of scraping in return object
|
|
||||||
write_meta_data: false,
|
|
||||||
// how long to sleep between requests. a random sleep interval within the range [a,b]
|
// how long to sleep between requests. a random sleep interval within the range [a,b]
|
||||||
// is drawn before every request. empty string for no sleeping.
|
// is drawn before every request. empty string for no sleeping.
|
||||||
sleep_range: '[1,2]',
|
sleep_range: '[1,2]',
|
||||||
|
5
TODO.txt
5
TODO.txt
@ -31,6 +31,9 @@
|
|||||||
- modify all scrapers to use the generic class where it makes sense
|
- modify all scrapers to use the generic class where it makes sense
|
||||||
- Bing, Baidu, Google, Duckduckgo
|
- Bing, Baidu, Google, Duckduckgo
|
||||||
|
|
||||||
|
7.2.2019
|
||||||
|
- add num_requests to test cases
|
||||||
|
|
||||||
TODO:
|
TODO:
|
||||||
- think about implementing ticker search for: https://quotes.wsj.com/MSFT?mod=searchresults_companyquotes
|
- think about implementing ticker search for: https://quotes.wsj.com/MSFT?mod=searchresults_companyquotes
|
||||||
- add proxy support
|
- add proxy support
|
||||||
@ -47,4 +50,4 @@ TODO:
|
|||||||
okay its fucking time to make a generic scraping class like in GoogleScraper [done]
|
okay its fucking time to make a generic scraping class like in GoogleScraper [done]
|
||||||
i feel like history repeats
|
i feel like history repeats
|
||||||
|
|
||||||
write good test case for google
|
write good test case for google [done]
|
@ -1 +0,0 @@
|
|||||||
{"scrapeulous.com":{"1":{"time":"Thu, 31 Jan 2019 14:40:37 GMT","no_results":false,"effective_query":"scrupulous","num_results":"1.370.000 Ergebnisse","results":[{"link":"https://www.dict.cc/englisch-deutsch/scrupulous.html","title":"dict.cc Wörterbuch :: scrupulous :: Englisch-Deutsch ...","snippet":"Dieses Deutsch-Englisch-Wörterbuch basiert auf der Idee der freien Weitergabe von Wissen. Mehr Informationen! Enthält Übersetzungen von der TU Chemnitz sowie aus Mr Honey's Business Dictionary (Englisch/Deutsch).","visible_link":"www.dict.cc › … › Übersetzungen mit gleichem Wortanfang › SCR","rank":1},{"link":"https://scrapeulous.com/about/","title":"About scrapeulous.com","snippet":"About scrapeulous.com. Scrapeulous.com allows you to scrape various search engines automatically and in large quantities. The business requirement to scrape information from search engines often occurs in marketing research or in scientific projects.","visible_link":"https://scrapeulous.com/about","rank":2},{"link":"https://www.dictionary.com/browse/scrupulous","title":"Scrupulous | Define Scrupulous at Dictionary.com","snippet":"Scrupulous definition, having scruples, or moral or ethical standards; having or showing a strict regard for what one considers right; principled: scrupulous about defending human rights. See more.","visible_link":"https://www.dictionary.com/browse/scrupulous","rank":3},{"link":"https://www.dict.cc/?s=scrupulous","title":"scrupulous | Übersetzung Englisch-Deutsch","snippet":"Kennst du Übersetzungen, die noch nicht in diesem Wörterbuch enthalten sind? Hier kannst du sie vorschlagen! Bitte immer nur genau eine Deutsch-Englisch-Übersetzung eintragen (Formatierung siehe Guidelines), möglichst mit einem guten Beleg im Kommentarfeld.","visible_link":"https://www.dict.cc/?s=scrupulous","rank":4},{"link":"https://dict.leo.org/englisch-deutsch/scrupulous","title":"scrupulous - Deutsch Wörterbuch - leo.org: Startseite","snippet":"Lernen Sie die Übersetzung für 'scrupulous' in LEOs Englisch ⇔ Deutsch Wörterbuch. Mit Flexionstabellen der verschiedenen Fälle und Zeiten Aussprache und relevante Diskussionen Kostenloser Vokabeltrainer","visible_link":"https://dict.leo.org/englisch-deutsch/scrupulous","rank":5},{"link":"https://www.merriam-webster.com/dictionary/scrupulous","title":"Scrupulous | Definition of Scrupulous by Merriam …","snippet":"Choose the Right Synonym for scrupulous. upright, honest, just, conscientious, scrupulous, honorable mean having or showing a strict regard for what is morally right.","visible_link":"https://www.merriam-webster.com/dictionary/scrupulous","rank":6},{"link":"https://dictionary.cambridge.org/de/worterbuch/englisch/scrupulous","title":"SCRUPULOUS | Bedeutung im Cambridge Englisch Wörterbuch","snippet":"These examples are from the Cambridge English Corpus and from sources on the web. Any opinions in the examples do not represent the opinion of the Cambridge Dictionary editors or of Cambridge University Press or its licensors.","visible_link":"https://dictionary.cambridge.org/de/worterbuch/englisch/scrupulous","rank":7},{"link":"https://en.oxforddictionaries.com/definition/scrupulous","title":"scrupulous | Definition of scrupulous in English by …","snippet":"Definition of scrupulous - (of a person or process) careful, thorough, and extremely attentive to details","visible_link":"https://en.oxforddictionaries.com/definition/scrupulous","rank":8},{"link":"https://www.dictionary.com/browse/scrupulously","title":"Scrupulously | Define Scrupulously at …","snippet":"Scrupulously definition, having scruples, or moral or ethical standards; having or showing a strict regard for what one considers right; principled: scrupulous about defending human rights. See more.","visible_link":"https://www.dictionary.com/browse/scrupulously","rank":9},{"link":"https://www.youtube.com/watch?v=a6xn6rc9GbI","title":"How to use scrapeulous.com - YouTube","snippet":"16.12.2018 · This video is unavailable. Watch Queue Queue. Watch Queue Queue","visible_link":"https://www.youtube.com/watch?v=a6xn6rc9GbI","rank":10}]}}}
|
|
161
examples/detection_checker.js
Normal file
161
examples/detection_checker.js
Normal file
@ -0,0 +1,161 @@
|
|||||||
|
/*
|
||||||
|
* See here for most recent detection avoidance: https://github.com/paulirish/headless-cat-n-mouse/blob/master/apply-evasions.js
|
||||||
|
*/
|
||||||
|
|
||||||
|
// We'll use Puppeteer is our browser automation framework.
|
||||||
|
const puppeteer = require('puppeteer');
|
||||||
|
|
||||||
|
// This is where we'll put the code to get around the tests.
|
||||||
|
const preparePageForTests = async (page) => {
|
||||||
|
// Pass the User-Agent Test.
|
||||||
|
const userAgent = 'Mozilla/5.0 (X11; Linux x86_64)' +
|
||||||
|
'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.39 Safari/537.36';
|
||||||
|
await page.setUserAgent(userAgent);
|
||||||
|
|
||||||
|
// Pass the Webdriver Test.
|
||||||
|
await page.evaluateOnNewDocument(() => {
|
||||||
|
const newProto = navigator.__proto__;
|
||||||
|
delete newProto.webdriver;
|
||||||
|
navigator.__proto__ = newProto;
|
||||||
|
});
|
||||||
|
|
||||||
|
// Pass the Chrome Test.
|
||||||
|
await page.evaluateOnNewDocument(() => {
|
||||||
|
// We can mock this in as much depth as we need for the test.
|
||||||
|
const mockObj = {
|
||||||
|
app: {
|
||||||
|
isInstalled: false,
|
||||||
|
},
|
||||||
|
webstore: {
|
||||||
|
onInstallStageChanged: {},
|
||||||
|
onDownloadProgress: {},
|
||||||
|
},
|
||||||
|
runtime: {
|
||||||
|
PlatformOs: {
|
||||||
|
MAC: 'mac',
|
||||||
|
WIN: 'win',
|
||||||
|
ANDROID: 'android',
|
||||||
|
CROS: 'cros',
|
||||||
|
LINUX: 'linux',
|
||||||
|
OPENBSD: 'openbsd',
|
||||||
|
},
|
||||||
|
PlatformArch: {
|
||||||
|
ARM: 'arm',
|
||||||
|
X86_32: 'x86-32',
|
||||||
|
X86_64: 'x86-64',
|
||||||
|
},
|
||||||
|
PlatformNaclArch: {
|
||||||
|
ARM: 'arm',
|
||||||
|
X86_32: 'x86-32',
|
||||||
|
X86_64: 'x86-64',
|
||||||
|
},
|
||||||
|
RequestUpdateCheckStatus: {
|
||||||
|
THROTTLED: 'throttled',
|
||||||
|
NO_UPDATE: 'no_update',
|
||||||
|
UPDATE_AVAILABLE: 'update_available',
|
||||||
|
},
|
||||||
|
OnInstalledReason: {
|
||||||
|
INSTALL: 'install',
|
||||||
|
UPDATE: 'update',
|
||||||
|
CHROME_UPDATE: 'chrome_update',
|
||||||
|
SHARED_MODULE_UPDATE: 'shared_module_update',
|
||||||
|
},
|
||||||
|
OnRestartRequiredReason: {
|
||||||
|
APP_UPDATE: 'app_update',
|
||||||
|
OS_UPDATE: 'os_update',
|
||||||
|
PERIODIC: 'periodic',
|
||||||
|
},
|
||||||
|
},
|
||||||
|
};
|
||||||
|
|
||||||
|
window.navigator.chrome = mockObj;
|
||||||
|
window.chrome = mockObj;
|
||||||
|
});
|
||||||
|
|
||||||
|
// Pass the Permissions Test.
|
||||||
|
await page.evaluateOnNewDocument(() => {
|
||||||
|
const originalQuery = window.navigator.permissions.query;
|
||||||
|
window.navigator.permissions.__proto__.query = parameters =>
|
||||||
|
parameters.name === 'notifications'
|
||||||
|
? Promise.resolve({state: Notification.permission})
|
||||||
|
: originalQuery(parameters);
|
||||||
|
|
||||||
|
// Inspired by: https://github.com/ikarienator/phantomjs_hide_and_seek/blob/master/5.spoofFunctionBind.js
|
||||||
|
const oldCall = Function.prototype.call;
|
||||||
|
function call() {
|
||||||
|
return oldCall.apply(this, arguments);
|
||||||
|
}
|
||||||
|
Function.prototype.call = call;
|
||||||
|
|
||||||
|
const nativeToStringFunctionString = Error.toString().replace(/Error/g, "toString");
|
||||||
|
const oldToString = Function.prototype.toString;
|
||||||
|
|
||||||
|
function functionToString() {
|
||||||
|
if (this === window.navigator.permissions.query) {
|
||||||
|
return "function query() { [native code] }";
|
||||||
|
}
|
||||||
|
if (this === functionToString) {
|
||||||
|
return nativeToStringFunctionString;
|
||||||
|
}
|
||||||
|
return oldCall.call(oldToString, this);
|
||||||
|
}
|
||||||
|
Function.prototype.toString = functionToString;
|
||||||
|
});
|
||||||
|
|
||||||
|
// Pass the Plugins Length Test.
|
||||||
|
await page.evaluateOnNewDocument(() => {
|
||||||
|
// Overwrite the `plugins` property to use a custom getter.
|
||||||
|
Object.defineProperty(navigator, 'plugins', {
|
||||||
|
// This just needs to have `length > 0` for the current test,
|
||||||
|
// but we could mock the plugins too if necessary.
|
||||||
|
get: () => [1, 2, 3, 4, 5]
|
||||||
|
});
|
||||||
|
});
|
||||||
|
|
||||||
|
// Pass the Languages Test.
|
||||||
|
await page.evaluateOnNewDocument(() => {
|
||||||
|
// Overwrite the `plugins` property to use a custom getter.
|
||||||
|
Object.defineProperty(navigator, 'languages', {
|
||||||
|
get: () => ['en-US', 'en']
|
||||||
|
});
|
||||||
|
});
|
||||||
|
|
||||||
|
// Pass the iframe Test
|
||||||
|
await page.evaluateOnNewDocument(() => {
|
||||||
|
Object.defineProperty(HTMLIFrameElement.prototype, 'contentWindow', {
|
||||||
|
get: function() {
|
||||||
|
return window;
|
||||||
|
}
|
||||||
|
});
|
||||||
|
});
|
||||||
|
|
||||||
|
// Pass toString test, though it breaks console.debug() from working
|
||||||
|
await page.evaluateOnNewDocument(() => {
|
||||||
|
window.console.debug = () => {
|
||||||
|
return null;
|
||||||
|
};
|
||||||
|
});
|
||||||
|
};
|
||||||
|
|
||||||
|
(async () => {
|
||||||
|
// Launch the browser in headless mode and set up a page.
|
||||||
|
const browser = await puppeteer.launch({
|
||||||
|
args: ['--no-sandbox'],
|
||||||
|
headless: true,
|
||||||
|
});
|
||||||
|
const page = await browser.newPage();
|
||||||
|
|
||||||
|
// Prepare for the tests (not yet implemented).
|
||||||
|
await preparePageForTests(page);
|
||||||
|
|
||||||
|
// Navigate to the page that will perform the tests.
|
||||||
|
const testUrl = 'https://intoli.com/blog/' +
|
||||||
|
'not-possible-to-block-chrome-headless/chrome-headless-test.html';
|
||||||
|
await page.goto(testUrl);
|
||||||
|
|
||||||
|
// Save a screenshot of the results.
|
||||||
|
await page.screenshot({path: 'headless-test-result.png'});
|
||||||
|
|
||||||
|
// Clean up.
|
||||||
|
await browser.close()
|
||||||
|
})();
|
BIN
examples/headless-test-result.png
Normal file
BIN
examples/headless-test-result.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 42 KiB |
@ -1 +0,0 @@
|
|||||||
{"scrapeulous.com":{"1":{"time":"Thu, 31 Jan 2019 14:40:33 GMT","num_results":"Ungefähr 163 Ergebnisse (0,25 Sekunden) ","no_results":false,"effective_query":"","results":[{"link":"https://scrapeulous.com/","title":"Scrapeuloushttps://scrapeulous.com/Im CacheDiese Seite übersetzen","snippet":"What We Do. Scrapeulous.com allows you to scrape various search engines automatically and in large quantities. Whether you need to analyze your ...","visible_link":"https://scrapeulous.com/","date":"","rank":1},{"link":"https://scrapeulous.com/about/","title":"About - Scrapeuloushttps://scrapeulous.com/about/Im CacheDiese Seite übersetzen","snippet":"Scrapeulous.com allows you to scrape various search engines automatically and in large quantities. The business requirement to scrape information from ...","visible_link":"https://scrapeulous.com/about/","date":"","rank":2},{"link":"https://scrapeulous.com/howto/","title":"Howto - Scrapeuloushttps://scrapeulous.com/howto/Im CacheDiese Seite übersetzen","snippet":"We offer scraping large amounts of keywords for the Google Search Engine. Large means any number of keywords between 40 and 50000. Additionally, we ...","visible_link":"https://scrapeulous.com/howto/","date":"","rank":3},{"link":"https://scrapeulous.com/contact/","title":"Contact - Scrapeuloushttps://scrapeulous.com/contact/Im CacheDiese Seite übersetzen","snippet":"Contact scrapeulous.com. Your email address. Valid email address where we are going to contact you. We will not send spam mail. Your inquiry.","visible_link":"https://scrapeulous.com/contact/","date":"","rank":4},{"link":"https://incolumitas.com/","title":"Coding, Learning and Business Ideashttps://incolumitas.com/Im CacheDiese Seite übersetzen","snippet":"About · Contact · GoogleScraper · Lichess Autoplay-Bot · Projects · Scrapeulous.com · Site Notice · SVGCaptcha · Home Archives Categories Tags Atom ...","visible_link":"https://incolumitas.com/","date":"","rank":5},{"link":"https://incolumitas.com/pages/scrapeulous/","title":"Coding, Learning and Business Ideas – Scrapeulous.com - Incolumitashttps://incolumitas.com/pages/scrapeulous/Im CacheDiese Seite übersetzen","snippet":"In autumn 2018, I created a scraping service called scrapeulous.com. There you can purchase scrape jobs that allow you to upload a keyword file which in turn ...","visible_link":"https://incolumitas.com/pages/scrapeulous/","date":"","rank":6},{"link":"https://www.youtube.com/watch?v=a6xn6rc9GbI","title":"scrapeulous intro - YouTubehttps://www.youtube.com/watch?v=a6xn6rc9GbIDiese Seite übersetzen","snippet":"Introduction for https://scrapeulous.com.","visible_link":"https://www.youtube.com/watch?v=a6xn6rc9GbI","date":"","rank":7},{"link":"https://www.youtube.com/channel/UCJs1Xei5LRefg9GwFYdYhOw","title":"Scrapeulous Scrapeulous - YouTubehttps://www.youtube.com/.../UCJs1Xei5LRefg9GwFYdYhOwIm CacheDiese Seite übersetzen","snippet":"How to use scrapeulous.com - Duration: 3 minutes, 42 seconds. 32 minutes ago; 4 views. Introduction for https://scrapeulous.com. Show more. This item has ...","visible_link":"https://www.youtube.com/.../UCJs1Xei5LRefg9GwFYdYhOw","date":"","rank":8},{"link":"https://readthedocs.org/projects/googlescraper/downloads/pdf/latest/","title":"GoogleScraper Documentation - ReadTheDocshttps://readthedocs.org/projects/googlescraper/downloads/.../latest...Im CacheDiese Seite übersetzen","snippet":"23.12.2018 - 1.1 Scrapeulous.com - Scraping Service. GoogleScraper is a open source tool and will remain a open source tool in the future. Some people ...","visible_link":"https://readthedocs.org/projects/googlescraper/downloads/.../latest...","date":"23.12.2018 - ","rank":9},{"link":"https://pypi.org/project/CountryGoogleScraper/","title":"CountryGoogleScraper · PyPIhttps://pypi.org/project/CountryGoogleScraper/Im CacheDiese Seite übersetzen","snippet":"Look [here to get an idea how to use asynchronous mode](http://scrapeulous.com/googlescraper-260-keywords-in-a-second.html). ### Table of Contents 1.","visible_link":"https://pypi.org/project/CountryGoogleScraper/","date":"","rank":10}]}}}
|
|
BIN
headless-test-result.png
Normal file
BIN
headless-test-result.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 43 KiB |
11
index.js
11
index.js
@ -3,6 +3,7 @@ var fs = require('fs');
|
|||||||
var os = require("os");
|
var os = require("os");
|
||||||
|
|
||||||
exports.scrape = async function(config, callback) {
|
exports.scrape = async function(config, callback) {
|
||||||
|
|
||||||
// options for scraping
|
// options for scraping
|
||||||
event = {
|
event = {
|
||||||
// the user agent to scrape with
|
// the user agent to scrape with
|
||||||
@ -11,8 +12,9 @@ exports.scrape = async function(config, callback) {
|
|||||||
random_user_agent: true,
|
random_user_agent: true,
|
||||||
// whether to select manual settings in visible mode
|
// whether to select manual settings in visible mode
|
||||||
set_manual_settings: false,
|
set_manual_settings: false,
|
||||||
// get meta data of scraping in return object
|
// log ip address data
|
||||||
write_meta_data: false,
|
log_ip_address: false,
|
||||||
|
// log http headers
|
||||||
log_http_headers: false,
|
log_http_headers: false,
|
||||||
// how long to sleep between requests. a random sleep interval within the range [a,b]
|
// how long to sleep between requests. a random sleep interval within the range [a,b]
|
||||||
// is drawn before every request. empty string for no sleeping.
|
// is drawn before every request. empty string for no sleeping.
|
||||||
@ -25,6 +27,8 @@ exports.scrape = async function(config, callback) {
|
|||||||
keywords: ['scrapeulous.com'],
|
keywords: ['scrapeulous.com'],
|
||||||
// whether to start the browser in headless mode
|
// whether to start the browser in headless mode
|
||||||
headless: true,
|
headless: true,
|
||||||
|
// the number of pages to scrape for each keyword
|
||||||
|
num_pages: 1,
|
||||||
// path to output file, data will be stored in JSON
|
// path to output file, data will be stored in JSON
|
||||||
output_file: '',
|
output_file: '',
|
||||||
// whether to prevent images, css, fonts and media from being loaded
|
// whether to prevent images, css, fonts and media from being loaded
|
||||||
@ -39,6 +43,9 @@ exports.scrape = async function(config, callback) {
|
|||||||
// example: 'socks5://78.94.172.42:1080'
|
// example: 'socks5://78.94.172.42:1080'
|
||||||
// example: 'http://118.174.233.10:48400'
|
// example: 'http://118.174.233.10:48400'
|
||||||
proxy: '',
|
proxy: '',
|
||||||
|
// check if headless chrome escapes common detection techniques
|
||||||
|
// this is a quick test and should be used for debugging
|
||||||
|
test_evasion: false,
|
||||||
};
|
};
|
||||||
|
|
||||||
// overwrite default config
|
// overwrite default config
|
||||||
|
36
package-lock.json
generated
36
package-lock.json
generated
@ -1,6 +1,6 @@
|
|||||||
{
|
{
|
||||||
"name": "se-scraper",
|
"name": "se-scraper",
|
||||||
"version": "1.1.7",
|
"version": "1.1.12",
|
||||||
"lockfileVersion": 1,
|
"lockfileVersion": 1,
|
||||||
"requires": true,
|
"requires": true,
|
||||||
"dependencies": {
|
"dependencies": {
|
||||||
@ -124,7 +124,7 @@
|
|||||||
},
|
},
|
||||||
"concat-stream": {
|
"concat-stream": {
|
||||||
"version": "1.6.2",
|
"version": "1.6.2",
|
||||||
"resolved": "http://registry.npmjs.org/concat-stream/-/concat-stream-1.6.2.tgz",
|
"resolved": "https://registry.npmjs.org/concat-stream/-/concat-stream-1.6.2.tgz",
|
||||||
"integrity": "sha512-27HBghJxjiZtIk3Ycvn/4kbJk/1uZuJFfuPEns6LaEvpvG1f0hTea8lilrouyo9mVc2GWdcEZ8OLoGmSADlrCw==",
|
"integrity": "sha512-27HBghJxjiZtIk3Ycvn/4kbJk/1uZuJFfuPEns6LaEvpvG1f0hTea8lilrouyo9mVc2GWdcEZ8OLoGmSADlrCw==",
|
||||||
"requires": {
|
"requires": {
|
||||||
"buffer-from": "^1.0.0",
|
"buffer-from": "^1.0.0",
|
||||||
@ -135,7 +135,7 @@
|
|||||||
"dependencies": {
|
"dependencies": {
|
||||||
"readable-stream": {
|
"readable-stream": {
|
||||||
"version": "2.3.6",
|
"version": "2.3.6",
|
||||||
"resolved": "http://registry.npmjs.org/readable-stream/-/readable-stream-2.3.6.tgz",
|
"resolved": "https://registry.npmjs.org/readable-stream/-/readable-stream-2.3.6.tgz",
|
||||||
"integrity": "sha512-tQtKA9WIAhBF3+VLAseyMqZeBjW0AHJoxOtYqSUZNJxauErmLbVm2FW1y+J/YA9dUrAC39ITejlZWhVIwawkKw==",
|
"integrity": "sha512-tQtKA9WIAhBF3+VLAseyMqZeBjW0AHJoxOtYqSUZNJxauErmLbVm2FW1y+J/YA9dUrAC39ITejlZWhVIwawkKw==",
|
||||||
"requires": {
|
"requires": {
|
||||||
"core-util-is": "~1.0.0",
|
"core-util-is": "~1.0.0",
|
||||||
@ -149,7 +149,7 @@
|
|||||||
},
|
},
|
||||||
"string_decoder": {
|
"string_decoder": {
|
||||||
"version": "1.1.1",
|
"version": "1.1.1",
|
||||||
"resolved": "http://registry.npmjs.org/string_decoder/-/string_decoder-1.1.1.tgz",
|
"resolved": "https://registry.npmjs.org/string_decoder/-/string_decoder-1.1.1.tgz",
|
||||||
"integrity": "sha512-n/ShnvDi6FHbbVfviro+WojiFzv+s8MPMHBczVePfUpDJLwoLT0ht1l4YwBCbi8pJAveEEdnkHyPyTP/mzRfwg==",
|
"integrity": "sha512-n/ShnvDi6FHbbVfviro+WojiFzv+s8MPMHBczVePfUpDJLwoLT0ht1l4YwBCbi8pJAveEEdnkHyPyTP/mzRfwg==",
|
||||||
"requires": {
|
"requires": {
|
||||||
"safe-buffer": "~5.1.0"
|
"safe-buffer": "~5.1.0"
|
||||||
@ -270,7 +270,7 @@
|
|||||||
},
|
},
|
||||||
"es6-promisify": {
|
"es6-promisify": {
|
||||||
"version": "5.0.0",
|
"version": "5.0.0",
|
||||||
"resolved": "http://registry.npmjs.org/es6-promisify/-/es6-promisify-5.0.0.tgz",
|
"resolved": "https://registry.npmjs.org/es6-promisify/-/es6-promisify-5.0.0.tgz",
|
||||||
"integrity": "sha1-UQnWLz5W6pZ8S2NQWu8IKRyKUgM=",
|
"integrity": "sha1-UQnWLz5W6pZ8S2NQWu8IKRyKUgM=",
|
||||||
"requires": {
|
"requires": {
|
||||||
"es6-promise": "^4.0.3"
|
"es6-promise": "^4.0.3"
|
||||||
@ -458,12 +458,12 @@
|
|||||||
},
|
},
|
||||||
"minimist": {
|
"minimist": {
|
||||||
"version": "0.0.8",
|
"version": "0.0.8",
|
||||||
"resolved": "http://registry.npmjs.org/minimist/-/minimist-0.0.8.tgz",
|
"resolved": "https://registry.npmjs.org/minimist/-/minimist-0.0.8.tgz",
|
||||||
"integrity": "sha1-hX/Kv8M5fSYluCKCYuhqp6ARsF0="
|
"integrity": "sha1-hX/Kv8M5fSYluCKCYuhqp6ARsF0="
|
||||||
},
|
},
|
||||||
"mkdirp": {
|
"mkdirp": {
|
||||||
"version": "0.5.1",
|
"version": "0.5.1",
|
||||||
"resolved": "http://registry.npmjs.org/mkdirp/-/mkdirp-0.5.1.tgz",
|
"resolved": "https://registry.npmjs.org/mkdirp/-/mkdirp-0.5.1.tgz",
|
||||||
"integrity": "sha1-MAV0OOrGz3+MR2fzhkjWaX11yQM=",
|
"integrity": "sha1-MAV0OOrGz3+MR2fzhkjWaX11yQM=",
|
||||||
"requires": {
|
"requires": {
|
||||||
"minimist": "0.0.8"
|
"minimist": "0.0.8"
|
||||||
@ -510,7 +510,7 @@
|
|||||||
},
|
},
|
||||||
"path-is-absolute": {
|
"path-is-absolute": {
|
||||||
"version": "1.0.1",
|
"version": "1.0.1",
|
||||||
"resolved": "http://registry.npmjs.org/path-is-absolute/-/path-is-absolute-1.0.1.tgz",
|
"resolved": "https://registry.npmjs.org/path-is-absolute/-/path-is-absolute-1.0.1.tgz",
|
||||||
"integrity": "sha1-F0uSaHNVNP+8es5r9TpanhtcX18="
|
"integrity": "sha1-F0uSaHNVNP+8es5r9TpanhtcX18="
|
||||||
},
|
},
|
||||||
"pathval": {
|
"pathval": {
|
||||||
@ -553,9 +553,9 @@
|
|||||||
}
|
}
|
||||||
},
|
},
|
||||||
"puppeteer": {
|
"puppeteer": {
|
||||||
"version": "1.11.0",
|
"version": "1.12.2",
|
||||||
"resolved": "https://registry.npmjs.org/puppeteer/-/puppeteer-1.11.0.tgz",
|
"resolved": "https://registry.npmjs.org/puppeteer/-/puppeteer-1.12.2.tgz",
|
||||||
"integrity": "sha512-iG4iMOHixc2EpzqRV+pv7o3GgmU2dNYEMkvKwSaQO/vMZURakwSOn/EYJ6OIRFYOque1qorzIBvrytPIQB3YzQ==",
|
"integrity": "sha512-xWSyCeD6EazGlfnQweMpM+Hs6X6PhUYhNTHKFj/axNZDq4OmrVERf70isBf7HsnFgB3zOC1+23/8+wCAZYg+Pg==",
|
||||||
"requires": {
|
"requires": {
|
||||||
"debug": "^4.1.0",
|
"debug": "^4.1.0",
|
||||||
"extract-zip": "^1.6.6",
|
"extract-zip": "^1.6.6",
|
||||||
@ -586,11 +586,11 @@
|
|||||||
}
|
}
|
||||||
},
|
},
|
||||||
"rimraf": {
|
"rimraf": {
|
||||||
"version": "2.6.2",
|
"version": "2.6.3",
|
||||||
"resolved": "https://registry.npmjs.org/rimraf/-/rimraf-2.6.2.tgz",
|
"resolved": "https://registry.npmjs.org/rimraf/-/rimraf-2.6.3.tgz",
|
||||||
"integrity": "sha512-lreewLK/BlghmxtfH36YYVg1i8IAce4TI7oao75I1g245+6BctqTVQiBP3YUJ9C6DQOXJmkYR9X9fCLtCOJc5w==",
|
"integrity": "sha512-mwqeW5XsA2qAejG46gYdENaxXjx9onRNCfn7L0duuP4hCuTIi/QO7PDK07KJfp1d+izWPrzEJDcSqBa0OZQriA==",
|
||||||
"requires": {
|
"requires": {
|
||||||
"glob": "^7.0.5"
|
"glob": "^7.1.3"
|
||||||
}
|
}
|
||||||
},
|
},
|
||||||
"safe-buffer": {
|
"safe-buffer": {
|
||||||
@ -640,9 +640,9 @@
|
|||||||
"integrity": "sha1-tSQ9jz7BqjXxNkYFvA0QNuMKtp8="
|
"integrity": "sha1-tSQ9jz7BqjXxNkYFvA0QNuMKtp8="
|
||||||
},
|
},
|
||||||
"ws": {
|
"ws": {
|
||||||
"version": "6.1.2",
|
"version": "6.1.3",
|
||||||
"resolved": "https://registry.npmjs.org/ws/-/ws-6.1.2.tgz",
|
"resolved": "https://registry.npmjs.org/ws/-/ws-6.1.3.tgz",
|
||||||
"integrity": "sha512-rfUqzvz0WxmSXtJpPMX2EeASXabOrSMk1ruMOV3JBTBjo4ac2lDjGGsbQSyxj8Odhw5fBib8ZKEjDNvgouNKYw==",
|
"integrity": "sha512-tbSxiT+qJI223AP4iLfQbkbxkwdFcneYinM2+x46Gx2wgvbaOMO36czfdfVUBRTHvzAMRhDd98sA5d/BuWbQdg==",
|
||||||
"requires": {
|
"requires": {
|
||||||
"async-limiter": "~1.0.0"
|
"async-limiter": "~1.0.0"
|
||||||
}
|
}
|
||||||
|
@ -1,7 +1,7 @@
|
|||||||
{
|
{
|
||||||
"name": "se-scraper",
|
"name": "se-scraper",
|
||||||
"version": "1.1.9",
|
"version": "1.1.12",
|
||||||
"description": "A simple module which uses puppeteer to scrape several search engines.",
|
"description": "A simple library using puppeteer to scrape several search engines such as Google, Duckduckgo and Bing.",
|
||||||
"homepage": "https://scrapeulous.com/",
|
"homepage": "https://scrapeulous.com/",
|
||||||
"main": "index.js",
|
"main": "index.js",
|
||||||
"scripts": {
|
"scripts": {
|
||||||
@ -23,6 +23,6 @@
|
|||||||
"chai": "^4.2.0",
|
"chai": "^4.2.0",
|
||||||
"cheerio": "^1.0.0-rc.2",
|
"cheerio": "^1.0.0-rc.2",
|
||||||
"got": "^9.6.0",
|
"got": "^9.6.0",
|
||||||
"puppeteer": "^1.9.0"
|
"puppeteer": "^1.12.2"
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
17
run.js
17
run.js
@ -6,13 +6,11 @@ let config = {
|
|||||||
user_agent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36',
|
user_agent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36',
|
||||||
// if random_user_agent is set to True, a random user agent is chosen
|
// if random_user_agent is set to True, a random user agent is chosen
|
||||||
random_user_agent: true,
|
random_user_agent: true,
|
||||||
// get meta data of scraping in return object
|
|
||||||
write_meta_data: false,
|
|
||||||
// how long to sleep between requests. a random sleep interval within the range [a,b]
|
// how long to sleep between requests. a random sleep interval within the range [a,b]
|
||||||
// is drawn before every request. empty string for no sleeping.
|
// is drawn before every request. empty string for no sleeping.
|
||||||
sleep_range: '[1,2]',
|
sleep_range: '[1,2]',
|
||||||
// which search engine to scrape
|
// which search engine to scrape
|
||||||
search_engine: 'marketwatch',
|
search_engine: 'google',
|
||||||
// whether debug information should be printed
|
// whether debug information should be printed
|
||||||
// debug info is useful for developers when debugging
|
// debug info is useful for developers when debugging
|
||||||
debug: false,
|
debug: false,
|
||||||
@ -20,15 +18,15 @@ let config = {
|
|||||||
// this output is informational
|
// this output is informational
|
||||||
verbose: true,
|
verbose: true,
|
||||||
// an array of keywords to scrape
|
// an array of keywords to scrape
|
||||||
keywords: ['MSFT', 'AAPL'],
|
keywords: ['news'],
|
||||||
// alternatively you can specify a keyword_file. this overwrites the keywords array
|
// alternatively you can specify a keyword_file. this overwrites the keywords array
|
||||||
keyword_file: '',
|
keyword_file: '',
|
||||||
// the number of pages to scrape for each keyword
|
// the number of pages to scrape for each keyword
|
||||||
num_pages: 1,
|
num_pages: 1,
|
||||||
// whether to start the browser in headless mode
|
// whether to start the browser in headless mode
|
||||||
headless: false,
|
headless: true,
|
||||||
// path to output file, data will be stored in JSON
|
// path to output file, data will be stored in JSON
|
||||||
output_file: '',
|
output_file: 'data.json',
|
||||||
// whether to prevent images, css, fonts from being loaded
|
// whether to prevent images, css, fonts from being loaded
|
||||||
// will speed up scraping a great deal
|
// will speed up scraping a great deal
|
||||||
block_assets: true,
|
block_assets: true,
|
||||||
@ -42,6 +40,13 @@ let config = {
|
|||||||
// example: 'socks5://78.94.172.42:1080'
|
// example: 'socks5://78.94.172.42:1080'
|
||||||
// example: 'http://118.174.233.10:48400'
|
// example: 'http://118.174.233.10:48400'
|
||||||
proxy: '',
|
proxy: '',
|
||||||
|
// check if headless chrome escapes common detection techniques
|
||||||
|
// this is a quick test and should be used for debugging
|
||||||
|
test_evasion: false,
|
||||||
|
// log ip address data
|
||||||
|
log_ip_address: true,
|
||||||
|
// log http headers
|
||||||
|
log_http_headers: true,
|
||||||
};
|
};
|
||||||
|
|
||||||
function callback(err, response) {
|
function callback(err, response) {
|
||||||
|
@ -19,7 +19,7 @@ class DuckduckgoScraper extends Scraper {
|
|||||||
});
|
});
|
||||||
});
|
});
|
||||||
|
|
||||||
let effective_query = $('#did_you_mean a.js-spelling-suggestion-link').attr('data-query') || '';
|
let effective_query = $('a.js-spelling-suggestion-link').attr('data-query') || '';
|
||||||
|
|
||||||
const cleaned = [];
|
const cleaned = [];
|
||||||
for (var i=0; i < results.length; i++) {
|
for (var i=0; i < results.length; i++) {
|
||||||
@ -68,6 +68,7 @@ class DuckduckgoScraper extends Scraper {
|
|||||||
|
|
||||||
async wait_for_results() {
|
async wait_for_results() {
|
||||||
await this.page.waitForSelector('.serp__results', { timeout: 5000 });
|
await this.page.waitForSelector('.serp__results', { timeout: 5000 });
|
||||||
|
await this.sleep(250);
|
||||||
}
|
}
|
||||||
|
|
||||||
async detected() {
|
async detected() {
|
||||||
|
@ -239,14 +239,7 @@ class GoogleImageScraper extends Scraper {
|
|||||||
}
|
}
|
||||||
|
|
||||||
async next_page() {
|
async next_page() {
|
||||||
let next_page_link = await this.page.$('#pnnext', {timeout: 1000});
|
return false;
|
||||||
if (!next_page_link) {
|
|
||||||
return false;
|
|
||||||
}
|
|
||||||
await next_page_link.click();
|
|
||||||
await this.page.waitForNavigation();
|
|
||||||
|
|
||||||
return true;
|
|
||||||
}
|
}
|
||||||
|
|
||||||
async wait_for_results() {
|
async wait_for_results() {
|
||||||
|
@ -1,12 +1,11 @@
|
|||||||
const cheerio = require('cheerio');
|
const cheerio = require('cheerio');
|
||||||
|
|
||||||
module.exports = {
|
module.exports = {
|
||||||
get_metadata: get_metadata,
|
get_ip_data: get_ip_data,
|
||||||
get_http_headers: get_http_headers,
|
get_http_headers: get_http_headers,
|
||||||
};
|
};
|
||||||
|
|
||||||
async function get_metadata(browser) {
|
async function get_ip_data(browser) {
|
||||||
let metadata = {};
|
|
||||||
const page = await browser.newPage();
|
const page = await browser.newPage();
|
||||||
await page.goto('https://ipinfo.io/json', {
|
await page.goto('https://ipinfo.io/json', {
|
||||||
waitLoad: true,
|
waitLoad: true,
|
||||||
@ -16,17 +15,19 @@ async function get_metadata(browser) {
|
|||||||
timeout: 20000
|
timeout: 20000
|
||||||
});
|
});
|
||||||
const $ = cheerio.load(json);
|
const $ = cheerio.load(json);
|
||||||
metadata.ipinfo = $('pre').text();
|
let ipinfo_text = $('pre').text();
|
||||||
return metadata;
|
return JSON.parse(ipinfo_text);
|
||||||
}
|
}
|
||||||
|
|
||||||
async function get_http_headers(browser) {
|
async function get_http_headers(browser) {
|
||||||
let metadata = {};
|
|
||||||
const page = await browser.newPage();
|
const page = await browser.newPage();
|
||||||
await page.goto('https://httpbin.org/get', {
|
await page.goto('https://httpbin.org/get', {
|
||||||
waitLoad: true,
|
waitLoad: true,
|
||||||
waitNetworkIdle: true // defaults to false
|
waitNetworkIdle: true // defaults to false
|
||||||
});
|
});
|
||||||
let headers = await page.content();
|
let headers = await page.content();
|
||||||
return headers;
|
|
||||||
|
const $ = cheerio.load(headers);
|
||||||
|
let headers_text = $('pre').text();
|
||||||
|
return JSON.parse(headers_text);
|
||||||
}
|
}
|
@ -29,6 +29,8 @@ module.exports = class Scraper {
|
|||||||
this.SOLVE_CAPTCHA_TIME = 45000;
|
this.SOLVE_CAPTCHA_TIME = 45000;
|
||||||
|
|
||||||
this.results = {};
|
this.results = {};
|
||||||
|
this.result_rank = 1;
|
||||||
|
this.num_requests = 0;
|
||||||
}
|
}
|
||||||
|
|
||||||
async run() {
|
async run() {
|
||||||
@ -55,6 +57,9 @@ module.exports = class Scraper {
|
|||||||
|
|
||||||
this.page = await this.browser.newPage();
|
this.page = await this.browser.newPage();
|
||||||
|
|
||||||
|
// prevent detection by evading common detection techniques
|
||||||
|
await evadeChromeHeadlessDetection(this.page);
|
||||||
|
|
||||||
// block some assets to speed up scraping
|
// block some assets to speed up scraping
|
||||||
if (this.config.block_assets === true) {
|
if (this.config.block_assets === true) {
|
||||||
await this.page.setRequestInterception(true);
|
await this.page.setRequestInterception(true);
|
||||||
@ -69,6 +74,16 @@ module.exports = class Scraper {
|
|||||||
});
|
});
|
||||||
}
|
}
|
||||||
|
|
||||||
|
if (this.config.test_evasion === true) {
|
||||||
|
// Navigate to the page that will perform the tests.
|
||||||
|
const testUrl = 'https://intoli.com/blog/' +
|
||||||
|
'not-possible-to-block-chrome-headless/chrome-headless-test.html';
|
||||||
|
await this.page.goto(testUrl);
|
||||||
|
|
||||||
|
// Save a screenshot of the results.
|
||||||
|
await this.page.screenshot({path: 'headless-test-result.png'});
|
||||||
|
}
|
||||||
|
|
||||||
return await this.load_start_page();
|
return await this.load_start_page();
|
||||||
}
|
}
|
||||||
|
|
||||||
@ -80,18 +95,16 @@ module.exports = class Scraper {
|
|||||||
* @returns {Promise<void>}
|
* @returns {Promise<void>}
|
||||||
*/
|
*/
|
||||||
async scraping_loop() {
|
async scraping_loop() {
|
||||||
|
|
||||||
this.result_rank = 1;
|
|
||||||
|
|
||||||
for (let keyword of this.config.keywords) {
|
for (let keyword of this.config.keywords) {
|
||||||
this.keyword = keyword;
|
this.keyword = keyword;
|
||||||
this.results[keyword] = {};
|
this.results[keyword] = {};
|
||||||
|
this.result_rank = 1;
|
||||||
|
|
||||||
if (this.pluggable.before_keyword_scraped) {
|
if (this.pluggable.before_keyword_scraped) {
|
||||||
await this.pluggable.before_keyword_scraped({
|
await this.pluggable.before_keyword_scraped({
|
||||||
keyword: keyword,
|
keyword: keyword,
|
||||||
page: this.page,
|
page: this.page,
|
||||||
event: this.config,
|
config: this.config,
|
||||||
context: this.context,
|
context: this.context,
|
||||||
});
|
});
|
||||||
}
|
}
|
||||||
@ -101,6 +114,9 @@ module.exports = class Scraper {
|
|||||||
try {
|
try {
|
||||||
|
|
||||||
await this.search_keyword(keyword);
|
await this.search_keyword(keyword);
|
||||||
|
// when searching the keyword fails, num_requests will not
|
||||||
|
// be incremented.
|
||||||
|
this.num_requests++;
|
||||||
|
|
||||||
do {
|
do {
|
||||||
|
|
||||||
@ -110,7 +126,7 @@ module.exports = class Scraper {
|
|||||||
|
|
||||||
await this.wait_for_results();
|
await this.wait_for_results();
|
||||||
|
|
||||||
if (event.sleep_range) {
|
if (this.config.sleep_range) {
|
||||||
await this.random_sleep();
|
await this.random_sleep();
|
||||||
}
|
}
|
||||||
|
|
||||||
@ -120,11 +136,20 @@ module.exports = class Scraper {
|
|||||||
|
|
||||||
page_num += 1;
|
page_num += 1;
|
||||||
|
|
||||||
if (await this.next_page() === false) {
|
// only load the next page when we will pass the next iteration
|
||||||
break;
|
// step from the while loop
|
||||||
|
if (page_num <= this.config.num_pages) {
|
||||||
|
|
||||||
|
let next_page_loaded = await this.next_page();
|
||||||
|
|
||||||
|
if (next_page_loaded === false) {
|
||||||
|
break;
|
||||||
|
} else {
|
||||||
|
this.num_requests++;
|
||||||
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
} while (page_num <= event.num_pages);
|
} while (page_num <= this.config.num_pages);
|
||||||
|
|
||||||
} catch (e) {
|
} catch (e) {
|
||||||
|
|
||||||
@ -230,4 +255,131 @@ module.exports = class Scraper {
|
|||||||
async detected() {
|
async detected() {
|
||||||
|
|
||||||
}
|
}
|
||||||
};
|
};
|
||||||
|
|
||||||
|
// This is where we'll put the code to get around the tests.
|
||||||
|
async function evadeChromeHeadlessDetection(page) {
|
||||||
|
// Pass the Webdriver Test.
|
||||||
|
await page.evaluateOnNewDocument(() => {
|
||||||
|
const newProto = navigator.__proto__;
|
||||||
|
delete newProto.webdriver;
|
||||||
|
navigator.__proto__ = newProto;
|
||||||
|
});
|
||||||
|
|
||||||
|
// Pass the Chrome Test.
|
||||||
|
await page.evaluateOnNewDocument(() => {
|
||||||
|
// We can mock this in as much depth as we need for the test.
|
||||||
|
const mockObj = {
|
||||||
|
app: {
|
||||||
|
isInstalled: false,
|
||||||
|
},
|
||||||
|
webstore: {
|
||||||
|
onInstallStageChanged: {},
|
||||||
|
onDownloadProgress: {},
|
||||||
|
},
|
||||||
|
runtime: {
|
||||||
|
PlatformOs: {
|
||||||
|
MAC: 'mac',
|
||||||
|
WIN: 'win',
|
||||||
|
ANDROID: 'android',
|
||||||
|
CROS: 'cros',
|
||||||
|
LINUX: 'linux',
|
||||||
|
OPENBSD: 'openbsd',
|
||||||
|
},
|
||||||
|
PlatformArch: {
|
||||||
|
ARM: 'arm',
|
||||||
|
X86_32: 'x86-32',
|
||||||
|
X86_64: 'x86-64',
|
||||||
|
},
|
||||||
|
PlatformNaclArch: {
|
||||||
|
ARM: 'arm',
|
||||||
|
X86_32: 'x86-32',
|
||||||
|
X86_64: 'x86-64',
|
||||||
|
},
|
||||||
|
RequestUpdateCheckStatus: {
|
||||||
|
THROTTLED: 'throttled',
|
||||||
|
NO_UPDATE: 'no_update',
|
||||||
|
UPDATE_AVAILABLE: 'update_available',
|
||||||
|
},
|
||||||
|
OnInstalledReason: {
|
||||||
|
INSTALL: 'install',
|
||||||
|
UPDATE: 'update',
|
||||||
|
CHROME_UPDATE: 'chrome_update',
|
||||||
|
SHARED_MODULE_UPDATE: 'shared_module_update',
|
||||||
|
},
|
||||||
|
OnRestartRequiredReason: {
|
||||||
|
APP_UPDATE: 'app_update',
|
||||||
|
OS_UPDATE: 'os_update',
|
||||||
|
PERIODIC: 'periodic',
|
||||||
|
},
|
||||||
|
},
|
||||||
|
};
|
||||||
|
|
||||||
|
window.navigator.chrome = mockObj;
|
||||||
|
window.chrome = mockObj;
|
||||||
|
});
|
||||||
|
|
||||||
|
// Pass the Permissions Test.
|
||||||
|
await page.evaluateOnNewDocument(() => {
|
||||||
|
const originalQuery = window.navigator.permissions.query;
|
||||||
|
window.navigator.permissions.__proto__.query = parameters =>
|
||||||
|
parameters.name === 'notifications'
|
||||||
|
? Promise.resolve({state: Notification.permission})
|
||||||
|
: originalQuery(parameters);
|
||||||
|
|
||||||
|
// Inspired by: https://github.com/ikarienator/phantomjs_hide_and_seek/blob/master/5.spoofFunctionBind.js
|
||||||
|
const oldCall = Function.prototype.call;
|
||||||
|
function call() {
|
||||||
|
return oldCall.apply(this, arguments);
|
||||||
|
}
|
||||||
|
Function.prototype.call = call;
|
||||||
|
|
||||||
|
const nativeToStringFunctionString = Error.toString().replace(/Error/g, "toString");
|
||||||
|
const oldToString = Function.prototype.toString;
|
||||||
|
|
||||||
|
function functionToString() {
|
||||||
|
if (this === window.navigator.permissions.query) {
|
||||||
|
return "function query() { [native code] }";
|
||||||
|
}
|
||||||
|
if (this === functionToString) {
|
||||||
|
return nativeToStringFunctionString;
|
||||||
|
}
|
||||||
|
return oldCall.call(oldToString, this);
|
||||||
|
}
|
||||||
|
Function.prototype.toString = functionToString;
|
||||||
|
});
|
||||||
|
|
||||||
|
// Pass the Plugins Length Test.
|
||||||
|
await page.evaluateOnNewDocument(() => {
|
||||||
|
// Overwrite the `plugins` property to use a custom getter.
|
||||||
|
Object.defineProperty(navigator, 'plugins', {
|
||||||
|
// This just needs to have `length > 0` for the current test,
|
||||||
|
// but we could mock the plugins too if necessary.
|
||||||
|
get: () => [1, 2, 3, 4, 5]
|
||||||
|
});
|
||||||
|
});
|
||||||
|
|
||||||
|
// Pass the Languages Test.
|
||||||
|
await page.evaluateOnNewDocument(() => {
|
||||||
|
// Overwrite the `plugins` property to use a custom getter.
|
||||||
|
Object.defineProperty(navigator, 'languages', {
|
||||||
|
get: () => ['en-US', 'en']
|
||||||
|
});
|
||||||
|
});
|
||||||
|
|
||||||
|
// Pass the iframe Test
|
||||||
|
await page.evaluateOnNewDocument(() => {
|
||||||
|
Object.defineProperty(HTMLIFrameElement.prototype, 'contentWindow', {
|
||||||
|
get: function() {
|
||||||
|
return window;
|
||||||
|
}
|
||||||
|
});
|
||||||
|
});
|
||||||
|
|
||||||
|
// Pass toString test, though it breaks console.debug() from working
|
||||||
|
await page.evaluateOnNewDocument(() => {
|
||||||
|
window.console.debug = () => {
|
||||||
|
return null;
|
||||||
|
};
|
||||||
|
});
|
||||||
|
}
|
@ -100,26 +100,24 @@ module.exports.handler = async function handler (event, context, callback) {
|
|||||||
browser = await puppeteer.launch(launch_args);
|
browser = await puppeteer.launch(launch_args);
|
||||||
}
|
}
|
||||||
|
|
||||||
if (config.log_http_headers === true) {
|
|
||||||
headers = await meta.get_http_headers(browser);
|
|
||||||
console.dir(headers);
|
|
||||||
}
|
|
||||||
|
|
||||||
let metadata = {};
|
let metadata = {};
|
||||||
|
|
||||||
if (config.write_meta_data === true) {
|
if (config.log_http_headers === true) {
|
||||||
metadata = await meta.get_metadata(browser);
|
metadata.http_headers = await meta.get_http_headers(browser);
|
||||||
|
}
|
||||||
|
|
||||||
|
if (config.log_ip_address === true) {
|
||||||
|
metadata.ipinfo = await meta.get_ip_data(browser);
|
||||||
}
|
}
|
||||||
|
|
||||||
// check that our proxy is working by confirming
|
// check that our proxy is working by confirming
|
||||||
// that ipinfo.io sees the proxy IP address
|
// that ipinfo.io sees the proxy IP address
|
||||||
if (config.proxy && config.write_meta_data === true) {
|
if (config.proxy && config.log_ip_address === true) {
|
||||||
console.log(`${metadata.ipinfo} vs ${config.proxy}`);
|
console.log(`${metadata.ipinfo} vs ${config.proxy}`);
|
||||||
|
|
||||||
try {
|
try {
|
||||||
let ipdata = JSON.parse(metadata.ipinfo);
|
|
||||||
// if the ip returned by ipinfo is not a substring of our proxystring, get the heck outta here
|
// if the ip returned by ipinfo is not a substring of our proxystring, get the heck outta here
|
||||||
if (!config.proxy.includes(ipdata.ip)) {
|
if (!config.proxy.includes(metadata.ipinfo.ip)) {
|
||||||
console.error('Proxy not working properly.');
|
console.error('Proxy not working properly.');
|
||||||
await browser.close();
|
await browser.close();
|
||||||
return;
|
return;
|
||||||
@ -153,13 +151,13 @@ module.exports.handler = async function handler (event, context, callback) {
|
|||||||
if (Scraper === undefined) {
|
if (Scraper === undefined) {
|
||||||
console.info('Currently not implemented search_engine: ', config.search_engine);
|
console.info('Currently not implemented search_engine: ', config.search_engine);
|
||||||
} else {
|
} else {
|
||||||
let scraper = new Scraper({
|
scraperObj = new Scraper({
|
||||||
browser: browser,
|
browser: browser,
|
||||||
config: config,
|
config: config,
|
||||||
context: context,
|
context: context,
|
||||||
pluggable: pluggable,
|
pluggable: pluggable,
|
||||||
});
|
});
|
||||||
var results = await scraper.run();
|
results = await scraperObj.run();
|
||||||
}
|
}
|
||||||
|
|
||||||
if (pluggable.close_browser) {
|
if (pluggable.close_browser) {
|
||||||
@ -168,13 +166,13 @@ module.exports.handler = async function handler (event, context, callback) {
|
|||||||
await browser.close();
|
await browser.close();
|
||||||
}
|
}
|
||||||
|
|
||||||
let num_keywords = config.keywords.length || 0;
|
let num_requests = scraperObj.num_requests;
|
||||||
let timeDelta = Date.now() - startTime;
|
let timeDelta = Date.now() - startTime;
|
||||||
let ms_per_keyword = timeDelta/num_keywords;
|
let ms_per_request = timeDelta/num_requests;
|
||||||
|
|
||||||
if (config.verbose === true) {
|
if (config.verbose === true) {
|
||||||
console.log(`Scraper took ${timeDelta}ms to scrape ${num_keywords} keywords.`);
|
console.log(`Scraper took ${timeDelta}ms to perform ${num_requests} requests.`);
|
||||||
console.log(`On average ms/keyword: ${ms_per_keyword}ms/keyword`);
|
console.log(`On average ms/request: ${ms_per_request}ms/request`);
|
||||||
console.dir(results, {depth: null, colors: true});
|
console.dir(results, {depth: null, colors: true});
|
||||||
}
|
}
|
||||||
|
|
||||||
@ -191,19 +189,18 @@ module.exports.handler = async function handler (event, context, callback) {
|
|||||||
});
|
});
|
||||||
}
|
}
|
||||||
|
|
||||||
if (config.write_meta_data === true) {
|
metadata.id = `${config.job_name} ${config.chunk_lines}`;
|
||||||
metadata.id = `${config.job_name} ${config.chunk_lines}`;
|
metadata.chunk_lines = config.chunk_lines;
|
||||||
metadata.chunk_lines = config.chunk_lines;
|
metadata.elapsed_time = timeDelta.toString();
|
||||||
metadata.elapsed_time = timeDelta.toString();
|
metadata.ms_per_keyword = ms_per_request.toString();
|
||||||
metadata.ms_per_keyword = ms_per_keyword.toString();
|
metadata.num_requests = num_requests;
|
||||||
|
|
||||||
if (config.verbose === true) {
|
if (config.verbose === true) {
|
||||||
console.log(metadata);
|
console.log(metadata);
|
||||||
}
|
}
|
||||||
|
|
||||||
if (pluggable.handle_metadata) {
|
if (pluggable.handle_metadata) {
|
||||||
await pluggable.handle_metadata({metadata: metadata, config: config});
|
await pluggable.handle_metadata({metadata: metadata, config: config});
|
||||||
}
|
|
||||||
}
|
}
|
||||||
|
|
||||||
if (config.output_file) {
|
if (config.output_file) {
|
||||||
@ -249,8 +246,8 @@ function parseEventData(config) {
|
|||||||
config.upload_to_s3 = _bool(config.upload_to_s3);
|
config.upload_to_s3 = _bool(config.upload_to_s3);
|
||||||
}
|
}
|
||||||
|
|
||||||
if (config.write_meta_data) {
|
if (config.log_ip_address) {
|
||||||
config.write_meta_data = _bool(config.write_meta_data);
|
config.log_ip_address = _bool(config.log_ip_address);
|
||||||
}
|
}
|
||||||
|
|
||||||
if (config.log_http_headers) {
|
if (config.log_http_headers) {
|
||||||
|
@ -36,11 +36,10 @@ function normal_search_test_case(err, response) {
|
|||||||
} else {
|
} else {
|
||||||
assert.equal(response.headers['Content-Type'], 'text/json', 'content type is not text/json');
|
assert.equal(response.headers['Content-Type'], 'text/json', 'content type is not text/json');
|
||||||
assert.equal(response.statusCode, 200, 'status code must be 200');
|
assert.equal(response.statusCode, 200, 'status code must be 200');
|
||||||
|
assert.equal(response.metadata.num_requests, 6);
|
||||||
let total_rank = 1;
|
|
||||||
|
|
||||||
for (let query in response.results) {
|
for (let query in response.results) {
|
||||||
|
let total_rank = 1;
|
||||||
assert.containsAllKeys(response.results, normal_search_keywords, 'not all keywords were scraped.');
|
assert.containsAllKeys(response.results, normal_search_keywords, 'not all keywords were scraped.');
|
||||||
|
|
||||||
for (let page_number in response.results[query]) {
|
for (let page_number in response.results[query]) {
|
||||||
@ -85,7 +84,7 @@ function normal_search_test_case(err, response) {
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
const keywords_no_results = ['fgskl34440abJAksafkl34a44dsflkjaQQuBBdfk',];
|
const keywords_no_results = ['2342kljp;fj9834u40abJAkasdlfkjsladfkjasfdas;lk3453-934023safkl34a44dsflkjaQQuBBdfk',];
|
||||||
|
|
||||||
async function no_results_test() {
|
async function no_results_test() {
|
||||||
let config = {
|
let config = {
|
||||||
@ -113,6 +112,8 @@ function test_case_no_results(err, response) {
|
|||||||
} else {
|
} else {
|
||||||
assert.equal(response.headers['Content-Type'], 'text/json', 'content type is not text/json');
|
assert.equal(response.headers['Content-Type'], 'text/json', 'content type is not text/json');
|
||||||
assert.equal(response.statusCode, 200, 'status code must be 200');
|
assert.equal(response.statusCode, 200, 'status code must be 200');
|
||||||
|
assert.equal(response.metadata.num_requests, 1);
|
||||||
|
|
||||||
results = response.results;
|
results = response.results;
|
||||||
for (let query in response.results) {
|
for (let query in response.results) {
|
||||||
|
|
||||||
@ -165,6 +166,7 @@ function test_case_effective_query(err, response) {
|
|||||||
|
|
||||||
assert.equal(response.headers['Content-Type'], 'text/json', 'content type is not text/json');
|
assert.equal(response.headers['Content-Type'], 'text/json', 'content type is not text/json');
|
||||||
assert.equal(response.statusCode, 200, 'status code must be 200');
|
assert.equal(response.statusCode, 200, 'status code must be 200');
|
||||||
|
assert.equal(response.metadata.num_requests, 1);
|
||||||
|
|
||||||
results = response.results;
|
results = response.results;
|
||||||
for (let query in response.results) {
|
for (let query in response.results) {
|
||||||
|
@ -17,7 +17,7 @@ async function normal_search_test() {
|
|||||||
keywords: normal_search_keywords,
|
keywords: normal_search_keywords,
|
||||||
keyword_file: '',
|
keyword_file: '',
|
||||||
num_pages: 2,
|
num_pages: 2,
|
||||||
headless: true,
|
headless: false,
|
||||||
output_file: '',
|
output_file: '',
|
||||||
block_assets: true,
|
block_assets: true,
|
||||||
user_agent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36',
|
user_agent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36',
|
||||||
@ -36,10 +36,10 @@ function normal_search_test_case(err, response) {
|
|||||||
} else {
|
} else {
|
||||||
assert.equal(response.headers['Content-Type'], 'text/json', 'content type is not text/json');
|
assert.equal(response.headers['Content-Type'], 'text/json', 'content type is not text/json');
|
||||||
assert.equal(response.statusCode, 200, 'status code must be 200');
|
assert.equal(response.statusCode, 200, 'status code must be 200');
|
||||||
|
assert.equal(response.metadata.num_requests, 4);
|
||||||
let total_rank = 1;
|
|
||||||
|
|
||||||
for (let query in response.results) {
|
for (let query in response.results) {
|
||||||
|
let total_rank = 1;
|
||||||
|
|
||||||
assert.containsAllKeys(response.results, normal_search_keywords, 'not all keywords were scraped.');
|
assert.containsAllKeys(response.results, normal_search_keywords, 'not all keywords were scraped.');
|
||||||
|
|
||||||
@ -112,6 +112,7 @@ function test_case_effective_query(err, response) {
|
|||||||
|
|
||||||
assert.equal(response.headers['Content-Type'], 'text/json', 'content type is not text/json');
|
assert.equal(response.headers['Content-Type'], 'text/json', 'content type is not text/json');
|
||||||
assert.equal(response.statusCode, 200, 'status code must be 200');
|
assert.equal(response.statusCode, 200, 'status code must be 200');
|
||||||
|
assert.equal(response.metadata.num_requests, 1);
|
||||||
|
|
||||||
results = response.results;
|
results = response.results;
|
||||||
for (let query in response.results) {
|
for (let query in response.results) {
|
||||||
|
@ -36,10 +36,10 @@ function normal_search_test_case(err, response) {
|
|||||||
} else {
|
} else {
|
||||||
assert.equal(response.headers['Content-Type'], 'text/json', 'content type is not text/json');
|
assert.equal(response.headers['Content-Type'], 'text/json', 'content type is not text/json');
|
||||||
assert.equal(response.statusCode, 200, 'status code must be 200');
|
assert.equal(response.statusCode, 200, 'status code must be 200');
|
||||||
|
assert.equal(response.metadata.num_requests, 6);
|
||||||
let total_rank = 1;
|
|
||||||
|
|
||||||
for (let query in response.results) {
|
for (let query in response.results) {
|
||||||
|
let total_rank = 1;
|
||||||
|
|
||||||
assert.containsAllKeys(response.results, normal_search_keywords, 'not all keywords were scraped.');
|
assert.containsAllKeys(response.results, normal_search_keywords, 'not all keywords were scraped.');
|
||||||
|
|
||||||
@ -59,7 +59,7 @@ function normal_search_test_case(err, response) {
|
|||||||
|
|
||||||
for (let res of obj.results) {
|
for (let res of obj.results) {
|
||||||
|
|
||||||
assert.containsAllKeys(res, ['link', 'title', 'rank', 'visible_link', 'rank'], 'not all keys are in the SERP object');
|
assert.containsAllKeys(res, ['link', 'title', 'rank', 'visible_link'], 'not all keys are in the SERP object');
|
||||||
|
|
||||||
assert.isOk(res.link, 'link must be ok');
|
assert.isOk(res.link, 'link must be ok');
|
||||||
assert.typeOf(res.link, 'string', 'link must be string');
|
assert.typeOf(res.link, 'string', 'link must be string');
|
||||||
@ -113,6 +113,8 @@ function test_case_no_results(err, response) {
|
|||||||
} else {
|
} else {
|
||||||
assert.equal(response.headers['Content-Type'], 'text/json', 'content type is not text/json');
|
assert.equal(response.headers['Content-Type'], 'text/json', 'content type is not text/json');
|
||||||
assert.equal(response.statusCode, 200, 'status code must be 200');
|
assert.equal(response.statusCode, 200, 'status code must be 200');
|
||||||
|
assert.equal(response.metadata.num_requests, 1);
|
||||||
|
|
||||||
results = response.results;
|
results = response.results;
|
||||||
for (let query in response.results) {
|
for (let query in response.results) {
|
||||||
|
|
||||||
@ -165,6 +167,7 @@ function test_case_effective_query(err, response) {
|
|||||||
|
|
||||||
assert.equal(response.headers['Content-Type'], 'text/json', 'content type is not text/json');
|
assert.equal(response.headers['Content-Type'], 'text/json', 'content type is not text/json');
|
||||||
assert.equal(response.statusCode, 200, 'status code must be 200');
|
assert.equal(response.statusCode, 200, 'status code must be 200');
|
||||||
|
assert.equal(response.metadata.num_requests, 1);
|
||||||
|
|
||||||
results = response.results;
|
results = response.results;
|
||||||
for (let query in response.results) {
|
for (let query in response.results) {
|
||||||
|
85
test/test_googleimage.js
Normal file
85
test/test_googleimage.js
Normal file
@ -0,0 +1,85 @@
|
|||||||
|
const se_scraper = require('./../index.js');
|
||||||
|
var assert = require('chai').assert;
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Use chai and mocha for tests.
|
||||||
|
* https://mochajs.org/#installation
|
||||||
|
*/
|
||||||
|
|
||||||
|
const normal_search_keywords = ['apple', 'rain'];
|
||||||
|
|
||||||
|
async function normal_image_search_test() {
|
||||||
|
let config = {
|
||||||
|
search_engine: 'google_image',
|
||||||
|
compress: false,
|
||||||
|
debug: false,
|
||||||
|
verbose: false,
|
||||||
|
keywords: normal_search_keywords,
|
||||||
|
keyword_file: '',
|
||||||
|
num_pages: 2,
|
||||||
|
headless: true,
|
||||||
|
output_file: '',
|
||||||
|
block_assets: true,
|
||||||
|
user_agent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36',
|
||||||
|
random_user_agent: false,
|
||||||
|
};
|
||||||
|
|
||||||
|
console.log('normal_image_search_test()');
|
||||||
|
await se_scraper.scrape(config, normal_image_search_test_case);
|
||||||
|
}
|
||||||
|
|
||||||
|
// we test with a callback function to our handler
|
||||||
|
function normal_image_search_test_case(err, response) {
|
||||||
|
|
||||||
|
if (err) {
|
||||||
|
console.error(err);
|
||||||
|
} else {
|
||||||
|
assert.equal(response.headers['Content-Type'], 'text/json', 'content type is not text/json');
|
||||||
|
assert.equal(response.statusCode, 200, 'status code must be 200');
|
||||||
|
assert.equal(response.metadata.num_requests, 2);
|
||||||
|
|
||||||
|
for (let query in response.results) {
|
||||||
|
|
||||||
|
let total_rank = 1;
|
||||||
|
|
||||||
|
assert.containsAllKeys(response.results, normal_search_keywords, 'not all keywords were scraped.');
|
||||||
|
|
||||||
|
for (let page_number in response.results[query]) {
|
||||||
|
|
||||||
|
assert.isNumber(parseInt(page_number), 'page_number must be numeric');
|
||||||
|
|
||||||
|
let obj = response.results[query][page_number];
|
||||||
|
|
||||||
|
assert.containsAllKeys(obj, ['results', 'time', 'no_results', 'effective_query'], 'not all keys are in the object');
|
||||||
|
|
||||||
|
assert.isAtLeast(obj.results.length, 15, 'results must have at least 15 SERP objects');
|
||||||
|
assert.equal(obj.no_results, false, 'no results should be false');
|
||||||
|
assert.typeOf(Date.parse(obj.time), 'number', 'time should be a valid date');
|
||||||
|
|
||||||
|
for (let res of obj.results) {
|
||||||
|
|
||||||
|
assert.containsAllKeys(res, ['link', 'snippet', 'rank', 'clean_link'], 'not all keys are in the SERP object');
|
||||||
|
|
||||||
|
assert.isOk(res.link, 'link must be ok');
|
||||||
|
assert.typeOf(res.link, 'string', 'link must be string');
|
||||||
|
assert.isAtLeast(res.link.length, 5, 'link must have at least 5 chars');
|
||||||
|
|
||||||
|
assert.isOk(res.clean_link, 'clean_link must be ok');
|
||||||
|
assert.typeOf(res.clean_link, 'string', 'clean_link must be string');
|
||||||
|
assert.isAtLeast(res.clean_link.length, 5, 'clean_link must have at least 5 chars');
|
||||||
|
|
||||||
|
assert.isOk(res.snippet, 'snippet must be ok');
|
||||||
|
assert.typeOf(res.snippet, 'string', 'snippet must be string');
|
||||||
|
assert.isAtLeast(res.snippet.length, 10, 'snippet must have at least 10 chars');
|
||||||
|
|
||||||
|
assert.isNumber(res.rank, 'rank must be integer');
|
||||||
|
assert.equal(res.rank, total_rank++, 'rank ist wrong');
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
(async () => {
|
||||||
|
await normal_image_search_test();
|
||||||
|
})();
|
@ -37,10 +37,8 @@ function reuters_search_test_case(err, response) {
|
|||||||
assert.equal(response.headers['Content-Type'], 'text/json', 'content type is not text/json');
|
assert.equal(response.headers['Content-Type'], 'text/json', 'content type is not text/json');
|
||||||
assert.equal(response.statusCode, 200, 'status code must be 200');
|
assert.equal(response.statusCode, 200, 'status code must be 200');
|
||||||
|
|
||||||
let total_rank = 1;
|
|
||||||
|
|
||||||
for (let query in response.results) {
|
for (let query in response.results) {
|
||||||
|
let total_rank = 1;
|
||||||
assert.containsAllKeys(response.results, quote_search_keywords, 'not all keywords were scraped.');
|
assert.containsAllKeys(response.results, quote_search_keywords, 'not all keywords were scraped.');
|
||||||
|
|
||||||
for (let page_number in response.results[query]) {
|
for (let page_number in response.results[query]) {
|
||||||
@ -108,10 +106,8 @@ function cnbc_search_test_case(err, response) {
|
|||||||
assert.equal(response.headers['Content-Type'], 'text/json', 'content type is not text/json');
|
assert.equal(response.headers['Content-Type'], 'text/json', 'content type is not text/json');
|
||||||
assert.equal(response.statusCode, 200, 'status code must be 200');
|
assert.equal(response.statusCode, 200, 'status code must be 200');
|
||||||
|
|
||||||
let total_rank = 1;
|
|
||||||
|
|
||||||
for (let query in response.results) {
|
for (let query in response.results) {
|
||||||
|
let total_rank = 1;
|
||||||
assert.containsAllKeys(response.results, quote_search_keywords, 'not all keywords were scraped.');
|
assert.containsAllKeys(response.results, quote_search_keywords, 'not all keywords were scraped.');
|
||||||
|
|
||||||
for (let page_number in response.results[query]) {
|
for (let page_number in response.results[query]) {
|
||||||
@ -177,10 +173,8 @@ function marketwatch_search_test_case(err, response) {
|
|||||||
assert.equal(response.headers['Content-Type'], 'text/json', 'content type is not text/json');
|
assert.equal(response.headers['Content-Type'], 'text/json', 'content type is not text/json');
|
||||||
assert.equal(response.statusCode, 200, 'status code must be 200');
|
assert.equal(response.statusCode, 200, 'status code must be 200');
|
||||||
|
|
||||||
let total_rank = 1;
|
|
||||||
|
|
||||||
for (let query in response.results) {
|
for (let query in response.results) {
|
||||||
|
let total_rank = 1;
|
||||||
assert.containsAllKeys(response.results, marketwatch_search_keywords, 'not all keywords were scraped.');
|
assert.containsAllKeys(response.results, marketwatch_search_keywords, 'not all keywords were scraped.');
|
||||||
|
|
||||||
for (let page_number in response.results[query]) {
|
for (let page_number in response.results[query]) {
|
||||||
|
Loading…
Reference in New Issue
Block a user