Crawl website data using NodeJS

We will together learn techniques to crawler website data using DOM parsing technique using Nodejs.

Image for post
Image for post
Photo by Carlos Muza on Unsplash

Perhaps you have heard about website crawling techniques somewhere. This is a fairly common technique for crawling; for example, the Google bot is also a form of crawler.

The crawler technique has many practical applications, which can include several ideas such as: Building a newspaper reader application by crawling website data from significant newspapers, crawling recruitment information from Facebook, etc.

There are many ways to create a web crawler, and there are also plenty of frameworks to support it. Python, for example, has a very famous Scrapy.

#Crawl website data is what?
#Technical demo of website crawler
Some requirements before implementation
#Build website crawler
1. Install Dependencies
2. Install webserver with Express
3. Create Request and Response Helper
4. Proceed to create crawler file of website data from scotch
5. Extract data from website
#Summary

#Crawl website data is what?

To put it plainly, a web crawler is a technique of gathering data from websites on the internet according to given links. The web crawler will access the link and download all the data as well as look for more internal links to download notes.

If, in the data collection process, you only filter the necessary information for your needs, it is called Web Scraping.

The two concepts of web crawler and web scraping are the same; if there are differences, it’s a little bit.

Image for post
Image for post
The technique of crawler data from website

For example, with shopee.com, web crawling techniques will collect the entire content of this website (product name, product description, product price, manuals, product reviews, and comments. Products,.etc). However, web scraping may only collect some information that is necessary for you, such as only collecting product prices for price comparison applications.

The data, as it is crawled, can be stored in your database for analysis or different purposes.

Note: the crawl of data from a website may not be allowed by the owner of that website. By law, you need to get permission from the website owner. I’m not responsible for any problems that arise.

#Technical demo of website crawler

To illustrate this crawling technique, I will guide you to build a bot to crawl data from the website Scotch.io (a site famous for teaching programming).

We will crawl the profile data of an author, as well as his posts. Then, build a RESTful API to return that data for later use in our app.

Below is a screenshot of the demo application generated based on the API we have built in this article.

Image for post
Image for post
Scotch.io

As I said, the crawling technique can be done by most modern programming languages supporting HTTP, XML, and DOM, such as PHP, Python, Java, Javascript …

In this article, I will use Javascript on the Nodejs environment to perform crawling. So you need to have basic knowledge of Javascript to be able to read articles more quickly and practice crawling websites.

Before starting to code, according to this article, you also need to have Nodejs and Npm installed on your machine.

Also, I use some 3rd party (Dependencies) libraries to support crawling, such as:

  • Cheerio — supports very simple DOM parsing. This library is lightweight, easy to use, and fast.
  • Axios — Support for getting content of webpage through https request.
  • Express — this is the popular web application framework. Perhaps there’s no need to say anything more about it.
  • Lodash — Is a utility JavaScript library. It has built a lot of commonly used functions about arrays, numbers, objects, strings …

# Build website crawler

As always, for you to easily read and follow, I will try to write as detailed as possible. If you do not understand, you must ask immediately!

First, you create a new Nodejs project, then install the libraries needed for the project.

# Create a new directory
mkdir scotch-scraping
# cd into the new directory
cd scotch-scraping
# Initiate a new package and install app dependencies
npm init -y
npm install express morgan axios cheerio lodash

We will create a simple Http server using ExpressJS. Create a new server.js file in your project root directory, then add the following code:

/ _ server.js _ /
// Require dependencies
const logger = require ('morgan');
const express = require ('express');
// Create an Express application
const app = express ();
// Configure the app port
const port = process.env.PORT || 3000;
app.set ('port', port);
// Load middlewares
app.use (logger ('dev'));
// Start the server and listen on the preconfigured port
app.listen (port, () => console.log (`App started on port $ {port} .`))

Then, you edit the package.json file to make it easier to run the server. Add the following code:

"scripts": {
"start": "node server.js"
}

With the above code, from now on, instead of typing: node server.js to launch the code, you need to type: npm start

If that’s all, then adding scripts to package.json wouldn’t be of many benefits. However, later when you need to do more tasks every time you run the server, such as: copying the configuration file, generating a specific code before starting the server, you only have to configure it in this start script.

In this section, we will create some functions for reuse throughout the application.
Create a new helpers.js file in the project root directory.

/ _ app / helpers.js _ /
const _ = require ('lodash');
const axios = require ("axios");
const cheerio = require ("cheerio");

In the code above, we import the libraries needed for the helper. Now is the time to write the content for the helper.

First, we’ll create one to make returning JSON data for the requester simpler.

/ _ app / helpers.js _ /
/ **
** _ Handles the request (Promise) when it is fulfilled
_ ** and sends a JSON response to the HTTP response stream (res).
* /
const sendResponse = res => async request => {
return await request
.then (data => res.json ({status: "success", data}))
.catch (({status: code = 500}) =>
res.status (code) .json ({status: "failure", code, message: code == 404? 'Not found.': 'Request failed.'})
);
};

Examples of uses of this function are as follows:

Examples of uses of this function are as follows:app.get ('/ path', (req, res, next) => {
const request = Promise.resolve ([1, 2, 3, 4, 5]);
sendResponse (res) (request);
});

I will explain in more detail when the server receives a request: GET/path. We assume doing something ABC XYZ, and the result is [1, 2, 3, 4, 5]. The sendResponse() function will now support returning JSON to the requester.

{
"status": "success",
"data": [1, 2, 3, 4, 5]
}

Next is the function to get Html from any URL.

/ **
_ Loads the html string returned for the given URL
_ and sends a Cheerio parser instance of the loaded HTML
* /
const fetchHtmlFromUrl = async url => {
return await axios
.get (enforceHttpsUrl (url))
.then (response => cheerio.load (response.data))
.catch (error => {
error.status = (error.response && error.response.status) || 500;
throw error;
});
};

As above its call, when you call this function, the result will be the entire Html of the URL. From this “heap” of HTML, we will split it up to get the necessary data.

In this helper.js file, there are many other functions as well, but because it’s too long, I can’t list them here. You get the source to use, if there is any place you do not understand, leave a comment below.

All the necessary procedures for crawling the data have been prepared. Now is the time to write crawling functions, analyze data from the website.

Create the scotch.js file in the app directory and add the following code:

/ _ app / scotch.js _ /
const _ = require ('lodash');
// Import helper functions
const {
compose,
composeAsync,
extractNumber,
enforceHttpsUrl,
fetchHtmlFromUrl,
extractFromElems,
fromPairsToObject,
fetchElemInnerText,
fetchElemAttribute,
extractUrlAttribute
} = require ("./ helpers");
// scotch.io (Base URL)
const SCOTCH_BASE = "https://scotch.io";
////////////////////////////////////////// /////////////////////////
// HELPER FUNCTIONS
////////////////////////////////////////// /////////////////////////
/ *
Resolves the url as relative to the base scotch url
and returns the full URL
/
const scotchRelativeUrl = url =>
_.isString (url)? `$ {SCOTCH_BASE} $ {url.replace (/ ^ \ / *? /," / ")}`: Null;
/ _ *
_ A composed function that extracts a url from element attribute,
_ resolves it to the Scotch base url and returns the url with https
* /
const extractScotchUrlAttribute = attr =>
compose (enforceHttpsUrl, scotchRelativeUrl, fetchElemAttribute (attr));

In it, keep in mind the scotchRelativeUrl () function: This function has the purpose of automatically returning the full URL when we pass the URL parameter.

For example:

scotchRelativeUrl ('tutorials');
// returns => 'https://scotch.io/tutorials'
scotchRelativeUrl ('// tutorials');
// returns => 'https://scotch.io///tutorials'
scotchRelativeUrl ('http://domain.com');
// returns => 'https://scotch.io/http://domain.com'

In this part, we will proceed to extract the necessary information, such as:

  • social links (Facebook, Twitter, GitHub,.etc)
  • profile (name, role, avatar,.etc)
  • stats (total views, total posts,.etc)
  • posts

However, because the article is too long, I will only explain for the first part (take the social link of an author). The rest, you refer to the source code offline.

To be able to extract someone’s social link data on scotch.io, I will define an extractSocialUrl () function in the scotch.js file. The purpose of this function is to extract the social network name, the URL in the <a> tag.

I take an example of a DOM of the <a> form in an author’s profile on scotch.

<a href="https://github.com/gladchinda" target="_blank" title="GitHub">
<span class = "icon-github-icon">
<svg xmlns = "http://www.w3.org/2000/svg" xmlns: xlink = "http://www.w3.org/1999/xlink" version = "1.1" id = "Capa_1" x = "0px" y = "0px" width = "50" height = "50" viewBox = "0 0 512 512" style = "enable-background: new 0 0 512 512;" xml: space = "preserve">
...
</svg>
</span>
</a>

When we call extractSocialUrl (), the result is an object of type like this:

{github: 'https://github.com/gladchinda'}

The complete code snippet of the function extracts the social link as follows:

/ _ app / scotch.js _ /
////////////////////////////////////////// /////////////////////////
// EXTRACTION FUNCTIONS
////////////////////////////////////////// /////////////////////////
/ _ *
_ Extract a single social URL pair from container element
* /
const extractSocialUrl = elem => {
// Find all social-icon <span> elements
const icon = elem.find ('span.icon');
// Regex for social classes
const regex = /^(?:icon|color)-(.+)$/;
// Extracts only social classes from the class attribute
const onlySocialClasses = regex => (classes = '') => classes
.replace (/ \ s + / g, '')
.split ('')
.filter (classname => regex.test (classname));
// Gets the social network name from a class name
const getSocialFromClasses = regex => classes => {
let social = null;
const [classname = null] = classes;
if (_.isString (classname)) {
const _ [_, name = null] = classname.match (regex);
social = name? _.snakeCase (name): null;
}
return social;
};
// Extract the href URL from the element
const href = extractUrlAttribute ('href') (elem);
// Get the social-network name using a composed function
const social = compose (
getSocialFromClasses (regex),
onlySocialClasses (regex),
fetchElemAttribute ('class')
)(Icon);
// Return an object of social-network-name (key) and social-link (value)
// Else return null if no social-network-name was found
return social && {[social]: href};
};

I will explain a little bit:

  • First, I will fetch (fetch) the <span> tags with the class icon. I also define a regular expression to match with the icon class.
  • We have defined a function onlySocialClasses() that is responsible for extracting all related social classes.

Specific examples for easy understanding: this function will return related social classes.

const regex = /^(?:icon|color)-(.+)$/;
const extractSocial = onlySocialClasses (regex);
const classNames = 'first-class another-class color-twitter icon-github';
extractSocial (classNames);
// returns ['color-twitter', 'icon-github'

Next, to extract the social network name, use the extracSocialName () function.

const regex = /^(?:icon|color)-(.+)$/;
const extractSocialName = getSocialFromClasses (regex);
const classNames = ['color-twitter', 'icon-github'];
extractSocialName (classNames); // returns 'twitter'

Finally, extract the URL from the href attribute. The results are as follows:

{twitter: 'https://twitter.com/gladchinda'}

#Summary

So, I have guided you step by step to be able to crawler website online. Maybe different sites will have different HTML structure, so you may need to update the extractors accordingly. But overall, it’s like this.

You can download the full source code of the tutorial here: https://github.com/sirhappy/demo-crawler-website

Written by

Always be nice to anybody who has access to my toothbrush.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store