Crawl website data using NodeJS

We will together learn techniques to crawler website data using DOM parsing technique using Nodejs.

Photo by Carlos Muza on Unsplash

Perhaps you have heard about website crawling techniques somewhere. This is a fairly common technique for crawling; for example, the Google bot is also a form of crawler.

The crawler technique has many practical applications, which can include several ideas such as: Building a newspaper reader application by crawling website data from significant newspapers, crawling recruitment information from Facebook, etc.

There are many ways to create a web crawler, and there are also plenty of frameworks to support it. Python, for example, has a very famous Scrapy.

#Crawl website data is what?
#Technical demo of website crawler
Some requirements before implementation
#Build website crawler
1. Install Dependencies
2. Install webserver with Express
3. Create Request and Response Helper
4. Proceed to create crawler file of website data from scotch
5. Extract data from website

#Crawl website data is what?

To put it plainly, a web crawler is a technique of gathering data from websites on the internet according to given links. The web crawler will access the link and download all the data as well as look for more internal links to download notes.

If, in the data collection process, you only filter the necessary information for your needs, it is called Web Scraping.

The two concepts of web crawler and web scraping are the same; if there are differences, it’s a little bit.

The technique of crawler data from website

For example, with, web crawling techniques will collect the entire content of this website (product name, product description, product price, manuals, product reviews, and comments. Products,.etc). However, web scraping may only collect some information that is necessary for you, such as only collecting product prices for price comparison applications.

The data, as it is crawled, can be stored in your database for analysis or different purposes.

Note: the crawl of data from a website may not be allowed by the owner of that website. By law, you need to get permission from the website owner. I’m not responsible for any problems that arise.

#Technical demo of website crawler

To illustrate this crawling technique, I will guide you to build a bot to crawl data from the website (a site famous for teaching programming).

We will crawl the profile data of an author, as well as his posts. Then, build a RESTful API to return that data for later use in our app.

Below is a screenshot of the demo application generated based on the API we have built in this article.

As I said, the crawling technique can be done by most modern programming languages supporting HTTP, XML, and DOM, such as PHP, Python, Java, Javascript …

In this article, I will use Javascript on the Nodejs environment to perform crawling. So you need to have basic knowledge of Javascript to be able to read articles more quickly and practice crawling websites.

Before starting to code, according to this article, you also need to have Nodejs and Npm installed on your machine.

Also, I use some 3rd party (Dependencies) libraries to support crawling, such as:

  • Cheerio — supports very simple DOM parsing. This library is lightweight, easy to use, and fast.
  • Axios — Support for getting content of webpage through https request.
  • Express — this is the popular web application framework. Perhaps there’s no need to say anything more about it.
  • Lodash — Is a utility JavaScript library. It has built a lot of commonly used functions about arrays, numbers, objects, strings …

# Build website crawler

As always, for you to easily read and follow, I will try to write as detailed as possible. If you do not understand, you must ask immediately!

First, you create a new Nodejs project, then install the libraries needed for the project.

We will create a simple Http server using ExpressJS. Create a new server.js file in your project root directory, then add the following code:

Then, you edit the package.json file to make it easier to run the server. Add the following code:

With the above code, from now on, instead of typing: node server.js to launch the code, you need to type: npm start

If that’s all, then adding scripts to package.json wouldn’t be of many benefits. However, later when you need to do more tasks every time you run the server, such as: copying the configuration file, generating a specific code before starting the server, you only have to configure it in this start script.

In this section, we will create some functions for reuse throughout the application.
Create a new helpers.js file in the project root directory.

In the code above, we import the libraries needed for the helper. Now is the time to write the content for the helper.

First, we’ll create one to make returning JSON data for the requester simpler.

Examples of uses of this function are as follows:

I will explain in more detail when the server receives a request: GET/path. We assume doing something ABC XYZ, and the result is [1, 2, 3, 4, 5]. The sendResponse() function will now support returning JSON to the requester.

Next is the function to get Html from any URL.

As above its call, when you call this function, the result will be the entire Html of the URL. From this “heap” of HTML, we will split it up to get the necessary data.

In this helper.js file, there are many other functions as well, but because it’s too long, I can’t list them here. You get the source to use, if there is any place you do not understand, leave a comment below.

All the necessary procedures for crawling the data have been prepared. Now is the time to write crawling functions, analyze data from the website.

Create the scotch.js file in the app directory and add the following code:

In it, keep in mind the scotchRelativeUrl () function: This function has the purpose of automatically returning the full URL when we pass the URL parameter.

For example:

In this part, we will proceed to extract the necessary information, such as:

  • social links (Facebook, Twitter, GitHub,.etc)
  • profile (name, role, avatar,.etc)
  • stats (total views, total posts,.etc)
  • posts

However, because the article is too long, I will only explain for the first part (take the social link of an author). The rest, you refer to the source code offline.

To be able to extract someone’s social link data on, I will define an extractSocialUrl () function in the scotch.js file. The purpose of this function is to extract the social network name, the URL in the <a> tag.

I take an example of a DOM of the <a> form in an author’s profile on scotch.

When we call extractSocialUrl (), the result is an object of type like this:

The complete code snippet of the function extracts the social link as follows:

I will explain a little bit:

  • First, I will fetch (fetch) the <span> tags with the class icon. I also define a regular expression to match with the icon class.
  • We have defined a function onlySocialClasses() that is responsible for extracting all related social classes.

Specific examples for easy understanding: this function will return related social classes.

Next, to extract the social network name, use the extracSocialName () function.

Finally, extract the URL from the href attribute. The results are as follows:


So, I have guided you step by step to be able to crawler website online. Maybe different sites will have different HTML structure, so you may need to update the extractors accordingly. But overall, it’s like this.

You can download the full source code of the tutorial here:

Always be nice to anybody who has access to my toothbrush.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store