Easy Guide to Setting Up a Proxy for Node.js Web Scraping

Node.js is a JavaScript runtime based on Chrome V8, mainly used to build fast and scalable network applications.Node.js performs well in web scraping, mainly due to its non-blocking and event-driven features, which can greatly improve crawling efficiency.

Advantages of node.js web scraping

Node.js demonstrates significant advantages in web scraping:‌

1. High performance and high concurrency‌

Node.js is based on the Chrome V8 engine, ‌adopts event-driven, ‌non-blocking I/O model, ‌making it perform well in handling a large number of concurrent requests, and ‌can handle requests for multiple web pages at the same time ,‌Greatly improve the efficiency of data capture. ‌‌

2. ‌Asynchronous operations‌

The asynchronous feature of Node.js allows operations such as HTTP requests to continue performing subsequent tasks without waiting for a response, ‌ thereby avoiding blocking and ‌ improving overall throughput. ‌‌

3. Rich third-party libraries‌

Node.js has a huge ecosystem,‌provides a large number of third-party libraries,‌such as axios,‌cheerio, etc.‌These libraries greatly simplify the crawler development process. ‌‌

4. Seamless integration with web technologies‌

Node.js has the same origin as front-end JavaScript technology,‌enabling crawlers to easily handle complex web pages,‌including dynamically loaded content. ‌‌

Node.js web scraping example

To do web scraping in Node.js, you usually use some popular libraries, such as axios for sending HTTP requests and cheerio for parsing HTML. Here is a simple Node.js web scraping example code:
First, make sure you have installed axios and cheerio. If not, you can install them through npm:

npm install axios cheerio

Then, you can create a JavaScript file, say webScraper.js, and write the following code:

const axios = require('axios');
const cheerio = require('cheerio');

async function scrapeWebpage(url) {
  try {
    // Sending HTTP GET request
    const { data } = await axios.get(url);

    // Loading HTML using cheerio
    const $ = cheerio.load(data);

    // Extract web page title
    const title = $('title').text();

    // Suppose we want to crawl all the links on a web page
    const links = [];
    $('a').each((index, element) => {
      const href = $(element).attr('href');
      const text = $(element).text();
      links.push({ href, text });
    });

    // Returns the fetched data
    return {
      title,
      links
    };
  } catch (error) {
    console.error('Scraping error:', error);
  }
}

// Usage Examples
scrapeWebpage('https://example.com').then(data => {
  console.log('Scraped Data:', data);
});

This code first defines an asynchronous function scrapeWebpage, which accepts a URL as a parameter. The function uses axios to send an HTTP GET request to get the webpage content, and then uses cheerio to load the content. Next, it extracts the title and all links of the webpage and returns this information as an object. ‌

Finally, the code demonstrates how to use this function by calling the scrapeWebpage function and passing it an example URL. The scraped data will be printed in the console. ‌

You can save this code to a file, such as webScraper.js, and then run node webScraper.js in the command line to execute it. ‌Remember to replace https://example.com with the URL of the webpage you want to scrape.

How to deal with obstacles in node.js web scraping

Node.js may encounter obstacles when crawling the web, and you can take a variety of measures to deal with them. ‌The following are some common coping strategies:‌

1.‌Set reasonable request header information‌

By simulating the request header information of normal browsers, such as User-Agent, Referer, Accept-Language, etc., reduce the risk of being identified as a crawler by the website. ‌

2.‌Use a proxy

Websites usually determine whether there is crawler behavior by detecting frequent requests from the same IP address. ‌Using a proxy can change a different IP address for each request, thereby reducing the risk of being blocked by the website. ‌
Using a proxy for web crawling in Node.js, you can use the axios library to send HTTP requests and the cheerio library to parse HTML. ‌The following is a simple example code, which shows how to scrap web content through a proxy:‌

const axios = require('axios');
const cheerio = require('cheerio');

async function scrapeWebpageWithProxy(url, proxy) {
  try {
    // Configure axios to use a proxy
    const config = {
      proxy: {
        host: proxy.host,
        port: proxy.port,
        auth: {
          username: proxy.username,
          password: proxy.password,
        },
      },
    };

    // Sending HTTP GET request with proxy
    const { data } = await axios.get(url, config);

    // Parsing HTML with cheerio
    const $ = cheerio.load(data);

    // Extract and return the title of a web page
    return $('title').text();
  } catch (error) {
    console.error('Scraping error with proxy:', error);
  }
}

// Example proxy configuration
const proxyConfig = {
  host: 'your-proxy-host',
  port: 'your-proxy-port',
  username: 'your-proxy-username',
  password: 'your-proxy-password',
};

// Usage Examples
scrapeWebpageWithProxy('https://example.com', proxyConfig).then(title => {
  console.log('Scraped Title:', title);
});

In this code, the scrapeWebpageWithProxy function receives a url and a proxy object as parameters. The proxy object contains the host, port, username, and password of the proxy server. Then, the function uses the axios library to send an HTTP GET request with the proxy configuration. ‌

Be sure to replace the placeholders in proxyConfig with your actual proxy server information. ‌If you don't need proxy authentication, you can remove the auth property from the config object. ‌

Finally, call the scrapeWebpageWithProxy function, passing in the URL of the webpage you want to scrape and your proxy configuration, and then process the returned scraping results. ‌

3.‌Limit access frequency‌

Simulate human browsing behavior, add random time intervals between requests, and avoid too frequent requests. ‌

4.‌ Handling dynamic pages and content generated by JavaScript‌

For page content dynamically generated by JavaScript, you can use tools such as Puppeteer or Cheerio to simulate browser behavior, execute JavaScript code and obtain dynamically generated content.‌ ‌

Node.js web scraping data saving

Scraping the web and saving data in Node.js is a common application scenario. You can use various libraries to help you send HTTP requests, parse web content, and save the scraped data to files, databases, or other storage systems.

Here is a simple example showing how to use Node.js to scrape web data and save it to a JSON file:

const axios = require('axios');
const cheerio = require('cheerio');
const fs = require('fs');

async function scrapeAndSaveData(url, filePath) {
  try {
    // Sending HTTP GET request
    const { data } = await axios.get(url);

    // Parsing HTML with cheerio
    const $ = cheerio.load(data);

    // Extract the data you need
    const title = $('title').text();
    const bodyText = $('body').text();

    // Create an object to hold the data
    const scrapedData = {
      title,
      bodyText,
    };

    // Convert data into JSON string and save to file
    const jsonData = JSON.stringify(scrapedData, null, 2);
    fs.writeFileSync(filePath, jsonData);

    console.log('Data saved successfully!');
  } catch (error) {
    console.error('Scraping or saving error:', error);
  }
}

// Usage Examples
scrapeAndSaveData('https://example.com', 'scrapedData.json');

In this example, the ‌scrapeAndSaveData function receives a URL and a file path as parameters. ‌It uses the axios library to send an HTTP GET request, and then uses the cheerio library to parse the returned HTML. ‌Next,‌ it extracts the title and body text of the webpage, and saves this data to a JSON file. ‌

You can modify this function as needed to extract and save other data that interests you. ‌For example,‌ you can scrape links, ‌images, ‌metadata, etc. on a webpage, and save them to different files or a database. ‌