How To Scrape Search Results With JavaScript

How To Scrape Search Results With JavaScript | The necessary tools are now available for JavaScript, whether for a web or mobile application. This post will describe how the dynamic NodeJS environment enables you to effectively scrape the web to satisfy most of your requirements.

The primary audience for this article is developers who have little prior knowledge of JavaScript. However, it might still be useful as a short introduction to JavaScript if you have a solid grasp of web scraping but no prior knowledge of the language. However, having expertise in the following areas will undoubtedly be helpful:-

Knowledge of JavaScript
Experience extracting element selectors using the browser’s DevTools
Some familiarity with JavaScript ES6

In essence, Node.js pioneered JavaScript as a server-side language and offers a standard JavaScript engine that is unrestricted by the constraints of the typical browser sandbox and is boosted with a standard system library for connectivity and access controls. To get powerful business insights powered by structured and actionable public Google Search data you can utilize a Google Search scraper.

JavaScript Event Loop

The Event Loop was what it retained. JavaScript has always used a single thread and conducted blocking tasks in an asynchronous manner, relying mostly on callback functions, in contrast to how many languages handle parallelism with multi-threading.

Let’s briefly examine that using a straightforward web server instance:-

const http = require('http');
const port = 5000;

const server = http.createServer((req, res) => {
    res.statusCode = 200;
    res.setHeader('Content-Type', 'text/plain');
    res.end('Hello World');
});

server.listen(port, () => {
    console.log(`Server running at PORT:${port}/`);
});

In this case, we need to import the HTTP standard library. Next, createServer is used to make a server object, and an anonymous handler function is passed to it; the library will call this method for each incoming HTTP request. Finally, we just start listening to the designated port.

Here, there are two noteworthy passages that both allude to the asynchronicity of JavaScript and our event loop:-

We supply a handler function to createServer. The fact that listening does not block calls and instead responds right away. The accept function or method, which blocks our thread and returns the connection socket of the connecting client, is typically present in most other programming languages. We would then need to switch to multi-threading because, without it, we could only manage one connection at a time. But in this scenario, callbacks and the event loop ensure that we never have to worry about thread management and that we always work in a single thread.

As previously said, listen will end instantly, however, the programmer won’t end right away even though there isn’t any code after our listen request. We still have a callback registered via createServer, which is why (the function we passed).

The request object is passed to our anonymous function every time a client sends a request, and Node.js parses the request in the background before calling it.

Using HTTP Clients to Search the Web

HTTP clients are devices that let you send requests to servers and then get replies back from them. The server of the website you plan to scrape is queried by almost every tool that will be covered in this article using an HTTP client.

1. Integrated HTTP Client

As you said in your server example, Node.js does come with an HTTP library by default. A built-in HTTP client is also included in that library.

const http = require('http');

const req = http.request('http://example.com', res => {
    const data = [];

    res.on('data', _ => data.push(_))
    res.on('end', () => console.log(data.join()))
});

req.end();

As there are no third-party dependencies to install or manage, getting started is rather simple. However, as you can see from our example, the library does require some boilerplate because it only gives the answer in chunks, and you must finally manually sew them together. Additionally, a separate library supporting HTTPS URLs is required.

2. Fetch API

The Fetch API is yet another included method. While fetch has been supported by browsers for some time, it took Node.js a little longer, but as of release 18, it is supported (). To be fair, it’s still regarded as an experimental feature for the time being, so if you’d rather err on the side of caution, you may choose the polyfill/wrapper library node-fetch, which offers the same functionality.

async function fetch_demo() {
    const resp = await fetch('https://www.reddit.com/r/programming.json');
    console.log(await resp.json());
}

fetch_demo();

Since await is not currently enabled at the top level, our only option was to encapsulate our code inside a function. Aside from that, we essentially just used our URL to execute get(), waited for the answer while Promise magic worked in the background, and used the json() function of the Response object.

3. Axios

Axios and Fetch are fairly comparable. It is a Promise-based HTTP client that functions in both Node.js and browsers. Its built-in type support will be beloved by TypeScript users as well.

npm install axios

Contrary to the libraries we’ve described thus far, it does have one downside in that we must install it first.

const axios = require('axios')

axios
    .get('https://www.reddit.com/r/programming.json')
    .then((response) => {
        console.log(response)
    })
    .catch((error) => {
        console.error(error)
    });

We may absolutely utilize await once more and reduce the verbosity of the entire affair by relying on Promises. So let’s once more wrap it up in a function:-

async function getForum() {
    try {
        const response = await axios.get(
            'https://www.reddit.com/r/programming.json'
        )
        console.log(response)
    } catch (error) {
        console.error(error)
    }
}

4. SuperAgent

SuperAgent is another capable HTTP client that supports commitments and the async/await syntax sugar, much like Axios. Like Axios, it offers a reasonably simple API, however, SuperAgent is less well-known and has additional dependencies.

No matter what, utilizing promises, async/await, and callbacks, making an HTTP request with SuperAgent looks like this:

const superagent = require("superagent")
const forumURL = "https://www.reddit.com/r/programming.json"

// callbacks
superagent
    .get(forumURL)
    .end((error, response) => {
        console.log(response)
    })

// promises
superagent
    .get(forumURL)
    .then((response) => {
        console.log(response)
    })
    .catch((error) => {
        console.error(error)
    })

// promises with async/await
async function getForum() {
    try {
        const response = await superagent.get(forumURL)
        console.log(response)
    } catch (error) {
        console.error(error)
    }
}

5. Request

The request is still a well-liked and extensively used HTTP client in the JavaScript environment even though it is no longer being updated. Making an HTTP request using Request is quite easy.

const request = require('request')
request('https://www.reddit.com/r/programming.json', function (
    error,
    response,
    body
) {
    console.error('error:', error)
    console.log('body:', body)
})

You’ve probably noticed that we didn’t use basic Promises or await in this case. This is so that Request can continue to use the conventional callback technique, even when a few wrapper libraries also allow await.

Do you want to utilize Request? Because Request is still a popular option, we put it on this list. However, official development has ended, and it is no longer being actively maintained. Even though there are still many libraries that use it, and this does not mean that it is useless, the fact that it is no longer widely used may cause us to pause before implementing it in a brand-new project. This is especially true given the abundance of viable alternatives and native fetch support.

JavaScript Data Extraction

Undoubtedly, obtaining the content of a website is a crucial first step in any scraping project; nevertheless, we also need to find and extract the data. The next thing we’ll look at is how to manage an HTML document in JavaScript and how to find and choose data for data extraction.

The difficult way to use regular expressions:- Using a tonne of regular expressions on the HTML material you obtained from your HTTP client is the easiest approach to start web scraping without any dependencies. But there is a significant trade-off.

Regular expressions are excellent in their field, but they are not the best for processing document structures like HTML. Regular expressions can potentially become too sophisticated for web scraping. Having said that, let’s attempt it nevertheless.

Let’s say we want the username from a label that has a username. Akin to what you would need to do if you used regular expressions is this:-

const htmlString = '<label>Username: John Doe</label>'
const result = htmlString.match(/<label>Username: (.+)<\/label>/)

console.log(result[1])

Output:-

John Doe

Here, we’re using String.match(), which will give us an array that contains the results of our regular expression’s evaluation. The second array member (result[1]) will contain anything that the group was able to catch because we employed a capturing group ((.+)).

Even while in our example it clearly worked, anything more complicated will either not function or require a far more complicated statement. Imagine that your HTML document contains a few label> components.