There are a number of options for web scraping in node.js, but my favorite library to work with is node-osmosis. Once you get the hang of it, it's super simple, and allows complex web scrapes with little code. While I was learning it, I found a dearth of simple examples so I wanted to put a few out there.

To start, we'll do a simple single page scrape. We'll be working with this page on wikipedia, which contains population information for US States.

Simple Page Scrape

First, let's scrape some basic information from the page using basic selectors.

osmosis
  .get(url)
  .set({
    heading: "h1",
    title: "title",
  })
  .data(item => console.log(item));

This returns:

{
  heading: 'List of U.S. states and territories by population',
  title: 'List of U.S. states and territories by population - Wikipedia'
}

Scrape Using Find

Next, let's say we want to get a list of all the states along with their populations. We can see that this data is in the first table on the page. In order to do this, we'll introduce a new function 'find', which sets the current context using selectors. Once we tell osmosis to select the rows from the first table, we can pick out the state and population values and pull them into an object.

osmosis
  .get(url)
  .find(".wikitable:first tr:gt(0)")
  .set({
    state: "td[3]",
    population: "td[4]",
  })
  .data(item => console.log(item));

Which returns:

{ state: 'California', population: '39,250,017' }
{ state: 'Texas', population: '27,862,596' }
{ state: 'Florida', population: '20,612,439' }
{ state: 'New York', population: '19,745,289' }
{ state: 'Illinois', population: '12,801,539' }
{ state: 'Pennsylvania', population: '12,784,227' }
{ state: 'Ohio', population: '11,646,273' }
{ state: 'Georgia', population: '10,310,371' }
{ state: 'North Carolina', population: '10,146,788' }
{ state: 'Michigan', population: '9,928,301' }
...

Scrape Multiple Parts

We can also use multiple set calls to pull out different pieces of data.

osmosis
  .get(url)
  .set({
    title: "title",
  })
  .find(".wikitable:first tr:gt(0)")
  .set({
    state: "td[3]",
    population: "td[4]",
  })
  .data(item => console.log(item));

This returns:

{
  title: 'List of U.S. states and territories by population - Wikipedia',
  state: 'California',
  population: '39,250,017'
}
{
  title: 'List of U.S. states and territories by population - Wikipedia',
  state: 'Texas',
  population: '27,862,596'
}
{
  title: 'List of U.S. states and territories by population - Wikipedia',
  state: 'Florida',
  population: '20,612,439'
}
{
  title: 'List of U.S. states and territories by population - Wikipedia',
  state: 'New York',
  population: '19,745,289'
}
{
  title: 'List of U.S. states and territories by population - Wikipedia',
  state: 'Illinois',
  population: '12,801,539'
}
...

Following Links

Now, let's say we wanted some information from each state. Osmosis has a follow function that we can use to scrape each state page. Since the URL is inside the table that we're scraping, we can pass that URL to the follow function using the a@href selector inside the 3rd column (td[3]). After we follow each page, we can use set again to pull data from each state page. In this example, we'll just pull out the longitude and latitude. The end result combines the elements from each set call.

osmosis
  .get(url)
  .find(".wikitable:first tr:gt(0)")
  .set({
    state: "td[3]",
    population: "td[4]",
  })
  .follow("td[3] a@href")
  .set({
    longitude: ".longitude",
    latitude: ".latitude",
  })
  .data(item => console.log(item));

This will return:

{
    state: 'California',
    population: '39,250,017',
    longitude: '119°21'19"W',
    latitude: '35°27'31"N'
}
{
    state: 'Illinois',
    population: '12,801,539',
    longitude: '88°22'49"W',
    latitude: '41°16'42"N'
}
...

Promisification

In all the examples above, we're just printing out each object as it's handled by the data function. Often times we want to return all of the data at once in an array. To do that, we can wrap the call in a Promise and add each element to an array as it's processed.

function scrapePopulations() {
  return new Promise((resolve, reject) => {
    let results = [];
    osmosis
      .get(url)
      .find(".wikitable:first tr:gt(0)")
      .set({
        state: "td[3]",
        population: "td[4]",
      })
      .data(item => results.push(item))
      .done(() => resolve(results));
  });
}

scrapePopulations().then(data => console.log(data));

// we can also get the same results more simply by wrapping the code with .set([])
osmosis
  .get(url)
  .set([
    osmosis.find(".wikitable:first tr:gt(0)").set({
      state: "td[3]",
      population: "td[4]",
    }),
  ])
  .data(items => console.log(items));

This returns:

[
  { state: 'California', population: '39,250,017' },
  { state: 'Texas', population: '27,862,596' },
  { state: 'Florida', population: '20,612,439' },
  { state: 'New York', population: '19,745,289' },
  { state: 'Illinois', population: '12,801,539' },
  { state: 'Pennsylvania', population: '12,784,227' },
  { state: 'Ohio', population: '11,646,273' },
  { state: 'Georgia', population: '10,310,371' },
  { state: 'North Carolina', population: '10,146,788' },
  { state: 'Michigan', population: '9,928,301' },
  { state: 'New Jersey', population: '8,944,469' },
  { state: 'Virginia', population: '8,411,808' },
  { state: 'Washington', population: '7,288,000' },
...
]

These examples only scratch the surface of the types of scrapes that can be completed using osmosis. If you want to see any other examples or have questions, leave me a comment below.