Home

I wrote a crawler for the first time.

February 12, 2021

Early on in the pandemic, I decided that I wanted a way to track the moving average of cases per day in my state, Mississippi, since that wasn't something our Department of Health had a graph for at the time. Since I thought, "you know, this won't be too long... I could definitely just do this for a few months," I had been manually adding data for every single day until the end of January. I would frequently forget or just not want to look at the data for a month or more at a time. I realized I needed to find a way to automate this process so I didn't have to go back through the last month's data to update my graph. So, I decided to finally write a crawler to get all that data from our state’s Department of Health website without even thinking about it.

The Crawler

For me, this was the easy part. I wanted to write a web crawler in a language I was comfortable with to get it up relatively quickly, so I decided on JavaScript. I took bits and pieces from various tutorials I had found and decided on using Axios to grab the data and Cheerio to parse it.

To start out, I added Axios and Cheerio to my site.

for yarn: yarn add axios cheerio for npm: npm install axios cheerio

Then, I included them in the JavaScript file I used for my crawler code.

const axios = require('axios') 
const cheerio = require('cheerio')

You could also do it the ✨ES6 way✨:

import axios from 'axios' 
import cheerio from 'cheerio'

I also included my JSON file and filestream so I could add the newest data to that JSON file.

const fs = require('fs') 
const data = require('../src/constants/covidData.json')

Then, I created a function to get the latest cases for the day off of the MSDH website. I fetched the data with Axios, loaded it into Cheerio, and then pulled the value out of the section of the DOM that contained the current day's data. I found this selector by going into the dev tools in the browser and looking for the section of the page that contained the daily case data. In this case, there was a data-description attribute on a p tag that helped me locate the correct HTML element. I removed any commas from the string it returned and made sure that it was getting saved as an integer so it would work with my charts.

const msdh = 'https://msdh.ms.gov/msdhsite/_static/14,0,420.html' 
const getDailyCases = async () => { 
  try { 
    const { data } = await axios.get(msdh) 
    const $ = cheerio.load(data) 
    let dailyCases = parseInt($('[data-description="New cases"]').text().replace(/,/g, '')) 
    return dailyCases 
  } catch (error) { 
    console.log(error) 
  } 
}

I created a new date object. And since All data is from the previous day, I set the date to the day before.

let today = new Date() today.setDate(today.getDate() - 1)

And then initialized my data object to eventually add those two pieces of information to an object to add to my JSON file.

let dailyCases = { 
  newCases: 0, 
  date: today.getFullYear() + '-' + today.getMonth() + '-' + today.getDate() //formatting date to match what I needed 
}

Finally, I wrote another async function to call my getDailyCases function and, after it gets that data, add it to my JSON file as long as there are new cases, and that date doesn't exist in the JSON file.

const getCovidData = async () => { 
  dailyCases.newCases = await getDailyCases() 
  if (!data.data.includes(daily.date) && daily.newCases != 0) { 
    data.data.push(dailyCases) 
    fs.writeFile('src/constants/covidData.json', JSON.stringify(data), (error) => { 
      if (error) { 
        console.log(error) 
      } 
    }) 
  } 
}

And, of course, call that function so that it'll actually run.

getCovidData()

That's all there is to the crawler! You can check out the full crawler file on my GitHub.

Getting it to run regularly

My first thought was to use a combination of Netlify functions to run the web crawler and Zapier to schedule the daily deployment. I quickly realized this wasn't going to work. Since my database was just a JSON file in my GitHub repo, I needed to make sure that the data was getting added every day. When I tried using the Netlify/Zapier combination, it would run the crawler and "overwrite" the last entry daily, since that data wasn't getting pushed back to GitHub.

After that didn't pan out, I decided to try GitHub Actions, which I had never used before. (Spoiler, this is what I ended up using.)

I just jumped right into GitHub Actions without any real research or planning. Normally, that's not something I'd recommend. However, it worked out pretty well this time because of how well the default YAML file was commented. I used a lot of the default YAML file for the action.

To get the Action to run daily, I used POSIX cron syntax to set the interval.

on: schedule: - cron: "00 20 * * *"

Each of those places separated by spaces represents a unit of time. This will determine how often your Action will run. A lot of times, you may see that denoted by five asterisks ("* * * * *"). The first place is the minute field. The second place is the hour (which hour in UTC). The third is the day. The fourth is the month (1-12 or JAN-DEC). Finally, the fifth place is the day of the week (0-6 or SUN-SAT). If you leave any of these as a star, it'll run for every one of those units of time. In my code, I wanted my Action to run every day at UTC 20:00 (or 2PM CST) to ensure the Department of Health had time to publish data that day. Therefore, I only put units of time in the minute and hour places and left the rest as asterisks.

Once I determined how often I needed it to run, I needed to define what the actual job (with steps!) was that I need it to run. So I set up Node.js, installed my dependencies (Axios and Cheerio), ran my crawler, and then pushed the changes to my repository.

jobs: 
  # This workflow contains a single job called "build" 
  build: 
    # The type of runner that the job will run on (I left it as the default) 
    runs-on: ubuntu-latest 
      # Steps represent a sequence of tasks that will be executed as part of the job 
    steps: 
      # Checks-out your repository under $GITHUB_WORKSPACE, so your job can access it 
      - uses: actions/checkout@v2 
    
      - name: Setup Node.js environment 
        uses: actions/setup-node@v2.1.4 
      
      - name: Install axios and cheerio 
        run: | 
          npm install axios 
          npm install cheerio 
      
      - name: Get Covid Data 
        run: | 
          node functions/crawler.js 
    
      - name: Push changes 
        uses: actions-go/push@v1 
        with: 
          # The commit message used when changes needs to be committed 
          commit-message: "running daily COVID data crawler"

That's all there is to it! Now the web crawler is running every day!

Senior-ish developers get intimidated too.

Writing a web crawler was something I put off for a LONG time in my career. It was probably the first thing I was asked to do as a developer (which I didn't). Quite honestly, it intimidated me a lot and took me around 9 years to get over that intimidation. I just assumed that I wouldn't be able to do it, and I let that consume me. Now, every single time I see that commit message "running daily COVID data crawler," I feel so proud. I've built many things throughout my career, but this may be the thing I'm most proud of because I proved to myself that I could do it.

Let this be a lesson for new developers that sometimes things don't get less scary. You just get less afraid of failing.

Illustration from Undraw