TIL: Ferret and FQL

Where I explore Ferret, a cool declarative approach to web scraping!

The other day I was trying to extract the directory of a particular office in a Mexican university. The directory consists of a few simple HTML tables, but with over 80 total records it was more than I'm OK copy-pasting.

automate-meme

Python being my strongest programming language, I normally default to Scrapy or Beautiful Soup but this seemed like overkill. I wanted a simple tool that I could declaratively feed the URL and CSS selectors for the fields I was interested in, and then emits the inner text values as a nice CSV or JSON.

My initial search lead me to the webscraper.io browser extension. It was almost what I wished for, but I didn't like the UI too much and after 5 minutes trying to figure out how to combine multiple selectors in a single record, I gave up.

I then went to search for web scraping on GitHub, and one of the first results was ferret. Immediately, the home page mentions a declarative query language and this seems to be was I was looking for!

Ferret is a Go library, CLI and parser for its own query language (FQL).

The Ferret Query Language

Coming from a data and backend engineering background, I'm familiar with SQL and other query languages so the initial description of FQL looked interesting:

FQL is mainly a declarative language, meaning that a query expresses what result should be achieved but not how it should be achieved.

The syntax of FQL queries is different to SQL, even if some keywords overlap. Nevertheless, FQL should be easy to understand for anyone with an SQL background.

The language comes with a very complete standard library of functions. Some of the builtin functions are reminiscent of SQL: TRIM, DATE_ADD, etc. There are also math, http and other functions.

Executing FQL files with the CLI

After installing the CLI, I started playing with a few queries and was pleasantly surprised by how little code I had to type. The query that does what I needed is shorter and arguably more readable than the equivalent Python code would've been, especially for someone more familiar with a declarative language like SQL:

directorio.fql

/* Open an HTML page
   https://www.montferret.dev/docs/stdlib/html/#document
*/
LET doc = DOCUMENT("https://cbi.izt.uam.mx/index.php/cbi/directorio")

/* Iterate elements that match a CSS selector
   https://www.montferret.dev/docs/stdlib/html/#elements
*/
FOR section IN ELEMENTS(doc, "div.g-owlcarousel-item")
    /* Extract text from a heading and remove surrounding whitespace.
       https://www.montferret.dev/docs/stdlib/html/#elements
    */
    LET oficina = TRIM(INNER_TEXT(section, "h4"), "\n +")

    FOR row IN ELEMENTS(section, "table > tbody > tr:not(:first-child)")
        RETURN {
            oficina: oficina,
            cargo: INNER_TEXT(row, "td:nth-child(1)"),
            responsable: INNER_TEXT(row, "td:nth-child(2)"),
            cubiculo: INNER_TEXT(row, "td:nth-child(3)"),
            email: REGEX_SPLIT(TRIM(INNER_TEXT(row, "td:nth-child(4)")), "[\n\t ]+"),
        }

The output of executing the FQL from the CLI is a JSON array, so it's easy to save the results and process them downstream with other tools. For example, the following SQLite query counts the number of workers in each office:

directorio.sql

select
    json_extract(value, '$.oficina') as "Office",
    count(*) as "Workers"
from json_each(readfile('./directorio.json'))
group by 1
order by 2 desc;

The following end-to-end pipeline involves only two commands:

ferret exec directorio.fql > directorio.json
sqlite3 -markdown < directorio.sql

Which briefly explained

scrapes the directory
saves the results to a JSON file
reads the JSON array in SQLite
outputs the result as a Markdown table

The final result is

Office	Workers
Coordinadores de Posgrado	13
Coordinación Divisional de Docencia y Atención a Alumnos	11
Departamento de Química	10
Departamento de Física	10
Coordinadores de Licenciatura	10
Departamento de Matemáticas	9
Departamento de Ingeniería Eléctrica	7
Coordinadores de Laboratorios de Docencia	6
Departamento de Ingeniería de Procesos e Hidráulica	5
Coordinadores del Tronco General	5
Secretaría Académica	3
Dirección	3
Coordinadores de Cursos de Apoyo a otras Divisiones	2

Further exploration

There's more capabilities to Ferret that may be worth exploring in more complex tasks:

pyfer is a Python wrapper around Ferret's C library, so it might be worth seeing how well it integrates in larger Python projects.
Handling dynamic pages with Chromium in browser mode and the Ferret worker.