Javascript

Web Scraping with TypeScript and Node.js

Apr 21, 2022

6 min read

Sometimes you'll find yourself wanting to use a set of data from a website, but the data won't be available as an API or in a downloadable format like CSV. In these cases, you may have to write a web scraper to download individual web pages and extract the data you want from within their HTML. This guide will teach you the basics of writing a web scraper using TypeScript and Node.js, and will note several of the obstacles you might encounter during web scraping.

If you want to skip straight to the finish code example, check it out on GitHub.

Setup

First things first: we need to initialize our project and install the base dependencies. We'll be writing our web scraper in TypeScript and running it as Node.js scripts using ts-node. For simplicity, we'll create an index.ts file in the project root to work from. From the command line, run the following to get started:

mkdir my-web-scraper && cd my-web-scraper # create project directory
git init # initialize new git repository
echo "node_modules" >> .gitignore # do not track node_modules in git
npm init -y # initialize Node.js project
# install dependencies
npm install typescript ts-node
npm install --save-dev @types/node
touch index.ts # create an empty TypeScript file

Node.js doesn't run TypeScript files natively. Rather than use the TypeScript compiler to output new JavaScript files whenever we want to run the script, we'll use ts-node to run the TypeScript files directly. We'll go ahead and add this to our new package.json file as an npm script.

  "scripts": {
    "scrape": "ts-node ./index.ts"
  }

Now, we'll be able to run our scraper from index.ts with the command npm run scrape.

Fetching Websites

In our examples, we'll be using Axios to make our http requests. If you'd prefer something else, like Node Fetch to match the Fetch API until it's ready in Node.js, that's fine too.

npm install axios

Let's create our first function for fetching a given URL and returning the HTML from that page.

import axios from 'axios';

function fetchPage(url: string): Promise<string | undefined> {
  const HTMLData = axios
    .get(url)
    .then(res => res.data)
    .catch((error: AxiosError) => {
      console.error(`There was an error with ${error.config.url}.`);
      console.error(error.toJSON());
    });

  return HTMLData;
}

This function will use Axios to create a promise to fetch a given URL, and return the HTML it gets back as a string. If there's an error, it will log that error to the console, and return undefined instead. Since you're probably going to be running this scraper from your command line throughout development, a healthy number of console.logs will help you make sure the script is running as expected.

Caching Scraped Pages

In the event that you're trying to scrape many, many static web pages in a single script, you might want to cache the pages locally as you download them. This will save you time and headache as you work on your scraper. You're much less likely to annoy the website you're scraping with high traffic and the bandwidth costs associated with it, and your scripts will probably run faster if they aren't limited by your Internet connection.

Let's go ahead and create a .cache folder in the project root. You probably won't want to keep cached files in your git history, so we'll want to add this folder to your .gitignore file.

mkdir .cache
echo ".cache" >> .gitignore

To cache our results, we'll first check if a cached version of the given page already exists. If so, we'll use that. If not, we'll fetch the page and save it to the .cache folder. For filenames, we're just going to base-64 encode the page's URL. If you prefer some other way to generate a unique filename, that's fine too. I've chosen the base-64 encoded URLs because it's easy and very obviously a temporary sort of file. We also have an optional function argument ignoreCache, in case you've built up your cache but want to scrape fresh data anyway.

import { existsSync, mkdirSync } from 'fs';
import { readFile, writeFile } from 'fs/promises';
import { resolve } from 'path';

async function fetchFromWebOrCache(url: string, ignoreCache = false) {
  // If the cache folder doesn't exist, create it
  if (!existsSync(resolve(__dirname, '.cache'))) {
    mkdirSync('.cache');
  }
  console.log(`Getting data for ${url}...`);
  if (
    !ignoreCache &&
    existsSync(
      resolve(__dirname, `.cache/${Buffer.from(url).toString('base64')}.html`),
    )
  ) {
    console.log(`I read ${url} from cache`);
    const HTMLData = await readFile(
      resolve(__dirname, `.cache/${Buffer.from(url).toString('base64')}.html`),
      { encoding: 'utf8' },
    );
    return HTMLData;
  } else {
    console.log(`I fetched ${url} fresh`);
    const HTMLData = await fetchPage(url);
    if (!ignoreCache && HTMLData) {
      writeFile(
        resolve(
          __dirname,
          `.cache/${Buffer.from(url).toString('base64')}.html`,
        ),
        HTMLData,
        { encoding: 'utf8' },
      );
    }
    return HTMLData;
  }
}

Extracting Data with jsdom

Now that we have HTML to work with, we want to extract the relevant data from it. To do this, we will use jsdom, a JavaScript implementation of the DOM. This will let us interact with the downloaded HTML in the exact same way as if we were working in a browser's console, giving access to methods like querySelector.

(If you prefer a syntax more like jQuery's, Cheerio is also a popular option.)

npm install jsdom
npm install --save-dev @types/jsdom

Now let's import jsdom and use it to return the Document object of our HTML string. Just modify the previous fetchFromWebOrCache to turn HTMLData into a DOM object, and return its window.document.

import { JSDOM } from 'jsdom';

async function fetchFromWebOrCache(url: string, ignoreCache = false) {
  // Get the HTMLData from fetching or from cache
  const HTMLData = '<html>...</html>'
  const dom = new JSDOM(HTMLData);
  return dom.window.document;
}

Now that we're working with a Document instead of a string, we've got access to everything we'd have if we were working in the browser console. This makes it much easier to write code that extracts the pieces of a page that we want! For example, let's scrape whatever is on the front page of Hacker News right now. We'll write a function that accepts the Document of the Hacker News front page, finds all of the links, and gives us back the link text and URL as a JavaScript object.

Using your browser's developer tools, you can easily inspect an element on the page with desired data to figure out a selector path. In our example, we can right-click a link and choose Inspect to view it in DevTools. Then we right-click the DOM element, and choose "Copy > Copy selector" in Chrome or "Copy > CSS Selector" in Firefox, for example.

A copied selector will give you a string of text that selects only the element you copied it from in DevTools. And often that is useful! Just throw your selector into document.querySelector('selector'), and you're good to go. But in our case, we want all of the front page links. So we need a broader selector than copy-pasting from DevTools will give us. This is where you'll have to actually read through the HTML, classes, ids, etc., to figure out how to craft the right selector.

Fortunately for us in this example, all of the links on the Hacker News feed have a unique class: titlelink. So we can use document.querySelectorAll('a.titlelink') to get all of them.

// Pass the scraped Document from news.ycombinator.com to this
// function to extract data about front page links.
function extractData(document: Document) {
  const writingLinks: HTMLAnchorElement[] = Array.from(
    document.querySelectorAll('a.titlelink'),
  );
  return writingLinks.map(link => {
    return {
      title: link.text,
      url: link.href,
    };
  });
}

This function is only a simple example, and would be different depending on what you want to get out of a page. When working with jsdom, remember that you're not working with arrays and objects but with NodeLists and Elements. To get useful data out of your selections, you'll often have to do things like convert a NodeList into an array as shown above.

Sometimes you'll have to get creative with your selections. I recently tried to scrape the information from an HTML table on a page with varying numbers of tables and no classes. Because the number of tables was always different, I couldn't reliably select from a list of tables by which number table it was. I had to select every table present on a page, then filter them by the text in the first cell to get precisely the one table I needed!

// Sometimes, web scraping is just hard...
const table: HTMLTableElement = Array.from(
    data.querySelectorAll('table'),
  ).filter(t =>
    t.children[0].children[0].children[0].innerHTML.match(
      /Unique Text in First Cell which IDs the Table/,
    ),
  )[0];

Extracting Data with Regular Expressions

Unfortunately for us, not all pages on the Internet are well-structured and ready for scraping. Sometimes, they don't even try to use HTML tags properly. In these sad cases, you may need to turn to regular expressions (regex) to extract what you need. We won't need to resort to such extreme measures in our example of scraping Hacker News, but it's worth knowing that you might need to do this.

I'll give you a contrived example where you would need some regex, based on another site I recently scraped. Imagine the following badly-done HTML:

<div class="pokemon">
  Name: Pikachu<br />
  Number: 25<br />
  Type: Electric<br />
  Weakness: Ground
</div>

The various data attributes we care about aren't wrapped by their own HTML elements! Everything is just inside a div with some br tags to create line breaks. If I wanted to extract the data from this, I could use regex to find and match the text and patterns I expect to find. This can require trial and error, and I recommend using a tool like regex101 to test the regular expressions you come up with. In this example, we might write the following code:

const rawPokemonHTML = document.querySelector('.pokemon');
const name = rawPokemonHTML.match(/Name: (\w+)/)[0];
const num = rawPokemonHTML.match(/Number: (\d+)/)[0];
// etc...

Saving Data

Once we've extracted our data from the HTML, we'll want to save it. This is basically the same as when we created a cache for the downloaded HTML files.

import { existsSync, mkdirSync } from 'fs';
import { writeFile } from 'fs/promises';
import { resolve } from 'path';

function saveData(filename: string, data: any) {
  if (!existsSync(resolve(__dirname, 'data'))) {
    mkdirSync('data');
  }
  writeFile(resolve(__dirname, `data/${filename}.json`), JSON.stringify(data), {
    encoding: 'utf8',
  });
}

Putting It All Together

Now that we've got all the necessary pieces, we're ready to build our JSON file of Hacker News front page stories. To see all of our code in one piece, check it out on GitHub.

async function getData() {
  const document = await fetchFromWebOrCache(
    'https://news.ycombinator.com/',
    true, // Hacker News is always changing, so ignore the cache!
  );
  const data = extractData(document);
  saveData('hacker-news-links', data);
}

getData();

When we run our script from the command line, it will execute getData(). That function will fetch the HTML from Hacker News' front page, extract all of the links and their titles, and then save it to data/hacker-news-links.json. And while you probably don't need a list of links from Hacker News, this information should be enough to get you started with collecting some data from the web which you do care about.

This Dot is a consultancy dedicated to guiding companies through their modernization and digital transformation journeys. Specializing in replatforming, modernizing, and launching new initiatives, we stand out by taking true ownership of your engineering projects.

We love helping teams with projects that have missed their deadlines or helping keep your strategic digital initiatives on course. Check out our case studies and our clients that trust us with their engineering.

Tom VanAntwerp

@tvanantwerp @tvanantwerp

Improving INP in React and Next.js

Improving INP in React and Next.js In one of my previous articles, I've explained what INP is, how it works, and how it may affect your website. I also promised you to follow up with more concrete advice on how to improve your INP in your favorite framework. This is the follow-up article, where I'll focus on how to improve your INP score in React and Next.js. How to prepare for INP in React and Next.js? The first thing to do is to ensure you're using the latest version of React. The React team has been working on making React more INP-friendly and has already made some improvements in the latest versions. To enhance your INP score, consider fully taking advantage of new features introduced in React 18, such as Concurrent Rendering, Automatic Batching, and Selective Hydration. However, there are also some general areas to focus on, such as SSR and SSG in Next.js, Web Workers, or optimizing your hooks and state management. Concurrent Rendering The Concurrent Mode in React uses an algorithm that breaks rendering down into so-called "fiber nodes" and schedules the renders based on their expiration and priority. This effectively allows the user to interact with the page while the rendering is still in progress. In previous React versions, all updates, such as setState calls were treated as "urgent" and once the re-render had started, there was no way to interrupt it. Concurrent Mode changes this by being able to prioritize the updates and interrupt a non-blocking state update started with startTransition. For a simple explanation of concurrency in React, you can check out Dan Abramov's explanation. As part of the Concurrent Mode, React introduced several lifecycle methods that allow you to prioritize the rendering of certain parts of your UI, such as: - useTransition hook that allows you to update the state without blocking the UI, - useDeferredValue hook that allows you to defer the rendering of certain parts of your UI, - startTransition API that, similarly to the useTransition hook lets you mark a state update as non-blocking. It lacks, however, an indication of whether it's still pending. Automatic Batching Introduced in React 18, Automatic Batching reduces the number of re-renders that happen on state changes even when they happen outside of event handlers, e.g. in a setTimeout or Promise callback. This feature comes out of the box and you don't have to do anything to enable it, and it makes a great argument for upgrading to React 18. Selective Hydration Selective Hydration allows you to take hydration off the main thread by wrapping your components in a Suspense boundary. This way, components can become interactive faster as the browser can do other work on the main thread while the hydration is happening. To fully take advantage of selective hydration, consider the following: - Prioritizing Above-the-Fold Content: Use Suspense boundaries strategically around any parts of your application that may take the server longer to deliver to ensure they don’t block critical content from becoming interactive as soon as possible. - Hydration on Interaction: Implementing hydration upon user interaction for non-critical components can drastically reduce the main thread's workload, enhancing INP. Vercel even has a small case study showing how using selective hydration improved the performance of a Next.js site. Server-Side Rendering (SSR) and Static Site Generation (SSG) in Next.js Not everything has to run client-side. Next.js excels in SSR and SSG capabilities, which can significantly impact INP by delivering content to users faster. Optimizing SSR with techniques like incremental static regeneration (ISR) or leveraging SSG for static pages ensures that users can interact with content faster, improving the perceived performance. Workers Offloading heavy computations to Web Workers can free up the main thread, enhancing the responsiveness of React and Next.js applications. This strategy is especially useful when dealing with third-party scripts. Offloading such scripts in Next.js can be easily done by specifying the "worker" strategy on your Script component. Be aware that this feature is not yet stable and does not work with the app directory, though. If you want to take things one step further, you could use Partytown, which helps you offload any resource-intensive scripts to Web Workers. It comes with a React component that you can use to wrap your third-party scripts and offload them to a Web Worker, and it's compatible with Next.js as well. Hooks and State Management State management in React applications can easily get out of hand, leading to unnecessary re-renders and effectively an increased INP. Sometimes, using a state management library like Redux or MobX can help you consolidate your state and reduce the number of re-renders. However, they are not silver bullets and can also introduce performance issues if not used properly. If you are dealing with a lot of re-renders due to prop changes, make sure you are leveraging memoization. As of now, you may need to work with useMemo and useCallback hooks to memoize your values and functions, respectively. The upcoming React 19’s Forget Compiler, however, will apparently memoize everything under the hood, making these hooks obsolete. Using memoization properly can help you reduce the number of re-renders and improve your INP. To investigate your hook dependencies and re-renders, you can leverage React Developer Tools or use this handy helper hook I found on the internet to trace your re-renders: ` Conclusion Improving INP in React and Next.js is not easy and can require much investigation and fine-tuning. Still, it's worth doing to avoid being penalized by Google in its search results and provide a better experience for your users. Adopting React 18's new features, leveraging SSR and SSG in Next.js, utilizing Web Workers, and optimizing hooks and state management can significantly boost your INP score and deliver a faster application to your users. Remember, INP is just one among many performance metrics emphasizing the need for a comprehensive approach to performance optimization...

Apr 19, 2024

5 mins

Web PerformanceReactNextJSJavaScript

JSR - The cross-platform package manager for ESM

JSR - The cross-platform package manager for ESM Move over NPM, there’s a new package manager in town. JSR is a new open-source package registry for JavaScript and TypeScript. This particular release has caught my attention, and I’m excited to explore if and where JSR fits into our overflowing ecosystem. Setting the stage The JSR website has a page called Why JSR? I want to explore the _why_ a little bit further. I think that the _why_ points to JSR potentially being the future of the JavaScript ecosystem. NPM and JavaScript modules If you’ve found yourself confused about publishing and consuming packages and modules in the modern ESM era, you’re not alone. When NodeJS was created, there was no standard for JS modules. Node introduced CommonJS modules, and NPM provided a way to publish and consume these modules in our applications. It’s been years now since ES Modules became a standard and we are still in a weird module purgatory with no clear end in sight. JSR represents a fresh start, free from the baggage of the Node and NPM ecosystems. JavaScript runtimes and interoperability New JavaScript runtimes are popping up every year. WinterCG (_Web-interoperable Runtimes Community Group)_ was started to improve interoperability for the various runtimes with some standard APIs. Node has made steps towards this goal, but it might never get all the way there and NPM is coupled pretty tightly to that ecosystem. JSR supports several runtimes (depending on the package). This is a huge opportunity for it to become the package manager for the ecosystem at large. See cross-runtime support section for more details. NPM Compatibility JSR doesn’t end up being a complete departure from NPM since it includes an NPM compatibility layer. The docs highlight some limitations and provide deeper technical details on how this piece works. The simple overview is that you can install and use packages from JSR to a Node/NPM based project. The packages will get added to your package.json and node_modules directory. About the same as any package installed from NPM. JSR Overview At a high level, JSR is very similar to NPM. You can search for packages on the website. You can publish packages to their registry, and you can download packages from it as well. We touched on the two major features that set it apart from NPM: It only supports ESModules, and it has cross-runtime support. In the rest of the article we’ll explore the essentials (package management) and also some other really great features that feel like are bringing the JS package management ecosystem up to speed with our modern workflows. Native TypeScript I have to say I’m not surprised by this feature, considering JSR was developed by the same people behind Deno (a runtime that supports TS natively). It’s a great feature. If you haven’t figured it out yet TypeScript is a big deal and doesn’t seem to be fading away anytime soon. So what exactly does TS support look like in a package manager, though? If your runtime supports TypeScript natively (like Deno) - it can consume the TS files in your package directly. For environments that don’t, it will automatically compile your code to JavaScript and package it with type declaration files (.d.ts). Pretty, stinking, cool. The docs include a page called “about slow types”. It is probably worth a read on its own ,but the TLDR is that JSR will analyze your TS code and penalize you in certain ways if you have some type code that it considers to be slow. On the page, it outlines exactly what these penalties are, as well as what it considers slow types to be. Slow types will also come into play when we get into package scoring later in the article. Packages The docs have a page dedicated to packages. It covers some important stuff like deleting packages, versioning, documentation, and publishing. We’ll touch on some of it with an emphasis on the things that are done differently or better than with NPM. jsr.json file jsr.json is a configuration file that specifies a name, version, and exports of a package. It doesn’t include a list of dependencies like you would have in a package.json file. This is a configuration file for a package to be published to JSR. JSR dependencies are defined in an import map file or a package.json file, depending on the runtime. Adding packages to a project Adding packages to your project will vary depending on your target runtime. Here’s an example of installing it for Deno and NPM using various package managers. For Deno, an import map entry is created. For NPM, it gets installed as an NPM package using the JSR NPM compatibility layer. ` If the runtime has native JSR support, you don’t need to explicitly install packages. You can important them using the jsr: scheme. ` As you can see semantic versioning is supported and works similarly to NPM. Publishing packages If you want to publish a package to JSR, there are many important rules you need to be aware of. I recommend reading “Publishing packages” before moving forward with this. The main things to note are: ESModules only, npm packages are supported, JSR packages are supported, and node APIs are supported. So there aren’t many constraints. You just need to understand how exactly to do these things. IE: how do I use NPM dependencies, how do the imports work, etc? Once you’ve got all that down, you will be ready to publish your first package to JSR 😃 Documentation JSR emphasizes quality package documentation. It’s one of the factors in their package scoring algorithm, which we will discuss shortly. The important parts of a package’s documentation include: * a README.md file * symbol documentation - functions, interfaces, classes, etc * module documentation - docs for each exported module in a package The symbol and module documentation are written using JSDoc. The JSR website generates some nice documentation based on these 3 pieces on your package page. This means you can understand a package's entire public API without ever leaving jsr.io. Browser support The cross-runtime support is amazing, but I was curious where the original JavaScript runtime fit (the browser). According to the documentation: “You can use JSR with any tool that supports ES modules, and either has native support for JSR or supports npm packages using node_modules.” Since modern web browsers support ES modules, I will count this as JSR support. From what I can tell, though, there is one caveat. To include a module on a web page, we need a URL to download it from. Can we publish our package to JSR and use it to load a module via a script tag? According to the JSR usage policy, no. The “Unacceptable use” section states: “It is not acceptable to use JSR as a CDN for serving assets directly to web applications in a browser, except for development purposes.” So, it seems that this is a no-go for now. We will probably need to wait a bit for CDN services like jsdelivr to pull packages from the JSR registry, similar to what they do with NPM currently. Other notable features We’ve covered the most important bits about package management, TS support, NPM compatibility, etc. But there are still a few unique features that add an extra bit of sparkle to JSR. Scoring We mentioned earlier that documentation plays a part in a scoring algorithm. JSR analyzes all of the packages published to its registry and assigns them a score based on certain criteria: documentation, best practices, discoverability, and compatibility. This score is shown with every package listed on the website. You can also add a shiny badge to your README if you’re particularly proud of something. It’s still early on, so it’s likely this will evolve as time goes on. It’s an interesting gimmick, at the very least! Summary I am excited to have a new cross-platform compatible package manager that only supports ESM. As useful as NPM has been all these years, it carries a lot of baggage from “the olden days”. JSR’s native TypeScript support and reliance on modern standards is a breath of fresh air. It feels like I can publish a simple package without much ceremony. I’m excited to release my first package to JSR soon :)...

Apr 12, 2024

6 mins

JSRJavaScript

Linting, Formatting, and Type Checking Commits in an Nx Monorepo with Husky and lint-staged

One way to keep your codebase clean is to enforce linting, formatting, and type checking on every commit. This is made very easy with pre-commit hooks. Using Husky, you can run arbitrary commands before a commit is made. This can be combined with lint-staged, which allows you to run commands on only the files that have been staged for commit. This is useful because you don't want to run linting, formatting, and type checking on every file in your project, but only on the ones that have been changed. But if you're using an Nx monorepo for your project, things can get a little more complicated. Rather than have you use eslint or prettier directly, Nx has its own scripts for linting and formatting. And type checking is complicated by the use of specific tsconfig.json files for each app or library. Setting up pre-commit hooks with Nx isn't as straightforward as in a simpler repository. This guide will show you how to set up pre-commit hooks to run linting, formatting, and type checking in an Nx monorepo. Configure Formatting Nx comes with a command, nx format:write for applying formatting to affected files which we can give directly to lint-staged. This command uses Prettier under the hood, so it will abide by whatever rules you have in your root-level .prettierrc file. Just install Prettier, and add your preferred configuration. ` Then add a .prettierrc file to the root of your project with your preferred configuration. For example, if you want to use single quotes and trailing commas, you can add the following: ` Configure Linting Nx has its own plugin that uses ESLint to lint projects in your monorepo. It also has a plugin with sensible ESLint defaults for your linter commands to use, including ones specific to Nx. To install them, run the following command: ` Then, we can create a default .eslintrc.json file in the root of our project: ` The above ESLint configuration will, by default, apply Nx's module boundary rules to any TypeScript or JavaScript files in your project. It also applies its recommended rules for JavaScript and TypeScript respectively, and gives you room to add your own. You can also have ESLint configurations specific to your apps and libraries. For example, if you have a React app, you can add a .eslintrc.json file to the root of your app directory with the following contents: ` Set Up Type Checking Type checking with tsc is normally a very straightforward process. You can just run tsc --noEmit to check your code for type errors. But things are more complicated in Nx with lint-staged. There are a two tricky things about type checking with lint-staged in an Nx monorepo. First, different apps and libraries can have their own tsconfig.json files. When type checking each app or library, we need to make sure we're using that specific configuration. The second wrinkle comes from the fact that lint-staged passes a list of staged files to commands it runs by default. And tsc will only accept either a specific tsconfig file, or a list of files to check. We do want to use the specific tsconfig.json files, and we also only want to run type checking against apps and libraries with changes. To do this, we're going to create some Nx run commands within our apps and libraries and run those instead of calling tsc directly. Within each app or library you want type checked, open the project.json file, and add a new run command like this one: ` Inside commands is our type-checking command, using the local tsconfig.json file for that specific Nx app. The cwd option tells Nx where to run the command from. The forwardAllArgs option tells Nx to ignore any arguments passed to the command. This is important because tsc will fail if you pass both a tsconfig.json and a list of files from lint-staged. Now if we ran nx affected --target=typecheck from the command line, we would be able to type check all affected apps and libraries that have a typecheck target in their project.json. Next we'll have lint-staged handle this for us. Installing Husky and lint-staged Finally, we'll install and configure Husky and lint-staged. These are the two packages that will allow us to run commands on staged files before a commit is made. ` In your package.json file, add the prepare script to run Husky's install command: ` Then, run your prepare script to set up git hooks in your repository. This will create a .husky directory in your project root with the necessary file system permissions. ` The next step is to create our pre-commit hook. We can do this from the command line: ` It's important to use Husky's CLI to create our hooks, because it handles file system permissions for us. Creating files manually could cause problems when we actually want to use the git hooks. After running the command, we will now have a file at .husky/pre-commit that looks like this: ` Now whenever we try to commit, Husky will run the lint-staged command. We've given it some extra options. First, --concurrent false to make sure attempts to write fixes with formatting and linting don't conflict with simultaneous attempts at type checking. Second is --relative, because our Nx commands for formatting and linting expect a list of file paths relative to the repo root, but lint-staged would otherwise pass the full path by default. We've got our pre-commit command ready, but we haven't actually configured lint-staged yet. Let's do that next. Configuring lint-staged In a simpler repository, it would be easy to add some lint-staged configuration to our package.json file. But because we're trying to check a complex monorepo in Nx, we need to add a separate configuration file. We'll call it lint-staged.config.js and put it in the root of our project. Here is what our configuration file will look like: ` Within our module.exports object, we've defined two globs: one that will match any TypeScript files in our apps, libraries, and tools directories, and another that also matches JavaScript and JSON files in those directories. We only need to run type checking for the TypeScript files, which is why that one is broken out and narrowed down to only those files. These globs defining our directories can be passed a single command, or an array of commands. It's common with lint-staged to just pass a string like tsc --noEmit or eslint --fix. But we're going to pass a function instead to combine the list of files provided by lint-staged with the desired Nx commands. The nx affected and nx format:write commands both accept a --files option. And remember that lint-staged always passes in a list of staged files. That array of file paths becomes the argument to our functions, and we concatenate our list of files from lint-staged into a comma-delimitted string and interpolate that into the desired Nx command's --files option. This will override Nx's normal behavior to explicitly tell it to only run the commands on the files that have changed and any other files affected by those changes. Testing It Out Now that we've got everything set up, let's try it out. Make a change to a TypeScript file in one of your apps or libraries. Then try to commit that change. You should see the following in your terminal as lint-staged runs: ` Now, whenever you try to commit changes to files that match the globs defined in lint-staged.config.js, the defined commands will run first, and verify that the files contain no type errors, linting errors, or formatting errors. If any of those commands fail, the commit will be aborted, and you'll have to fix the errors before you can commit. Conclusion We've now set up a monorepo with Nx and configured it to run type checking, linting, and formatting on staged files before a commit is made. This will help us catch errors before they make it into our codebase, and it will also help us keep our codebase consistent and readable. To see an example Nx monorepo with these configurations, check out this repo....

May 10, 2023

6 mins

Intro to EdgeDB - The 10x ORM

Intro to EdgeDB - The 10x ORM I’ve written a couple of posts recently covering different TypeScript ORMs. One about Prisma, and another about Drizzle. ORM’s are a controversial topic in their own right - some people think they are evil, and others think they are great. I enjoy them quite a bit. They make it easy to interact with your databases. What is more important and magical for an application than data? SQL without an ORM is amazing as well, but there are some pain points with that approach. Today I’m excited to write about EdgeDB, which isn’t _exactly_ an ORM or a database from my perspective (although they call themselves one). It is, however an incredibly impressive piece of technology that solves these common pain points in a pretty novel way. So if it’s not an ORM or a database, what exactly is it? I don’t think I can answer that in one or two sentences, so we will explore the various pieces that make up EdgeDB in this article. From a high-level standpoint, though, an interface/query language that sits in front of PostgreSQL. This may seem like some less important implementation detail, but in my eyes, it’s a feature and one of the most compelling selling points. Data Modeling EdgeDB advertises itself as a “graph-relational database”. The core components of EdgeDB data modeling are the schema, type system, and relationship definitions. A schema will consist of objects that contain various typed attributes and links that connect the objects. In SQL, a table is analogous to an Object, and a foreign-key is associated with a link. Here’s what a simple schema in EdgeDB looks like ` There’s a few things to highlight here * We defined two different objects (tables) User and Post * Each object contains properties with their types defined * str is one of several scalar types (bool, int, float, json, datetime, etc) * the author property is a required link to the User object Defining relations / associations In our example above we defined a one-to-many relationship between a user and posts. All the relation types that you can define in traditional SQL are available. One interesting feature though is called backward links. These can be defined in your schema and it allows you to access related data from both sides of a relationship. ` likes are a many-to-many relationship between Tweet and User. With a backlink defined multi link likers := Tweet.<likes[is User]; - we can access likes from a User and likers from a Tweet. ` That's how we can access these relations in our queries. You might be looking at these queries and thinking it looks a lot like GraphQL. This is why they call it a ‘Graph-relational’ database. We’ve only scratched the surface of EdgeDB schema’s. Hopefully, I’ve at least managed to pique your interest. Computed properties Computed properties are a super powerful feature that can be added to your schema or queries. This example user schema creates a computed discriminator and username property. The discriminator uses an EdgeDB standard library function to generate a discriminator number and the username property is a combination between the name and discriminator properties. ` Globals and Access Policies EdgeDB allows you to define global variables as part of your schema. The most common use case I’ve seen for this is to power the access policy feature. You can define a global variable as part of your schema: global current_user: uuid; With a global variable defined, you can provide the value as a sort of context from your application by providing it into your EdgeDB driver/client. ` You can then add access policies directly to your schema, for example, to provide fine-grained access control to blog posts in your blogging application. ` Aside from access policies, you can use your global variables in your queries as well. For example, if you wanted a query to select the current user. ` Types and Sets EdgeDB is very type-centric. All of your data is strongly typed. We’ve touched on some of the types already. * Scalars - There are a lot of scalar types available out of the box * Custom scalars - Custom scalar types are user-defined extensions of existing types * Enums - Are supported out of the box - enum<Admin, Moderator, Member> * Arrays - Defined by passing the singular value type - array<str>; * Tuples - In EdgeD, tuples can contain more than 2 elements and come in named and unnamed varieties - <str, number>; tuple<name: str, jersey_number: float64, active: bool>; All queries return a Set which is a collection of values of a given type. In the query language, all values are Sets. Sets are a collection of values of a given type. A comma-separated list of values inside a set of {curly braces}. A query with no results will return an empty or singleton set. If we have no User values stored yet - select User returns {}. Paired with the query language types and set provides an incredibly powerful and expressive system for interacting with your data. If you thought TypeScript was cool, wait until you start writing EdgeQL! 🙂 EdgeQL Now to the fun stuff: the query language. We’ll use a schema from the docs and start by looking at some of those queries and build on those. The example schema has an abstract type Person with two sub-types based on it Hero and Villian. This is known as a polymorphic type in EdgeDB. The Movie type includes a 1:m association with Person ` Selecting properties / data Before we dig into some real queries we should just touch on how we select actual data from a query. It’s pretty obvious and GraphQL-like but worth mentioning. To specify properties to select, you attach a shape. This works for getting nested link / association data as well. Based on our schema, here’s how we could select fields from Movie, including data from the collection of related characters. ` There is also a feature called a splats that allows you to select all fields and/or all linked fields without specifying them individually. ` If you don’t specify any properties or splats, only id’s get returned select Movie; . Adding some objects with insert To get started, we can use insert to add objects to our database. We’ll start big by looking at the nested insert example. This example is interesting because it shows the creation of two objects in a single query. You’ll notice the simplicity of the syntax. Even though this is the first EdgeQL query we’re looking at, in my experience, it’s like this across the board. I’ve found EdgeQL queries to be simple and intuitive to the point where I’ve been able to intuit how to accomplish things in my head without having to reference the docs or ask the AI. This example adds a new Villian and a new Hero which gets assigned as a link or association to the nemesis field on our Villian. To accomplish this we see that we can nest queries by wrapping them in (). ` The next example is pretty similar, but instead of creating the linked property, we are select ing and adding several potential objects to the characters list of Movie since it is a multi link. This is a pretty complex query that is doing a lot of different things. It’s deceivingly succinct. To accomplish the same thing with SQL this would probably be about 3 different queries. This query finds the objects to add to the characters multi link by filtering on a collection of different strings to match against the name property. ` The last thing we’ll cover for insert is bulk inserts. This is particularly useful for things like seed scripts. In this example, you can just imagine that you have a JSON array of objects with hero names that gets passed in as an argument to your query ` Querying data with select We’ve already seen subqueries and a select in the last section where we found a collection of Person records with a filter. We’ll build on that and see what tools are available to us when it comes to querying data. This one covers a lot of ground. Very similar to SQL we have order and limit, and offset operators to support sorting and pagination. Also there is a whole standard library of functions and operators like count that can be used in our queries. This example returns a collection of villian names, excluding the first and last result. ` Most commonly, you will want to filter by an id ` Here’s another common example filtering by datetime. Since we’re using a string value here we need to cast it to the EdgeDB datetime type. ` You get a pretty similar toolbox to SQL when it comes to filtering with your common operators and things. Combined with all the tools in the standard library, you can get pretty creative with it. Changing values and links with update The update..filter..set statement is how we can update existing data with EdgeQL. set is followed by a shape with assignments of properties to be updated. ` You can replace links for an object ` or add additional ones ` An even more interesting example is removing links matched on a type. Since Villian is a sub-type of Person , this query will remove all characters linked of the Villian type. ` Deleting objects with delete Deleting is pretty straight forward. Using the delete command you can just filter for the objects that you would like to remove. ` When the EdgeQL pieces fall into place As you become more familiar with the EdgeQL query language chances are you’ll start writing very complex queries fluently because everything just makes sense once you’ve learned the building blocks. Domain and business concerns I don’t think they explicitly mention this as a goal anywhere but it’s something that I picked up on pretty quickly. EdgeDB nudges you to move more of what might have traditionally been application logic into your database layer. This is a topic that can bring a lot of division since even things like foreign keys and constraints in SQL are frowned upon in some circles. EdgeDB goes as far as providing constraints, global variables, contexts, and authorization support built into the database. I think that the ability to bake some of these concerns into your EdgeDB Schema is great. The way you model your schema and database in EdgeDB map to your domain in a much more intuitive way where domain concerns don’t really feel out of place there. Database Clients and Query Builders and Generators We’ve covered a lot so far to highlight what EdgeDB is and how to handle common use cases with the query language. To use it in your project though, you will need a client/driver library. There are clients available in several different languages. The one that they clearly have put the most investment into is the TypeScript query builder. We’ll briefly look at both options: simple driver/client and query builder. Whichever you end up choosing you will need to instantiate a driver and make sure you have a connection to your database instance configured. Basic client Although the TS query builder is very popular and pretty amazing, I couldn’t get away from just writing EdgeQL queries. In my application, I composed queries using template strings, and it worked great. The clients all have a set of methods available for passing in EdgeQL queries and parameters. querySingle is a method for queries where you are only expecting a single result. If your query will have multiple results you would use query instead. There is also a queryRequiredSingle which will throw an error if no results are found. There are some other methods available as well including one for running queries in a transaction ` The first argument is the query, and the second is a map of parameters. In this example we include the title parameter and it is accessed in our query via $title. TypeScript query builder If you have a TypeScript app and type-safety is important, you might prefer using the query builder. It is a pretty incredible feat of TypeScript magic initially developed by the same developer behind the popular library Zod. We can’t cover it in very much depth here but we’ll look at an example just to have an idea of what the query builder looks like in an application. ` The query builder is able to infer the result type automatically. It knows which fields you’ve selected, it knows that the result will be a single item. Query generator There are generators for queries and types. So even if you opt out of using the query builder you can still have queries that are strongly typed. It’s nice to have this option if you want to just write your queries as EdgeQL in .edgeql files. ` We end up with an exported function named getUser that is strongly typed. ` Tools and Utilities The team at EdgeDB puts a big emphasis on developer experience. It shows up all over the place. We’ve already seen some utilities with the generators that are available. There are some other tools available as well that help complete the entire experience. EdgeDB CLI The first and most important tool to mention is the CLI. If you’ve started using EdgeDB then you’ve most likely already installed and used it. The CLI is pretty extensive. It includes commands for things like migrations, managing EdgeDB versions and installations, managing projects and local/cloud database instances, dumps and restores, a repl, and more. The CLI makes managing EdgeDB a breeze. Admin UI The CLI includes a command to launch an admin UI for any project or database. The Admin UI includes a awesome interactive diagram of your database schema, a repl for running queries, and a table to inspect and make changes to the data stored in your database. Summary Adopting newer database technology is a tough sales pitch. Replacing your application’s database technology at any point in its lifecycle is not a problem that anyone wants to have. This is one of the reasons why EdgeDB being a superset of PostgreSQL is a huge feature in my opinion. The underlying database technology is tried and true, and EdgeDB is open-source. Based on this, I would feel confident using EdgeDB if it aligned well from a technical and business perspective. We’ve covered a lot of ground in this post. EdgeDB is feature-packed and powerful. Databases is a tough nut to crack, and I commend the team for all their hard work to help continue pushing forward one of the most important components of almost any application. I’m typically pretty conservative when it comes to databases, but EdgeDB took a great approach, in my opinion. I recommend at least giving it a try. You might catch the EdgeDB bug like I did!...

Apr 26, 2024

11 mins

JavaScriptEdgeDB

Web Scraping with TypeScript and Node.js

Setup

Fetching Websites

Caching Scraped Pages

Extracting Data with jsdom

Extracting Data with Regular Expressions

Saving Data

Putting It All Together

Tom VanAntwerp

You might also like

Improving INP in React and Next.js

JSR - The cross-platform package manager for ESM

Linting, Formatting, and Type Checking Commits in an Nx Monorepo with Husky and lint-staged

Intro to EdgeDB - The 10x ORM

You might also like

Improving INP in React and Next.js

JSR - The cross-platform package manager for ESM

Linting, Formatting, and Type Checking Commits in an Nx Monorepo with Husky and lint-staged

Intro to EdgeDB - The 10x ORM