After a particularly egregious outage was reported by a Weave Cloud user, we decided it would be a good idea to add some alerting to our front-end codebase. Alerting and metrics for Weave Cloud are handled using Prometheus, however, which is designed to monitor back-end services. Implementing alerting for front-end applications would be uncharted waters for Prometheus.

Classes of UI problems

There are several questions that we want our metrics/alerts to answer about the health of our UI. The current implementation of UI alerts seeks to answer these two questions:

Did our application make it to the user?

Since our front-end app has to travel across the network, we should probably track whether or not they make it across the wire to the end user. We should also make sure that the application mounts to the DOM and at least renders something on the page.

Are all of the different pages rendering successfully?

Just because our app made it to the user and did an initial render doesn’t mean that all of the pages are working. Our application is a React single-page-application (SPA), which means that server round trips are not necessary when navigating between pages.

In a traditional ui-server environment using Prometheus (i.e. Ruby on Rails), the server can catch errors that happen during page render and increment a counter; when the number of errors is high enough within a given time period, an alert is triggered.

Because all of the pages rendering for our React SPA happens in the user’s browser, it is possible for render errors to happen without ever being tracked.

Design Constraints

To answer the question “Did my application load?”, we cannot rely on loading external JavaScript files, because in the event of a network outage (say, for example, an S3 outage), our alert code will never reach the page and thus never phone home to our metrics database.

Tracking page render errors is a much more difficult problem. A page render error in the browser will manifest itself as a JS runtime error and could be indistinguishable from other runtime errors that result from user interaction or network events. Luckily, each of our pages is expressed as a React component with a uniquely identifiable render function that we instrument via the unstable_handleErrorcomponent lifecycle method. This method is experimental and subject to change in future versions of React, hence the unstable prefix.

Finally, all of our metrics must be expressed as a “rate,” meaning that we need to capture successful conditions as well as error conditions. For example, counting the number of page-render errors doesn’t do any good if we don’t know how many times the page rendered successfully.

Implementation

The task of determining whether or not the application made it to the users is accomplished by embedding a script within our index.html file. This script sets a timer that must be cancelled by the application code that arrives later:

(function setPageLoadTimer() {
  window.__WEAVEWORKS_PAGE_LOAD_TIMER = setTimeout(function() {
    var xhr = new XMLHttpRequest();
    var str = '#HELP ui_page_load_failures Number of times service-ui loads\n';
    str += '# TYPE ui_application_load counter\n';
    str += 'ui_application_load{status="fail"} 1\n';

    xhr.open('POST', '/api/ui/metrics');
    xhr.setRequestHeader('Content-Type', 'text/plain');
    xhr.send(str);
  }, 3000);
})();

The application code has 3 seconds to cancel the timer, otherwise a Prometheus-formatted message will be sent to the /api/ui/metrics endpoint. Since we cannot rely on an library to render a Prometheus-formatted payload, we do it ourselves. Luckily, the payload is not very complex.

And in our router.jsx component (which wraps our entire application):

componentDidMount() {
  if (window.__WEAVEWORKS_PAGE_LOAD_TIMER) {
    clearTimeout(window.__WEAVEWORKS_PAGE_LOAD_TIMER);
    trackAppLoadSuccess();
  }
}

For tracking page render health, we use a more nuanced approach that keeps these points in mind:

  • We don’t want to overload our server by reporting every time a page is rendered. We should report every 15 seconds instead.
  • Users might not stay on an error page long enough to hit our reporting interval. In order to avoid missing error events, report errors to the server immediately.
  • Prometheus best-practices espouse avoiding metric labels with high-cardinality, so we need to scrub user-specific data from our metrics. In our case, that means normalizing things like the orgId from page routes.

Each page will share the same render instrumentation logic, which makes this a solid use case for a React higher-order-component. As you can guess from the name, a higher-order-component accepts a component as an argument, and returns a component with extra functionality. This is the preferred method for inheriting component logic in React, according to the maintainers.

Our RenderCheck higher-order-component looks something like this:

export default function wrapWithRenderCheck(Component, Error = ErrorPage) {
  class RenderCheck extends React.Component {
    unstable_handleError(error) {
      this.setState({ error });
      trackPageRenderError(window.location.pathname);
    }

    render() {
      const { error } = this.state;
      return (
        <div>
          { error && <Error error={error} {...this.props} /> }
          { !error && <Component {...this.props} /> }
        </div>
      );
    }
  }

  return RenderCheck;
}

This also gives us the chance to handle the error somewhat gracefully, and show a formatted Error page with a stack trace. The real version also contains a helper function to scrub the URL of user data and replace it with a normalized token: /app/loud-breeze-77 -> /app/:orgId .

Most of the Prometheus JavaScript client libraries only promised compatibility with Node.js. This is not surprising, as Prometheus is generally used for server-side metrics. These implementations support lots of helpful metrics straight out of the box in their default state, which rely on calls to Node.js-specific APIs that do not exist in the browser.

There didn’t appear to be any JS Prometheus libraries that are both actively maintained and browser-compatible. So I had no choice but to roll my own!

This library currently only supports the Counter metric type and could use a good refactoring. It might also be useful to split this into its own repository and npm module so that it can be used in other projects (by both Weaveworks and the public). One key difference between this implementation and others is the ability to render one-off metrics without saving them to an instantiated registry:

export function trackPageRenderError(path, err) {
  const txt = prom().render(
    'counter',
    'ui_page_render',
    'Number of times a page fails to render',
    { value: 1 },
    { path, status: 'fail' }
  );
  postText('/api/ui/metrics', txt);
}

This is helpful when it is not possible to pass a singleton all the way through the front-end application.

prom-aggregation-gateway

In order to collect metrics from users browsers, we needed to expose a service that will receive the metrics data and aggregate them for a Prometheus scrape. The prom-aggregation-gateway accepts POST requests containing Prometheus-formatted data and serves the aggregated metrics data via GET requests to the /metrics route.

Alerts

The UI alerts are configured in the service-conf repo. They are currently configured to alert if the error rate exceeds 0.1 for 5 minutes.

We also have graphs for UI stats available on Grafana:

Well, did it work?

TLDR: Yes!

During a recent S3 outage, the UiAppLoadFailed did trigger properly. In this case, we were already aware that the UI (along with most of Weave Cloud) was failing, but the alerts did do their job. Thanks to some quick thinking, we were at least able to restore the Weave Cloud UI, even though Cortex and Scope were both running with severely degraded performance.

Future improvements and closing thoughts

The current list of UI problems that can be accurately measured and subsequently alerted on is very limited. Here are some other things that can (and have) gone wrong with our UI that would be great to add in the future:

  • Network stuff: checking for errors and latency on network stuff from the client could be interesting. We already have metrics for each of the services, so maybe we would alert when there is a discrepancy between what the client is reporting and what the server is reporting. Network metrics at the UI level would also give us greater insight into what the user is actually experiencing.
  • JS runtime errors: it is very difficult to add alerting around exceptions that are triggered from user input. For example, when a user clicks a button and triggers an Undefined is not a function error, we can record the error, but there is no concept of a “rate,” unless we track how many times that button is clicked without an error. We may quickly run into cardinality problems if we try to track individual errors. It may be enough to count the number of errors in the last “n” minutes and alert on that. Also, these errors will not contain a stack trace to help with troubleshooting. We may be able to include the current page that the runtime error is happening on, however.