Own Your Event Tracking

Event tracking is something all web apps need. Own your own infrastructure, don't rent it. This post will show you how.

Mike Coughlin

Updated: january 02, 2022

Whether you're a SaaS app, e-commerce storefront, or a media publisher, you need event tracking to help you make informed decisions. When I say "event" I mean everything from a simple page visit, to a completed purchase, to the general use of your application.

In this post, I'll walk you through setting up what I believe to be the best event tracking system available. There are many reasons I refer to this as the best but here are the top three:

1. Future Proof

This system uses an open-source, abstract library to collect events. Those events are then sent to your data warehouse as well as any third-party tool you like. This is the opposite of a common scenario where you allow third-party tools to collect your events then pass the data they collect to your data warehouse.

For example, when a visitor loads a page we will record a "page visited" event and tell Google Analytics about it rather than letting Google Analytics record it with their script. Doing it this way results in several key benefits:

2. Greater Control & Flexibility

If you buy something like Segment you are giving up control and the ability to customize your pipeline (beyond what they expose in their interface). You will also be paying extra for seemingly trivial changes to data transport. For example, it's a premium feature to send data to some sources but not others.

3. Cost

It's relatively low cost and a helluva lot cheaper than something like Segment (for the record, I think Segment is a great product and many of my clients love it).

Now that you're sold on owning your own, let's get started by getting more familiar with our new stack (shout-out to Jonathan Geggatt who taught me a lot about these components).

The Functionality We Need

Event Collection

We need to manage the triggering and recording of events (pageviews, purchases, app-specific behavior, etc.)

We also need the ability to tie pre-signup, anonymous behavior to post-signup user behavior. This is often referred to as "identity resolution"

Event Storage

Pretty straightforward, we need a place to store all of the collected events. This will be our data warehouse.

Event Transport

We need a way to transport the events collected on our website, app, etc. to the data warehouse.

The Components We'll Use

Get Analytics (Event Collection)

GetAnalytics will be responsible for our event collection and identity resolution

For event collection and identity resolution, we'll be using an open-source, javascript library by David Wells called simply, "Analytics". To limit confusion, throughout this post I will refer to it as GetAnalytics (because the docs live at getanalytics.io).

In its own words "GetAnalytics is a lightweight abstraction library for tracking page views, custom events, & identifying visitors. It is pluggable & designed to work with any third-party analytics tool or your own backend."

This library will be present on all of our web assets (marketing website, web app, etc.). It will collect pageview events (similar to Google Analytics) out-of-the-box for us and we can use it to wire up custom events and actions.

Cost: Open-source (free). Buy David Wells a beer next time you're in San Francisco.

BigQuery (Event Storage)

BigQuery will be our DWH of choice

We'll be using a data warehouse (DWH) for event storage. In this post, I'm using BigQuery but Stitch connects to all the most popular DWHs (Redshift, Azure, Snowflake, etc.). I chose BigQuery because, in my opinion, it's easy to set up and relatively cheap to maintain (from a cost and sysops perspective)

Cost:

The free tiers are very generous and most projects that I've work on that go beyond the free tiers are still < $100

Stitch (Event Transport)

We'll use Stitch to extract from our data payload and load into our DWH

Stitch is a platform for transporting data. It will be one part of our two-part pipeline sending events collected by GetAnalytics to our data warehouse. We will be streaming our data to Stitch using their free webhook.

Cost: They have a generous free plan (5M rows per month) that would see you recording a lot of events. If you were to double that to 10M it's $180/month.

AWS Lambda & AWS Gateway (Event Transport)

Our Lambda will secure our Stitch webhook and provide us with additional controls

We'll be using Stitch to load the events into our data warehouse but we will be adding a layer of security between Stitch and our web assets with an AWS Lambda/Gateway combo (credit to Pete Michel for suggesting this component).

The AWS Gateway provides us an API endpoint to send data to and also acts as the trigger that will cause our Lambda code to execute. The Lambda is where our code lives and is executed (which for this walkthrough will be solely to pass the data it receives onto the Stitch webhook).

We route the data being collected by GetAnalytics through a Lambda before hitting Stitch for a couple of reasons:

Cost: Lambda pricing, Gateway pricing - if you can crack the free tiers, you aren't worried about pricing.

The Way They All Work Together

Now that you are familiar with all of the components, let's talk about how they are all going to work together.

  1. GetAnalytics is the library we'll add to our website/app and it will be responsible for capturing events.
  2. When an event is fired, we will send (POST) its payload to our API endpoint (AWS Gateway).
  3. When the Gateway receives the payload it will trigger the AWS Lambda function.
  4. The Lambda function will parse the data and send it (POST) to our Stitch webhook.
  5. Stitch will receive the data via the webhook, extract it, and load it into our data warehouse (BigQuery).
  6. Data will land in our data warehouse (BigQuery) where we can query it for analysis, build reports, etc.

Let's Build It

We're going to start with our Event Collection (GetAnalytics) then move onto setting up our Event Storage (BigQuery) and finally the two-part pipeline (AWS & Stitch) that will make up our Event Transport to connect them.

Event Collection

GetAnalytics

Here are the docs and the repo for this library. For this post, we're going to use the vanilla-html implementation but feel free to add the library any way you like. You'll need to add it to all of the web assets you want to track events on but for now, let's just add it to a simple marketing site.

Now that we've added GetAnalytics to the site, let's use the console to test the three major functions: page, track and identify.

When you open up your console you will see an error POST https://will-add-our-api-endpoint-here/ net::ERR_NAME_NOT_RESOLVED This is because we haven't added our API endpoint yet. Ignore it for now, we will update this after we've set up our AWS Gateway (which provides the API endpoint).

Let's start with page. The page function fires every time a page is loaded, you can think of it simply as recording a visit. By default, the script you've added to your footer calls analytics.page(); on every load but let's run that manually in the console.

page being run in the console

After we run page, you can see there is a Promise object returned; expand that and the nested PromiseResult and you'll see the payload that holds all the data we want. This payload is what we are going to be passing along to our data warehouse. When looking at the payload pay particular attention to the anonymousId and userId keys. Your anonymousId is a unique identifier for you, set as a cookie in your browser. The userId should currently be null because we have yet to identify ourselves; that's about to change.

Let's try out the identify function by calling it and passing it a userId. Run analytics.identify('userid-123'); in your console.

identify function being run in the console

The identify function is used to identify a user after they've signed up for your application. The most common identifier to pass to the function is a user's id but it can be anything unique to a user. We passed a string, userid-123. Let's look at the payload returned. Our anonymousId should be the same as before but this time our userId should be set as userid-123. If you run analytics.page(); again you'll see that it now also has a userId value. We can now tie prior anonymous actions to a specific user and every subsequent page visit or action tracked should carry both an anonymousId and userId because we now have cookies for both (see below).

The two cookies that make it all possible

When you call identify in your application, you'll call it in all the places a user goes through authorization (mainly sign-up and sign-in).

Finally, we'll try a track call. The track function is used to track any kind of custom event you wish; could be a button click, a cart completion, or some other app-specific behavior. For this example, we'll fire track for a button click (normally you'd fire a track call upon the actual "click" of the button but we are just going to execute it in the console). Run analytics.track('buttonClicked') in your console.

track function run in the console

Take a look at the payload returned. It should be familiar by now but take note that both the anonymousId and userId are both unchanged and present.

We've fired all three of the major library functions successfully but those payloads aren't going to do us any good until we transport and store them in our data warehouse. Let's set up the warehouse next.

Event Storage

BigQuery (data warehouse)

There are a ton of different ways to configure your DWH. This post is just going to walk through a basic, 4-step setup.

1. If you don't have one, sign up for a Google Cloud account.

2. Head to your console and select BigQuery from the sidebar.

3. Click create project.

4. Name your project and click create.

You'll land here:

You now have a BigQuery data warehouse. We have just taken care of our Event Storage, now let's move on to Event Transport.

Event Transport

Stitch

There are a ton of different ways to configure Stitch. This post is just going to walk through a basic setup (8 steps).

1. Sign up for a free Stitch account.

2. We'll start by adding our first data source which Stitch calls an "Integration". Our first and only Integration will be a webhook we can pass our data to.

3. When you configure the webhook be aware that the name you choose will be what shows up in BigQuery (or your chosen DWH). I'm using a side project for this demo named Artemis so I'm calling mine "artemis_analytics". Primary key can be blank. Click save.

4. After you click save you'll be presented with your Webhook URL. Copy and paste this somewhere safe for now because you can not retrieve it later (though you can revoke it and generate a new one if you need to). I've blocked mine out in the image below because this URL has sensitive information in it. This is one of the reasons we will be POSTing to it safely from our Lambda.

5. We have our first Integration set up now let's finish by setting up a Destination. A Destination is where we want Stitch to load the data it receives via the Integration and our Destination will be our data warehouse, BigQuery. (circle where they should click).

6. Follow the instructions on this page to create a service account and generate/upload your key file

7. For our loading behavior, we want "Append".

8. Click "Check and Save" and Stitch will test the connection. Once it goes through you are all set up. At this point we have a webhook that we can send data to that will load directly into our Data Warehouse. However, as I mentioned a few steps prior, the Stitch webhook is not secure. We don't want to expose it in the client-side javascript we'll be using to capture events. With that in mind, we are now going to stand up some light AWS infrastructure to secure that webhook.

AWS Lambda & Gateway

Follow the 6-steps below to configure your AWS Lambda with an AWS Gateway as the trigger.

1. If you don't have one already, sign up for a free AWS account.

2. Because AWS is so expansive and horrifying to navigate, search for "Lambda".

3. Create a new function.

4. Choose "Author from Scratch", name your function (I called mine postToStitch) and choose the language you want to write your function in. If you want to use the code in this post, choose Ruby 2.7, otherwise choose whatever language you prefer. Click "Create function".

5. You'll land here which is the command center for your Lambda function. Let's set up our API Gateway (endpoint) that will trigger the Lambda code. Click "Add trigger".

6. Select "API Gateway" in the first dropdown, select "Create an API" in the second dropdown, under API type select REST API and for now, set the security to Open. Click "Add"

Now your Lambda command center should look like the image below. You can click on the API Gateway and Lambda blocks in the designer to access the configuration for each below. The API Gateway should be selected by default but if not, click on it. You should see your API endpoint at the bottom (if not, click "details" to expand the interface and expose your API endpoint). Click on your API endpoint.

By visiting your API endpoint you've triggered the code in your lambda function. A new browser window should have opened and displayed the response of "Hello from Lambda!".

This is the API endpoint that the GetAnalytics script needs to POST to (remember up above when we were setting up GetAnalytics and we were seeing an error in the console?). Let's update that now. Go back to your GetAnalytics script and replace both instances of https://will-add-our-api-endpoint-here with your API endpoint from your AWS Gateway.

Our last step will be to update our Lambda code.

Before proceeding you should consider setting up Cloudwatch logs for your API endpoint this is a great, quick walkthrough. This will be particularly helpful in tweaking your Lambda code to suit the request that's being made from your website (and ultimately the GetAnalytics script). That said, if you are just following along to see how this all works first, the code provided should "just work".

Lambda Code

Now click on your Lambda function to expose the Lambda code. Let's update our Lambda code so that instead of just returning a response ("Hello from Lambda!"), it sends the data it receives to our Stitch webhook.

Our Lambda code needs to receive the POST request made to our Gateway API endpoint (from our site/app), parse out the body of the request and make a POST request to our Stitch webhook.

1. Copy/paste the code below into your Lambda (replacing all of the default code)

2. Replace https://your-stitch-webhook with your Stitch webhook you copy/pasted to a safe place when we set up our Stitch webhook above

3. Hit 'save'

Test it all out

You should now go back to your site where you have GetAnalytics installed, open up the console in dev tools, and hit refresh. You should see the page payload (because we have console.log in our code for debugging).

Jump over to your Stitch dashboard for your webhook and you should see data being extracted/prepared/loaded into your DWH (it can sometimes take a few mins to start).

Stitch extracting data hitting the webhook

After a few rows have been loaded, jump over to your DWH and take a look at your new tables.

Congratulations, you now own your own event tracking!

Now what?

A couple of cool things you could look into now: