PagerDuty Blog

All, None, and One: The SQL Left Join Trick

Introduction

PagerDuty released Multi-User Alerting in early 2014, which allowed notifying and assigning multiple people when an incident is triggered. In addition to assigning multiple users to an incident, multi-user alerting also makes it possible for an incident to have multiple acknowledgers. This post will demonstrate the changes we made to our data model to implement multi-user alerting and the resulting sophistication added to our SQL queries to maintain their performance.

Data Model Migrations

The logic and data model behind our incident process exists as part of a Rails app. Before the migration, our data model looked like:

class Incident < ActiveRecord::Base
  has_one :acknowledging_user
  has_one :assigned_user

  # Time objects
  acknowledged_at
  resolved_at
end

After multi-user alerting rolled out, the data model looked like:

class Incident < ActiveRecord::Base
  has_many :acknowledgements, class_name: :IncidentsAcknowledgement, dependent: :delete_all, inverse_of: :incident, order: :created_at
  has_many :acknowledging_users, through: :acknowledgements, source: :user

  has_many :assignments, class_name: :IncidentsAssignment, dependent: :destroy, inverse_of: :incident
  has_many :assigned_to_users, through: :assignments, source: :user

  # Time object, acknowledged_at moved to IncidentsAcknowledgement.created_at
  resolved_at
end

where IncidentsAcknowledgement and IncidentsAssignment are classic join tables:

class IncidentsAcknowledgement < ActiveRecord::Base
  belongs_to :incident, inverse_of: :acknowledgements
  belongs_to :user
end

class IncidentsAssignment < ActiveRecord::Base
  belongs_to :incident, inverse_of: :assignments
  belongs_to :user
end

At the surface, this type of migration seems straightforward: A 1:1 relationship becomes a 1:many relationship. A total cake walk straight out of Data Modeling 101, right?

Not really.

Some of the most common queries PagerDuty performs require finding all incidents for an account with a given state: triggered, acknowledged, or resolved. We’ll discuss how we migrated each of these queries, starting with the easiest.

Resolved Incidents

The query for resolved incidents did not change at all with Multi-User Alerting. An incident can only be resolved once, with or without Multi-User Alerting. That leads to the following scopes:

  scope :resolved, where.not(resolved_at: nil)
  scope :unresolved, where(resolved_at: nil)

Triggered Incidents

Before Multi-User Alerting, the triggered scope was:

  scope :triggered, unresolved.where(acknowledged_at: nil)

A triggered incident was merely one where both acknowledged_at and resolved_at were nil. Under Multi-User Alerting, an incident can have multiple acknowledgements, so this scope needed to be changed. We needed a query to find unresolved incidents that had exactly zero acknowledgements.

In SQL, how does one find if a join table has exactly zero associations? LEFT JOIN of course!

As a refresher, a left join guarantees that the result set contains all of the rows from the original table.

For example, consider the following two incidents, where one has some acknowledgements:

# Incidents
id | description     | resolved_at
----------------------------------
1  | CPU high        | NULL
2  | Server on fire  | NULL

# Acknowledgements
id   | incident_id  | user_id
-----------------------------
100  | 2            | 9875
101  | 2            | 9876

Consider the following SQL query that uses left join:

SELECT *
FROM incidents
LEFT OUTER JOIN acknowledgements on acknowledgements.incident_id = incidents.id

The result set of the above query is:

id | description     | resolved_at | id   | incident_id | user_id
-----------------------------------------------------------------
1  | CPU high        | NULL        | NULL | NULL        | NULL
2  | Server on fire  | NULL        | 100  | 2           | 9875
2  | Server on fire  | NULL        | 101  | 2           | 9876

Incident 1 (with zero acknowledgements) is the one where acknowledgements.incident_id IS NULL. That leads to the following straightforward scope definition:

  scope :triggered,
    arel_left_join(:acknowledgements).
    where(incidents_acknowledgements: {incident_id: nil}).
    unresolved

where arel_left_join is a helper defined as:

  def self.arel_join(association, join_type)
    incident_t = arel_table
    association_model = reflect_on_association(association).klass
    association_t = association_model.arel_table
    predicate = association_t[:incident_id].eq(incident_t[:id])
    joins(incident_t.join(association_t, join_type).on(predicate).join_sources)
  end

  def self.arel_inner_join(association)
    arel_join(association, Arel::Nodes::InnerJoin)
  end

  def self.arel_left_join(association)
    arel_join(association, Arel::Nodes::OuterJoin)
  end

This new triggered scope translates to the following SQL

SELECT `incidents.*`
FROM incidents
LEFT OUTER JOIN acknowledgements on acknowledgements.incident_id = incidents.id
WHERE (acknowledgements.incident_id IS NULL) AND (resolved_at IS NULL)

Acknowledged Incidents

Before Multi-User Alerting, the scope for acknowledged incidents was:

  scope :acknowledged, unresolved.where("acknowledged_at IS NOT NULL")

Extending this to support multiple acknowledgements was similar to how triggered was extended. In this case, we are looking for unresolved incidents with at least one entry in the acknowledgements table. Instead of using LEFT JOIN with acknowledgements.incident_id IS NOT NULL, we can simply use a plain old INNER JOIN.

Unfortunately, naively using INNER JOIN as we did with triggered incidents is incorrect. To illustrate, consider the following implementation of acknowledged:

  # Naive and very broken
  scope :acknowledged,
    arel_inner_join(:acknowledgements).
    unresolved

To see why this is incorrect, consider the generated SQL:

SELECT incidents.*
FROM incidents
INNER JOIN acknowledgements on acknowledgements.incident_id = incidents.id
WHERE (resolved_at IS NULL)

as well as our sample incident data from before:

# Incidents
id | description     | resolved_at
----------------------------------
2  | Server on fire  | NULL

# Acknowledgements
id   | incident_id  | user_id
-----------------------------
100  | 2            | 9875
101  | 2            | 9876

The result set will be:

id | description     | resolved_at | id   | incident_id | user_id
-----------------------------------------------------------------
2  | Server on fire  | NULL        | 100  | 2           | 9875
2  | Server on fire  | NULL        | 101  | 2           | 9876

The result set has each incident occurring once for each acknowledgement. However, we want an incident to occur exactly once if it has at least one acknowledgement, so we need a way to remove these duplicate incidents. It turns out that efficiently removing duplicates in SQL is not as trivial as it seems.

SQL DISTINCT / Rails uniq

The most straightforward approach is to use .uniq in Rails:

  # Naive and very broken
  scope :acknowledged,
    arel_inner_join(:acknowledgements).
    unresolved.
    uniq

which translates to the following SQL:

SELECT DISTINCT incidents.*
FROM incidents
INNER JOIN acknowledgements on acknowledgements.incident_id = incidents.id
WHERE (resolved_at IS NULL)

Unfortunately, the above query is not very performant. For accounts with a large number of acknowledged incidents, we found this query was 2 to 3 times slower than the straightforward (though incorrect) query. The explain plan for the above query showed Using temporary, which indicates that MySQL was using a temporary table to compute the distinct rows. We suspect that a large part of the slowdown is based on MySQL copying all of the columns of each row to the temporary table for deduplication.

Use GROUP BY

In suspecting that the slowdown was caused by copying all of the columns (DISTINCT incidents.*), we tried to find a way to have DISTINCT run on just one column. GROUP BY can be used exactly for this purpose, so we attempted the following query as a starting point:

SELECT incidents.*
FROM incidents
INNER JOIN acknowledgements on acknowledgements.incident_id = incidents.id
WHERE (resolved_at IS NULL)
GROUP BY incident.id

For accounts with large numbers of acknowledged incidents, this query was significantly faster than using DISTINCT. Furthermore, the explain plan for this query did not contain Using temporary. Unfortunately, this query too had one problem due to the implementation of .count in Rails.

The Rails scope that generated the previous query is:

  # This works, but breaks other things
  scope :acknowledged,
    arel_inner_join(:acknowledgements).
    unresolved.
    group('incident.id')

GROUP BY affects the results of aggregate functions, specifically COUNT. In Rails, to count the total number of incidents from a scope, you will use count (e.g., Incident.acknowledged.count). However, with group, .count returns the count per group, not the total count. In other words,
instead of getting an integer returned:

Incident.acknowledged.count => 7

the return value is a map of incident id to count for that incident

Incident.acknowledged.count => {19 => 1, 20 => 1, 21 => 1}

Lots of places in our app assume that calling count returns an Integer, so changing that behaviour was not desirable. Wrapping the GROUP BY query in its own subquery restored the correct behaviour of count, but that killed the performance gains since a temporary table was again required. So close, yet so far.

Use a second LEFT JOIN

The crux of the initial problem of joining acknowledgements directly against incidents was that a single incident could have multiple acknowledgements, leading to duplicate incidents in the joined result.

What if there was a way to select a single exemplar acknowledgement for each incident? The MySQL manual describes a technique using LEFT JOIN for finding the group-wise maximum of a certain column. Using that trick with our query, we can find the acknowledgement with the max id for a given incident. We finally arrive at the following query:

SELECT incidents.* FROM incidents
INNER JOIN acknowledgements ON acknowledgements.incident_id = incidents.id
LEFT OUTER JOIN acknowledgements AS acknowledgements_uniq ON
  acknowledgements.incident_id = acknowledgements_uniq.incident_id AND
  acknowledgements.id < acknowledgements_uniq.id
WHERE (incidents.resolved_at IS NULL) AND (acknowledgements_uniq.id IS NULL)

To illustrate why this works, consider the following incident with four acknowledgements:

# Incidents
id | description     | resolved_at
----------------------------------
3  | Server inferno  | NULL

# Acknowledgements
id   | incident_id  | user_id
-----------------------------
200  | 3            | 9875
201  | 3            | 9876
202  | 3            | 9874
203  | 3            | 9873

and the following query without the acknowledgements_uniq.id IS NULL clause:

SELECT incident.id, acknowledgement.id, acknowledgements_uniq.id FROM incidents
INNER JOIN acknowledgements ON acknowledgements.incident_id = incidents.id
LEFT OUTER JOIN acknowledgements AS acknowledgements_uniq ON
  acknowledgements.incident_id = acknowledgements_uniq.incident_id AND
  acknowledgements.id < acknowledgements_uniq.id
WHERE (incidents.resolved_at IS NULL)

The result is:

incident.id | acknowledgement.id | acknowledgements_uniq.id |
-------------------------------------------------------------
3           | 200                | 201                      |
3           | 200                | 202                      |
3           | 200                | 203                      |
3           | 201                | 202                      |
3           | 201                | 203                      |
3           | 202                | 203                      |
3           | 203                | NULL                     |

From the MySQL manual, the LEFT OUTER JOIN works on the basis that when acknowledgements.id is at its maximum value, there is no acknowledgements_uniq.id with a greater value and the acknowledgements_uniq rows values will be NULL. Nick Kallen, author of Arel, calls this technique “one of the great mind-blowing SQL queries”.

The fantastic thing about this query is that it is just as performant as the GROUP BY version, and does not require the use of a temporary table. In addition, this query does not adversely affect count in Rails. Perfect!

The final scope for Incident.acknowledged is:

  def left_join_as_distinct(association)
    association_model = reflect_on_association(association).klass
    table = association_model.arel_table
    table_alias = table.alias("#{table.name}_uniq")

    predicate = table[:incident_id].eq(table_alias[:incident_id])
    predicate = predicate.and(table[:id].lt(table_alias[:id]))
    join_sources = table.join(table_alias, Arel::Nodes::OuterJoin).on(predicate).join_sources

    joins(join_sources).where(table_alias[:id].eq(nil))
  end

  scope :acknowledged,
    arel_inner_join(:acknowledgements).
    left_join_as_distinct(:acknowledgements).
    unresolved

Conclusion

SQL’s left join operator turned out to be a valuable tool in writing correct and efficient queries for multi-user alerting. Its usage was typical for the triggered incidents scope, which involved finding incidents with zero acknowledgements. Left join additionally proved versatile for the acknowledged incidents scope, where it was used as an efficient proxy for DISTINCT to find unique incidents with at least one acknowledgement.

Deriving efficient queries involved profiling the original SQL queries, understanding the behaviour of Percona’s query optimizer, and experimenting with alternative queries. If challenges problems like these excite you, PagerDuty is hiring.

Thanks to Steve Rice and Evan Gilman for reviewing drafts of this article.

Monitoring_Ebook_728_90