Brightcove: Providing Reliable Video Solutions With PagerDuty

Size: 250+

Industry: Technology

Location: Boston, MA

Customer Since: 2/2010

Brightcove is a leading global provider of cloud services for video. The company offers products that revolutionize the way organizations deliver video experiences, including Video Cloud, the market-leading online video platform and Zencoder, a leading cloud-based media processing service and HTML5 video player technology provider. Brightcove has more than 6,300 customers in over 70 countries that rely on Brightcove cloud content services to build and operate video experiences across PCs, smartphones, tablets and connected TVs.

Early Brightcove Challenges

Three years ago, Brightcove embraced a DevOps model to give their engineers ownership over the design, production and support phases of their code. Building high-quality software faster is the goal of DevOps, but most DevOps transitions fail due to a lack of cultural and technological change to support the shift. Brightcove took the first cultural step toward this model by expanding on-call to developers. It made sense that the people who built the code would be the ones to fix issues when they occurred. Brightcove’s strong team-oriented culture allowed for employees to choose their own on-call rotations. These schedules were difficult to manage as new team members were added and frequent changes were needed. Additionally, these schedules did not solve the issue of managing incident life-cycles. It was unclear who was working on an incident and what stage it was at. This caused issues to prematurely escalate to other teammates or managers.

“We lacked visibility into incident ownership, which impacted our efficiency.” – Brian Sensale, Senior Engineering Manger

All on-call engineers received alerts through Blackberries that were synced with their monitoring tools. Blackberries were rotated amongst the team members, but this approach was cumbersome, error-prone and limited the participation of those out of range. There was also no way to escalate issues to another team member if it was missed. As their teams grew to span 3 global offices, they needed to figure out how to fairly share on-call responsibility.

“We needed a solution that would match our flexible work environment. Exchanging a physical device did not scale.”

Brightcove needed to make the logistics of on-call rotation simpler and move scheduling responsibility directly to the on-call engineers. After all, if engineers have to take on the on-call duty, they should have a say around when they should be on and how they are notified. To fully transition towards a DevOps model, they needed the technology to accompany their cultural shift.

Giving Control To On-call Engineers

Having a balanced lifestyle is important to Brightcove’s engineers. When team members want to go on vacation during a time they’re scheduled to be on-call, they work together to find substitutes. By using PagerDuty for on-call scheduling, changes can be made without any hassle.

“After we horse trade on-call duties, it’s a breeze to update schedules in PagerDuty.”

Missed alerts at Brightcove due to ineffective alerting via their Blackberries are not a problem anymore thanks to PagerDuty. Engineers can now control the manner in which they are notified. Some teammates live in places with bad cell coverage, so they have programmed their home phone in their alert policy. All on-call engineers can now choose the alert settings that are most effective for them. If the primary on-call engineer misses alerts, the secondary engineer will be alerted. With a natural flow of escalation, PagerDuty acts like a safety net for Brightcove incidents and ensures all problems will be quickly addressed.

“I can’t image life without PagerDuty. Having multiple alerting methods and escalations are no-brainers.”

To increase incident visibility, Brightcove has integrated PagerDuty with HipChat so everyone can track the lifecycle of the incident. This means there is no more scrambling to see who is taking care of an incident and if it has been resolved. And other teams can jump in to help if needed. Managers aren’t mistakenly alerted for low severity issues anymore and can be brought in quickly when there’s a larger issue.

“We have less of a fire drill with PagerDuty. We now know if an incident is being handled and by whom. It is a stress reliever.”

With PagerDuty, Brightcove has the technology to support their DevOps shift and to deliver a high quality, highly reliable service for their customers.