PagerDuty Blog

Incident Management for Media and Entertainment

In the entertainment world, building enterprise apps involves many challenges, such as compatibility with numerous devices and large files like HD videos, along with streaming media to millions of users simultaneously. But today’s entertainment apps are possible only because of a modern approach to software delivery—DevOps brings greater efficiency across the development pipeline.

Of course, DevOps doesn’t solve all problems. The number of issues increases along with the frequency of releases and the number of devices supported. To meet this challenge, organizations need incident management. Incident management is not just for issues in production, but for every stage of the application lifecycle.

Let’s look at how some of the leading entertainment companies are pioneering the DevOps approach and integrating incident management into every stage of their operations.

Microservices Is More Complex

Netflix is a big proponent of the DevOps approach, which naturally means they have adopted the microservices architecture. The key advantage of microservices that Netflix cites is velocity and reliability. Its developers are able to work independently and churn out releases much more frequently. Also, because they are able to fix most issues without an external team intervening, they can build reliability into the system.

That said, microservices has its own unique challenges. With the number of services—and the services needing to communicate with each other—failures are common. It’s a challenge to manage dependencies, especially when it comes to keeping track of version changes with dependencies. One solution that Netflix is trying is to follow a “monorepo model,” where the source code for the organization is in a single repository. This brings consistency across teams and services.

However, the challenge is giving teams flexibility to innovate in unique ways. To deal with this, Netflix is attempting to provide faster feedback to publishers of services. Doing so would inform Netflix which downstream services have broken because of a release, thus bringing more responsibility to the owner of the code to fix issues faster. Netflix admits that it hasn’t figured it all out, but nonetheless, it is considering all aspects of the microservices model as it designs solutions that work for each team and for the organization as a whole.

Designing for Security

SoundCloud has an app for Xbox One, but the sign-in experience has been very cumbersome, as it’s hard to enter text using an Xbox One controller. To overcome this challenge, SoundCloud used an authentication process–generated a token on another device like a smartphone and enabling Xbox One to grant access to the user if this token is activated. This method is similar to two-factor authentication and is also used by Google to connect its apps on smart TVs and Chromecast devices.

SoundCloud had to take steps to ensure the process is bulletproof and not prone to attacks. This included using clearly recognizable language, as well as warnings if the token entered looked suspicious.

When building entertainment apps, there are many minute details to consider, such as the device type, functionality, and building features that are secure and easy to use. When running apps on these devices, it’s easy to orphan them as they’re not the highest priority. However, if your app is built for a device or platform, you’ll need to go all the way to make sure it is fully compatible. That’s what SoundCloud demonstrates with their Xbox One app.

Proactive QA

Circling back to Netflix, they are one of the most prolific contributors to the open-source community. When it comes to user experience on its apps, Netflix creates unique solutions and open-sources them for the rest of the world to use. One recent example is Simone, a tool for simulated testing on various types of devices.

Netflix ships apps on many new devices like smart TVs and mobile devices, and its apps need to be certified to work with all of them. This translates into a lot of repetitive tasks to check for compatibility issues between the app and the device. Netflix uses Simone to run various simulations that are specific to each device the app is being tested on. It automates configuration, deployment, and execution of simulations.

Quality is vital to the success of an entertainment app. And to ensure quality, you need a mature QA process. Whether this is with a custom tool like Simone, or a more generic one like Selenium or Appium, testing your apps on the devices and platforms they run on is essential. QA and incident management go together, because the best time to catch a bug is early on before it makes it into production. You’d rather have a process that proactively stops incidents from occurring than be forced to firefight them post-release.

Phased-Out Releases

Let’s take a look at Spotify, which relies heavily on Docker containers and uses its own container orchestrator Helios to manage these containers. However, the more Spotify depended on Docker, the more issues related to Docker containers affected the user experience, particularly when updating Docker from an older to a newer version.

Spotify found that many containers still held on to their original ports from the previous version, and traffic was not being routed to these containers. Spotify’s solution was to build a tool called Tsunami that could implement gradual rollouts over a span of two weeks. During this phase, Tsunami instructed all Docker containers on which version of Docker they should be running. Tsunami implemented the rollout to a subset of users, and as errors were found it was easily handled by the team during the two-week period.

Tsunami made rollouts a lot less stressful, and improved the user experience during time of change. With continuous delivery, releases are much more frequent, which makes it all the more important to limit any damage they can cause. Phasing out over time is a great way to ensure high availability during releases. During this phase, incident management will play a key role as the transition needs to be monitored and responded to. Incidents need to be fixed quickly for the transition to be successful.

Building applications in the entertainment space is full of challenges and obstacles at every stage. However, by adopting a DevOps approach and developing a strong incident management workflow, you can deliver an outstanding user experience and enjoy the process as much as your end users enjoy using your app.