With a core value of focusing on the customer, Nelnet provides innovative educational services in loan servicing, payment processing, education planning, and asset management. These products and services help students and families plan, prepare, and pay for their education while making the administrative and financial processes more efficient for schools and financial organizations. They are headquartered in Lincoln, Nebraska, with more than 3,400 associates who serve customers throughout the education life cycle.
Nelnet’s IT department has multiple service tiers; intake, on-call, escalation, and product owners and architects. Ryan Regnier is an IT manager at Nelnet and is responsible for the tier 2 team who is on-call, escalating issues, and responding to any outages as they arise. Managing a team of that nature involves a large number of critical incident alerts and figuring out how incidents are escalated to other service tiers. For these reasons, both Ryan and the company were in search of a solution that could help simplify these processes.
Overcoming manual processes to manage on-call scheduling and incident escalation
Nelnet is monitoring everything from web servers that process credit card payments to network devices that are transmitting traffic to web and database servers. The organization is monitoring 35,000 events at a given time, resulting in alerts firing off at all hours of the day. Before Nelnet implemented PagerDuty, managing on-call scheduling and escalations was a challenge because of the existing manual processes. If any app went down, the Network Operations (NOC) team members had to manually sift through pages of spreadsheets to identify who to contact. The spreadsheets outlined what to do when there was an incident alert, who to escalate the issue to, and how to react to each individual incident. This manual process didn’t easily scale, making it difficult for teams to work efficiently, and added time to the outage. This had a negative impact on customers in addition to Nelnet; if the core payment processing site was down, customers couldn’t make payments, resulting in loss of revenue and customer dissatisfaction.
Who to contact during an incident was also an issue; even with a 24×7 NOC team, the wrong people were being contacted and at the wrong time. Not only did this create frustration, there was also no way to automate or customize how alerts were coming through. All of these obstacles resulted in a delay of incidents being resolved, customers unable to make payments, and a decrease in productivity due to the lengthy and complex manual process.
Increasing operational efficiency and reducing costs
Nelnet adopted PagerDuty to help minimize the challenges around scheduling, alerting, on-call escalations, and to help lower costs. An area they were able to reduce costs was within the NOC team. With PagerDuty’s automated and reliable incident management platform, Nelnet no longer needed to pay for a 24/7 NOC environment. “Before we brought in PagerDuty, we were looking for ways to cut costs and improve our incident response management. The PagerDuty solution has proven to be the right one for Nelnet. PagerDuty makes life easy,” said Regnier.
An estimated 35,000 incidents are generated through Nelnet’s monitoring tools. These incidents, generated from file transfers and external websites, including those hosted on Amazon Web Services, are sent directly to PagerDuty. The typical use case for the on-call and escalation team consists of issues that come from any of their servers or services. PagerDuty alerts those on-call about the issue within seconds. This allows the on-call contacts to figure out what the problem is, escalate the issue if needed, and resolve it.
Currently, Nelnet has 80 escalation policies, which are used multiple times each day. An example of these policies being used was when a large incident arose that required help from multiple teams. The incident management team logged into PagerDuty to send an email alerting the appropriate people about the issue. The solution then allowed people on-call to contact those individuals rather than blasting the notice out to everyone within those teams. Those involved ended up joining the incident call except for one person who was called every 5 minutes until the escalation policy kicked in after 20 minutes. Due to the escalation feature, the backup responder was able to acknowledge the alert and help get the issue resolved.
PagerDuty provides Nelnet the flexibility to contact users in a number of ways, including the option to receive alerts via text or email. “PagerDuty makes my team’s lives easier and provides us with more structure. When finding a replacement for someone on-call, the solution provides that person with the option of being contacted in a variety of ways,” said Regnier. Nelnet is able to get services back up and running more quickly, enabling their customers to use the services and keep the business moving. “During the day we have people on-call who can respond to a server that has gone down within minutes of it happening. Depending on the complexity or nature of the problem, we can have it back up in 10 minutes or less. We know about these alerts within seconds and can respond to them within minutes,” stated Regnier. With increased uptime and employee productivity, PagerDuty has saved Nelnet $650,000 annually.
Improving uptime, agility, and employee satisfaction
Before PagerDuty there was little way of tracking outages. Now, they have critical data at their fingertips. Any incident or triggered item from up to a year can be reviewed. “When we were evaluating PagerDuty, we found there weren’t other organizations that had such a complete product offering, or feature set, and they weren’t as easy to use,” said Regnier. PagerDuty helps Nelnet increase uptime and employee productivity, provide teams with flexibility, and ensure that incidents are always addressed.
“I would encourage everyone to consider PagerDuty. The cost savings can’t be overlooked. With PagerDuty, the person on-call is conveniently alerted with each incident. There is so much flexibility with scheduling and alerting the right people, it’s a simple decision to use PagerDuty”