This is a guest post by Ilan Rabinovitch, Director of Product Management at Datadog. The convergence of rapid feature development, automation, continuous delivery, and the shifting...by Ilan Rabinovitch
August 24, 2017
Hiring software engineers is hard. We all know this. If you get past the problem of sourcing and landing good candidates (which is hard in itself), the whole issue of “is this person I’m talking to ‘good enough’ to actually work here?” is a very difficult nut to crack. Again, we all know this. There has been much ink spilled, or, I suppose, many liquid crystal molecules twisted on the subject.
I’ve been thinking a lot about why hiring is so difficult, and to do so I’ve started to think about it in terms of a machine learning problem. (Some background on myself: in a past life, I worked in Amazon.com’s Fraud department, and wrote some substantial pieces of the complex machine learning systems they use to pick out the fraudulent/bad orders out of the millions of orders placed each day. I’ve also interviewed somewhere on the order of a few hundred people over the years, and am still actively doing so for PagerDuty.)
So why look at hiring as a machine learning problem? Well, it’s not that I (necessarily?) want to start using computers to help solve this hard problem for me. It’s just that ML is a very formal way of teaching computers (which are very formal systems) about how to become good at something; how to become an expert in a very narrow and focused field. Figuring out why hiring would be a hard machine learning problem will help me figure out why it’s such a hard human problem.
So lets get started, and dive into some machine learning basics as we go along.
When interviewing software engineering candidates, we’re basically trying to sort applicants into two buckets: hirable engineers and non-hirable engineers. “Hirable” engineers are ones that we think have the skills, experience, and personality necessary to do well in a job at our company, and “non-hirable” engineers we don’t.
So, this is a classification problem, and one with only two output classes. We need to build a model or classifier – in our minds, I suppose – that will take in some sort of input about the candidate being interviewed, crunch some numbers and/or think really hard, and output whether or not we should hire this person. I’ll talk about this classifier a bit later, but first let’s talk about the inputs we need.
An extremely important part (really the most important part) when building a classifier with machine learning is to have good data to work with, and to be able to shape that data so it can be used as inputs to efficiently and unambiguously train your classifer. We gather these inputs (or sets of what are confusingly called “features” in the ML world) both for training our classifier, and then again later when actually evaluating a software engineering candidate that we’re interviewing.
So what are these “features” we are gathering? For hiring, they’re mostly just the answers to the interview questions we ask the candidate. Depending on the question, this could be the algorithm they’ve defined, or the code they’ve written, or the system they’ve architected, or the snippet of computer science minutiae they’ve drummed up out of the recesses of their minds in order to answer your question. So, by the end of a couple of phone screens and maybe 5 in-person interviews, we’re talking about somewhere on the scale of 20-50 questions or so (depending on how granular the questioning gets during the individual interviews, and what you count as an interview “question”). So maybe a few dozen features per candidate. Not a ton to go on, really.
These interview questions tend to get re-used a lot. This is bad, in some ways: they can get “stale” after a while, and over the years the chances increase that a candidate has already been asked and has answered this question or a similar question posed by someone else. (Or the topic of the question itself might even become less relevant over time!) But question re-use is important too: how do you know how discriminative a question is unless you’ve employed it against a number of different people? How else do you know whether or not a wrong answer to the question is predictive of a “bad” engineer? As humans, we can use a degree of intuition here where machines probably couldn’t, but I couldn’t tell you how many times I’ve heard the following in an after-interview debrief: “He didn’t do very well on this question, but I’ve never asked it before, so I’m not sure if it’s too hard/obscure a question.”
Ok, we now know what data to gather (our features, above); so how do we build our classifier so that it can properly predict who we should hire and who we shouldn’t?
Now, when doing actual machine learning, the classification algorithm used is a fairly important and pretty damn interesting part of the process. But the algorithm is also the part that is most connected and dependent on the fact that ML is, you know, done on computers. For the purposes of our thought exercise, we only have our brains to work with to build this model, so lets just ignore the intricacies of the classification algorithms out there. Lets just say we’re using some sort of Neural Network to build our model here. After they’re built, those are about as opaque as a human mind anyway. 🙂
So how do we actually build this mental classifier? Well, we train it using a Training Set. This is essentially composed of the list of people who have answered your interview questions in the past. Each person interviewed counts as one “observation” in the set.
For beginning interviewers, this training set can be as small as one person: the interviewer himself. They’ll mentally say, “I can answer these questions, so a good engineering candidate should be able to give the same answers to these questions.” This is pretty much only good for hiring people who are identical to you. (In ML terms, you’re overfitting your model.) Also, are you sure you would have been able to answer this question as well as you think you would have, had you not come up with the question yourself and thought through it in detail beforehand?
Unfortunately, even for experienced interviewers, the training set is still very small. On the order of a few hundred samples, at best. And, as we’ll see below, there could be a lot of noise in these samples.
For each sample in our training set (i.e, a given interviewer’s interviewing history), we need to label them as “would have been a good hire” or “would not have been a good hire”. This is very difficult. Specifically, there is a LOT of error in our mental data here.
How do we create our labels? Well, we start with the initial prediction that our classifier came up with at the time; i.e. whether or not we decided to hire the person. And then we need to retroactively adjust these labels by establishing a feedback loop that helps our model learn from it’s past mistakes. I.e. we need to correct for all the errors that it has made in the past. We need to figure out which decisions were false positives, and which were false negatives. How do we do that?
False negatives are the people to whom we said “no” to hiring, but who would have actually made a good employee nevertheless. After turning someone down, it is virtually impossible to know how well they would have done as an employee at your company. You have very little future contact with these people and don’t interact with them as you would a co-worker. So it is really difficult to feed any false negative information back into your mental model in order to adjust your labels.
The best way to do this, and many good companies do, is to always allow people that you turn down to enter back into the interview process in the not-too-distant future. Say, 6 months down the road. We do this at PagerDuty, and out of our 11 or so engineers, 2 were actually turned down the first time they were interviewed!
Still, this is only a trickle of false negative feedback, and there is quite a delay on being able to gather it.
False positives are the opposite type of error: people that we said “yes” to hiring, but we really shouldn’t have. Unless you are very lucky, you have met these sort of people at your work.
There are two categories of these false positives: those that ended up taking the job offer, and those that didn’t. For those that took the job offer, it is very easy to feed this information back into your mental hiring model, as you work with (and eventually fire) these people. For those that didn’t take the offer, however, you’re faced with a similar problem as false negatives: you don’t really know if they were a bad hiring decision or not.
When setting your “hire” or “no hire” labels on your training set, it’s good practice to take into account the costs of making a wrong decision.
The costs of a false positive are quite large: hiring someone who isn’t a good fit for your company can have a large impact on the quality of your product, on the productivity of others on the team, on the timeline of projects, on morale, etc. It can be fairly disastrous in some cases.
The costs of a false negative are mostly opportunity costs: we missed out on hiring this great engineer. It is really hard to tell what the consequences of missing out on this hire would be: would this just delay us a couple weeks as we look for another good candidate, or did we just miss out on our Paul Buchheit?
Regardless of the magnitude of the costs of false negatives, they are certainly less visible than the costs of false positives. It is much easier to quantify the costs of false positives, and there is much less deviation in value. So interviewers almost universally tend to err on the side of caution, and favor a false negative over a false positive. We also hope that allowing candidates to re-enter the interview process down the line will help mitigate this problem as well.
Finally, there is our Testing Set. In machine learning, a testing set is a separate set of data (made up of both features and corresponding labels) that is completely independent of the training set. It is used after the classifier has been built (using that training set) to figure out if the constructed model is “good enough” to actually use. Using one keeps us from overfitting our model to match our training data too closely.
In practice, a machine learning scientist worth her salt would actually build several classifiers, and use the one that performs the best against the testing set. She would even employ a third set (a validation set) to make sure that, after the repeated iterations of creating + testing + refining models, that she actually haven’t overfit their model against the testing set too!
When creating our mental model of what makes a hirable engineer, a testing set is another luxury that we don’t have, and we end up with overfitting. In general, as we make more observations (give more interviews) we tend to just lump those observations into the training set.
So, going back to my original question: why is hiring so hard? Well, to sum it up: we’re gathering a small set of features, over an equally small number of observations in our training set, labeling our observations fairly inaccurately due to the difficulty in flagging false positives and false negatives, using all this to build an opaque classification model, overfitting like crazy due to the lack of a testing set, and in the end we lean kind of heavily towards not hiring the candidate due to the large costs of false positives.
Looks pretty bleak, doesn’t it?
So how do we end up doing a remotely decent job at hiring? Well, we have at least one thing going for us: we use a form of ensemble learning to make our final decision. With ensemble learning, you take a number of separate classifiers and use them together to make better predictions than you could have with any individual classifier. We do this in our debrief meetings after the interview loop has completed: the interviewers sit down together, go over all the questions and answers, and each makes an individual decision on whether or not to hire this candidate. Then we make the final hire/no-hire decision based on some sort of aggregation of all the votes.
 “Fraudulent orders” being those placed by criminals who have stolen someone else’s credit card, phished their way into someone else’s account, or are attempting, in some shape or form, to steal money indirectly from Amazon. In addition to this, in some countries, Amazon allows payments through non-guaranteed payment methods. This means that Amazon isn’t actually guaranteed to get paid after shipping goods ordered by some generally non-criminal, but broke, individual. So in some cases Amazon has to also create machine-learned models to determine the general credit-worthiness of its customers as well.
 For many ML problems, this might also be where you employ some sort of domain expert to help come up with relevant and predictive inputs. Unfortunately, when deciding whether or not to hire someone, we usually only have ourselves as the experts. In addition, at least one of these “software engineering experts” in a hiring loop are often very new to giving interviews, and sometimes even pretty new at software engineering! But they need to sharpen this hiring skill somewhere…
 As an aside, this is another reason why “programming portfolios” tend to be interesting but can’t really replace an interview. They can teach you a lot about a person’s programming style, etc, but due to the unstructured nature of, say, a person’s personal Github repo, you can’t tell (A) how they are at solving problems presented to them by others, or (B) how productive/fast they are, or (C) if there are any horrible logic errors in the code without spending some serious time delving into it, understanding the problem that it is trying to solve, etc. And even if you could tell the above by glancing at their past programming escapades, you’d often have a hard time fitting these inputs into your mental classifier of a “good hire”, as these features are ones you’ve only created for this candidate alone, and no others.