HackerRank open sourced its ATS. My resume scored 90/100. Oh wait 74. No

<- Back

HackerRank open sourced its ATS. My resume scored 90/100. Oh wait 74. No – 88

sambellll

Comments (177)

dvt
An alarming number of people don't understand that LLMs work via purely stochastic processes, so I'm happy to see in-depth pieces like this. I'm looking for a job and maybe this is why it's so hard to get a callback these days: resumes are just dumped in some LLM black hole and no one really knows how it works. The author says:> temperature 0.1 — low, supposedly nudging the model toward deterministic outputsThis is not correct (and is briefly touched on later in the piece when he sets temperature to 0), temperature is not some kind of "deterministic" switch, but rather it affects the sampling distribution (which becomes more "spiky"—but is still very much a distribution).
ryukoposting
At this point we might as well adopt that joke where you blindly throw away half the resumes because you don't want to hire unlucky people.
jerrythegerbil
> I fail 65% of the time. Same exact resume, different luck.As someone who’s run hiring pipelines for technical roles in the past few years, that’s actually a fantastic number. I objectively hate saying that, but it’s true.35% chance of elevating a technical individual to the next stage with no effort? I’ve seen as many as 100+ applicants an hour even when including a domain specific screener question. That’s 35 “screened” applicants in an hour. Were valid candidates screened out? Yes. Does you still have a candidate pool 35x larger than you need? Unfortunately, also yes.The volume of applicants is SO HIGH such that your chances of getting moved to the next stage are actually markedly worse if AI isn’t involved. If you didn’t apply immediately (using an AI bot) there’s 50+ people ahead of you, and an exhausted technical leader if they ever make it to your resume.Referral bonuses exist for a reason.
CM30
I think what's more worrying to me (if other systems work like this ATS) is that it seems to judge based on a bunch of factors that will probably disqualify a ton of decent to good participants.For example, 65 points are given for a mix of personal projects and open source contributions. Which is great if your one and only interest is in tech, and you don't have a family, dependents or a second/third job. If you have any of those other things, well the odds seem like they're incredibly stacked against you.And it makes me wonder how many of these systems are stacked in favour of wealthy people with a near special interest level of obsession with tech and no worries outside of going to college/working a single job in their industry of choice.
bartread
The takeaway from this for me is that, using an LLM to score anything takes multiple (maybe even many) runs and the result you’ll get is, at best, a sane-ish distribution.Which sort of sounds workable until you scale it up to larger datasets, where at some point compute/time/energy costs will render it non-viable.I am sure there’s some reasonable rule of thumb estimation on distribution that could be applied based off fewer runs per data artifact, but you’re always going to be trading off against confidence by doing this.Beyond this, I’d bet that almost no implemented systems that use LLMs for scoring, ranking, or decision making use such a multi-run approach. Partly because people don’t understand their behaviour is stochastic, perhaps because a lot of people without a background in statistics don’t understand what stochastic actually means, and no doubt partly because of budget concerns: if you have to ask an LLM to do the same thing 10, 50, 100 times to get a sufficiently good result, then the cost saving argument is either weakened or completely destroyed.There is at least one more aspect worth considering in the specific case of resumes/CVs: is the inconsistency of scoring by LLM worse than the inconsistency of scoring by a human following a similar process?Because the reality is that, even for an experienced recruiter, reviewing hundreds or thousands of resumes or CVs gets pretty fatiguing. People get hungry, bored, tired, restless, irritable, etc.That inevitably leads to inconsistencies creeping in, so there’s always an element of “luck” (or, perhaps better, uncertainty) as to whether your resume/CV passes screening.So is that inconsistency better or worse with LLM screening? I don’t know. But, at least, if it’s not worse maybe it doesn’t matter for this specific use case. And if it’s notably better then maybe it’s raised the bar on what “good enough” screening looks like?(And I’m sure other use cases warrant similar, “does it matter?”, questions, with the answers no doubt landing differently.)
sleepynoodle
I really dont understand this constant changing of numbers. I have tried a bunch of ATS reviewers and everytime on the same resume i get different numbers. Its weird and unreliable. I understand the need for doing this to filter through thousands of CVs but maybe there is a better way. Like a take home test at the beginning or a test of somekind.
robertlagrant
I tried this with my CV, and it somehow scored me bonus points for GSoC! BONUS POINTS: 5.0 ------------------------------ Google Summer of Code (GSoC) participation: +5 Even though I've never done this, and don't claim to have done it in my CV.
Aurornis
> The default model is gemma3:4bThat’s a tiny model. No LLM is going to be a perfect and repeatable judge, but a tiny 4B model is like plugging an RNG into this system.This whole exercise feels like someone vibe coded an ATS and got it to the point where the tests were passing because they decided they should have an open source ATS project.
seanieb
It's always amazed me that a tech company will pay $300,000+ for a good engineer, because talent is so hard hard to find... meanwhile their recruiter operates unsupported, has a very different idea about what good looks like. Their ATS black-holes >50% the resumes because it's filtering heuristics are garbage because recruiting selected the ATS system because it has a google Gmail integration or something, and the ATS's filtering technology was not reviewed by anyone in the engineering or data teams.
gs17
I'm a little confused, is this an ATS system that anyone actually uses? If not, I'm not sure how it's better than just asking ChatGPT to score your resume out of 100. Why would you want to optimize your resume for a system no one is using to score it?
saidnooneever
Count to three, no more, no less. Four shalt thou not count, neither count thou two—excepting that thou then proceed to three. Five is right out.
kailpa1
From `resume_evaluation_system_message.jinja`> *SCORES MUST NEVER DEPEND ON THE FOLLOWING FACTORS:*> - College, university, or educational institution name> - CGPA, GPA, or academic gradesI don't understand why they would omit these factors from the evaluation.
dev_l1x_be
Did anyone try to prompt hack this setup?
davidpapermill
A better way to reformulate this problem is for the LLM to be tasked with making a _comparative_ judgement between two CVs. This should prove much more reliable, especially if you give it a third “too close to call” option. You can also ask for clear justifications of preference.
makeavish
Hiring and job search has been so hard and AI has amplified the existing problems instead of solving any.
tasuki
> Sometimes my projects “lack architectural complexity”Well done you! It is difficult to avoid architectural complexity, but imho well worth it.
realty_geek
Why doesn't something like this exist for real estate? A popular open source AVM (automated valuation model) that helps home sellers get an idea of what their home will sell for. Right now it seems AVMs are mainly seen as just a way to capture leads. Every estate agent will tell you they have some magic recipe that makes their valuation better than anyone else's. I have had a bunch of ideas on how to approach this, but I really could do with a collaborator or two.
cs02rm0
I feel like hiring is all a bit broken. Roles get flooded with applications, it's chance whether your CV gets through, then there's hiring rounds that seem designed to make you quit the process before they have to filter you out.Is it working for anyone, on any level?
speedgoose
Many em dashes and a "This is not, it is…" later, I think this article would have been a much better critic if it didn't use a LLM to (re)write some parts of it.
thrance
I cope by telling myself that I probably wouldn't want to work for a company that used an LLM to filter my resume out.
nullc
The true test of HackerRank is can you setup a system that combines a document editing / paraphrasing LLM with gradient descent on the HackerRank LLM to turn your arbitrary resume into a reliable 120 out of 100.One of the weird properties of other people using LLMs is the potential of having oracle access to your opponent. Even if you don't have their exact LLM a good guess at it may be a better model of the opponent than you ever had before.
bryanrasmussen
>If your company’s cutoff sits at 85, I fail 65% of the time. Same exact resume, different luck.Your resume's reception is always affected by random factors, only now you are able to test, debug and technically critique the randomness.
dc3k
Disregarding the fact that this thing is completely broken, its grading rubric is ridiculous to begin with (as was mentioned in the article itself, but I must reiterate how completely stupid this is):> 35 points for open source contributions> 30 for personal projectsI don't contribute to open source or have personal projects because I don't spend my free time doing what I do 40 hours a week to make a living. My 15 years of work experience is worth a maximum of 25%, so any company using this idiotic system would pass on me immediately. Open source and personal projects are fine, but in no sane world are they worth 65% of a resume's score.
rkuska
This reminds me of my former CTO. He would take bunch of CVs and randomly throw some of them in a bin. He didn’t want to work with “unlucky” people.
cemoktra
So sending my CV to every company three times should get me pass the ATS?
YossarianFrPrez
Looking at the linked scoring prompt (resume_evaluation_criteria.jinja) [0], I immediately see several red flags that suggest the output won't be reliable. (I'm developing an LLM intensive application where the stakes are high enough that I need the LLM output to be reasonably correct.)[0] https://github.com/interviewstreet/hiring-agent/blob/main/pr...In no particular order:1. The prompt is trying to get the system to do all of the evaluation steps at once. Instead, the system should break down the task of resume evaluation into its subcomponents and have separate prompts for each component. Like "evaluating open source contributions" should be its own task. Same with "assessing the complexity of software projects on the resume." Fwiw, each of the tasks contained within the prompt is woefully underspecified.2. The prompt leaves spreads of ~10 points up to the LLM, when it's doubtful that humans are that well calibrated. Take for example: > SCORING CRITERIA Open Source (0-35 points) HIGH SCORES (25-35 points): - Contributions to popular open source projects (1000+ stars) - Significant contributions to well-known projects - Google Summer of Code (GSoC) participation - Substantial community involvement Are all of these 35-point examples? Is one a 26-point example? If not, what's the difference? If an expert can't reliably make the judgement, the LLM is going to struggle too. One partial fix is to get rid of the ranges and just say all of these are worth 30 points. An additive point scheme would be better...3. The authors of this prompt have left an incredible number of judgement calls up to the LLM, when that's the very thing you want to minimize. Using the same example as above...- Are all contributions to open source projects with 1000+ stars equal?- What counts as a "significant contribution"? Doesn't that imply that the LLM has to know or read through all of the commits in like the last ~6 months at minimum for the project to understand what the given contribution meant to the project? That itself isn't impossible with tool usage, but again, that'd be a separate task.- What on earth counts as "Substantial community involvement"? Why didn't the prompt authors define this, or at least give a few examples?Honestly at this point maybe someone should build a tool that scans prompts for adjectives...4. This sort of thing is just asking for trouble: > SCORES MUST NEVER DEPEND ON: Candidate's name, gender, or personal demographic information Just remove this stuff before you send the rest of the resume to the LLM. Even if you ask it not to, it's not a person, it's a very fancy statistical distribution generator. All of the input (including the name) will affect the distribution that gets generated. (This one is not unlike Andreessen's "don't be a sycophant" prompt.)5. Obviously this one depends on the LLM in question, but instead of writing things like: > DO NOT RETURN A RESUME SUMMARY. RETURN ONLY THE SCORING EVALUATION IN THE SPECIFIED JSON FORMAT. Analyze the following resume and provide a JSON response with this EXACT structure (all fields are required):... The system should utilize the "structured output" option, which guarantees a fixed output format. Also, fwiw, the JSON should force the LLM to pick between categorical options as much as possible. Forced-choice structured output should, at least in theory, cut down on hallucinatory responses and constrain judgement calls.6. One major thing that's not in the prompt is anything about traceability. This system should be designed so that humans can review the logs and make sure this is working as intended.7. Another thing that is missing in the file is what I'll call evidence of a theory of coding / coder quality. Most of the examples are designed to have the LLM assess proxies for code quality, not code quality itself. Surely both should be taken into account?I'm not an expert at evaluating coders. But two pretty basic LLM-answerable thing I would ask is: How well do a candidate's 5 most recent commit messages match the contents of those commits? Do the claimed technical skills on the resume match their GitHub code? (i.e., if they say they know R, is there any evidence of that on their GitHub?)8. The prompt also seems unaware of what it's asking the LLM to do: > LIVE DEMO BONUS: Projects with working live demos should receive 10-20% higher scores This implies that the LLM can use tools, but even then, I'd be pretty wary of its ability to fully execute this part of the prompt without more detailed instructions, examples, and guidance. There are very likely tons of edge cases here.
bhanu786
ATS resume usually check the keywords, and formatting your spacing and give score accordingly. As If someone is following some reference of the format. It can depend might he will be getting low scores.
pu_pe
He tried with a tiny model (gemma3:4b), got a range from 66 to 99. Then tried again with a small model (gemini 3.1 flash lite), the range was 48 to 64. Would a frontier model be more consistent? Perhaps this tool was optimized for more capable models?
nnevatie
> An LLM is calledHooray for incidental non-determinism.
0xpgm
With such kind of ATS systems, is it still a thing to optimize for a one page resume that is easy for a human reviewer to scan, or just include enough buzzwords and external links to try and please the LLM?
ChicagoDave
I was inspired by this. I made a Claude skill to take my resume and compare it to any job description to point out viability and gaps. Pretty cool skill. I'll post it somewhere.
neya
I wonder how is this even legal? The only useful job the HR departments are ever required to do - they decide to automate it? Aside from being a daycare for adults, what exactly does HR accomplish? It's clearly NOT on the side of employees, but this seems like they're clearly NOT on the side of employers, either.While resume's are being filtered left and right, they just make TikTok's on company's dime [1]. What a sad state of affairs.[1] https://www.youtube.com/shorts/wSug80Vg5JU
swingboy
I’ve always assumed any LLM output that was some type of rating or score was bullshit. Unless the LLM writes a Python script to calculate the score (and even then…) then the score it outputs is just the next most likely token, taking into account temperature and what not.You see a lot of frameworks for things like spec-driven development make use of scoring how good the spec/design/plan is and it’s like, uhhh…
padolsey
This is just the 'LLM judge', very badly implemented without any scientific prudence. What a joke. To be terse: you cannot rely on LLMs to provide standardized scores against arbitrary criteria. To get close to 'reliable' you would need highly tested rubrics, grounded in human decision-making, and you'd need to avoid all the measurement biases these things are riddled with... positional/order effects, anchoring on whatever numbers you stuffed into your own prompt, scale-format sensitivity (a 1–5 and an A–E scale give different answers for the same input), holistic-vs-isolated context effects, and lovely examples like where adding a "be unbiased" instruction makes it more biased. I've studied this at length. You cannot even _begin_ to approach this problem seriously without held-out validation, inter-rater agreement, and ground truth. This repo is just quagmire of wishful vibes with random numbers littered throughout.
steve_j_choi
This could be used as a good way to self-evaluate one's current position from the company's point of view. you would tweak prompts and guidelines that are expected from the company and see how you score
jdw64
It seems like the design is flawed, probably because the scoring structure and conditions are wrong. And originally, due to the nature of LLMs, even if the input is unstructured, when you design something like a RAG system, you usually need to create a verifiable evidence table. Even with that, the scores are still probabilistic by nature, but at least they stay within an error distribution that I can verify. But it doesn't seem like there's any such evaluation criteria here.Typically, retrieval should be tied to evaluation metrics, evidence should be linked to scores, and you also need to account for parsing errors.But personally, I'm weak to these kinds of ATS systems (ugly appearance, non-native English speaker, didn't go to a good university), so if this kind of filtering existed, I probably would have never had a job in my entire life. Come to think of it, even now I don't have a proper job—I just bid on projects at the lowest price and implement them. So maybe it doesn't really matter whether such a system exists or not
cyberax
Ah... The AI learned the old HR trick: take 50% of resumes and throw them out without looking. Rationale: "we don't need unlucky losers".
brikym
So that's where the Windows XP file copy dialog author now works.
quink
"A computer can never be held accountable, therefore a computer must never make a management decision."
maxignol
Are many people using HackerRank ATS ?
diimdeep
They forgot to add "masterpiece" /s https://www.youtube.com/watch?v=mcYl70vq_Ns https://github.com/interviewstreet/hiring-agent/blob/main/pr...
anon
undefined
carb
It's a good analysis but the AI slop writing makes me not trust you've reviewed this and I'm unable to finish or subscribe. I'm sure you're a great blogger but this is holding it back!
rvz
I see.> LLM is called six times to extract structured informationFollowed by> The default model is gemma3:4b, running at temperature 0.1 — low, supposedly nudging the model toward deterministic outputs.This is exactly why hiring is even more broken: Because the people looking for candidates are also just as unqualified if not, more.Using much weaker LLMs to replace the person in charge of the final judgement call is the wrong solution as this is a plain old social problem.Even if you wanted to use LLMs for this case, the default configuration, model choice is laughably flawed. This LLM can’t be trusted as it doesn’t even know what it is reading.The correct solution is either advanced OCR with keyword ranking with a basic filter or a far stronger LLM that excels at document / vision parsing benchmarks with an experienced person making the final judgement call in case the technology misses a critical detail.Rather than using this less accurate one that hallucinates out its decision depending on a dice roll.
Traubenfuchs
This actually makes a lot of sense, it's testing the luck of the candidate through the rng feeding the LLM. You wouldn't want to hire unlucky employees after all! Hiring managers of the past would solve this by throwing every second resume in the trash, now this is a built in feature of ATS.
anon
undefined
mihaaly
So many people are willing to participate in this kind of robotic practices in human employment makes me think that many are starting to consider that this is as unavoidable as global warming and rather play along, adapts their career (life) to it, sculpture it towards a specific look, doing things that will give them point on some arbitrary test run. Which I feel being dangerous, leading to superficial minded workforce, not those good in something, including judgement of a problem and solution. But good at manipulation.Speculative thought only, of course.
zuzululu
this is why i dont feel sorry for working 3 remote jobs
anon
undefined
glouwbug
I guess at least HR doesn’t have to read 1,000 resumes. Heck, to be frank, could they make sense of the first 10 resumes?
psychoslave
>You might as well throw out half the resumes and tell the the applicants you don’t fuck with bad luck.Hmm, well, maybe a bit with a nuance of elite class structure reproduction (that doesn’t prevent a few transclass to showcase in case anyone critic the perfect meritocracy at run), that’s basically what people get, so crude truth but truth nonetheless.Oh don’t take it personally. Your own bespoke hand-tailored process of course is different, it does give the opportunity to everyone to reach the most accomplished version of themselves beyond what they ever dare to dream.It won’t help though with the systematic failure of aiming to provide an accessible path to flourish for everyone and letting no one behind.Again, this is no fault of any specific player, but as long as a majority feel compelled to move within the frame of the game with few winners that merit all they got in contrast to large stock of inept losers, the outcomes are no wonder.
yieldcrv
this will get patched, as in I'll optimize my resume for this and so will many other people that any edge disintegrates
mlpicker
[flagged]
chonghaoju
[dead]
hari_vardhan
[dead]
tesnorindian
[dead]