Friday, March 13, 2026

My Scoring System Rated a Senior Engineer 1.8 Out of 10

The candidate gave 10 detailed answers. They described a custom ATT&CK matrix they built. They walked through a real detection project. They named specific tools. My scoring system looked at all of that and said: 1.8 out of 10. Junior.

It was wrong. But fixing it took three complete rebuilds in one day.

Today was the deepest technical sprint of this hackathon build. 55 hours left on the clock. The submission is not started. And somewhere in the middle of fixing 7 bugs in a live testing session, I had to tear out the entire evaluation engine and start over. Twice.

The Scoring Problem (Three Versions)

The original scorer matched keywords. I had a list of expected technical concepts for each question. The system counted how many matched what the candidate said. Simple, measurable, wrong.

The problem is that keyword matching has no semantic understanding. When someone says "forensic imaging tool," that is not the same string as "ftk_imager," so it did not count. When someone described a detection pipeline in precise architectural terms but used different vocabulary than I expected, the system returned a low match ratio. A real senior engineer, with a real transcript from a real TikTok interview, scored 1.8 out of 10.

Version two fixed the scoring ceiling problem. I removed artificial caps that prevented a brilliant answer to an early question from scoring above 6 out of 10. The same transcript improved to 5.0 out of 10. Better. Still wrong.

Version three threw out keyword matching entirely. The model that runs the interview already hears the full audio, understands the answer in context, and has detailed scoring guidance for every question. So I stopped building a second system to guess what the first one experienced. Now the interviewing model is the scorer. It provides a rating from 1 to 5, a technical depth rating, a specificity rating, a communication rating, and a list of what was strong and what to probe next. A separate model does the final report using the full transcript and all those per-question ratings.

Same transcript. Version three: 8.0 out of 10. Senior. That is the correct answer.

The lesson here is not just about AI systems. It is about building evaluators. Every rigid rubric eventually punishes excellence that does not fit the rubric's vocabulary. The more I tried to quantify "correct," the more I drifted from "accurate."

The Audio Bug That Was Never Gemini's Fault

Two days ago I was convinced the audio stutter was coming from the model. The raw output from Gemini was getting fragmented somewhere between the API and the speaker. I spent significant time investigating the backend pipeline.

Today I pulled the raw audio bytes before any playback logic touched them and ran a transcription. The audio was clean. Every word, every pause, exactly right.

The stutter was mine. The browser's AudioBufferSourceNode scheduling fires when one chunk ends before the next one is queued. Under load, that gap creates a click or a cut. The fix was to move to AudioWorklet, which runs on a separate audio thread with a ring buffer that the browser's audio process pulls from at a constant rate. Google's own reference implementation for Gemini Live uses this pattern. I had just not looked at it closely enough.

7 bugs got fixed in the afternoon live testing session with Obadiah:

Behavioral mode was pulling from the technical question tree instead of the behavioral one
Session resumption was sending a static handle that Gemini Live does not support
A role level selector in the interface turned out to be purely cosmetic (it did not change any questions or scoring), so it got removed entirely
Context window compression was not supported on the native audio model and was crashing sessions
Transcript chunks were concatenating without spaces between words
Silence detection was set to 500 milliseconds, which punishes anyone who needs a second to think. Changed to 5,000 milliseconds.
When Gemini Live disconnects and reconnects, the new session had zero context. Now if there are previous exchanges, the last 5 get injected as a summary on reconnect.

Most of these were caught because Obadiah tested it live and described exactly what he was seeing. That feedback loop is faster than any automated test I could write for audio behavior.

The Gemini Free Tier Reality

There is a gap between the headline capability of Gemini Live and what you actually get on the free tier during a hackathon. Free tier sessions disconnect after 2 to 4 minutes. The theoretical maximum is 10 minutes. The practical ceiling during active development is much shorter, especially when hitting the API repeatedly.

For a product demo, this matters. If the model disconnects mid-interview, the session feels broken even if the reconnect logic recovers cleanly. I have the reconnect logic working. The disconnect is still happening. I do not have a fix for that before Sunday.

This is the main risk going into submission. The scoring is accurate now. The audio plays without stutter. The question trees cover 6 technical domains and a behavioral mode with 11 questions. But if the demo video catches a mid-session disconnect, that is what judges will remember.

What Else Shipped Today

Three blog posts went live this morning. The pipeline that was 9 drafts deep and stuck is moving now.

The nightly builder agent shipped an SMS automation for retail shift replenishment while I was sleeping. Two workflows. Estimated ROI of $2,600 to $5,200 per client per year. Setup time under 30 minutes. That is the kind of thing that happens when you build systems instead of doing tasks.

Memory search broke mid-morning. The OpenAI embeddings key had expired. Switched the entire memory system over to Gemini embeddings. 676 out of 1,237 files indexed before hitting rate limits. All the important memory files are covered. Session transcripts are lower priority and will index over time.

55 Hours

The hackathon closes Sunday at 8 PM Eastern. I have a working interviewer with accurate scoring and stable audio. I do not have a public repository, an architecture diagram, a demo video, or a Devpost submission.

That is tomorrow's entire job.

One thing at a time.