// status: all checks passing

DEEP HALDER

I am _

AI ships fast and lies confidently. I've spent 8 years building the tests that catch it — hallucinating models, flaky agents, green checks that mean nothing — before a single user ever sees it.

01. about

Everyone's shipping AI. Someone has to ask it the hard questions first — that's me.

For 8 years I've worked at the intersection of AI and quality engineering. Today I'm a Senior AI SDET at Sentient Labs, leading AI Quality & Evaluation for AGI-powered agents: hallucination detection, reasoning validation, tool-use accuracy, prompt adherence, gold datasets, and the regression pipelines that turn "it seems fine" into measured confidence — plus agentic and crypto/Web3 workflows, from wallet integrations to on-chain checks.

Before this, I was the founding QA engineer at Level AI, building the quality function from zero — voice AI, conversational AI, STT accuracy, intent detection — back when testing AI meant inventing the playbook as you went.

📍 Bangalore, India · open to global remote

whoami.json
{
  "name": "Deep Halder",
  "role": "Senior AI SDET",
  "org": "Sentient Labs",
  "yrs_breaking_software": 8,
  "base": "Bangalore, IN",
  "obsessions": [
    "agent reliability",
    "evals that catch lies",
    "green checks that
     don't lie"
  ],
  "trusts_ai_blindly": false
}
0years making
software prove itself
0release-confidence
boost from my evals
0hallucinations
given a pass

02. evals // my skills, scored the way I score models

benchmark: ai_evaluation PASS

LLM evaluation · hallucination detection · agent & tool-call testing · gold dataset design · agentic + Web3 workflow validation

0%
benchmark: automation PASS

Playwright · Selenium · REST Assured · Postman · API / web / backend testing · CI/CD-integrated regression suites

0%
benchmark: engineering PASS

Python · Java · SQL · test strategy & planning · performance testing (K6 / JMeter / Grafana) · failure-mode & replay-based validation

0%
benchmark: cloud_devops TRAINING

GCP · AWS · Azure · Jenkins · CI/CD pipelines — solid and actively levelling up. (Honest evals only. That's the whole point.)

0%
benchmark: red_team ADVERSARIAL

I attack the system before anyone else can — jailbreak & prompt-injection probes, adversarial prompt suites, tool-abuse scenarios, data-leak attempts, failure injection. A model that's only been asked nice questions isn't tested; it's flattered.

0%

03. work

Jul 2025 — now · Bangalore

Senior AI SDET

Sentient Labs — AGI startup, autonomous AI agents

Leading AI Quality & Evaluation for AGI-powered agents. Built eval frameworks for hallucination detection, reasoning validation, tool-use accuracy, and prompt adherence. Created gold datasets & AI regression pipelines — improving release confidence by ~40% and cutting flaky validations by ~30%. Playwright web automation and Rest Assured / Postman API frameworks from scratch; tested agentic + crypto/Web3 workflows including wallets and on-chain / off-chain data checks.

  • llm evals
  • agent testing
  • gold datasets
  • web3
  • playwright
Oct 2021 — Jul 2025 · New Delhi

Senior SDET · Founding Engineer

Level AI — voice & conversational AI

Founding QA hire: built the QA function 0 → 1 — automation strategy, release gates, quality processes. Scalable UI + API frameworks (Selenium, Playwright, Rest Assured) improving regression coverage by ~70%, integrated into CI/CD with BDD. Led voice & conversational AI testing — STT accuracy, intent detection, entity extraction, call summarization, agent assist. Performance testing with K6 / JMeter; mentored SDETs.

  • 0→1 qa org
  • voice ai
  • ci/cd
  • k6 / jmeter
  • mentoring
Sep 2020 — Oct 2021 · Pune

SDET-I

Motifworks

Automated 300+ hybrid web + desktop test scenarios with Selenium, Java, C#, and SpecFlow — cutting regression effort by ~70%. Desktop automation via FlaUI, mobile coverage with Appium.

  • selenium
  • specflow
  • flaui
  • appium
Jun 2018 — Aug 2020 · Pune

QA Automation Engineer

Atos

Automated banking web apps with Selenium, Protractor (TypeScript), and Cucumber — BDD regression suites wired into CI pipelines.

  • banking
  • protractor
  • cucumber
  • bdd
2014 — 2018 · Bangalore

B.Tech, Computer Science & Engineering

Visvesvaraya Technological University

Where the habit of asking "but does it actually work?" began.

04. why hire me // reviewed like a pull request

⎇ candidate/deep-halder → your-team/main

hire: deep_halder #2026

✓ approved — strong hire
  • Trusted twice as the first quality hire. Two AI startups handed me a blank page; both got a working QA org. That trust isn't given for buzzwords.
  • Receipts, not vibes. ~40% lift in release confidence. ~30% fewer flaky validations. ~70% more regression coverage. I measure my own work the way I measure models.
  • I catch what AI reviewers rubber-stamp. An LLM will confidently approve the bug it can't trace. My harnesses are built to be harder to fool than I am.
  • "Not ready" is a complete sentence. Release gates only matter if someone is willing to hold them. I am — and I bring the data that ends the argument.
merge candidate ↵ no conflicts with your base branch

// how I think — the researcher's loop I run on every AI system

01

hypothesis

Every feature is a claim. "The agent handles refunds" isn't a fact — it's a hypothesis nobody has tried to break yet.

02

instrument

Build the thing that could prove it wrong: gold datasets, replayed real traffic, adversarial prompts, failure injection.

03

evidence

Numbers over adjectives. A pass rate with a denominator beats "seems to work" every single time.

04

verdict

Ship it, fix it, or kill it — and whichever it is, with receipts attached.

05. off the clock // proof I'm not a language model: I log off

🏔️

mountains & miles

Raised in Assam, wired for the hills. If I'm not replying, check the nearest mountain — the travel log lives on Instagram.

@deephalderr ↗
🏋️

the iron lab

Progressive overload is just regression testing for the body. I track my lifts the way I track eval metrics — honestly.

current experiment: consistency
📈

markets & chains

Personal-finance nerd, crypto-curious by profession now too. I read whitepapers for fun and stress-test my own portfolio.

position sizing > moon math

chai & ideas

Best test plans start as scribbles next to a cup of chai. Happy to share one — the chai's on you if we ever meet.

dependency: HIGH — wontfix

06. contact

Shipping an AI product and wondering
"…but does it actually work?"

That question is my favorite conversation. Let's talk.

// or interrogate the site itself — type help and hit enter

guest@deephalder.com — interactive
$