Deep Halder

01. about

Everyone's shipping AI. Someone has to ask it the hard questions first — that's me.

For 8 years I've worked at the intersection of AI and quality engineering. Today I'm a Senior AI SDET at Sentient Labs, leading AI Quality & Evaluation for AGI-powered agents: hallucination detection, reasoning validation, tool-use accuracy, prompt adherence, gold datasets, and the regression pipelines that turn "it seems fine" into measured confidence — plus agentic and crypto/Web3 workflows, from wallet integrations to on-chain checks.

Before this, I was the founding QA engineer at Level AI, building the quality function from zero — voice AI, conversational AI, STT accuracy, intent detection — back when testing AI meant inventing the playbook as you went.

📍 Bangalore, India · open to global remote

Deep Halder, Senior AI SDET, in a dark suit at a lake at sunset in Assam — deep_halder.jpg // hover for full color

whoami.json

{
  "name": "Deep Halder",
  "role": "Senior AI SDET",
  "org": "Sentient Labs",
  "yrs_breaking_software": 8,
  "base": "Bangalore, IN",
  "obsessions": [
    "agent reliability",
    "evals that catch lies",
    "green checks that
     don't lie"
  ],
  "trusts_ai_blindly": false
}

0years making
software prove itself

0release-confidence
boost from my evals

0test scenarios
automated & shipped

02. evals // my skills, scored the way I score models

benchmark: ai_evaluation PASS

LLM evaluation · hallucination detection · agent & tool-call testing · gold dataset design · agentic + Web3 workflow validation

0%

benchmark: automation PASS

Playwright · Selenium · REST Assured · Postman · API / web / backend testing · CI/CD-integrated regression suites

0%

benchmark: engineering PASS

Python · Java · SQL · test strategy & planning · performance testing (K6 / JMeter / Grafana) · failure-mode & replay-based validation

0%

benchmark: cloud_devops TRAINING

GCP · AWS · Azure · Jenkins · CI/CD pipelines — solid and actively levelling up. (Honest evals only. That's the whole point.)

0%

benchmark: red_team ADVERSARIAL

I attack the system before anyone else can — jailbreak & prompt-injection probes, adversarial prompt suites, tool-abuse scenarios, data-leak attempts, failure injection. A model that's only been asked nice questions isn't tested; it's flattered.

0%

03. work

Jul 2025 — now · Bangalore

Senior AI SDET

Sentient Labs — AGI startup, autonomous AI agents

Leading AI Quality & Evaluation for AGI-powered agents. Built eval frameworks for hallucination detection, reasoning validation, tool-use accuracy, and prompt adherence. Created gold datasets & AI regression pipelines — improving release confidence by ~40% and cutting flaky validations by ~30%. Playwright web automation and REST Assured / Postman API frameworks from scratch; tested agentic + crypto/Web3 workflows including wallets and on-chain / off-chain data checks.

llm evals
agent testing
gold datasets
web3
playwright

Oct 2021 — Jul 2025 · New Delhi

Senior SDET · Founding Engineer

Level AI — voice & conversational AI

Founding QA hire: built the QA function 0 → 1 — automation strategy, release gates, quality processes. Scalable UI + API frameworks (Selenium, Playwright, REST Assured) improving regression coverage by ~70%, integrated into CI/CD with BDD. Led voice & conversational AI testing — STT accuracy, intent detection, entity extraction, call summarization, agent assist. Performance testing with K6 / JMeter; mentored SDETs.

0→1 qa org
voice ai
ci/cd
k6 / jmeter
mentoring

Sep 2020 — Oct 2021 · Pune

SDET-I

Motifworks

Automated 300+ hybrid web + desktop test scenarios with Selenium, Java, C#, and SpecFlow — cutting regression effort by ~70%. Desktop automation via FlaUI, mobile coverage with Appium.

selenium
specflow
flaui
appium

Jun 2018 — Aug 2020 · Pune

QA Automation Engineer

Atos

Automated banking web apps with Selenium, Protractor (TypeScript), and Cucumber — BDD regression suites wired into CI pipelines.

banking
protractor
cucumber
bdd

2014 — 2018 · Bangalore

B.Tech, Computer Science & Engineering

Visvesvaraya Technological University

Where the habit of asking "but does it actually work?" began.

04. why hire me // reviewed like a pull request

⎇ candidate/deep-halder → your-team/main

hire: deep_halder #2026

✓ approved — strong hire

✓
Trusted twice as the first quality hire. Two AI startups handed me a blank page; both got a working QA org. That trust isn't given for buzzwords.
✓
Receipts, not vibes. ~40% lift in release confidence. ~30% fewer flaky validations. ~70% more regression coverage. I measure my own work the way I measure models.
✓
I catch what AI reviewers rubber-stamp. An LLM will confidently approve the bug it can't trace. My harnesses are built to be harder to fool than I am.
✓
"Not ready" is a complete sentence. Release gates only matter if someone is willing to hold them. I am — and I bring the data that ends the argument.

merge candidate ↵ no conflicts with your base branch

// how I think — the researcher's loop I run on every AI system

01

hypothesis

Every feature is a claim. "The agent handles refunds" isn't a fact — it's a hypothesis nobody has tried to break yet.

02

instrument

Build the thing that could prove it wrong: gold datasets, replayed real traffic, adversarial prompts, failure injection.

03

evidence

Numbers over adjectives. A pass rate with a denominator beats "seems to work" every single time.

04

verdict

Ship it, fix it, or kill it — and whichever it is, with receipts attached.

05. off the clock // proof I'm not a language model: I log off

remote office: wherever the wifi and the view are both decent.

Deep Halder at the HPCA cricket stadium in Dharamshala, Himalayas behind

mountains & miles

Raised in Assam, wired for the hills. If I'm not replying, check the nearest mountain — the travel log lives on Instagram.

@deephalderr ↗

the iron lab

Progressive overload is just regression testing for the body. I track my lifts the way I track eval metrics — honestly.

current experiment: consistency

markets & chains

Personal-finance nerd, crypto-curious by profession now too. I read whitepapers for fun and stress-test my own portfolio.

position sizing > moon math

chai & ideas

Best test plans start as scribbles next to a cup of chai. Happy to share one — the chai's on you if we ever meet.

dependency: HIGH — wontfix

06. contact

Shipping an AI product and wondering
"…but does it actually work?"

That question is my favorite conversation. Let's talk.

Based in IST · I flex across US / EU hours for remote roles.

email me whatsapp · fastest (IST) 👋 book a 1:1 · topmate ↗

linkedin ↗ github ↗ instagram ↗ résumé ↓

// or interrogate the site itself — type help and hit enter

guest@deephalder.com — interactive

$

Senior AI SDET · AI Evaluation · Agent Testing · Red-Teaming · open to global remote