Hive Financial Systems is HIRING A

Peach Pilot — Principal QA Engineer

📍 United States 🌐 Fully RemoteFull Time
POSTED April 14, 2026

Please mention you found this job on TestDev Jobs. It helps us get more people to hire on our site. Thanks and good luck!


95% of enterprise AI pilots fail — not because the technology is broken, but because users don't trust it. At Peach Pilot, we are building an enterprise AI operating system where trust is the product. That means every feature we ship must work exactly as the user expects, every time. One broken interaction at the wrong moment can undo months of adoption. You are the last line of defense before our platform reaches a CFO's desk. We are a funded startup co-founded by Mario Montag (ex-McKinsey, Predikto founder — acquired by Fortune 50) and JP James (Georgia Tech alum, US patent holder in AI/ML, Senior Fellow at the National War College, four-time Atlanta 500 honoree). The Role This is a player-coach hire. You will build and own the QA function at Peach Pilot — writing test code, designing eval pipelines, and setting the quality bar — while also standing up and growing a QA team as the company scales. We are not looking for someone who manages spreadsheets and delegates everything. We are looking for someone who can do the work, knows what good looks like, and builds a team around that standard. The Challenge: QA for AI is a Different Problem Traditional QA assumes deterministic outputs. LLMs don't give you that. You will be building a quality function from scratch in an environment where:

  • Multi-model routing (Claude, GPT-4o, Grok, Gemini) means the same input can produce different outputs depending on which model handled it.
  • Agent orchestration and governance agents must maintain a structurally separate audit trail — any drift between execution and governance is a critical failure.
  • The file ingestion pipeline (Word, Excel, PowerPoint, PDF) must survive edge cases that enterprise clients will find within the first week of deployment.
  • Your users are CEOs and operations leaders who have never used a terminal. A confusing error state isn't a minor bug — it kills adoption. This is not a ticket-closing role. This is a quality architecture and team leadership role. What You Will Own & Build Build the QA Foundation (First 90 Days)
  • Establish the testing framework from zero: unit, integration, end-to-end, and LLM-specific evaluation pipelines.
  • Define quality standards, test coverage requirements, and documentation practices in partnership with the Lead Engineer.
  • Audit the existing platform and identify the highest-risk surfaces before the next major customer deployment.
  • Define the team structure you will need — onshore vs. offshore mix, roles, and a hiring roadmap — and begin executing against it. Build and Lead the QA Team
  • Recruit, hire, and onboard QA engineers as the team grows, setting clear expectations, working standards, and a bar for technical excellence from day one.
  • Mentor junior and mid-level QA engineers — building their ability to own test domains independently rather than creating dependency on you.
  • Act as the quality culture carrier across the full engineering team — QA is not a department, it is everyone's responsibility, and you will make that real.
  • Report directly to the Lead Engineer and participate in product planning to ensure quality is designed in, not bolted on. AI & Agent Testing
  • Design evaluation frameworks for non-deterministic LLM outputs — including prompt regression testing, model drift detection, and output quality scoring across Claude, GPT-4o, Grok, and Gemini.
  • Build automated test suites for the agent orchestration layer, including governance agent audit trail integrity and human-override behavior.
  • Validate the Enterprise Knowledge Graph (Neo4j + vector search) for data accuracy, retrieval quality, and failure modes under real enterprise data conditions. Platform & Integration Testing
  • Own end-to-end testing of the file ingestion pipeline across document types (Word, Excel, PowerPoint, PDF) including encryption, formatting edge cases, and audit trail continuity.
  • Validate streaming response handling, latency thresholds, and graceful degradation when a model is unavailable or slow.
  • Test multi-model routing logic to confirm cost-optimized task allocation behaves correctly across LLM providers. UX Quality & FDE Support
  • Partner with the Full-Stack Engineer to define and test trust-layer UX standards — onboarding flows, progressive disclosure, uncertainty states, and real-time document viewers.
  • Build reusable test playbooks for Forward Deployed Engineers to use in new customer deployments and agent configurations.
  • Act as the internal advocate for the non-technical enterprise user — if a CEO would be confused by it, it doesn't ship. Who You Are
  • The Player-Coach: You have 7+ years of QA engineering experience, with at least 3 years in a lead or senior role where you both wrote test code and managed or mentored other engineers. You do not delegate the hard problems — you model how to solve them.
  • Team Builder: You have experience leading and growing QA teams. You know how to hire for technical excellence, set up engineers to own domains independently, and build accountability without micromanagement.
  • AI-Native Tester: You have hands-on experience testing LLM-powered applications — you understand prompt sensitivity, output variance, and how to build eval pipelines that catch regressions across model updates.
  • Automation-First: You write test code. Python is your primary tool. You have built and maintained CI/CD-integrated test suites and you don't wait for someone to file a bug to find one.
  • Integration Expert: You are comfortable testing complex API chains, async/streaming responses, and multi-service workflows. Document processing pipelines and knowledge graph outputs don't intimidate you.
  • 0-to-1 Mindset: You have built a QA function from the ground up in an early-stage environment. You know when to move fast and when to go deep, and you can make that call without being told.
  • Enterprise Empathy: You understand that your end users are not developers. You test for confusion and trust failure, not just broken functionality. The Stack You'll Test Against
  • AI/LLM: Anthropic Claude, OpenAI GPT-4o, xAI Grok, Gemini, OpenClaw
  • Frontend: React/Next.js, TypeScript, Tailwind CSS
  • Backend: Python, Node.js/TypeScript (FastAPI/Express)
  • Data & Graph: Neo4j, Snowflake, Azure Cosmos DB, Azure AI Search
  • Infrastructure: Azure (Functions, Key Vault), CI/CD pipelines
  • Visualization: Plotly, D3, Recharts, Mermaid Even Better If
  • You have experience with LLM evaluation frameworks (e.g., LangSmith, PromptFlow, or custom eval pipelines).
  • You have tested agent frameworks such as LangChain or CrewAI.
  • You have a background in enterprise software or regulated industries where audit trail integrity is non-negotiable.
  • You have worked alongside Forward Deployed or solutions engineering teams and understand field deployment risk. Why This is Different
  • You are building the QA function and the team — not inheriting either. Your decisions will define how this company ships software for the next five years.
  • You will work directly with the founding engineering team. Your findings shape the roadmap, not a backlog queue.
  • Real enterprise data, real deployments, real consequences. No toy environments.
  • Meaningful equity as a founding-team engineering hire. Compensation & Benefits
  • Base Salary: Broad range (Commensurate with experience)
  • Equity: Meaningful founding-team equity package
  • Benefits: Comprehensive medical, dental, and vision; 401(k); flexible PTO
  • Location: Fully Remote — US Based The Clincher Tell us about a quality failure — one you caught before it shipped, or one that got through. What did you build or change after it, and how did you make sure your team could catch the next one without you?

Please mention you found this job on TestDev Jobs. It helps us get more people to hire on our site. Thanks and good luck!