Principal Software Quality Engineer - AI Model Serving @ Red Hat

Job Summary:

Are you ready to take a technical leadership role in shaping the quality of an open-source, Kubernetes-native AI platform that’s redefining hybrid cloud?

TheRed Hat OpenShift AI(RHOAI) team is seeking aPrincipal Quality Engineerwith deep experience in Kubernetes-native application testing and a strong foundation in Python and PyTest to lead quality strategy across our AI Model Serving offerings. You’ll work in a highly collaborative team responsible for one of the core capabilities of OpenShift AI, contributing to open-source projects such as KServe, Kubeflow, and vLLM. Your work will directly impact enterprises running mission-critical AI workloads on hybrid and multi-cloud environments.

This role requires not only excellent engineering skills but also strategic vision, thought leadership, and the ability to mentor and influence others across multiple teams.

What You Will Do:

Lead the quality strategy and implementation for Kubernetes-native components in Model Serving, including Custom Resources, Controllers, and Operators.
Own and evolve automated test architecture with a focus on PyTest, CI/CD, integration testing, and end-to-end testing in Kubernetes environments.
Partner with engineering, product, and community teams to define testability requirements, ensure early validation, and prevent regressions.
Design tests that validate system-level properties including scalability, autoscaling, observability, and reliability for AI workloads.
Participate and influence upstream communities (KServe, Kubeflow, ModelMesh, etc.), raising quality standards and sharing best practices.
Drive efforts to mock, simulate, and validate model serving use cases in hybrid cloud and on-prem environments.
Serve as a technical mentor and go-to expert for Python-based testing frameworks and Kubernetes-native validation strategies.
Take a lead role in debugging complex system-level issues, especially in multi-tenant, distributed AI systems.
Champion Shift-left testing and early validation practices across the RHOAI stack.

Lead the quality strategy and implementation for Kubernetes-native components in Model Serving, including Custom Resources, Controllers, and Operators.

Own and evolve automated test architecture with a focus on PyTest, CI/CD, integration testing, and end-to-end testing in Kubernetes environments.

Partner with engineering, product, and community teams to define testability requirements, ensure early validation, and prevent regressions.

Design tests that validate system-level properties including scalability, autoscaling, observability, and reliability for AI workloads.

Participate and influence upstream communities (KServe, Kubeflow, ModelMesh, etc.), raising quality standards and sharing best practices.

Drive efforts to mock, simulate, and validate model serving use cases in hybrid cloud and on-prem environments.

Serve as a technical mentor and go-to expert for Python-based testing frameworks and Kubernetes-native validation strategies.

Take a lead role in debugging complex system-level issues, especially in multi-tenant, distributed AI systems.

Champion Shift-left testing and early validation practices across the RHOAI stack.

What You Will Bring:

Proven expertise with Kubernetes API development and testing (CRs, Operators, Controllers). Experience working directly with Custom Resources and reconciliation logic is essential.
Strong programming and testing experience in Python, especially with PyTest in large, scalable codebases. Golang knowledge is a plus.
Deep understanding of Kubernetes internals, networking, and lifecycle hooks. Experience with OpenShift is a plus.
Extensive knowledge of CI/CD pipelines, especially in containerized or cloud-native ecosystems (e.g., GitHub Actions, Tekton, Jenkins).
Strong knowledge of test strategy for ML model serving systems, including considerations for runtime performance, isolation, and failure recovery.
Experience with troubleshooting distributed systems and validating observability via Prometheus, Grafana, OpenTelemetry, etc.
A proven ability to lead technical projects and mentor others across teams and time zones.
Excellent communication skills and comfort presenting to engineers, managers, and external stakeholders.

Proven expertise with Kubernetes API development and testing (CRs, Operators, Controllers). Experience working directly with Custom Resources and reconciliation logic is essential.

Strong programming and testing experience in Python, especially with PyTest in large, scalable codebases. Golang knowledge is a plus.

Deep understanding of Kubernetes internals, networking, and lifecycle hooks. Experience with OpenShift is a plus.

Extensive knowledge of CI/CD pipelines, especially in containerized or cloud-native ecosystems (e.g., GitHub Actions, Tekton, Jenkins).

Strong knowledge of test strategy for ML model serving systems, including considerations for runtime performance, isolation, and failure recovery.

Experience with troubleshooting distributed systems and validating observability via Prometheus, Grafana, OpenTelemetry, etc.

A proven ability to lead technical projects and mentor others across teams and time zones.

Excellent communication skills and comfort presenting to engineers, managers, and external stakeholders.

Preferred (Nice-to-Have):

Hands-on experience with KServe, ModelMesh, Ray, vLLM, or other model serving frameworks.
Familiarity with Red Hat Service Mesh, Istio, Knative, or similarserverless/K8s-nativemiddleware stacks.
Experience with performance/load testing frameworks and chaos testing in Kubernetes.
Contribution history in open-source projects or technical leadership in community forums.

Hands-on experience with KServe, ModelMesh, Ray, vLLM, or other model serving frameworks.

Familiarity with Red Hat Service Mesh, Istio, Knative, or similarserverless/K8s-nativemiddleware stacks.

Experience with performance/load testing frameworks and chaos testing in Kubernetes.

Contribution history in open-source projects or technical leadership in community forums.

#LI-MD2

About Red Hat

is the world’s leading provider of enterprise software solutions, using a community-powered approach to deliver high-performing Linux, cloud, container, and Kubernetes technologies. Spread across 40+ countries, our associates work flexibly across work environments, from in-office, to office-flex, to fully remote, depending on the requirements of their role. Red Hatters are encouraged to bring their best ideas, no matter their title or tenure. We're a leader in open source because of our open and inclusive environment. We hire creative, passionate people ready to contribute their ideas, help solve complex problems, and make an impact.

Red Hat does not seek or accept unsolicited resumes or CVs from recruitment agencies. We are not responsible for, and will not pay, any fees, commissions, or any other payment related to unsolicited resumes or CVs except as required in a written contract between Red Hat and the recruitment agency or party requesting payment of a fee.

Red Hat supports individuals with disabilities and provides reasonable accommodations to job applicants. If you need assistance completing our online job application, emailapplication-assistance@redhat.com. General inquiries, such as those regarding the status of a job application, will not receive a reply.

Red Hat is HIRING A

Principal Software Quality Engineer - AI Model Serving