NVIDIA is the world leader in GPU Computing. We are passionate about markets include gaming, automotive, vision, HPC, datacenters and networking in addition to our traditional OEM business. NVIDIA is also well positioned as the âAI Computing Companyâ, and NVIDIA GPUs are the brains powering Deep Learning software frameworks, analytics, data centers, and driving autonomous vehicles. We have some of the most experienced and dedicated people in the world working for us. If you are dedicated, forward-thinking, and hard-working technical people across countries sounds exciting, this job is for you. NVIDIA is looking for an outstanding individual who thrives in a diverse work environment, has outstanding interpersonal skills and possesses a strong sense of engagement and continuous process improvement. This candidate must have enterprise server integration, strong OS experience, reliability testing with various telemetries, scale out cluster, test plan development, CI/CD and DevOps experience to join our platform SWQA team.
What youâll be doing:
- Responsible for the development and execution of NVIDIA HGX/DGX/MGX platform test plan on servers, OS, FW and CUDA SW stack from design doc.
- Installing and testing various systems OS, server firmware and SW stack.
- Drive support for root cause analysis on reliability and validation test failures to identify root cause(s) and achieve mitigation.
- Build, develop/debug server and OS level automation front-end and back-end framework and tests
- Review partner and supplier test results and prescribe additional reliability testing on components, servers, and packaging as needed.
- Work in an agile software development team with very high production quality standards.
- Manage bug lifecycle and collaborate with inter-groups to drive for solutions.
Responsible for the development and execution of NVIDIA HGX/DGX/MGX platform test plan on servers, OS, FW and CUDA SW stack from design doc.
Installing and testing various systems OS, server firmware and SW stack.
Drive support for root cause analysis on reliability and validation test failures to identify root cause(s) and achieve mitigation.
Build, develop/debug server and OS level automation front-end and back-end framework and tests
Review partner and supplier test results and prescribe additional reliability testing on components, servers, and packaging as needed.
Work in an agile software development team with very high production quality standards.
Manage bug lifecycle and collaborate with inter-groups to drive for solutions.
What we need to see:
- Bachelorâs Degree (or equivalent experience) in a STEM (Science, Technology, Engineering, Math or Physics) field
- 5+ years proven experience; or Masterâs Degree.
- Proven years of OS and server level automation experience using Python, SHELL, Ansible, Jenkins, C/C++, Java, JavaScript
- Strong OS(Ubuntu, RedHat, CentOS, SuSE, Fedora, Windows and etcâŠ) trouble-shooting and debugging experience in a bare-metal and KVM/VMWare/Hyper-V environment.
- Ability to write test plans focusing on functional, performance, stress and negative testing.
- Experience in developing CI/CD automation processes and DevOps contribution with a real passion for automation.
- Strong experience in FW, BMC/OpenBMC, Network protocol, internal/external enterprise storage devices, PCIe buses and devices, IO sub-devices, CPU and memory, ACPI, UEFI spec, Redfish - huge plus
- Proven years of experience inGitHub/Gitlab/Gerrit,PXE, SLURM,Stack/Kubernetes/Docker)â huge plus
Bachelorâs Degree (or equivalent experience) in a STEM (Science, Technology, Engineering, Math or Physics) field
5+ years proven experience; or Masterâs Degree.
Proven years of OS and server level automation experience using Python, SHELL, Ansible, Jenkins, C/C++, Java, JavaScript
Strong OS(Ubuntu, RedHat, CentOS, SuSE, Fedora, Windows and etcâŠ) trouble-shooting and debugging experience in a bare-metal and KVM/VMWare/Hyper-V environment.
Ability to write test plans focusing on functional, performance, stress and negative testing.
Experience in developing CI/CD automation processes and DevOps contribution with a real passion for automation.
Strong experience in FW, BMC/OpenBMC, Network protocol, internal/external enterprise storage devices, PCIe buses and devices, IO sub-devices, CPU and memory, ACPI, UEFI spec, Redfish - huge plus
GitHub/Gitlab/Gerrit, Stack/Kubernetes/Docker)
Ways to stand out from the crowd:
- Experience working with NVIDIA GPU hardware is a strong plus.
- Good to have solid understanding of virtualization in Linux (KVM, Docker orchestrated with Kubernetes)
- Expertise in packaging software in Linux (rpm, debs)
- Background in parallel programming ideally CUDA/OpenCL is a plus
Experience working with NVIDIA GPU hardware is a strong plus.
Good to have solid understanding of virtualization in Linux (KVM, Docker orchestrated with Kubernetes)
Expertise in packaging software in Linux (rpm, debs)
Background in parallel programming ideally CUDA/OpenCL is a plus
You will also be eligible for equity andbenefits. NVIDIA accepts applications on an ongoing basis.