Senior HPC and AI Cluster Administrator
AFS is looking for a Senior HPC and AI Cluster Administrator to support software and data solutions for our customers. We are integrating supercomputers and AI clusters based on existing technologies. We are looking for a system administrator to be a key player to enable artificial intelligence and GPU computing solutions.
You will work with many scientific researchers, developers, and customers to create improved workflows and develop unique solutions. You will interact with HPC, OS, GPU compute, and systems specialist to architect, develop and bring up large scale performance platforms.
Key Responsibilities:
- Design, Deploy, and maintain HPC/AI clusters
- Manage AI jobs workflows using various scheduling technology, such as Kubernetes.
- Support and maintain continuous integration and delivery pipelines
- Troubleshooting and fixing, bottom up from bare metal, operating system, software stack and application level
- Support Research, Development and Operational activities.
Basic Qualifications:
- Bachelor's Degree in Computer Science, Engineering, or a related field; or equivalent experience
- 5 years of experience in any of the following:
- Knowledge of HPC and AI solution technologies to include hardware, hypervisors, CPU’s and GPU’s.
- Experience with job scheduling workloads and orchestration tools such as Slurm & K8s
- Excellent knowledge of Linux (i.e. Redhat, Ubuntu) networking (Routing, Switching) and internals, ACLs and OS level security protection and common protocols e.g. TCP, DHCP, DNS, etc.
- Experience with multiple storage solutions such as Lustre, GPFS, zfs and xfs. Familiarity with newer and emerging storage technologies.
- Automation and configuration management tools such as Python, Bash within a Gitops workflows.
- Knowledge of Networking Protocols like InfiniBand, Ethernet
- Experience with private cloud platforms (for example VMware, Hyper-V, KVM)
- Familiarity with public cloud computing platforms (e.g. AWS, Azure)
- Must possess and maintain required DoD 8140 certifications.
Ways to stand out from the crowd:
- Knowledge of GPU architectures, time-slicing, Multi-instance GPU (MIG)
- Experience with container orchestration technologies i.e. Kubernetes, Docker
- Experience designing, deploying AI workflow technologies such as Apache Airflow, Prefect, Dagster.
- Background with RDMA (InfiniBand or RoCE) fabrics
- Experience working in regulated industries and applying compliance requirements (i.e. DISA STIG, CIS etc.)
- NVIDIA Certifications (AI Infrastructure, AI Operations, AI networking)
- VMWARE Certifications (Certified Professional / Advanced Professional)
Clearance
- An active TS/SCI federal security clearance is required
As required by local law, Accenture Federal Services provides reasonable ranges of compensation for hired roles based on labor costs in the states of California, Colorado, Hawaii, Illinois, Maryland, Massachusetts, Minnesota, New Jersey, New York, Washington, Vermont, the District of Columbia, and the city of Cleveland. The base pay range for this position in these locations is shown below. Compensation for roles at Accenture Federal Services varies depending on a wide array of factors, including but not limited to office location, role, skill set, and level of experience. Accenture Federal Services offers a wide variety of benefits. You can find more information on benefits here. We accept applications on an on-going basis and there is no fixed deadline to apply.
The pay range for the states of California, Colorado, Hawaii, Illinois, Maryland, Massachusetts, Minnesota, New Jersey, New York, Washington, Vermont, the District of Columbia, and the city of Cleveland is:$118,300—$195,100 USD