HPC Platform Engineer

Job description

Scope of Work: HPC Cluster Deployment

• Automate the deployment process of HPC clusters using CI/CD pipelines by utilizing GitHub pipeline and AWS Systems Manager.

• Implement CI/CD pipelines to manage and deploy updates to the HPC cluster efficiently.

• Set up and configure HPC clusters to meet specific requirements and workloads.

• Manage and maintain HPC hardware components such as CPUs and GPUs, along with the necessary software.

• Conduct regression testing to verify the functionality and performance of non-GXP HPC clusters.

Workload Scheduler Management:

• Install and configure workload managers and schedulers like LSF, SLURM, and PBS Pro.

• Manage the addition and removal of compute nodes and adjust the priority of master and slave nodes.

• Develop and manage resource policies and rules to optimize cluster performance.

• Configure and allocate resources such as CPU and memory, and profile applications for optimal performance.

• Address and resolve issues related to schedulers, daemons, and license servers.

Network and High-Performance Connectivity Management:

• Install and configure HPC interconnect networks.

• Design and configure the network topology for HPC clusters.

• Ensure the maintenance and monitoring of InfiniBand connectivity.

• Resolve connectivity issues related to InfiniBand, RoCE, and Ethernet.

Monitoring and Reports:

• Produce daily health check reports for the HPC cluster.

• Automate monitoring scripts to streamline the monitoring process.

• Conduct periodic reviews of reports and audit trails.

OS Administration and Management:

• Install and configure operating systems for HPC clusters.

• Address OS-related issues such as CPU, memory, and SWAP utilization, and perform application file system cleanup.

• Ensure application service continuity by performing pre and post checks from both OS and application perspectives during planned and unplanned outages.

Applications and Tools:

• Install HPC libraries and tools such as MPI and compilers.

• Install and configure HPC applications, both commercial off-the-shelf (COTS) and open source, and manage packages using Spack.

• Apply patches and upgrades to HPC applications.

• Resolve issues related to HPC applications.

HPC Storage Management:

• Administer and configure HPC storage systems.

• Oversee the administration of HPC file systems.

• Monitor and troubleshoot HPC storage systems.

• Manage backup and tape library systems.

Below is the key responsibility, essential skills of the resources we will deploy.

Key Responsibilities

• Cluster Management: Install, configure, and maintain compute nodes, GPUs (NVIDIA), high-speed storage (Lustre, GPFS),

and interconnects (InfiniBand, RoCE).

• Performance Tuning: Optimize scientific applications, kernels, and workflows for maximum throughput, scalability, and minimal queue times.

• User Support: Act as a technical expert for researchers, debugging jobs, resolving complex issues, and providing training on tools and best practices.

• Software Management: Manage workload managers (Slurm, LSF), schedulers, software licensing (FlexLM), OpenPBS,

containers (Singularity), and compilers.

• Infrastructure: Administer high-speed interconnects (InfiniBand), storage (Lustre, CEPH), and potentially cloud/hybrid solutions.

• Implement and manage monitoring (Grafana, Prometheus) and orchestration tools (Slurm, Kubernetes).

• Automation: Develop scripts (Python, Ansible) for provisioning, monitoring, and automating routine tasks.

• Security & Policy: Implement and enforce security policies, manage user access, and oversee lifecycle management.

Essential Skills & Qualifications

• Technical Expertise: Strong Linux, Python, scripting (Ansible, Terraform), HPC schedulers (Slurm), networking (InfiniBand), and GPU computing.

Team will have knowledge of Gilead systems and AWS CICD pipelines.

• HPC Domain Knowledge: Experience with parallel file systems, workload management, and performance analysis tools.

• Problem Solving: Excellent analytical and debugging skills for complex distributed systems.

• Communication: Ability to explain complex technical issues to scientists and non-technical stakeholders.

Experience: Hands-on experience in data centers, managing large clusters, and supporting diverse scientific/AI workloads.