Job description
Scope of Work: HPC Cluster Deployment
• Automate the deployment process of HPC clusters using CI/CD pipelines by utilizing GitHub pipeline and AWS Systems Manager.
• Implement CI/CD pipelines to manage and deploy updates to the HPC cluster efficiently.
• Set up and configure HPC clusters to meet specific requirements and workloads.
• Manage and maintain HPC hardware components such as CPUs and GPUs, along with the necessary software.
• Conduct regression testing to verify the functionality and performance of non-GXP HPC clusters.
Workload Scheduler Management:
• Install and configure workload managers and schedulers like LSF, SLURM, and PBS Pro.
• Manage the addition and removal of compute nodes and adjust the priority of master and slave nodes.
• Develop and manage resource policies and rules to optimize cluster performance.
• Configure and allocate resources such as CPU and memory, and profile applications for optimal performance.
• Address and resolve issues related to schedulers, daemons, and license servers.
Network and High-Performance Connectivity Management:
• Install and configure HPC interconnect networks.
• Design and configure the network topology for HPC clusters.
• Ensure the maintenance and monitoring of InfiniBand connectivity.
• Resolve connectivity issues related to InfiniBand, RoCE, and Ethernet.
Monitoring and Reports:
• Produce daily health check reports for the HPC cluster.
• Automate monitoring scripts to streamline the monitoring process.
• Conduct periodic reviews of reports and audit trails.
OS Administration and Management:
• Install and configure operating systems for HPC clusters.
• Address OS-related issues such as CPU, memory, and SWAP utilization, and perform application file system cleanup.
• Ensure application service continuity by performing pre and post checks from both OS and application perspectives during planned and unplanned outages.
Applications and Tools:
• Install HPC libraries and tools such as MPI and compilers.
• Install and configure HPC applications, both commercial off-the-shelf (COTS) and open source, and manage packages using Spack.
• Apply patches and upgrades to HPC applications.
• Resolve issues related to HPC applications.
HPC Storage Management:
• Administer and configure HPC storage systems.
• Oversee the administration of HPC file systems.
• Monitor and troubleshoot HPC storage systems.
• Manage backup and tape library systems.
Below is the key responsibility, essential skills of the resources we will deploy.
Key Responsibilities
• Cluster Management: Install, configure, and maintain compute nodes, GPUs (NVIDIA), high-speed storage (Lustre, GPFS),
and interconnects (InfiniBand, RoCE).
• Performance Tuning: Optimize scientific applications, kernels, and workflows for maximum throughput, scalability, and minimal queue times.
• User Support: Act as a technical expert for researchers, debugging jobs, resolving complex issues, and providing training on tools and best practices.
• Software Management: Manage workload managers (Slurm, LSF), schedulers, software licensing (FlexLM), OpenPBS,
containers (Singularity), and compilers.
• Infrastructure: Administer high-speed interconnects (InfiniBand), storage (Lustre, CEPH), and potentially cloud/hybrid solutions.
• Implement and manage monitoring (Grafana, Prometheus) and orchestration tools (Slurm, Kubernetes).
• Automation: Develop scripts (Python, Ansible) for provisioning, monitoring, and automating routine tasks.
• Security & Policy: Implement and enforce security policies, manage user access, and oversee lifecycle management.
Essential Skills & Qualifications
• Technical Expertise: Strong Linux, Python, scripting (Ansible, Terraform), HPC schedulers (Slurm), networking (InfiniBand), and GPU computing.
Team will have knowledge of Gilead systems and AWS CICD pipelines.
• HPC Domain Knowledge: Experience with parallel file systems, workload management, and performance analysis tools.
• Problem Solving: Excellent analytical and debugging skills for complex distributed systems.
• Communication: Ability to explain complex technical issues to scientists and non-technical stakeholders.
Experience: Hands-on experience in data centers, managing large clusters, and supporting diverse scientific/AI workloads.

