HPC Engineering Leader


Principal Duties and Responsibilities

  • Participate in the design and implementation of Linux-based HPC, Infrastructure and Parallel file system servers and clusters. 
  • Design and maintain a multi-petabyte distributed storage system 
  • Optimize resource utilization and job scheduling 
  • Analyze performance issues at scale 
  • Troubleshoot node-level issues, such as kernel panics and system hangs 
  • Propose new solutions and argue for their inclusion 
  • Monitor installation of software releases, patches of the operating system, and third-party utilities with emphasis on overall system security
  • Hire and retain staff, facilitate training and development, provide guidance and coaching, complete performance evaluations on schedule, and work with Human Resources to enforce organizational policies and procedures 
  • Establish and create a positive team environment through leadership and mentoring. Work with staff members to develop an individualized professional development path with clear goals and objectives 
  • Provide the proper supervision, work environment and structure to ensure performance and well- being of employees. Provide clear channels of communication, delegation and accountability within the team for effective problem solving 
  • Use the Partners HealthCare values to govern decisions, actions and behaviors. These values guide how we get our work done: Patients, Affordability, Accountability & Service Commitment, Decisiveness, Innovation & Thoughtful Risk; and how we treat each other: Diversity & Inclusion, Integrity & Respect, Learning, Continuous Improvement & Personal Growth, Teamwork & Collaboration 

Qualifications

  • Bachelors degree or equivalent combination of education and experience required. Computer science, engineering, or equivalent undergraduate and graduate degrees are preferred
  • A minimum of 8 years of experience in data engineering with 3 years specifically in HPC 
  • At least 5 years of experience in Linux administration in a financial services or research environment
  • Previous experience in a supervisory position
  • Hands-on knowledge of distributed filesystems, such as, GPFS, Lustre and object storage, and knowledge of ZFS 
  • Extensive experience with HPC or cloud scheduling, such as, GridEngine, HTCondor, SLURM, Mesos and Nomad 
  • Experience with configuration management such as Ansible, Chef, Puppet, Salt etc 
  • Experience supporting cluster interconnects such as Infiniband, 100GBe, or OmniPath 
  • Strong knowledge of local and distributed I/O performance tuning 
  • Experience with open source applications to build enterprise-level systems 
  • Previous working experience with x86 hardware testing and integration 
  • Fluency in at least one scripting language and bash 
  • Clear, demonstrable evidence of exceptional productivity and performance in competitive environments
  • Knowledge of software team management philosophies (e.g. Agile, Scrum) and various product management/software development tools (e.g., JIRA, Trello, etc.) are required

Skills/Abilities/Competencies Required

  • Strong sense of urgency and proactiveness 
  • Ability to function effectively and independently in a fast-paced environment, organize and prioritize work independently, and meet tight deadlines 
  • Self-motivated, with an entrepreneurial mindset and ability to learn quickly 
  • Excellent project management skills (ability to multitask and prioritize work requirements) with a strong commitment to customer service. 
  • Strong analytical, planning, organization and time management skills with a high attention to detail 
  • Excellent interpersonal skills to effectively communicate with technical teams, cross-functional teams, and staff at all levels of the organization including both technical and non-technical personnel 
  • Ability to successfully negotiate and collaborate with others of different skill sets, backgrounds and levels within and external to the organization 
  • Ability to relate to and gain insights from product end users 
  • Excellent and succinct written and oral communication skills 
  • Ability to effectively conduct meetings and lead and facilitate large working sessions with all levels of staff and across various stakeholder groups 
  • Ability to empathize with end users, understand and intuit customer needs, 
  • and gain insights from product end users 
  • Strong decision making skills, with the ability to negotiate/balance decisions and priorities across functions; comfort making hard decisions with incomplete data and in a timely manner 
  • Demonstrates strong evidence of algorithmic and structured thinking, with an intuition for logic, pattern matching, what-if analysis, problem decomposition and synthesis. 
  • Demonstrated ability to organize and incorporate complex systems requirements into product features and prioritize features effectively