ML Systems Engineer, Infrastructure & Cloud
Company: Basis Research Institute
Location: New York City
Posted on: April 4, 2026
|
|
|
Job Description:
About Basis Basis is a nonprofit applied AI research
organization with two mutually reinforcing goals. The first is to
understand and build intelligence. This means to establish the
mathematical principles of what it means to reason, to learn, to
make decisions, to understand, and to explain; and to construct
software that implements these principles. The second is to advance
society’s ability to solve intractable problems . This means
expanding the scale, complexity, and breadth of problems that we
can solve today, and even more importantly, accelerating our
ability to solve problems in the future. To achieve these goals,
we’re building both a new technological foundation that draws
inspiration from how humans reason, and a new kind of collaborative
organization that puts human values first. About the Role ML
Systems Engineers at Basis ensure training and evaluation
infrastructure is fast, reliable, and scalable. You will own the
full stack from distributed training frameworks through cloud
administration, making it possible for researchers to iterate
quickly on complex models while managing computational resources
efficiently. We are looking for engineers who combine deep
understanding of ML systems with operational excellence. The ideal
ML Systems Engineer has experience with distributed training at
scale, understands the intricacies of debugging numerical
instabilities, and can manage cloud infrastructure that scales from
experiments to production. You will be the guardian of training
stability, the optimizer of compute costs, and the enabler of
reproducible research. This role spans traditional ML engineering
and cloud/DevOps responsibilities. You will manage GPU clusters,
optimize cloud spending, ensure security and compliance, and build
the infrastructure that lets researchers focus on algorithms rather
than operations. We seek individuals who aspire to build robust ML
infrastructure, maintain “logbook culture” for documenting issues
and solutions, and treat operational excellence as a first-class
concern. We expect you to: Have demonstrated expertise in ML
systems engineering . Examples include: Managing distributed
training jobs across hundreds of GPUs Debugging and fixing
numerical instabilities in large-scale training Building
infrastructure for reproducible ML experiments Optimizing training
throughput and resource utilization Possess deep knowledge of
distributed training frameworks including PyTorch/JAX distributed
strategies (DDP, FSDP, ZeRO), gradient accumulation, mixed
precision training, and checkpoint/recovery systems. Have strong
cloud administration skills including AWS/GCP/Azure services,
infrastructure as code (Terraform), Kubernetes orchestration, cost
optimization, security best practices, and compliance requirements.
Understand the full ML stack from hardware (GPUs, interconnects,
storage) through frameworks (PyTorch, JAX) to high-level training
loops and evaluation pipelines. Be skilled at debugging complex
failures across the stack—GPU/NCCL issues, data loading
bottlenecks, memory leaks, gradient explosions, and convergence
problems. Value documentation and knowledge sharing . You maintain
comprehensive logs of issues encountered, solutions found, and
lessons learned, building institutional knowledge. Progress with
autonomy while coordinating closely with researchers. You can
anticipate infrastructure needs, prevent problems before they
occur, and respond quickly when issues arise. In addition, the
following would be an advantage: Experience at organizations
training large models (OpenAI, Anthropic, Google, Meta). Background
in both ML research and production systems. Contributions to ML
frameworks or distributed training libraries. Experience with
on-premise GPU cluster management. Knowledge of optimization theory
and numerical methods. Understanding of robotics-specific
infrastructure requirements. Responsibilities: Own distributed
training infrastructure including job launchers, checkpointing
systems, recovery mechanisms, and monitoring that ensures
experiments run reliably at scale. Debug and resolve training
failures by diagnosing issues across GPUs, networking, numerics,
and data pipelines, maintaining detailed logs of problems and
solutions. Profile and optimize training performance by identifying
bottlenecks in data loading, gradient computation, communication
overhead, and implementing solutions that improve step time. Manage
cloud infrastructure and costs including capacity planning, spot
instance strategies, storage optimization, and building tools that
give researchers visibility into resource usage. Implement security
and compliance measures including access controls, data encryption,
audit logging, and ensuring infrastructure meets requirements for
handling sensitive data. Build evaluation and benchmarking
infrastructure that enables consistent, reproducible measurement of
model performance across different conditions and datasets. Develop
monitoring and alerting systems that detect anomalies in training
metrics, resource utilization, or system health, enabling rapid
response to issues. Maintain development environments including
containerization, dependency management, and tools that ensure
researchers can reproduce results across different systems.
Document and share knowledge through runbooks, post-mortems, and
training materials that help the team understand and operate ML
infrastructure effectively. Collaborate with researchers to
understand requirements, suggest infrastructure solutions, and
ensure systems support rather than constrain research goals. Role
Details Exceptional candidates who may not meet all of the
following criteria are still encouraged to apply. FT/PT: Full-time.
In-person Policy: We are in the office four days a week. Be
prepared to attend multi-day Basis-wide in-person events. Location:
New York City or Cambridge, MA. Salary range: Competitive salary.
Privacy Notice By submitting your application, you grant Basis
permission to use your materials for both hiring evaluation and
recruitment-related research and development purposes. Your
information may be processed in different countries, including the
US. You retain copyright while providing Basis a license to use
these materials for the stated purposes. Read our full Global Data
Privacy Notice here .
Keywords: Basis Research Institute, Danbury , ML Systems Engineer, Infrastructure & Cloud, IT / Software / Systems , New York City, Connecticut