Senior sys admin

3 months ago


Berkeley, United States Astera Institute Full time
Job DescriptionJob DescriptionAbout Astera

Jed McCaleb founded the Astera Institute as a non-profit dedicated to developing high leverage technologies that can lead to massive returns for humanity. 

About Obelisk

Obelisk is the Artificial General Intelligence (AGI) lab at Astera. Obelisk’s mission is to produce AGI in a safe, socially beneficial way. We are focusing on different problems and different approaches than some other AGI efforts. In particular we are focused on the following problems: 

How does an agent continuously adapt to a changing environment and incorporate new information? 

In a complicated stochastic environment with sparse rewards, how does an agent associate rewards with the correct set of actions that led to those rewards? 

How does higher level planning arise? 

What we're looking for

We’re looking for a system administrator / site reliability engineer (SRE) who will be in charge of the low level systems that we use to do our machine learning research. We use a large number of GPUs to run experiments of various sizes. We need someone to make that infrastructure performant, reliable, efficient, and secure.

We’re currently using the following technologies, but as our first and only SRE, you would be free to change most of this:

  • Bare-metal servers running Ubuntu, configured via Ansible.

  • Some of our servers are on-prem, some are rented from a specialty provider of GPU servers.

  • Clusters running Kubernetes, deployed via Ansible (Kubespray).

  • We run various services including self-hosted GitHub runners.

  • Our machine learning training uses Ray for multi-node jobs.

  • Tailscale for VPN / secure access.

  • Google Workspace for SSO.

Your Responsibilities:

  • Network administration: make it fast, easy, and secure for us to connect to our clusters.

  • Kubernetes cluster management: make sure our clusters and all the workloads we run on them are reliable and easy to use.

  • Information security: make sure everything we do is secure.

Basic Qualifications:

  • 5 years relevant experience in domains such as Linux server administration, networking, information security, or Kubernetes administration.

Preferred Qualifications:

  • Experience running a bare-metal Kubernetes cluster

  • Deep knowledge of networking (TCP/IP, NAT, firewalls, VLANs)

  • Familiarity with Tailscale

Location

You will be required to be in the office in Berkeley, California at least once per week because we have our own hardware on-premise. Beyond that, most work can be done remotely, but you must be available during normal Pacific business hours.

Why work here?

• Plenty of funding and computers. 

• Trying to advance the state of the art in AI, which requires facing fascinating technical problems.

• Small focus. Other places (e.g., DeepMind) are doing research into lots of problems simultaneously, or are doing research and building products (e.g. Anthropic). We are completely focused on a small set of problems. 

• Small. This has benefits and disadvantages, but a huge advantage is less communication overhead and bureaucracy. This makes work faster and more fun. 

• No outside funding means there’s no pressure to chase trends or make products.

Compensation Range: $150K - $300K