Current jobs related to Lead Reliability Engineer - Herndon, Virginia - LanceSoft
-
Reliability Engineer
4 weeks ago
Herndon, Virginia, United States Amazon Full timeAbout the RoleWe are seeking a highly skilled Reliability Engineer to join our team at Amazon Web Services (AWS). As a Reliability Engineer, you will be responsible for driving the reliability risk identification, assessment, and mitigation for datacenter infrastructure and security equipment.Key ResponsibilitiesProactively identify and assess reliability...
-
Reliability Engineer
2 weeks ago
Herndon, Virginia, United States Amazon Full timeJob DescriptionAmazon is seeking a highly skilled Reliability Engineer to join our team. As a key member of our Infrastructure Reliability team, you will be responsible for driving reliability risk identification, assessment, and mitigation for datacenter infrastructure equipment, with a specific focus on High Voltage (HV) substations.Key...
-
Site Reliability Engineer
3 weeks ago
Herndon, Virginia, United States The Swift Group Full timeJob Title: Site Reliability EngineerThe Swift Group is seeking a highly skilled Site Reliability Engineer to join our team. As a Site Reliability Engineer, you will play a critical role in ensuring the reliability, performance, and availability of our mission-critical systems.Key Responsibilities:Collaborate with DevOps engineers and developers to design,...
-
Reliability Engineer
3 days ago
Herndon, Virginia, United States Amazon Full timeJob DescriptionJob Title: Reliability Engineer - Datacenter InfrastructureJob Summary: We are seeking a highly skilled Reliability Engineer to join our team at Amazon Web Services (AWS). As a Reliability Engineer, you will be responsible for driving the reliability risk identification, assessment, and mitigation for datacenter infrastructure equipment, with...
-
Hardware Reliability Engineer
3 days ago
Herndon, Virginia, United States Amazon Full timeJob DescriptionJob Title: Hardware Reliability EngineerJob Summary: We are seeking a highly skilled Hardware Reliability Engineer to join our team at Amazon. As a Hardware Reliability Engineer, you will be responsible for proactively driving the reliability risk identification, assessment, and mitigation for datacenter infrastructure and security...
-
Reliability Engineer
2 weeks ago
Herndon, Virginia, United States Amazon Full timeJob Title: Infrastructure Reliability EngineerAmazon is seeking a highly skilled Infrastructure Reliability Engineer to join our team. As an Infrastructure Reliability Engineer, you will be responsible for driving the reliability risk identification, assessment, and mitigation for data center infrastructure and security equipment.Key...
-
Reliability Engineer
1 week ago
Herndon, Virginia, United States Amazon Full timeJob SummaryWe are seeking a highly skilled Reliability Engineer to join our team at Amazon. As a Reliability Engineer, you will be responsible for driving the reliability risk identification, assessment, and mitigation for datacenter infrastructure and security equipment. You will work closely with internal and external partners to drive key aspects of...
-
Site Reliability Engineer
4 weeks ago
Herndon, Virginia, United States The Swift Group Full timeJob Title: Site Reliability EngineerThe Swift Group is seeking a highly skilled Site Reliability Engineer to join our team. As a Site Reliability Engineer, you will be responsible for ensuring the reliability, performance, and availability of our mission-critical systems.Key Responsibilities:Design, implement, and maintain scalable and highly available...
-
Reliability Engineer
3 weeks ago
Herndon, Virginia, United States Amazon Full timeJob SummaryWe are seeking a highly skilled Reliability Engineer to join our team at Amazon Web Services (AWS). As a Reliability Engineer, you will be responsible for driving the reliability risk identification, assessment, and mitigation for datacenter infrastructure equipment, with a specific focus on High Voltage (HV) substations.Key ResponsibilitiesLead...
-
Site Reliability Engineer
2 weeks ago
Herndon, Virginia, United States Amazon Full timeJob DescriptionAmazon is seeking a highly motivated and experienced Site Reliability Engineer to join our AWS Operations Management (AWSOM) team. As a key member of our team, you will be responsible for designing, implementing, and operating large-scale cloud infrastructure to ensure the reliability, performance, and efficiency of our services.Key...
-
Site Reliability Engineer
2 weeks ago
Herndon, Virginia, United States Amazon Full timeJob DescriptionWe are seeking a highly motivated and experienced Site Reliability Engineer to join our team at Amazon. As a key member of our Operations Management team, you will be responsible for designing, implementing, and maintaining the reliability, performance, and efficiency of our cloud infrastructure.Key Responsibilities:Design and implement...
-
Senior Principal Site Reliability Engineer
3 weeks ago
Herndon, Virginia, United States Chameleon Consulting Group Full timeJob Title: Senior Principal Site Reliability EngineerChameleon Consulting Group is seeking a highly skilled Senior Principal Site Reliability Engineer to join our team. As a key member of our engineering team, you will play a crucial role in ensuring the reliability and performance of our systems and infrastructure.Key Responsibilities:Lead a team of...
-
Reliability Engineer
2 weeks ago
Herndon, Virginia, United States Amazon Full timeAbout the RoleWe are seeking a highly skilled Reliability Engineer to join our team at Amazon. As a Reliability Engineer, you will be responsible for driving the reliability risk identification, assessment, and mitigation for datacenter infrastructure equipment. This includes proactive identification of potential risks, root cause analysis of critical...
-
Principal Site Reliability Engineer
1 month ago
Herndon, Virginia, United States Chameleon Consulting Group Full timeJob Title: Principal Site Reliability EngineerChameleon Consulting Group is seeking a highly skilled Principal Site Reliability Engineer to lead our team in building a Kubernetes-based capability to support cyber operations.Key Responsibilities:Build and manage a Security Operations CenterDeploy and manage Security Information and Event Management...
-
Senior Site Reliability Engineer
2 weeks ago
Herndon, Virginia, United States Peraton Full timeJob Title: Senior Site Reliability EngineerPeraton is seeking a highly skilled Senior Site Reliability Engineer to join our team. As a key member of our Infrastructure Managed Services (IMS) Program, you will play a crucial role in mapping and optimizing all aspects of our underlying I.T. infrastructure.Responsibilities:Utilize monitoring and triage tools to...
-
Senior Site Reliability Engineer
2 weeks ago
Herndon, Virginia, United States LanceSoft Full timeJob Title: Senior Site Reliability EngineerWe are seeking a highly skilled Senior Site Reliability Engineer to join our team at LanceSoft. As a key member of our Engineering/Infrastructure team, you will be responsible for designing, developing, and implementing automated solutions to reduce risk and promote efficiencies in support of our organization's...
-
Principal Site Reliability Engineer
4 weeks ago
Herndon, Virginia, United States Cyber Crime Full timeJob Title: Principal Site Reliability EngineerChameleon Consulting Group is seeking a highly skilled Principal Site Reliability Engineer to lead our team in building a Kubernetes-based capability to support cyber operations. As a key member of our team, you will be responsible for architectural design, systems design, and developing unique solutions to...
-
Senior Site Reliability Engineer
2 weeks ago
Herndon, Virginia, United States LanceSoft Full timeJob Title: Senior Site Reliability EngineerAt LanceSoft, we're seeking a highly skilled Senior Site Reliability Engineer to join our team. As a key member of our Engineering/Infrastructure team, you will play a critical role in designing, developing, and implementing solutions to ensure high-quality process automation within our Information Technology...
-
Principal Site Reliability Engineer
2 weeks ago
Herndon, Virginia, United States Chameleon Consulting Group Full timeJob Title: Principal Site Reliability EngineerChameleon Consulting Group is seeking a highly skilled Principal Site Reliability Engineer to join our team. As a key member of our infrastructure team, you will be responsible for designing, instantiating, and configuring physical computer systems in classified government environments using Infrastructure as...
-
Herndon, Virginia, United States Amazon Full timeJob SummaryWe are seeking a highly skilled Quality Engineer for Infrastructure Reliability to join our team at Amazon. As a Quality Engineer, you will be responsible for ensuring the highest level of quality in our data center infrastructure. This includes working closely with suppliers to ensure that their products meet our quality standards.Key...
Lead Reliability Engineer
2 months ago
JOB OVERVIEW
Position:
Lead Reliability Engineer
Contract Type:
Long-term engagement (contracts are renewed biannually)
Remote Work:
YES
Team Composition:
Engineering/Infrastructure - consisting of 5 members
Role Summary:
This is a senior-level, advisory position that will act as the Subject Matter Expert (SME) for solutions related to observability, incident management, and foundational infrastructure.
We are looking for an individual who has successfully implemented solutions that enhance business operations as well as the essential infrastructure necessary for deploying and managing those solutions.
The ideal candidate will possess leadership and communication skills to guide and mentor peers.Critical thinking, strong communication abilities, and a technology-focused background with expertise in Python scripting, API integrations, and a comprehensive understanding of system/application interdependencies are essential.
Work Schedule:
Weekdays, with an on-call rotation for nights and weekends
Compensation:
Competitive with traditional senior software engineering roles
Hiring Process:
Remote interview protocol - initial half-hour with team leads followed by a technical panel interview (details to be determined) - Remote onboarding
Imagine a vibrant environment where you collaborate with exceptional professionals daily, committed to developing technologies that enhance educational pathways for millions of students worldwide.
Visualize an organization where machine learning, distributed systems, artificial intelligence, networking, security, optimization, user experience, and user interface converge to create innovative solutions, with limitless potential for growth.
Are you a dedicated, high-energy, technology-driven engineer? We eagerly anticipate collaborating with our colleagues to share our passion for technology.
Recognized by industry leaders as one of the most innovative companies in education, LanceSoft is a mission-driven organization focused on enhancing educational opportunities and outcomes, particularly for underprivileged students, within a competitive business landscape.
As a Lead Reliability Engineer, you will research, design, and implement solutions to achieve high-quality process automation within the IT division and across various business units.
You will have experience in designing, developing, and implementing solutions that support business operations and the necessary infrastructure for their deployment.
Hands-on technical skills and experience with cloud services and continuous delivery systems are required.As an engineer, you must possess excellent written and verbal communication skills and be adaptable to the evolving needs of the department and organization.
Building and maintaining effective relationships with team members and stakeholders across multiple projects is crucial.
This role will leverage the knowledge acquired during your studies in Computer Science, Electrical Engineering, or a related engineering discipline.
KEY RESPONSIBILITIES
- Design, develop, and implement automated solutions based on established standards and processes to ensure consistency across the organization, minimizing risk and enhancing efficiency in alignment with organizational goals.
- Ensure the quality of your work; develop and implement quality criteria and validation methods to guarantee deliverables meet expected quality standards, utilizing quality management metrics to maintain quality levels and exploring new techniques for improvement.
- Continuously review observability products, both custom and commercial off-the-shelf (COTS), and implement industry best practices to enhance the efficiency, scalability, and quality of observability tools.
- Manage and resolve incidents, conduct incident reviews, and proactively address problems.
- Take on key response roles during major incidents and participate in an on-call rotation with team members. Engage in post-incident reviews for Root Cause Analysis (RCA).
- Contribute to system design consulting, AWS platform management, and capacity planning.
- Provide ongoing support (coaching and mentoring) for team members' work activities.
- Utilize product SLAs and enterprise metrics to ensure product availability and user experience quality, seeking innovative methods to enhance quality and analyzing the impact of changes on application performance and availability.
- Design and develop tools and processes to improve infrastructure reliability and facilitate monitoring and reporting.
- Write complex code, build infrastructure as code, work with serverless cloud environments, and develop the necessary automated toolsets to support continuous metric collection.
- Integrate COTS products into the continuous delivery pipeline to create a comprehensive automated system for the development, testing, and deployment of applications.
- Act as a hands-on engineer who leads by example, taking responsibility for creating design specifications, unit testing, and preparing technical documentation. Develop solutions from initial business concepts through to operational integrity.
- Support the establishment of observability standards by creating user-friendly templates to enhance adoption.
- Foster a community of practice for collective learning regarding observability tools and systems across all development teams.
- Participate in an on-call rotation to respond to incidents affecting availability and provide support for development team engineers with customer-related incidents.
- Leverage on-call experiences to analyze and prevent future incidents.
QUALIFICATIONS
- A bachelor's degree in Computer Science, Engineering, or Management Information Systems is preferred.
- 5-8 years of experience in software systems, programming, and infrastructure development and administration.
- Proven experience as a DevOps engineer in a scalable production environment, managing one or more of the Atlassian Suite products (Jira, Confluence, Bitbucket, Crowd).
- Ability to thrive in high-pressure environments, quickly troubleshoot complex issues, and effectively manage multiple priorities.
- Strong practical skills in Linux-based systems administration and scripting in a cloud environment.
- Experience with programming languages and their frameworks and design patterns.
- Familiarity with APIs and Microservices.
- Working knowledge of IP Networking, VPCs, DNS, Load Balancing, and Firewalls.
- Experience in building infrastructure as code using AWS CDK, Cloud Formation, or similar scripting techniques.
- Experience managing production releases using AWS Code Pipeline.
- Expertise with Git and Bitbucket, including branching workflows.
ADDITIONAL QUALITIES
- Excellent interpersonal and collaboration skills, with the ability to work effectively with a diverse range of colleagues.
- Strong decision-making, problem-solving, critical thinking, and testing skills.
- Self-motivated with the ability to prioritize, work independently, and achieve goals.
- A commitment to continuous improvement and a desire to learn new skills.
- Strong ability to grasp and internalize the broader implications of projects.