Infrastructure Engineer

We are looking for an experienced engineer to build and scale services in a cloud environment in our Infrastructure team. You’ll be working with a high-energy, fast-paced team responsible for supporting initiatives and operations across Toptal.

This is a remote position that can be done from anywhere. All communication and resumes must be submitted in English.

Responsibilities:

Toptal services are deployed across hundreds of servers. You will be responsible for designing, building, deploying, and maintaining highly available production systems, with shared ownership with the development teams.

You’ll develop tooling and processes to drive and improve the developers’ experience, implement monitoring for automated system health checks, develop procedures, and maintain documentation for system troubleshooting and maintenance. Lastly, you’ll collaborate with engineering teams to improve the company’s engineering tools, systems, procedures, and data security, not just administer clusters and cloud services.

We hold daily scrum standups (GMT-3 to GMT+5). Expect pair programming, engaging in peer code reviews, and use collaboration tools like Slack and Zoom.

In the first week you will:



Join our boot camp team and begin your onboarding into Toptal.



Learn about our team’s processes and get familiar with the code that maintains our infrastructure resources.



In the first month you will:



Learn about our systems - why they are built the way they are and how to improve them.



Monitor systems security, performance, and availability.



Begin to participate in a variety of roles in a wide range of Infrastructure projects.



Review procedures and documentation for system troubleshooting and maintenance.



In the first three months you will:



Perform regular systems maintenance including OS/application patches, driver updates, and regular performance monitoring.



Provide excellent customer service by seeking to understand and address the teams’ needs and expectations through effective communication and collaboration while learning about our infrastructure.



Deliver internal Infrastructure and services such as monitoring, logging, and data services to our internal users.



Support the development of CI/CD pipelines.



In the first six months you will:



Support Infrastructure design, architecture, and implementation support. You may be involved in network design, identification of new technologies to support the business, and resolve infrastructure compatibility and performance problems as they arise.



Participate in the on-call rotation schedule (during business and after hours) to support all infrastructure related systems.



Report any downtime or performance issues faced by the system, drill down to find out what caused it and coordinate with the development teams to resolve them.



Handle incident resolution if a developer is not needed.



Participate in our Disaster Recovery, change control, and security standards initiatives.



In the first year you will:



Communicate with key stakeholders on project engagements.



Partner closely with our Engineering teams to develop infrastructure automation and management solutions with a keen focus on scalability, observability, automation, reliability, security, and quality in Google Cloud Platform.



Plan and coordinate testing of changes, upgrades, patches, new releases, and new services.



Participate in technology initiatives that enable developers to deliver their services to our customers with a minimal amount of friction and a high degree of quality.



Requirements:



Be well-versed in deploying automation with tools like ansible and terraform, as well as version control.



Be eager to help your teammates, share your knowledge with them, and learn from them.



Previous experience managing infrastructure configuration and provisioning through code for large, distributed systems on public cloud platforms (AWS, GCP).



Solid understanding of Linux debugging, LAN and WAN networking, IP addressing, Load Balancing, VPNs, and routing.



A strong understanding of modern systems and service-related security best practices.



Hands-on experience with system and application metric collection and alerting services such as Graphite, Grafana, Prometheus, InfluxDB, Sensu, or others. A keen focus on what makes a system observable.



Proficient in scripting languages such as Python, Bash, Ruby, etc.



Understanding of and experience with continuous integration and continuous deployment patterns and tools such as Jenkins and Travis.



Superior troubleshooting skills. Experience in resolving difficult problems through various troubleshooting protocols and processes.



Experience with Docker, Docker Compose, and creating optimized docker files.



Kubernetes building, operating, and debugging experience is a plus.



Experience managing RDBMS. PostgreSQL experience is an added advantage.



Participate in the on-call rotation schedule (during business and after hours) to support all infrastructure related systems.







