Automation & Reliability Engineer: Reporting to the VP of Automation & Reliability Engineering, this position is critical to the mission of the Automation & Reliability Engineering team as part of the Global Infrastructure Services group. The core purpose of the role is to ensure that our complex technologies and systems are designed and operated for high efficiency and low risk.
Responsibilities of Automation & Reliability Engineer:
– Own or assist in the configuration of monitoring and alerting systems in support of observability goals for myriad systems and services, including our public-facing services.
– Evaluate incident response requirements and workflows and contribute towards constant improvement through process improvement and application of automation.
– Gain deep knowledge of our complex applications
– Assist in the roll-out and deployment of new product features and installations to facilitate our rapid iteration and constant growth.
– Drive efficiencies in systems and processes: capacity planning, configuration management, performance tuning, monitoring, and root cause analysis
– Participate in objective no-blame post-incident analysis and review process
– On behalf of operations be on point for capacity planning and to help the team anticipate and prepare for growth
– Develop tools to improve our ability to rapidly deploy and effectively monitor custom applications
– Support the creation of end-to-end availability and performance of mission-critical services. Build automation to prevent problem recurrence. Partner with specialists to build automated responses for non-exceptional service conditions.
– Develop reliability tools and frameworks for use by all operations teams
– Ensure all key services are measured, monitored, and raising alerts when needed
– Partner with specialists on automating the deployment and configuration processes
– Develop tools to improve our ability to rapidly deploy and effectively monitor custom applications in a large-scale UNIX and Windows environment.
– Function well in a fast-paced, rapidly-changing environment.
– Be on-call when required to support our operations centers.
– Bachelor’s degree in Computer Science, Information Technology, Mathematics, Software or Broadcast Engineering, or another technical discipline, or related practical experience.
– 3+ years experience with troubleshooting in Unix/Linux
– Experience in the Linux environment and a good understanding of its fundamentals and internals: filesystems and modern memory management, threads and processes, the user/kernel-space divide, etc.
– Background in Configuration and management of large-scale platforms. (Virtualization, Cloud, Unix, Linux, Java, SQL, Oracle)
– A good understanding of large-scale distributed systems in practice, including multi-tier architectures, application security, monitoring, and storage systems.
– Working knowledge of the TCP/IP stack, internet routing, and load balancing
– Working exposure to linear and digital broadcasting and platforms preferred
– Knowledge of most of these: data structures, relational and non-relational databases, networking, Linux internals, filesystems, web architecture, and related topics
– Previous experience working with geographically-distributed coworkers.
– Strong verbal, written, interpersonal communication and customer service skills and ability to work well in a globally diverse, team-focused environment
– Good organizational and conceptual skills combined with proven critical thinking, analytic, problem solving, and decision-making abilities
– Ability to multi-task within related functions
– Positive attitude and can-do mentality
– Experience of working for a Media Company/Broadcast is desirable but not essential
– Must have the legal right to work in the United States.
Disclaimer : We try to ensure that the information we post on Noticedash.com is accurate. However, despite our best efforts, some of the content may contain errors. You can trust us, but please conduct your own checks too.