Job Summary
Roles and Responsibilities:
System Delivery and Deployment:
- Deploy and configure on-premise and cloud Linux and Windows servers, and related services.
- Set up and manage virtualized environments, including ProxMox and Hyper-V hypervisors.
- Install and configure monitoring and observability tools such as Grafana, Prometheus, ELK Stack, SaltStack, and Telegraf.
- Integrate databases like PostgreSQL, Mimir, and Elasticsearch to support the data infrastructure.
System Maintenance and Monitoring:
- Monitor infrastructure performance and availability, using observability tools to ensure continuous operation and health of all systems.
- Manage updates, patches, and lifecycle maintenance for Linux and Windows systems.
- Troubleshoot system and application issues, providing quick, accurate resolutions to maintain uptime.
System Improvement and Optimization:
- Continuously optimize systems and infrastructure for enhanced performance, reliability, and scalability.
- Develop and maintain automation scripts and configurations (using SaltStack, Terraform and Ansible) to streamline system processes and reduce manual intervention.
- Analyze logs, metrics, and data trends to identify potential system enhancements.
Customer and Product Support:
- Serve as a technical point of contact for customers, providing Tier 2 and Tier 3 support as needed.
- Work with cross-functional teams to troubleshoot, escalate, and resolve issues impacting customer experience.
- Communicate proactively with customers about system improvements, updates, and troubleshooting processes.
Documentation and Knowledge Sharing:
- Create and update comprehensive documentation for system configurations, participate in knowledge-sharing sessions to keep the team updated on best practices, new features, and changes.
Deliverables:
Operational Excellence:
- Uptime metrics for core services meet or exceed 99.9%.
- Timely completion of scheduled system maintenance with minimal disruptions.
- Rapid and accurate resolution of issues, tracked via support and incident metrics.
Deployment and Configuration:
- Successful deployment of new systems or enhancements within project deadlines.
- Configuration management files/scripts stored securely and maintained in version control.
Documentation:
- Up-to-date system documentation, including configurations, deployment steps, and troubleshooting guides.
- Clear and comprehensive incident reports for major support cases or outages.
Customer Satisfaction:
- Positive customer feedback on system reliability and responsiveness to support requests.
- Completion of customer-facing updates, notifications, and support queries in a timely manner.
Qualifications:
- Bachelor’s degree in Computer Science, Information Systems, or related field, or equivalent practical experience.
- 3+ years of experience in system engineering, infrastructure, or observability solutions.
- Proficiency in Linux and Windows operating systems and virtualized environments.
- Experience with monitoring and observability tools (Grafana, Prometheus, ELK Stack) and automation tools (SaltStack).
- Strong knowledge of SQL and NoSQL databases (PostgreSQL, Elasticsearch).