Offer summary

Qualifications:

Proven experience in managing and monitoring large-scale environments., Strong understanding of Linux OS and basic networking protocols., Knowledge of programming languages such as Python, Ansible, and Kubernetes is a plus., Excellent verbal and written communication skills..

Key responsibilities:

Responsible for the uptime and reliability of infrastructure for Quotient & NRS.

Develop and automate monitoring to improve mean time to detect and recover from issues.

Manage activities/projects related to Datacenter and Cloud Management.

Provide 24/7 monitoring and response to incidents, ensuring system health and reliability.

Job description

About Company:

Quotient a subsidiary of Neptune Retail Solutions is the leading digital media and promotions technology company that creates cohesive omnichannel brand-building and sales-driving opportunities to deliver valuable outcomes for advertisers, retailers and consumers. The Quotient platform is powered by exclusive consumer spending data, location intelligence and purchase intent data to reach millions of shoppers daily and deliver measurable, incremental sales. Quotient partners with leading advertisers and retailers, including Clorox, Procter & Gamble, General Mills, Unilever, CVS, Dollar General and Peapod Digital Labs, a company of Ahold Delhaize USA. For more information visit www.quotient.com

Quotient is an equal opportunity employer. We celebrate diversity and do not unlawfully discriminate on the basis of race, color, national origin, ancestry, creed, sex, gender, sexual orientation, gender identity or expression, age (40 and over), religion, political affiliation, citizenship, disability, marital or registered domestic partner status, veteran status, legally protected medical conditions, or any protected category prohibited by local, state or federal laws.

Team Description:

As a Site Reliability team, we enjoy working on challenges that no one has solved yet. Being the first Line in handling any issues for the company, we partner with other Engineering and Product teams to have the right toolset to deliver the Best Customer Experience on Quotient & Neptune Retail Solutions and its partner's Site. SRE's together manage a large-scale system made up of thousands of servers in on-prem data centers and cloud, request rates in the tens of thousands per minute, sub millisecond SLA's, and data measured in terabytes. Responding to production issues on a 24/7 basis we work, and support technology stack comprises of Cloud (GCP & AWS), VMware, Ubuntu, Kubernetes, Java, MySQL, Cassandra, distributed event streaming, memory stores, network devices (Switches, routers, firewalls, load balancers, Storage).

You are the right fit: If you are passionate about operational excellence for large-scale platforms and distributed systems that underpin companies Promotion, Media, and Analytics offerings? You will be right for this role if you bring in a software engineering perspective to deliver quality operations at scale, driving automation in every aspect of the job. You need to bring

Responsibilities:

Responsible for the uptime and reliability of infrastructure of Quotient & NRS.
Developing and automating monitoring to help improve mean time to detect, mitigate, and recover.
Responsible for activities/projects involving Datacenter (On-prime/Cloud) Management.
Management of events related to IT infrastructure elements (e.g. data centers, networks, servers, storage, operating systems, Internet security, and business applications).
24x7x365 Monitoring and response to events, Incident Management, Problem Management, Activities pertaining to Change management, Reporting of KPI's, CMDB management.
Responsible for managing activities/projects for Network Team, DB team, BI etc.
Maintain services once they are live by measuring and monitoring availability, latency and overall system health.
Use industry tools such as SolarWinds, Nagios, Splunk, ELK, PagerDuty, Grafana, Prometheus, Loki, Zabbix
Systematic problem-solving approach coupled with strong communication skills and a sense of ownership and drive.
Provide input into process and procedure for increasing reliability, reducing procedural errors, and managing change within the datacenters.
Managing, provisioning, and servicing Datacenter and Cloud servers.
Responsible for identifying Problem incidents and driving them for resolution.

Qualification/Requirement:

Proven experience in managing and monitoring large-scale environments.
Ability to assess, prioritize, and escalate faults efficiently.
Strong understanding of Linux OS (e.g., configuring networks, LVM, and troubleshooting Linux performance issues).
Basic understanding of networking protocols and components (HTTP, DNS, TCP/IP, OSI Model, etc.).
Knowledge of programming languages such as Python, Ansible, Kubernetes, Docker, Terraform is a plus.
Familiarity with log aggregators (e.g., Splunk, ELK, Loki, Prometheus).
Experience with industry-standard monitoring tools like Nagios, CMDB, Splunk, PagerDuty, Grafana, Zabbix, Datadog etc.
Willingness to work in a 24x7x365 monitoring and support environment.
Excellent verbal and written communication skills.