Service faults in the online production environment, such as resource overload and service logic errors, directly affect the normal use of services. In addition, when a fault occurs, we want to inform the operation and maintenance personnel of the incorrect alarm information quickly to ensure service availability. Therefore, the alarm monitoring system is essential in the online system. The alarm monitoring system is used to monitor the service health status, service interface response time, and key performance indicators. When a service problem occurs, contact the maintenance personnel as soon as possible. The necessary error snapshot information is kept to facilitate troubleshooting and recovery.
The goals achieved by our system monitoring are:
Continuous system monitoring: The monitoring system monitors the service system in real time and reports alarms to operation and maintenance personnel if system exceptions occur.
Different dimensions of monitoring: Unexpected problems may occur during system running, such as network interruption, system resource usage exceptions, and service logic errors. The system needs to monitor performance indicators and log information.
Perfect alarm record: All alarms are persisted. When the operation and maintenance personnel go online again to view the error information, they can trace the specific time period and alarm subject if the alarm is caused by an abnormal performance indicator. (If the CPU usage is abnormal, they can find the machine that generates the error within the time range.) If the alarm is caused by a specific service logic error, you can query the error logs that trigger the exception, and analyze the fault based on the troubleshooting manual.
Closed-loop troubleshooting: After receiving the alarm information, operation and maintenance personnel need to analyze the fault based on the troubleshooting manual and solve it in time to form a complete closed-loop involving problem occurrence -> problem occurrence -> problem analysis -> problem solving.
2. Technology selection
Monitoring and alarm function system is based on the current market open source system to do unified integration. telegraf+influxdb is used to monitor VM indicators, prometheus+ grafana is used to monitor basic indicators, and fELK is used to monitor service logs.