Welcome to the Self-Healing Server project! This advanced server management system is designed to autonomously monitor, diagnose, and manage server health to ensure uninterrupted performance and high availability. Leveraging state-of-the-art technologies such as adaptive health checks, dynamic threshold adjustments, and machine learning-based anomaly detection, our system delivers comprehensive and proactive server management.
- Features
- Project Structure
- Installation
- Configuration
- Running the Server
- Running the Web Dashboard
- Unit Tests
- Development
- Contributing
- Future Improvements
- Troubleshooting
- License
-
π Health Checks:
- Monitors vital server metrics including CPU usage, memory consumption, disk space, and response time.
- Provides a detailed view of server health and performance.
-
βοΈ Adaptive Thresholds:
- Dynamically adjusts health check thresholds based on historical data and trends.
- Adapts to changes in server workload and usage patterns to reduce false positives.
-
π Self-Healing Actions:
- Automatically performs corrective actions, such as restarting services or reallocating resources, when thresholds are exceeded.
- Ensures minimal downtime and consistent performance.
-
π Distributed Monitoring:
- Enables monitoring of multiple servers or services from a central location.
- Aggregates data for a comprehensive view of the entire infrastructure.
-
π οΈ Error Handling:
- Implements robust error handling mechanisms to capture, log, and diagnose issues effectively.
- Provides detailed error reports and stack traces for troubleshooting.
-
π§ Alerts:
- Configurable alerting system for notifying administrators of critical issues, such as high resource usage or failed self-healing actions.
- Supports email notifications and integration with other messaging services.
-
π Prometheus Integration:
- Exposes metrics in a format compatible with Prometheus for advanced monitoring and alerting.
- Facilitates the creation of custom dashboards and alerts in Prometheus.
-
π€ Machine Learning:
- Utilizes machine learning models to detect anomalies and predict potential issues before they impact server performance.
- Continuously improves anomaly detection accuracy based on historical data.
The project is organized into the following modules:
.
βββ config.json # Configuration file for monitoring settings
βββ main.py # Main entry point to start the server
βββ health_checks.py # Functions to perform various health checks
βββ anomaly_detection.py # Machine learning models and functions for anomaly detection
βββ alerts.py # Functions to manage and send alerts
βββ service_manager.py # Manages service restarts and recovery actions
βββ logging_setup.py # Configuration for logging and error reporting
βββ prometheus_metrics.py # Setup and management of Prometheus metrics
βββ adaptive_health_checks.py # Implements adaptive health check adjustments
βββ error_handling.py # Utilities for comprehensive error handling
βββ dynamic_thresholds.py # Logic for dynamically adjusting thresholds
βββ distributed_monitoring.py # Functions for monitoring multiple servers
βββ web_dashboard
βββ app.py # Flask application for the web dashboard
βββ static # Static files (CSS, JS, images) for the dashboard
βββ templates # HTML templates for the dashboard
βββ index.html # Main page of the web dashboard
- Python 3.7 or higher: Ensure that you have Python 3.7 or later installed.
- Virtual Environment (recommended): To create an isolated environment for the project.
-
Clone the Repository:
git clone https://github.com/tanm-sys/self-healing-server.git cd self-healing-server
-
Create and Activate Virtual Environment:
python -m venv venv venv\Scripts\activate # On Windows
-
Install Dependencies:
pip install -r requirements.txt
-
Install Additional Development Tools (optional but recommended):
pip install pytest pylint
Create a config.json
file in the root directory with the following structure:
{
"server_urls": ["http://localhost:8000/health"],
"cpu_threshold": 80,
"memory_threshold": 80,
"disk_threshold": 90,
"response_time_threshold": 2.0,
"check_interval": 60,
"alert_email": "[email protected]",
"log_level": "INFO",
"prometheus_port": 9090
}
server_urls
: List of URLs for distributed health checks.cpu_threshold
: CPU usage percentage that triggers self-healing actions.memory_threshold
: Memory usage percentage threshold for self-healing.disk_threshold
: Disk usage percentage threshold for self-healing.response_time_threshold
: Maximum acceptable response time in seconds.check_interval
: Interval (in seconds) between health checks.alert_email
: Email address for receiving alerts.log_level
: Logging level (e.g., INFO, DEBUG).prometheus_port
: Port for Prometheus metrics endpoint.
- Set Up Configuration: Ensure that
config.json
is correctly configured according to your environment. - Start the Server:
python main.py
The server will initialize and start performing health checks based on the configuration. It will automatically take corrective actions if necessary.
-
Navigate to the Web Dashboard Directory:
cd web_dashboard
-
Activate Virtual Environment:
venv\Scripts\activate
-
Run the Flask Application:
python app.py
-
Access the Dashboard: Open a web browser and visit
http://localhost:5000
to interact with the dashboard.
Unit tests are essential for validating the functionality of the system. To run tests:
-
Run Unit Tests:
python -m unittest discover tests
-
For Advanced Testing:
pip install pytest pytest
Ensure that all new features and bug fixes are accompanied by appropriate tests.
- Implement the Check: Add the new health check function in
health_checks.py
. - Integrate with Adaptive Health Checks: Update
adaptive_health_checks.py
to incorporate the new check. - Add Tests: Write unit tests for the new health check in the
tests
directory.
- Update Alerts Module: Add new alerting functionality to
alerts.py
. - Modify Health Checks: Adjust
adaptive_health_checks.py
to trigger new alerts as necessary. - Test Alerts: Ensure the new alerts are tested and functioning correctly.
Enhance logging capabilities by updating logging_setup.py
:
- Configure Logging Levels: Define levels such as INFO, DEBUG, ERROR.
- Set Logging Formats: Specify formats for logs, such as JSON or plain text.
- Log Destinations: Set up log destinations, including files and remote logging services.
- Refine ML Models: Improve models in
anomaly_detection.py
for better accuracy. - Update Training Data: Incorporate new data to improve model performance.
- Validate Models: Test updated models to ensure they effectively identify anomalies.
We welcome contributions from the community! To contribute:
- Fork the Repository: Create a copy of the repository under your own GitHub account.
- Create a Feature Branch: Use descriptive names for your branches (
git checkout -b feature/your-feature
). - Make Your Changes: Implement new features, bug fixes, or improvements.
- Commit Your Changes: Commit with clear and detailed messages (
git commit -am 'Add feature X'
). - Push Your Branch: Push your changes to your forked repository (
git push origin feature/your-feature
). - Open a Pull Request: Submit a Pull Request to the main repository, describing your changes and their impact.
Please ensure your code adheres to the project's style guide and passes all tests
before submitting a Pull Request.
- Enhanced Anomaly Detection: Develop advanced ML models for better anomaly detection.
- Improved UI/UX: Revamp the web dashboard for a more intuitive user experience.
- Extended Metrics: Add support for additional metrics and services.
- Scalability Enhancements: Optimize the system for better performance and scalability.
Service Not Restarting:
- Ensure that the
restart_service
function inservice_manager.py
is correctly implemented and has the necessary permissions.
High CPU Usage Alerts:
- Verify the CPU threshold settings in
config.json
or check the CPU monitoring logic inadaptive_health_checks.py
.
Distributed Monitoring Issues:
- Confirm that server URLs in
config.json
are correct and that the monitored services are operational.
Prometheus Metrics:
- Check Prometheus configuration and ensure it is properly scraping metrics from the endpoint.
This project is licensed under the MIT License. See the LICENSE file for more details.
Thank you for your interest in the Self-Healing Server project! We appreciate your feedback and contributions. If you have any questions or need assistance, please open an issue or contact us via the repository.