GitHub - tanm-sys/self-healing-server

Welcome to the Self-Healing Server project! This advanced server management system is designed to autonomously monitor, diagnose, and manage server health to ensure uninterrupted performance and high availability. Leveraging state-of-the-art technologies such as adaptive health checks, dynamic threshold adjustments, and machine learning-based anomaly detection, our system delivers comprehensive and proactive server management.

🚀 Features

🔍 Health Checks:
- Monitors vital server metrics including CPU usage, memory consumption, disk space, and response time.
- Provides a detailed view of server health and performance.
⚙️ Adaptive Thresholds:
- Dynamically adjusts health check thresholds based on historical data and trends.
- Adapts to changes in server workload and usage patterns to reduce false positives.
🔄 Self-Healing Actions:
- Automatically performs corrective actions, such as restarting services or reallocating resources, when thresholds are exceeded.
- Ensures minimal downtime and consistent performance.
🌐 Distributed Monitoring:
- Enables monitoring of multiple servers or services from a central location.
- Aggregates data for a comprehensive view of the entire infrastructure.
🛠️ Error Handling:
- Implements robust error handling mechanisms to capture, log, and diagnose issues effectively.
- Provides detailed error reports and stack traces for troubleshooting.
📧 Alerts:
- Configurable alerting system for notifying administrators of critical issues, such as high resource usage or failed self-healing actions.
- Supports email notifications and integration with other messaging services.
📊 Prometheus Integration:
- Exposes metrics in a format compatible with Prometheus for advanced monitoring and alerting.
- Facilitates the creation of custom dashboards and alerts in Prometheus.
🤖 Machine Learning:
- Utilizes machine learning models to detect anomalies and predict potential issues before they impact server performance.
- Continuously improves anomaly detection accuracy based on historical data.

🏗️ Project Structure

The project is organized into the following modules:

.
├── config.json                # Configuration file for monitoring settings
├── main.py                    # Main entry point to start the server
├── health_checks.py           # Functions to perform various health checks
├── anomaly_detection.py       # Machine learning models and functions for anomaly detection
├── alerts.py                  # Functions to manage and send alerts
├── service_manager.py         # Manages service restarts and recovery actions
├── logging_setup.py           # Configuration for logging and error reporting
├── prometheus_metrics.py      # Setup and management of Prometheus metrics
├── adaptive_health_checks.py  # Implements adaptive health check adjustments
├── error_handling.py          # Utilities for comprehensive error handling
├── dynamic_thresholds.py      # Logic for dynamically adjusting thresholds
├── distributed_monitoring.py  # Functions for monitoring multiple servers
└── web_dashboard
    ├── app.py                # Flask application for the web dashboard
    ├── static                # Static files (CSS, JS, images) for the dashboard
    └── templates             # HTML templates for the dashboard
        └── index.html        # Main page of the web dashboard

🛠️ Installation

Prerequisites

Python 3.7 or higher: Ensure that you have Python 3.7 or later installed.
Virtual Environment (recommended): To create an isolated environment for the project.

Steps

Clone the Repository:

git clone https://github.com/tanm-sys/self-healing-server.git
cd self-healing-server

Create and Activate Virtual Environment:

python -m venv venv
venv\Scripts\activate  # On Windows

Install Dependencies:
```
pip install -r requirements.txt
```
Install Additional Development Tools (optional but recommended):
```
pip install pytest pylint
```

⚙️ Configuration

Create a config.json file in the root directory with the following structure:

{
  "server_urls": ["http://localhost:8000/health"],
  "cpu_threshold": 80,
  "memory_threshold": 80,
  "disk_threshold": 90,
  "response_time_threshold": 2.0,
  "check_interval": 60,
  "alert_email": "[email protected]",
  "log_level": "INFO",
  "prometheus_port": 9090
}

server_urls: List of URLs for distributed health checks.
cpu_threshold: CPU usage percentage that triggers self-healing actions.
memory_threshold: Memory usage percentage threshold for self-healing.
disk_threshold: Disk usage percentage threshold for self-healing.
response_time_threshold: Maximum acceptable response time in seconds.
check_interval: Interval (in seconds) between health checks.
alert_email: Email address for receiving alerts.
log_level: Logging level (e.g., INFO, DEBUG).
prometheus_port: Port for Prometheus metrics endpoint.

🏃 Running the Server

Set Up Configuration: Ensure that config.json is correctly configured according to your environment.
Start the Server:
```
python main.py
```

The server will initialize and start performing health checks based on the configuration. It will automatically take corrective actions if necessary.

🌐 Running the Web Dashboard

Navigate to the Web Dashboard Directory:
```
cd web_dashboard
```
Activate Virtual Environment:
```
venv\Scripts\activate
```
Run the Flask Application:
```
python app.py
```
Access the Dashboard: Open a web browser and visit http://localhost:5000 to interact with the dashboard.

🧪 Unit Tests

Unit tests are essential for validating the functionality of the system. To run tests:

Run Unit Tests:
```
python -m unittest discover tests
```
For Advanced Testing:
```
pip install pytest
pytest
```

Ensure that all new features and bug fixes are accompanied by appropriate tests.

🛠️ Development

Adding New Health Checks

Implement the Check: Add the new health check function in health_checks.py.
Integrate with Adaptive Health Checks: Update adaptive_health_checks.py to incorporate the new check.
Add Tests: Write unit tests for the new health check in the tests directory.

Adding New Alerts

Update Alerts Module: Add new alerting functionality to alerts.py.
Modify Health Checks: Adjust adaptive_health_checks.py to trigger new alerts as necessary.
Test Alerts: Ensure the new alerts are tested and functioning correctly.

Updating Logging

Enhance logging capabilities by updating logging_setup.py:

Configure Logging Levels: Define levels such as INFO, DEBUG, ERROR.
Set Logging Formats: Specify formats for logs, such as JSON or plain text.
Log Destinations: Set up log destinations, including files and remote logging services.

Enhancing Anomaly Detection

Refine ML Models: Improve models in anomaly_detection.py for better accuracy.
Update Training Data: Incorporate new data to improve model performance.
Validate Models: Test updated models to ensure they effectively identify anomalies.

🤝 Contributing

We welcome contributions from the community! To contribute:

Fork the Repository: Create a copy of the repository under your own GitHub account.
Create a Feature Branch: Use descriptive names for your branches (git checkout -b feature/your-feature).
Make Your Changes: Implement new features, bug fixes, or improvements.
Commit Your Changes: Commit with clear and detailed messages (git commit -am 'Add feature X').
Push Your Branch: Push your changes to your forked repository (git push origin feature/your-feature).
Open a Pull Request: Submit a Pull Request to the main repository, describing your changes and their impact.

Please ensure your code adheres to the project's style guide and passes all tests

before submitting a Pull Request.

🚀 Future Improvements

Enhanced Anomaly Detection: Develop advanced ML models for better anomaly detection.
Improved UI/UX: Revamp the web dashboard for a more intuitive user experience.
Extended Metrics: Add support for additional metrics and services.
Scalability Enhancements: Optimize the system for better performance and scalability.

🛠️ Troubleshooting

Service Not Restarting:

Ensure that the restart_service function in service_manager.py is correctly implemented and has the necessary permissions.

High CPU Usage Alerts:

Verify the CPU threshold settings in config.json or check the CPU monitoring logic in adaptive_health_checks.py.

Distributed Monitoring Issues:

Confirm that server URLs in config.json are correct and that the monitored services are operational.

Prometheus Metrics:

Check Prometheus configuration and ensure it is properly scraping metrics from the endpoint.

📝 License

This project is licensed under the MIT License. See the LICENSE file for more details.

Thank you for your interest in the Self-Healing Server project! We appreciate your feedback and contributions. If you have any questions or need assistance, please open an issue or contact us via the repository.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📚 Table of Contents

🚀 Features

🏗️ Project Structure

🛠️ Installation

Prerequisites

Steps

⚙️ Configuration

🏃 Running the Server

🌐 Running the Web Dashboard

🧪 Unit Tests

🛠️ Development

Adding New Health Checks

Adding New Alerts

Updating Logging

Enhancing Anomaly Detection

🤝 Contributing

🚀 Future Improvements

🛠️ Troubleshooting

📝 License

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
tests		tests
web_dashboard		web_dashboard
LICENSE		LICENSE
Project hosted by Tanmay		Project hosted by Tanmay
README.md		README.md
adaptive_health_checks.py		adaptive_health_checks.py
alerts.py		alerts.py
anomaly_detection.py		anomaly_detection.py
config.json		config.json
distributed_monitoring.py		distributed_monitoring.py
dynamic_thresholds.py		dynamic_thresholds.py
error_handling.py		error_handling.py
health_checks.py		health_checks.py
historical_data.py		historical_data.py
logging_setup.py		logging_setup.py
main.py		main.py
prometheus_metrics.py		prometheus_metrics.py
requirements.txt		requirements.txt
service_manager.py		service_manager.py

License

tanm-sys/self-healing-server

Folders and files

Latest commit

History

Repository files navigation

📚 Table of Contents

🚀 Features

🏗️ Project Structure

🛠️ Installation

Prerequisites

Steps

⚙️ Configuration

🏃 Running the Server

🌐 Running the Web Dashboard

🧪 Unit Tests

🛠️ Development

Adding New Health Checks

Adding New Alerts

Updating Logging

Enhancing Anomaly Detection

🤝 Contributing

🚀 Future Improvements

🛠️ Troubleshooting

📝 License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages