This repository contains a comprehensive collection of Python programming concepts, techniques, and libraries applied to various tasks, including data analysis, machine learning, and more. It also contains different projects on Machine learning and statistical analysis. Each folder contains detailed information and codes for specific task (all developed using Jupyter Notebook) as well as a .md
file to explain every aspect of each folder.
All concepts and techniques covered can be found in the "Introduction, Python" folder, including details such as definition/descriptions, when to use each concept, typical output, definitions, and examples. More details on all concepts can be explored in the projects and other folders in more details with images.
-
Python Basics:
- Variables, Data types (integers, floats, strings)
- Input/Output (printing,
input()
function) - Arithmetic Operations
- String Manipulation
- Conditional Statements
- Comparison Operators
- Basic Mathematical Operations
-
Arithmetic Operations:
- Addition, subtraction, multiplication, division
-
String Manipulation:
- String methods (
lower()
function)
- String methods (
-
Conditional Statements:
if
statements
-
User Input:
input()
function for user interaction
-
Data Types:
float
,str
-
Comparison Operators:
>=
(greater than or equal to)
-
Basic Mathematical Operations:
- Performing calculations and printing results.
-
Data Loading and Handling:
- Reading CSV files using pandas (
pd.read_csv()
). - Checking columns and dataset head (
df.columns
,df.head()
). - Handling missing data and duplicates.
- Reading CSV files using pandas (
-
Data Exploration:
- Exploring dataset properties (columns, shape, etc.).
- Descriptive statistics (
describe()
,info()
). - Correlation analysis (
corr()
). - Skewness and kurtosis (
skew()
,kurt()
).
-
Data Visualization:
- Matplotlib and Seaborn for plotting.
- Various types of plots:
- Scatter plots (
scatterplot()
). - Pair plots (
pairplot()
). - Histograms and density plots (
distplot()
,histplot()
). - Box plots (
boxplot()
). - Violin plots (
violinplot()
). - Heatmaps (
heatmap()
). - Confusion matrix plots (
ConfusionMatrixDisplay
). - ROC curves (
roc_curve()
,auc()
).
- Scatter plots (
-
Statistical Analysis:
- Hypothesis testing (
t-test
,ANOVA
). - Probability distributions.
- Confidence intervals.
- Hypothesis testing (
-
Machine Learning:
-
Classification:
- Logistic Regression (
LogisticRegression
). - Support Vector Machines (
SVC
). - K-Nearest Neighbors (
KNeighborsClassifier
). - Decision Trees (
DecisionTreeClassifier
). - Random Forests (
RandomForestClassifier
). - Multi-class classification.
- Model evaluation metrics (accuracy, precision, recall, F1-score).
- Confusion matrix (
confusion_matrix()
). - ROC curves (
roc_curve()
,auc()
).
- Logistic Regression (
-
Clustering:
- K-Means clustering (
KMeans
). - Hierarchical clustering.
- K-Means clustering (
-
Dimensionality Reduction:
- Principal Component Analysis (PCA).
- t-Distributed Stochastic Neighbor Embedding (t-SNE).
-
-
Feature Engineering:
- Feature scaling (
StandardScaler
). - Feature selection techniques.
- Handling categorical variables (encoding).
- Feature scaling (
-
Model Evaluation:
- Cross-validation (
cross_val_score()
). - Train-test split (
train_test_split()
). - Hyperparameter tuning.
- Cross-validation (
-
Time Series Analysis:
- Handling time series data.
- Time series decomposition (
seasonal_decompose()
). - Forecasting techniques.
-
Natural Language Processing (NLP):
- Text preprocessing (tokenization, stemming, lemmatization).
- Text vectorization (TF-IDF, CountVectorizer).
- Sentiment analysis.
- Topic modeling (LDA, NMF).
-
Web Scraping:
- Using libraries like
requests
,BeautifulSoup
for web scraping.
- Using libraries like
-
Geospatial Analysis:
- Visualizing geospatial data using
folium
. - Handling geospatial data (shapefiles, GeoJSON).
- Visualizing geospatial data using
-
Deep Learning:
- Using libraries like TensorFlow and PyTorch.
- Building neural networks (CNNs, RNNs).
- Transfer learning.
-
Reinforcement Learning:
- Implementing algorithms (Q-learning, SARSA).
- OpenAI Gym.
-
Model Deployment:
- Using Flask for web application deployment.
- Containerization with Docker.
- Cloud deployment (AWS, GCP, Azure).
-
Data Handling and Analysis:
- Pandas (
import pandas as pd
). - NumPy (
import numpy as np
). - Scipy (
import scipy.stats as stats
). - yfinance (for collecting historical stock data).
- Pandas (
-
Data Visualization:
- Matplotlib (
import matplotlib.pyplot as plt
). - Seaborn (
import seaborn as sns
).
- Matplotlib (
-
Machine Learning:
- Scikit-learn (for machine learning models and tools, including preprocessing, model selection, and evaluation) -
from sklearn.model_selection import train_test_split, cross_val_score; from sklearn.metrics import ...
. - Statsmodels (
import statsmodels.api as sm
). - TensorFlow (
import tensorflow as tf
), PyTorch (import torch
).
- Scikit-learn (for machine learning models and tools, including preprocessing, model selection, and evaluation) -
-
Natural Language Processing (NLP):
- NLTK (
import nltk
). - SpaCy (
import spacy
).
- NLTK (
-
Web Scraping:
- Requests (
import requests
). - BeautifulSoup (
from bs4 import BeautifulSoup
).
- Requests (
-
Geospatial Analysis:
- Folium (
import folium
).
- Folium (
-
Deep Learning:
- TensorFlow (
import tensorflow as tf
). - PyTorch (
import torch
).
- TensorFlow (
-
Reinforcement Learning:
- Gym (
import gym
).
- Gym (
-
Cloud and Deployment:
- Flask (
from flask import Flask, ...
). - Docker (
Dockerfile
). - AWS, GCP, Azure.
- Flask (
-
Additional Libraries:
datetime
for handling date and time data.re
for regular expressions.itertools
for combinatorial iterators.pickle
for object serialization.- statsmodels (for exploring data, estimating statistical models, and performing statistical tests).
- Loading and handling dataset from a CSV file.
- Exploratory Data Analysis (EDA).
- Plotting various types of graphs and charts to visualize data distribution and relationships.
- Performing statistical analysis like skewness, kurtosis, and hypothesis testing.
- Applying machine learning algorithms for classification, clustering, and regression tasks.
- Evaluating model performance using various metrics and techniques.
- Creating visualizations like confusion matrices, ROC curves, and heatmaps.
- Implementing feature engineering techniques such as feature scaling, selection, and transformation.
- Applying time series analysis techniques such as decomposition and forecasting.
- Conducting web scraping using Python libraries.
- Stock Market Analysis
- Customer segmentation for E-commerce
- Performing geospatial analysis and visualization.
- Handling natural language processing tasks like text preprocessing, vectorization, sentiment analysis, and topic modeling.
- Data preprocessing tasks like outlier handling, imputation, and normalization.
- Model evaluation and selection using cross-validation and hyperparameter tuning.
- Implementing deep learning models using TensorFlow and PyTorch.
- Reinforcement learning tasks with OpenAI Gym.
- Deploying models using Flask and Docker, and deploying to cloud platforms like AWS, GCP, and Azure.
-
Python Skills:
- Understanding of Python syntax and structure.
- Ability to handle data using pandas and numpy.
- Familiarity with plotting libraries like Matplotlib and Seaborn.
- Practical experience in machine learning and data analysis tasks.
- Experience with web scraping, geospatial analysis, and natural language processing.
- Familiarity with deep learning frameworks like TensorFlow and PyTorch.
- Experience with reinforcement learning using Gym.
- Knowledge of cloud deployment and containerization using Docker.
- Familiarity with Flask for web application deployment.
-
Advanced Python Concepts Covered:
- Loops (for, while)
- Lists, tuples, dictionaries
- Functions
-
Error Handling:
- Try-except blocks for handling exceptions.
-
Modules and Packages:
- Importing and using built-in modules (
math
,random
, etc.)
- Importing and using built-in modules (
-
File Handling:
- Reading from and writing to files.
-
Object-Oriented Programming (OOP):
- Classes and objects.
A lot of images are availabe in the images folder to refer back to the results acieved throughout various analysis.
Classifying Iris flower species with Logistic Regression, SVM, Decision Trees, and Random Forest; learned model evaluation and hyperparameter tuning.
Understanding Iris dataset through statistical analysis and visualizations; learned data exploration and hypothesis testing.
Reducing dimensionality of Iris dataset using PCA and t-SNE; learned feature scaling, dimensionality reduction, and data visualization.
Comparing Iris species using statistical tests and visualizations; learned statistical testing, effect size calculation, and data visualization.
Grouping Iris flowers into clusters using K-means clustering; learned clustering, feature preprocessing, and evaluation metrics.
Description: Conducted customer segmentation using K-means clustering based on purchasing behavior in an e-commerce dataset.
Key Objectives: Identified distinct customer segments to tailor marketing strategies and improve business decision-making.
Key Steps:
- Data Exploration and Preprocessing: Ensured data quality by handling missing values, duplicates, and formatting dates. Engineered features like TotalAmount for customer spending quantification.
- Customer Segmentation using Clustering: Implemented K-means clustering to group customers with similar purchasing patterns. Determined optimal clusters using the elbow method and visualized segment distributions.
- Customer Profiling and Insights: Analyzed segment characteristics, behavior patterns, and visualized metrics to derive actionable marketing insights.
Overall Result:
- Successfully segmented customers into meaningful groups based on their purchasing behavior, enabling personalized marketing strategies and business optimizations.
- Demonstrated proficiency in data preprocessing, clustering techniques, and deriving actionable insights from customer data in an e-commerce context.
Description: Built a machine learning model to forecast Apple Inc. (AAPL) stock prices for the next 30 days using historical data.
Key Objectives: Predicted future stock prices to provide insights for investment decisions and understand potential price trends.
Key Steps:
- Data Collection and Preprocessing: Gathered historical stock data using yfinance, cleaned, and engineered features like moving averages and Relative Strength Index (RSI).
- Model Selection and Training: Trained models including Linear Regression, RandomForestRegressor, and SVR. Evaluated model performance based on metrics like RMSE and R-squared.
- Prediction and Evaluation: Forecasted AAPL stock prices for January 2023 and visualized actual vs. predicted results to assess model accuracy.
Overall Result:
- Successfully developed a predictive model that effectively forecasted AAPL stock prices, demonstrating the application of machine learning in financial forecasting.
- Gained practical experience in data preprocessing, feature engineering, model selection, and evaluation for stock price prediction.
- Machine learning model selection, evaluation, and hyperparameter tuning.
- Data exploration, visualization techniques, and hypothesis testing.
- Dimensionality reduction using PCA and t-SNE.
- Statistical analysis including ANOVA, t-tests, and effect size calculations.
- Unsupervised learning techniques and cluster analysis using K-means.
- Python programming skills with pandas, scikit-learn, and numpy.
- Time Series Analysis: Ability to analyze and model temporal data, including trend analysis, seasonality decomposition, and forecasting techniques such as ARIMA and Prophet.
- Feature Engineering: Proficiency in creating new features from existing data to enhance predictive model performance, including domain-specific feature selection and transformation techniques.
- Deployment and Productionization: Understanding of deploying machine learning models into production environments, including containerization, API development (e.g., Flask or FastAPI), and model monitoring for performance and drift detection.
Navigate to each project folder for detailed code, results, and further documentation.
- Throughout this project, I expanded my knowledge across various domains of Python programming and data science.
- I gained proficiency in data handling, exploratory data analysis, machine learning, natural language processing, deep learning, and reinforcement learning.
- I learned to effectively use libraries such as pandas, NumPy, Matplotlib, Seaborn, Scikit-learn, TensorFlow, and PyTorch and many other libraries.
- Acquired skills in cross-validation techniques to assess model generalization and performance metrics selection to measure model efficacy.
- Additionally, I gained experience in web scraping, geospatial analysis, and model deployment using Flask and Docker.
- This respository helped me strengthen my skills in Python programming, data analysis, machine learning model development and much more.
- The most enjoyable aspect of this project was the application of machine learning algorithms and deep learning models.
- I particularly liked experimenting with different classification and clustering techniques to solve real-world problems.
- Building and optimizing neural networks using TensorFlow and PyTorch was both challenging and rewarding.
- Working on predictive models to forecast trends and make informed decisions based on data insights was highly satisfying.
- Moreover, the process of deploying models using Flask and Docker, and integrating them with cloud platforms, was another aspect that I found fascinating.
In conclusion, this respository provided me with a comprehensive hands-on experience in Python programming and data science. I gained valuable insights into various machine learning, statistical analysis and deep learning techniques, as well as practical skills in data visualization, natural language processing, and geospatial analysis. This project not only reinforced my understanding of core concepts but also equipped me with the skills necessary to tackle complex data-driven challenges. Moving forward, I plan to further enhance this project by integrating advanced techniques and exploring new areas of interest in data science.
- Clone the repository:
git clone https://github.com/FarahIbrar/Programming-in-Python.git
cd Programming-in-Python
- Install the required dependencies:
Copy code
pip install -r requirements.txt
- Running Notebooks Navigate to the notebooks directory:
cd notebooks
- Start Jupyter Notebook:
jupyter notebook
- Open and run the desired notebook.
Contributions are welcome! Please fork this repository, make your changes, and submit a pull request. For major changes, please open an issue first to discuss what you would like to change.
Please follow these steps to contribute to the project:
- Fork the repository and clone it to your local machine.
- Create a new branch:
git checkout -b feature/my-new-feature
. - Make your changes and test them thoroughly.
- Commit your changes:
git commit -am 'Add some feature'
. - Push to the branch:
git push origin feature/my-new-feature
. - Submit a pull request.
- I will review your changes and work with you to integrate them into the project.
This project is licensed under the MIT License. You are free to use, modify, and distribute the code, provided that appropriate credit is given to the original author.