diff --git a/_quarto.yml b/_quarto.yml index 91fb064..ba07ba0 100644 --- a/_quarto.yml +++ b/_quarto.yml @@ -15,10 +15,9 @@ book: - chapters/02_floodmapping.qmd - chapters/references.qmd navbar: - logo: https://www.tuwien.at/index.php?eID=dumpFile&t=f&f=180458&token=fe76daebeea536ee4be1650979e817f6fc0a2ea2 + logo: assets/images/tuw-geo-logo.svg sidebar: - logo: https://www.tuwien.at/index.php?eID=dumpFile&t=f&f=180458&token=fe76daebeea536ee4be1650979e817f6fc0a2ea2 - + logo: assets/images/tuw-geo-logo.svg bibliography: chapters/references.bib format: diff --git a/assets/images/tuw-geo-logo.svg b/assets/images/tuw-geo-logo.svg new file mode 100644 index 0000000..3ca1956 --- /dev/null +++ b/assets/images/tuw-geo-logo.svg @@ -0,0 +1,129 @@ + + + + + + + RESEARCH GROUPSPHOTOGRAMMETRY & REMOTE SENSING + DEPARTMENT FOR GEODESYAND GEOINFORMATION + + VIENNA UNIVERSITY OF TECHNOLOGY + + + FORSCHUNGSGRUPPENPHOTOGRAMMETRIE & FERNERKUNDUNG + DEPARTMENT FÜR GEODÄSIEUND GEOINFORMATION + + TECHNISCHE UNIVERSITÄT WIEN + + + diff --git a/chapters/01_classification.qmd b/chapters/01_classification.qmd index 8989b2f..cf04630 100644 --- a/chapters/01_classification.qmd +++ b/chapters/01_classification.qmd @@ -48,7 +48,7 @@ import matplotlib.colors as colors Before we start, we need to load the data. We will use ``odc-stac`` to obtain data from Earth Search by Element 84. Here we define the area of interest and the time frame, aswell as the EPSG code and the resolution. -### Searching Catalog +### Searching in the Catalog The module ``odc-stac`` provides access to free, open source satelite data. To retrieve the data, we must define several parameters that specify the location and time period for the satellite data. Additionally, we must specify the data collection we wish to access, as multiple collections are available. In this example, we will use multispectral imagery from the Sentinel-2 satellite. ```{python} @@ -102,6 +102,7 @@ Now we will load the data directly into an ``xarray`` dataset, which we can use Here's how we can load the data using odc-stac and xarray: ```{python} +#| output: false # define a geobox for my region geobox = GeoBox.from_bbox(bounds, crs=f"epsg:{epsg}", resolution=dx) @@ -177,9 +178,10 @@ plt.show() ``` ## Classification +In this chapter, we will classify the satellite data to identify forested areas within the scene. By using supervised machine learning techniques, we can train classifiers to distinguish between forested and non-forested regions based on the training data we provide. We will explore two different classifiers and compare their performance in accurately identifying forest areas. ### Regions of Interest -Since this is a supervised classification, we need to have some training data. Therefore we need to define areas or regions, which we are certain represent the feature which we are classifiying. In this case we are looking for forested areas and areas that are definitly not forested. We will use these to train our classifiers. +Since this is a supervised classification, we need to have some training data. Therefore we need to define areas or regions, which we are certain represent the feature which we are classifiying. In this case we are interested in forested areas and regions that are definitly not forested. These regions will be used to train our classifiers. ```{python} # Define Polygons forest_areas = { @@ -218,7 +220,7 @@ plt.show() ### Data Preparation -Additionally to the Regions of Interest we will extract the bands that we want to use for the classification from the loaded Dataset. With that we will create a Training and Testing Dataset, which we will train the classifier on. +In addition to the Regions of Interest we will extract the specific bands from the loaded dataset that we intend to use for the classification, which are the `red, green, blue` and `near-infrared` bands, although other bands can also be utilized. Using these bands, we will create both a training and a testing dataset. The training dataset will be used to train the classifier, while the testing dataset will be employed to evaluate its performance. ```{python} # Classifiying dataset (only necessary bands) bands = ['red', 'green', 'blue', 'nir'] @@ -264,7 +266,7 @@ X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_ ``` -Now that we have the training and testing data, we create an Image array of the actual scene which we want to classify. +Now that we have prepared the training and testing data, we will create an image array of the actual scene that we intend to classify. This array will serve as the input for our classification algorithms, allowing us to apply the trained classifiers to the entire scene and identify the forested and non-forested areas accurately. ```{python} image_data = ds_class[bands].to_array(dim='band').transpose('latitude', 'longitude', 'band') @@ -276,9 +278,9 @@ X_image_data = image_data.values.reshape(num_of_pixels, num_of_bands) ``` ### Classifiying with Naive Bayes -Now that we have prepared all the needed data, we can start to classify the image. +Now that we have prepared all the needed data, we can begin the actual classification process. -We will start with a _Naive Bayes_ classificator. We train the classificator on our Training data and apply it on the actual image. +We will start with a _Naive Bayes_ classifier. First, we will train the classifier using our training dataset. Once trained, we will apply the classifier to the actual image to identify the forested and non-forested areas. ```{python} # Naive Bayes initialization and training @@ -295,7 +297,7 @@ ds_class['NB-forest'] = xr.DataArray(nb_predict_img, dims=['latitude', 'longitud ``` -To see how well the classification has worked we plot the image that has been predicted by the classifier. Furthermore we can have a look at the `Classification Report` and the `Confusion Matrix`. +To evaluate the effectiveness of the classification, we will plot the image predicted by the classifier. Additionally, we will examine the ``Classification Report`` and the ``Confusion Matrix`` to gain further insights into the classifier's performance. ```{python} # Plot Naive Bayes @@ -319,7 +321,7 @@ display(con_mat_nb) ``` ### Classifiying with Random Forest -To not only rely on one classificator lets have a look at another one. Here we use the _Random Forest_ Classificator. The Procedure useing it is the same as before. +To ensure our results are robust, we will explore an additional classifier. In this section, we will use the Random Forest classifier. The procedure for using this classifier is the same as before: we will train the classifier using our training dataset and then apply it to the actual image to classify the scene. ```{python} # Random Forest initialization and training @@ -350,11 +352,11 @@ con_mat_rf = pd.DataFrame(confusion_matrix(y_test, rf_predict), display(con_mat_rf) ``` -We can already see from the `classification reports` and the `confusion matrices` that the _random forest_ classifier has performed better. This is for example indicated by the lower values in the secondary diagonal, which means that False Positvies and Negatives are only minimal. It seems that _Naive Bayes_ is more sensitive to False Positives. +We can already see from the `classification reports` and the `confusion matrices` that the Random Forest classifier has outperformed the Naive Bayes classifier. This is particularly evident from the lower values in the secondary diagonal, indicating minimal False Positives and False Negatives. It appears that the Naive Bayes classifier is more sensitive to False Positives, resulting in a higher rate of incorrect classifications. ### Comparison of the Classificators -To have a more in depth look at the performance of the classificators, we can compare them. Lets see what areas both classificators agree upon, and which areas then don't agree upon. +To gain a more in-depth understanding of the classifiers' performance, we will compare their results. Specifically, we will identify the areas where both classifiers agree and the areas where they disagree. This comparison will provide valuable insights into the strengths and weaknesses of each classifier, allowing us to better assess their effectiveness in identifying forested and non-forested regions. ```{python} #| code-fold: true @@ -374,7 +376,7 @@ ax.set_axis_off() plt.show() ``` -The areas that both agree upon are the bigger forests, like the _Nationalpark Donauauen_ and the _Leithagebirge_ also the urban areas of vienna have both rightfully not been classified. +The areas where both classifiers agree include the larger forested regions, such as the _Nationalpark Donau-Auen_ and the _Leithagebirge_. Additionally, both classifiers accurately identified the urban areas of Vienna and correctly excluded them from being classified as forested. ```{python} #| code-fold: true @@ -392,7 +394,11 @@ for i in range(4): plt.tight_layout() ``` -When plotting the areas, where classification has happend, individually we can see that the _random forest_ classifiyer falsly predicted the river _danube_ as a forest. On the other hand has the _naive bayes_ classifyer identified a lot of cropland as forest. Finally we can have a look at how big the percentage of forested areas in the scene are. We can see here that around 18% are forest and about 66% are not forest. The remaining areas are not so clear to define, as waterbodies and cropland are both in the remaining categories. +When plotting the classified areas individually, we observe that the Random Forest classifier mistakenly identified the Danube River as a forested area. Conversely, the Naive Bayes classifier erroneously classified a significant amount of cropland as forest. + +Finally, by analyzing the proportion of forested areas within the scene, we find that approximately 18% of the area is classified as forest, while around 66% is classified as non-forest. The remaining areas, which include water bodies and cropland, fall into less clearly defined categories. + +The accompanying bar chart illustrates the distribution of these classifications, highlighting the percentage of forested areas, non-forested areas, and regions classified by only one of the two classifiers. This visual representation helps to quantify the areas of agreement and disagreement between the classifiers, providing a clearer picture of their performance. ```{python} #| code-fold: true @@ -409,4 +415,6 @@ ax = class_counts_df.plot.bar(x='Class', y='Percentage', rot=0, color='darkgreen for p in ax.patches: ax.annotate(f'{p.get_height():.1f}%', (p.get_x() + p.get_width() / 2., p.get_height()), ha='center', va='center', xytext=(0, 9), textcoords='offset points') -``` \ No newline at end of file +``` +## Conclusion +In this chapter, we utilized machine learning to classify satellite imagery into forested and non-forested areas, comparing Naive Bayes and Random Forest classifiers. The Random Forest classifier generally outperformed Naive Bayes, with fewer errors in classification, although it misclassified the Danube River as forested, while Naive Bayes incorrectly identified cropland as forest. The analysis, supported by the bar chart, revealed that about 18% of the scene was classified as forest, 66% as non-forest, and the remainder included ambiguous categories. This comparison highlights the strengths and limitations of each classifier, underscoring the need for careful selection and evaluation of classification methods. \ No newline at end of file diff --git a/notebooks/01_classification.ipynb b/notebooks/01_classification.ipynb index f54fefa..8bd7ddb 100644 --- a/notebooks/01_classification.ipynb +++ b/notebooks/01_classification.ipynb @@ -64,7 +64,7 @@ "source": [ "Before we start, we need to load the data. We will use ``odc-stac`` to obtain data from Earth Search by Element 84. Here we define the area of interest and the time frame, aswell as the EPSG code and the resolution.\n", "\n", - "### Searching Catalog\n", + "### Searching in the Catalog\n", "The module ``odc-stac`` provides access to free, open source satelite data. To retrieve the data, we must define several parameters that specify the location and time period for the satellite data. Additionally, we must specify the data collection we wish to access, as multiple collections are available. In this example, we will use multispectral imagery from the Sentinel-2 satellite." ] }, @@ -132,6 +132,7 @@ "cell_type": "code", "metadata": {}, "source": [ + "#| output: false\n", "# define a geobox for my region\n", "geobox = GeoBox.from_bbox(bounds, crs=f\"epsg:{epsg}\", resolution=dx)\n", "\n", @@ -233,9 +234,10 @@ "metadata": {}, "source": [ "## Classification \n", + "In this chapter, we will classify the satellite data to identify forested areas within the scene. By using supervised machine learning techniques, we can train classifiers to distinguish between forested and non-forested regions based on the training data we provide. We will explore two different classifiers and compare their performance in accurately identifying forest areas.\n", "\n", "### Regions of Interest\n", - "Since this is a supervised classification, we need to have some training data. Therefore we need to define areas or regions, which we are certain represent the feature which we are classifiying. In this case we are looking for forested areas and areas that are definitly not forested. We will use these to train our classifiers. " + "Since this is a supervised classification, we need to have some training data. Therefore we need to define areas or regions, which we are certain represent the feature which we are classifiying. In this case we are interested in forested areas and regions that are definitly not forested. These regions will be used to train our classifiers." ] }, { @@ -284,7 +286,7 @@ "metadata": {}, "source": [ "### Data Preparation\n", - "Additionally to the Regions of Interest we will extract the bands that we want to use for the classification from the loaded Dataset. With that we will create a Training and Testing Dataset, which we will train the classifier on." + "In addition to the Regions of Interest we will extract the specific bands from the loaded dataset that we intend to use for the classification, which are the `red, green, blue` and `near-infrared` bands, although other bands can also be utilized. Using these bands, we will create both a training and a testing dataset. The training dataset will be used to train the classifier, while the testing dataset will be employed to evaluate its performance." ] }, { @@ -340,7 +342,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Now that we have the training and testing data, we create an Image array of the actual scene which we want to classify." + "Now that we have prepared the training and testing data, we will create an image array of the actual scene that we intend to classify. This array will serve as the input for our classification algorithms, allowing us to apply the trained classifiers to the entire scene and identify the forested and non-forested areas accurately." ] }, { @@ -362,9 +364,9 @@ "metadata": {}, "source": [ "### Classifiying with Naive Bayes\n", - "Now that we have prepared all the needed data, we can start to classify the image.\n", + "Now that we have prepared all the needed data, we can begin the actual classification process.\n", "\n", - "We will start with a _Naive Bayes_ classificator. We train the classificator on our Training data and apply it on the actual image.\n" + "We will start with a _Naive Bayes_ classifier. First, we will train the classifier using our training dataset. Once trained, we will apply the classifier to the actual image to identify the forested and non-forested areas.\n" ] }, { @@ -390,7 +392,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "To see how well the classification has worked we plot the image that has been predicted by the classifier. Furthermore we can have a look at the `Classification Report` and the `Confusion Matrix`. " + "To evaluate the effectiveness of the classification, we will plot the image predicted by the classifier. Additionally, we will examine the ``Classification Report`` and the ``Confusion Matrix`` to gain further insights into the classifier's performance." ] }, { @@ -424,7 +426,7 @@ "metadata": {}, "source": [ "### Classifiying with Random Forest\n", - "To not only rely on one classificator lets have a look at another one. Here we use the _Random Forest_ Classificator. The Procedure useing it is the same as before." + "To ensure our results are robust, we will explore an additional classifier. In this section, we will use the Random Forest classifier. The procedure for using this classifier is the same as before: we will train the classifier using our training dataset and then apply it to the actual image to classify the scene." ] }, { @@ -465,11 +467,11 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "We can already see from the `classification reports` and the `confusion matrices` that the _random forest_ classifier has performed better. This is for example indicated by the lower values in the secondary diagonal, which means that False Positvies and Negatives are only minimal. It seems that _Naive Bayes_ is more sensitive to False Positives.\n", + "We can already see from the `classification reports` and the `confusion matrices` that the Random Forest classifier has outperformed the Naive Bayes classifier. This is particularly evident from the lower values in the secondary diagonal, indicating minimal False Positives and False Negatives. It appears that the Naive Bayes classifier is more sensitive to False Positives, resulting in a higher rate of incorrect classifications.\n", "\n", "### Comparison of the Classificators\n", "\n", - "To have a more in depth look at the performance of the classificators, we can compare them. Lets see what areas both classificators agree upon, and which areas then don't agree upon." + "To gain a more in-depth understanding of the classifiers' performance, we will compare their results. Specifically, we will identify the areas where both classifiers agree and the areas where they disagree. This comparison will provide valuable insights into the strengths and weaknesses of each classifier, allowing us to better assess their effectiveness in identifying forested and non-forested regions." ] }, { @@ -499,7 +501,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "The areas that both agree upon are the bigger forests, like the _Nationalpark Donauauen_ and the _Leithagebirge_ also the urban areas of vienna have both rightfully not been classified." + "The areas where both classifiers agree include the larger forested regions, such as the _Nationalpark Donau-Auen_ and the _Leithagebirge_. Additionally, both classifiers accurately identified the urban areas of Vienna and correctly excluded them from being classified as forested." ] }, { @@ -527,7 +529,11 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "When plotting the areas, where classification has happend, individually we can see that the _random forest_ classifiyer falsly predicted the river _danube_ as a forest. On the other hand has the _naive bayes_ classifyer identified a lot of cropland as forest. Finally we can have a look at how big the percentage of forested areas in the scene are. We can see here that around 18% are forest and about 66% are not forest. The remaining areas are not so clear to define, as waterbodies and cropland are both in the remaining categories." + "When plotting the classified areas individually, we observe that the Random Forest classifier mistakenly identified the Danube River as a forested area. Conversely, the Naive Bayes classifier erroneously classified a significant amount of cropland as forest.\n", + "\n", + "Finally, by analyzing the proportion of forested areas within the scene, we find that approximately 18% of the area is classified as forest, while around 66% is classified as non-forest. The remaining areas, which include water bodies and cropland, fall into less clearly defined categories.\n", + "\n", + "The accompanying bar chart illustrates the distribution of these classifications, highlighting the percentage of forested areas, non-forested areas, and regions classified by only one of the two classifiers. This visual representation helps to quantify the areas of agreement and disagreement between the classifiers, providing a clearer picture of their performance." ] }, { @@ -551,6 +557,14 @@ ], "execution_count": null, "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Conclusion\n", + "In this chapter, we utilized machine learning to classify satellite imagery into forested and non-forested areas, comparing Naive Bayes and Random Forest classifiers. The Random Forest classifier generally outperformed Naive Bayes, with fewer errors in classification, although it misclassified the Danube River as forested, while Naive Bayes incorrectly identified cropland as forest. The analysis, supported by the bar chart, revealed that about 18% of the scene was classified as forest, 66% as non-forest, and the remainder included ambiguous categories. This comparison highlights the strengths and limitations of each classifier, underscoring the need for careful selection and evaluation of classification methods." + ] } ], "metadata": { diff --git a/notebooks/references.ipynb b/notebooks/references.ipynb index 8d814cc..d0c13c4 100644 --- a/notebooks/references.ipynb +++ b/notebooks/references.ipynb @@ -13,10 +13,10 @@ ], "metadata": { "kernelspec": { - "name": "01_classification", + "name": "python3", "language": "python", - "display_name": "01_classification", - "path": "/home/runner/.local/share/jupyter/kernels/01_classification" + "display_name": "Python 3 (ipykernel)", + "path": "/home/npikall/miniconda3/envs/dev/share/jupyter/kernels/python3" } }, "nbformat": 4,