Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluation Metrics & Process #47

Open
Ty4Code opened this issue Jan 26, 2024 · 7 comments
Open

Evaluation Metrics & Process #47

Ty4Code opened this issue Jan 26, 2024 · 7 comments
Labels
enhancement New feature or request

Comments

@Ty4Code
Copy link

Ty4Code commented Jan 26, 2024

NOTE: This isn't really an issue, more of a discussion topic / idea that I wanted to raise and get some feedback on to see if there is interest or value in it before I implement something for it. Also a huge thanks to everyone that's contributed to this project & OpenPointClass, lots of amazing work already done so kudos!

Idea: What if we add an additional set of sub-task evaluation metrics that evaluates how well the point cloud classification is able to produce accurate DTMs.

My current understanding is that the evaluation metrics used so far focus on the classification metrics for the point cloud. For example, the metrics found on this PR (#46).

So models are currently evaluated on how accurately they are able to classify points which makes a lot of sense. The only questions that I then have is, how accurate are the DTM models generated using the point cloud classification?

For example, I imagine that we could have two models, M1 and M2. It's quite possible that M1 might have worse point classification precision/recall/accuracy scores compared to M2, but could produce higher quality/more accurate DTMs from the classified point clouds.

For that reason, I thought it might be a good idea to add in a new 'subtask' evaluation routine that is run as follows:

  1. Use the ground-truth classified point clouds to produce a 'ground truth DTM' file for each ground truth cloud using the pc2dem.py script.
  2. Using the current model under evaluation, run the point cloud classification routine as normal but run the outputs through the pc2dem.py script to produce a 'predicted DTM file'
  3. Finally, we use the 'ground truth DTM' and the 'predicted DTM' outputs from steps 1 & 2 and perform some type of evaluation routine to compare the predictions to ground truth.

This would produce a new set of 'DTM estimation metrics' that would be complementary to the current set of 'point cloud classification metrics'. I would like to hear what others think, does this seem like a useful addition that could be pulled/merged in, or does it not align with the current goals of the project & dataset?

@pierotofy
Copy link
Member

It's an interesting idea, but how would that differ than comparing the metrics for the "ground" class?

@Ty4Code

This comment was marked as duplicate.

@Ty4Code
Copy link
Author

Ty4Code commented Jan 26, 2024

I think my example might have been a bit hard to follow with text, I wish I had some visualisations.

Just to summarise shortly: The current metrics for "ground" class treat all points the same. So if you mis-classify a point as 'ground' when it was a rooftop or a treetop or a small bucket on the ground, then that has the same 'error' in the metric.

But when we care about generating a DTM, it's much 'worse' (has larger cost/error) to mis-classify a treetop as ground then it is to mis-classify a small bucket on the ground. But if we only look at the current metrics for "ground" class, it would say there is no difference.

The current metrics are useful and should be kept, but there's an old saying that what is not measured cannot be improved. In this case, if we are not measuring the final DTM accuracy then who's to say that any new models trained/released are actually improving the DTM? Maybe a new model has better ground class metrics but is actually producing worse DTMs for ODM but we would never know unless we measure the DTM metrics.

@Ty4Code
Copy link
Author

Ty4Code commented Jan 26, 2024

Actually just had another idea in a similar vein @pierotofy , but in LightGBM there is an option to provide sample weightings during training.

So you could easily add in a weighting that is calculated from the ground truth DEM, so that during training the model will be able to learn that it is worse to mis-classify points with large elevation deltas from the terrain/ground. So the model would be able to better learn those patterns and might be able to generate higher quality DTM models without adding any new training data at all.

I'm also curious, have you experimented at all with hyperparameter tuning? I noticed that the current learning rate and 'num_leaves' parameters are hard-coded, and I wonder if there could be some easy gains in performance with a search over those parameters to find the best ones? Not sure if you've already done this already.

@pierotofy pierotofy added the enhancement New feature or request label Jan 27, 2024
@pierotofy
Copy link
Member

That makes sense, thanks for the explanation.

It could be an interesting addition.

I have not played with hyperparameters much. We'd welcome improvements in this area as well.

@Ty4Code
Copy link
Author

Ty4Code commented Jan 30, 2024

Quick update! I put together a python script that can be run with something like:

python3 evaluate_opc_dtm.py --input_point_cloud /data/ground_truth_pc.laz --input_opc_model /data/opc-v1.3_model.bin
It will:

  1. Load the input point cloud and generate a DTM and a DSM using pc2dem.py
  2. Run pcclassify with the input model on the point cloud and then generate a DTM from the re-classified cloud using pc2dem.py
  3. Run a bunch of evaluation metrics to compare the 'predicted DTM' to the 'ground truth DTM'.

As output it can save a stats file with json 'dtm evaluation metrics' and also can save some graphs which is helpful for debugging the DEMs and the errors your model is getting.

Questions:

  • Does this make sense to add somewhere into the pipeline & does it align with the projects goals?
  • If it does, do you have any recommendations on how best to integrate it? The biggest issue is that the pipeline requires pc2dem.py which requires ODM repository as a dependency which seems like it might be an infeasible addition to this repo as a dependency? Curious to hear your thoughts.

Adding some extra info below on the evaluation metrics I came up with for anyone that might be interested and want to discuss or provide suggestions. These were just my best initial guesses for metrics that would measure how 'good' or 'useful' a predicted DTM is compared to a ground truth DTM.

Evaluating Predicted DTMs

NOTE: Skip this section if you're not interested in the evaluation metrics definitions
To evaluate the 'ground truth DTM' compared to the 'predicted DTM', it first aligns and/or expands to ensure both DEMs are aligned and have the same shape.

The evaluation metrics produced:

  • MAE: The mean absolute error of each DEM cell compared between ground-truth DTM and predicted DTM. (e.g. MAE=0.5 means that for any location on the DTM, our prediction is 'off' by 50cm on average)
  • RMSE: The root mean squared error of each DEM cell compared between ground-truth DTM and predicted DTM.
  • q95AE: The 95th quantile of the absolute error of each DEM cell compared between ground-truth DTM and predicted DTM (e.g. q95AE=0.8 means that our prediction is 'off' by less than 80cm for 95% of the DTM surface area)
  • q99AE: Same as q95AE but for the 99th quantile
  • MAX_ERR: The maximum error for a single DEM cell compared between ground-truth DTM and predicted DTM

Finally, for each evaluation metric, I also re-computed the evaluation metric using the DSM as a baseline. So for example for MAE, we treat the DSM as a 'predicted DTM' and calculate the MAE of lets say 4.5m and we see that for our actual predicted DTM we have an MAE of 0.9m. In that case, we can consider our 'MAE_relative-dsm' to be 80%.

On this 'relative DSM' scale, every metric is divided by the metric computed for the DSM by using (1 - pred_metric / dsmpred_metric) and has a scale where 100% would mean our model is perfect and 0% would mean our model is the exact same as just using the DSM.

This relative-to-dsm metric seems helpful because it lets us compare across different point clouds which might have different scales or levels of difficulty.

Example metrics for OPC V1.3 run on odm_data_toledo.laz:
DTM Prediction Metrics

DEM Cell Count: 23.0M
Mean Absolute Error: 0.153016m
Root Mean Squared Error: 0.41m
Maximum Error: 13.35m

Prediction relative 'mae' is: 88.52% (model error 0.15m compared to DSM error 1.33m)
Prediction relative 'rmse' is: 89.37% (model error 0.41m compared to DSM error 3.86m)
Prediction relative 'max_error' is: 34.92% (model error 13.35m compared to DSM error 20.52m)
Prediction relative 'q95ae' is: 93.29% (model error 0.76m compared to DSM error 11.29m)
Prediction relative 'q99ae' is: 88.49% (model error 1.91m compared to DSM error 16.60m)

@pierotofy
Copy link
Member

I think it might make sense for this to live as a separate effort (at least initially), due to the ODM dependency.

I would recommend to publish the script in a separate repo, then add instructions on how to run the method on the README here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants