[SEDONA-667] Getis Ord (#1652)

* Getis Ord * add matplotlib dependency * scipy/pysal compatability fix * rebase onto pre-commit changes * add missing paren --------- Co-authored-by: jameswillis <[email protected]>
apache · Oct 28, 2024 · 61bb000 · 61bb000
1 parent 3d0d54d
commit 61bb000
Show file tree

Hide file tree

Showing 14 changed files with 1,120 additions and 2 deletions.
diff --git a/docs/api/stats/sql.md b/docs/api/stats/sql.md
@@ -49,3 +49,70 @@ names in parentheses are python variable names
 - geometry - name of the geometry column
 - handleTies (handle_ties) - whether to handle ties in the k-distance calculation. Default is false
 - useSpheroid (use_spheroid) - whether to use a cartesian or spheroidal distance calculation. Default is false
+
+The output is the input DataFrame with the lof added to each row.
+
+## Using Getis-Ord Gi(*)
+
+The G Local function is provided at `org.apache.sedona.stats.hotspotDetection.GetisOrd.gLocal` in scala/java and `sedona.stats.hotspot_detection.getis_ord.g_local` in python.
+
+Performs the Gi or Gi* statistic on the x column of the dataframe.
+
+Weights should be the neighbors of this row. The members of the weights should be comprised
+of structs containing a value column and a neighbor column. The neighbor column should be the
+contents of the neighbors with the same types as the parent row (minus neighbors). Reference the _Using the Distance
+Weighting Function_ header for instructions on generating this column. To calculate the Gi*
+statistic, ensure the focal observation is in the neighbors array (i.e. the row is in the
+weights column) and `star=true`. Significance is calculated with a z score.
+
+### Parameters
+
+- dataframe - the dataframe to perform the G statistic on
+- x - The column name we want to perform hotspot analysis on
+- weights - The column name containing the neighbors array. The neighbor column should be the contents of the neighbors with the same types as the parent row (minus neighbors). You can use `Weighting` class functions to achieve this.
+- star - Whether the focal observation is in the neighbors array. If true this calculates Gi*, otherwise Gi
+
+The output is the input DataFrame with the following columns added: G, E[G], V[G], Z, P.
+
+## Using the Distance Weighting Function
+
+The Weighting functions are provided at `org.apache.sedona.stats.Weighting` in scala/java and `sedona.stats.weighting` in python.
+
+The function generates a column containing an array of structs containing a value column and a neighbor column.
+
+The generic `addDistanceBandColumn` (`add_distance_band_column` in python) function annotates a dataframe with a weights column containing the other records within the threshold and their weight.
+
+The dataframe should contain at least one `GeometryType` column. Rows must be unique. If one
+geometry column is present it will be used automatically. If two are present, the one named
+'geometry' will be used. If more than one are present and neither is named 'geometry', the
+column name must be provided. The new column will be named 'cluster'.
+
+### Parameters
+
+#### addDistanceBandColumn
+
+names in parentheses are python variable names
+
+- dataframe - DataFrame with geometry column
+- threshold - Distance threshold for considering neighbors
+- binary - whether to use binary weights or inverse distance weights for neighbors (dist^alpha)
+- alpha - alpha to use for inverse distance weights ignored when binary is true
+- includeZeroDistanceNeighbors (include_zero_distance_neighbors) - whether to include neighbors that are 0 distance. If 0 distance neighbors are included and binary is false, values are infinity as per the floating point spec (divide by 0)
+- includeSelf (include_self) - whether to include self in the list of neighbors
+- selfWeight (self_weight) - the value to use for the self weight
+- geometry - name of the geometry column
+- useSpheroid (use_spheroid) - whether to use a cartesian or spheroidal distance calculation. Default is false
+
+#### addBinaryDistanceBandColumn
+
+names in parentheses are python variable names
+
+- dataframe - DataFrame with geometry column
+- threshold - Distance threshold for considering neighbors
+- includeZeroDistanceNeighbors (include_zero_distance_neighbors) - whether to include neighbors that are 0 distance. If 0 distance neighbors are included and binary is false, values are infinity as per the floating point spec (divide by 0)
+- includeSelf (include_self) - whether to include self in the list of neighbors
+- selfWeight (self_weight) - the value to use for the self weight
+- geometry - name of the geometry column
+- useSpheroid (use_spheroid) - whether to use a cartesian or spheroidal distance calculation. Default is false
+
+In both cases the output is the input DataFrame with the weights column added to each row.
diff --git a/docs/tutorial/sql.md b/docs/tutorial/sql.md
@@ -946,6 +946,67 @@ The output will look like this:
 +--------------------+------------------+
 ```
 
+## Perform Getis-Ord Gi(*) Hot Spot Analysis
+
+Sedona provides an implementation of the [Gi and Gi*](https://en.wikipedia.org/wiki/Getis%E2%80%93Ord_statistics) algorithms to identify local hotspots in spatial data
+
+The algorithm is available as a Scala and Python function called on a spatial dataframe. The returned dataframe has additional columns added containing G statistic, E[G], V[G], the Z score, and the p-value.
+
+Using Gi involves first generating the neighbors list for each record, then calling the g_local function.
+=== "Scala"
+
+	```scala
+	import org.apache.sedona.stats.Weighting.addBinaryDistanceBandColumn
+	import org.apache.sedona.stats.hotspotDetection.GetisOrd.gLocal
+
+	val distanceRadius = 1.0
+	val weightedDf = addBinaryDistanceBandColumn(df, distanceRadius)
+    gLocal(weightedDf, "val").show()
+	```
+
+=== "Java"
+
+	```java
+	import org.apache.sedona.stats.Weighting;
+	import org.apache.sedona.stats.hotspotDetection.GetisOrd;
+	import org.apache.spark.sql.DataFrame;
+
+	double distanceRadius = 1.0;
+	DataFrame weightedDf = Weighting.addBinaryDistanceBandColumn(df, distanceRadius);
+	GetisOrd.gLocal(weightedDf, "val").show();
+	```
+
+=== "Python"
+
+	```python
+	from sedona.stats.weighting import add_binary_distance_band_column
+	from sedona.stats.hotspot_detection.getis_ord import g_local
+
+	distance_radius = 1.0
+	weighted_df = addBinaryDistanceBandColumn(df, distance_radius)
+    g_local(weightedDf, "val").show()
+	```
+
+The output will look like this:
+
+```
++-----------+---+--------------------+-------------------+-------------------+--------------------+--------------------+--------------------+
+|   geometry|val|             weights|                  G|                 EG|                  VG|                   Z|                   P|
++-----------+---+--------------------+-------------------+-------------------+--------------------+--------------------+--------------------+
+|POINT (2 2)|0.9|[{{POINT (2 3), 1...| 0.4488188976377953|0.45454545454545453| 0.00356321373799772|-0.09593402008347063|  0.4617864875295957|
+|POINT (2 3)|1.2|[{{POINT (2 2), 0...|0.35433070866141736|0.36363636363636365|0.003325666155464539|-0.16136436037034918|  0.4359032175415549|
+|POINT (3 3)|1.2|[{{POINT (2 3), 1...|0.28346456692913385| 0.2727272727272727|0.002850570990398176| 0.20110780337013057| 0.42030714022155924|
+|POINT (3 2)|1.2|[{{POINT (2 2), 0...| 0.4488188976377953|0.45454545454545453| 0.00356321373799772|-0.09593402008347063|  0.4617864875295957|
+|POINT (3 1)|1.2|[{{POINT (3 2), 3...| 0.3622047244094489| 0.2727272727272727|0.002850570990398176|  1.6758983614177538| 0.04687905137429871|
+|POINT (2 1)|2.2|[{{POINT (2 2), 0...| 0.4330708661417323|0.36363636363636365|0.003325666155464539|  1.2040263812249166| 0.11428969105925013|
+|POINT (1 1)|1.2|[{{POINT (2 1), 5...| 0.2834645669291339| 0.2727272727272727|0.002850570990398176|  0.2011078033701316|  0.4203071402215588|
+|POINT (1 2)|0.2|[{{POINT (2 2), 0...|0.35433070866141736|0.45454545454545453| 0.00356321373799772|   -1.67884535146075|0.046591093685710794|
+|POINT (1 3)|1.2|[{{POINT (2 3), 1...| 0.2047244094488189| 0.2727272727272727|0.002850570990398176| -1.2736827546774914| 0.10138793530151635|
+|POINT (0 2)|1.0|[{{POINT (1 2), 7...|0.09448818897637795|0.18181818181818182|0.002137928242798632| -1.8887168824332323|0.029464887612748458|
+|POINT (4 2)|1.2|[{{POINT (3 2), 3...| 0.1889763779527559|0.18181818181818182|0.002137928242798632| 0.15481285921583854| 0.43848442662481324|
++-----------+---+--------------------+-------------------+-------------------+--------------------+--------------------+--------------------+
+```
+
 ## Run spatial queries
 
 After creating a Geometry type column, you are able to run spatial queries.

diff --git a/python/Pipfile b/python/Pipfile
@@ -11,6 +11,10 @@ mkdocs="*"
 pytest-cov = "*"
 
 scikit-learn = "*"
+esda = "*"
+libpysal = "*"
+matplotlib = "*" # implicit dependency of esda
+scipy = "<=1.10.0" # prevent incompatibility with pysal 4.7.0, which is what is resolved to when shapely >2 is specified
 
 [packages]
 pandas="<=1.5.3"

diff --git a/python/sedona/stats/hotspot_detection/__init__.py b/python/sedona/stats/hotspot_detection/__init__.py
@@ -0,0 +1,18 @@
+#  Licensed to the Apache Software Foundation (ASF) under one
+#  or more contributor license agreements.  See the NOTICE file
+#  distributed with this work for additional information
+#  regarding copyright ownership.  The ASF licenses this file
+#  to you under the Apache License, Version 2.0 (the
+#  "License"); you may not use this file except in compliance
+#  with the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#  Unless required by applicable law or agreed to in writing,
+#  software distributed under the License is distributed on an
+#  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+#  KIND, either express or implied.  See the License for the
+#  specific language governing permissions and limitations
+#  under the License.
+
+"""Detecting across a region where a variable's value is significantly different from other values nearby."""
diff --git a/python/sedona/stats/hotspot_detection/getis_ord.py b/python/sedona/stats/hotspot_detection/getis_ord.py
@@ -0,0 +1,65 @@
+#  Licensed to the Apache Software Foundation (ASF) under one
+#  or more contributor license agreements.  See the NOTICE file
+#  distributed with this work for additional information
+#  regarding copyright ownership.  The ASF licenses this file
+#  to you under the Apache License, Version 2.0 (the
+#  "License"); you may not use this file except in compliance
+#  with the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#  Unless required by applicable law or agreed to in writing,
+#  software distributed under the License is distributed on an
+#  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+#  KIND, either express or implied.  See the License for the
+#  specific language governing permissions and limitations
+#  under the License.
+
+"""Getis Ord functions. From the 1992 paper by Getis & Ord.
+
+Getis, A., & Ord, J. K. (1992). The analysis of spatial association by use of distance statistics.
+Geographical Analysis, 24(3), 189-206. https://doi.org/10.1111/j.1538-4632.1992.tb00261.x
+"""
+
+from pyspark.sql import Column, DataFrame, SparkSession
+
+# todo change weights and x type to string
+
+
+def g_local(
+    dataframe: DataFrame,
+    x: str,
+    weights: str = "weights",
+    permutations: int = 0,
+    star: bool = False,
+    island_weight: float = 0.0,
+) -> DataFrame:
+    """Performs the Gi or Gi* statistic on the x column of the dataframe.
+
+    Weights should be the neighbors of this row. The members of the weights should be comprised of structs containing a
+    value column and a neighbor column. The neighbor column should be the contents of the neighbors with the same types
+    as the parent row (minus neighbors). You can use `wherobots.weighing.add_distance_band_column` to achieve this. To
+    calculate the Gi* statistic, ensure the focal observation is in the neighbors array (i.e. the row is in the weights
+    column) and `star=true`. Significance is calculated with a z score. Permutation tests are not yet implemented and
+    thus island weight does nothing. The following columns will be added: G, E[G], V[G], Z, P.
+
+    Args:
+        dataframe: the dataframe to perform the G statistic on
+        x: The column name we want to perform hotspot analysis on
+        weights: The column name containing the neighbors array. The neighbor column should be the contents of
+            the neighbors with the same types as the parent row (minus neighbors). You can use
+            `wherobots.weighing.add_distance_band_column` to achieve this.
+        permutations: Not used. Permutation tests are not supported yet. The number of permutations to use for the
+            significance test.
+        star: Whether the focal observation is in the neighbors array. If true this calculates Gi*, otherwise Gi
+        island_weight: Not used. The weight for the simulated neighbor used for records without a neighbor in perm tests
+    Returns:
+        A dataframe with the original columns plus the columns G, E[G], V[G], Z, P.
+    """
+    sedona = SparkSession.getActiveSession()
+
+    result_df = sedona._jvm.org.apache.sedona.stats.hotspotDetection.GetisOrd.gLocal(
+        dataframe, x, weights, permutations, star, island_weight
+    )
+
+    return DataFrame(result_df, sedona)
diff --git a/python/sedona/stats/weighting.py b/python/sedona/stats/weighting.py
@@ -0,0 +1,110 @@
+#  Licensed to the Apache Software Foundation (ASF) under one
+#  or more contributor license agreements.  See the NOTICE file
+#  distributed with this work for additional information
+#  regarding copyright ownership.  The ASF licenses this file
+#  to you under the Apache License, Version 2.0 (the
+#  "License"); you may not use this file except in compliance
+#  with the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#  Unless required by applicable law or agreed to in writing,
+#  software distributed under the License is distributed on an
+#  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+#  KIND, either express or implied.  See the License for the
+#  specific language governing permissions and limitations
+#  under the License.
+
+"""Weighting functions for spatial data."""
+
+from typing import Optional
+
+from pyspark.sql import DataFrame, SparkSession
+
+
+def add_distance_band_column(
+    dataframe: DataFrame,
+    threshold: float,
+    binary: bool = True,
+    alpha: float = -1.0,
+    include_zero_distance_neighbors: bool = False,
+    include_self: bool = False,
+    self_weight: float = 1.0,
+    geometry: Optional[str] = None,
+    use_spheroid: bool = False,
+) -> DataFrame:
+    """Annotates a dataframe with a weights column containing the other records within the threshold and their weight.
+
+    The dataframe should contain at least one GeometryType column. Rows must be unique. If one
+    geometry column is present it will be used automatically. If two are present, the one named
+    'geometry' will be used. If more than one are present and neither is named 'geometry', the
+    column name must be provided. The new column will be named 'cluster'.
+
+    Args:
+        dataframe: DataFrame with geometry column
+        threshold: Distance threshold for considering neighbors
+        binary: whether to use binary weights or inverse distance weights for neighbors (dist^alpha)
+        alpha: alpha to use for inverse distance weights ignored when binary is true
+        include_zero_distance_neighbors: whether to include neighbors that are 0 distance. If 0 distance neighbors are
+            included and binary is false, values are infinity as per the floating point spec (divide by 0)
+        include_self: whether to include self in the list of neighbors
+        self_weight: the value to use for the self weight
+        geometry: name of the geometry column
+        use_spheroid: whether to use a cartesian or spheroidal distance calculation. Default is false
+
+    Returns:
+        The input DataFrame with a weight column added containing neighbors and their weights added to each row.
+
+    """
+    sedona = SparkSession.getActiveSession()
+    return sedona._jvm.org.apache.sedona.stats.Weighting.addDistanceBandColumn(
+        dataframe._jdf,
+        float(threshold),
+        binary,
+        float(alpha),
+        include_zero_distance_neighbors,
+        include_self,
+        float(self_weight),
+        geometry,
+        use_spheroid,
+    )
+
+
+def add_binary_distance_band_column(
+    dataframe: DataFrame,
+    threshold: float,
+    include_zero_distance_neighbors: bool = True,
+    include_self: bool = False,
+    geometry: Optional[str] = None,
+    use_spheroid: bool = False,
+) -> DataFrame:
+    """Annotates a dataframe with a weights column containing the other records within the threshold and their weight.
+
+    Weights will always be 1.0. The dataframe should contain at least one GeometryType column. Rows must be unique. If
+    one geometry column is present it will be used automatically. If two are present, the one named 'geometry' will be
+    used. If more than one are present and neither is named 'geometry', the column name must be provided. The new column
+    will be named 'cluster'.
+
+    Args:
+        dataframe: DataFrame with geometry column
+        threshold: Distance threshold for considering neighbors
+        include_zero_distance_neighbors: whether to include neighbors that are 0 distance. If 0 distance neighbors are
+            included and binary is false, values are infinity as per the floating point spec (divide by 0)
+        include_self: whether to include self in the list of neighbors
+        geometry: name of the geometry column
+        use_spheroid: whether to use a cartesian or spheroidal distance calculation. Default is false
+
+    Returns:
+        The input DataFrame with a weight column added containing neighbors and their weights (always 1) added to each
+        row.
+
+    """
+    sedona = SparkSession.getActiveSession()
+    return sedona._jvm.org.apache.sedona.stats.Weighting.addBinaryDistanceBandColumn(
+        dataframe._jdf,
+        float(threshold),
+        include_zero_distance_neighbors,
+        include_self,
+        geometry,
+        use_spheroid,
+    )