Skip to content

Commit

Permalink
[SEDONA-667] Getis Ord (#1652)
Browse files Browse the repository at this point in the history
* Getis Ord

* add matplotlib dependency

* scipy/pysal compatability fix

* rebase onto pre-commit changes

* add missing paren

---------

Co-authored-by: jameswillis <[email protected]>
  • Loading branch information
james-willis and jameswillis authored Oct 28, 2024
1 parent 3d0d54d commit 61bb000
Show file tree
Hide file tree
Showing 14 changed files with 1,120 additions and 2 deletions.
67 changes: 67 additions & 0 deletions docs/api/stats/sql.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,3 +49,70 @@ names in parentheses are python variable names
- geometry - name of the geometry column
- handleTies (handle_ties) - whether to handle ties in the k-distance calculation. Default is false
- useSpheroid (use_spheroid) - whether to use a cartesian or spheroidal distance calculation. Default is false

The output is the input DataFrame with the lof added to each row.

## Using Getis-Ord Gi(*)

The G Local function is provided at `org.apache.sedona.stats.hotspotDetection.GetisOrd.gLocal` in scala/java and `sedona.stats.hotspot_detection.getis_ord.g_local` in python.

Performs the Gi or Gi* statistic on the x column of the dataframe.

Weights should be the neighbors of this row. The members of the weights should be comprised
of structs containing a value column and a neighbor column. The neighbor column should be the
contents of the neighbors with the same types as the parent row (minus neighbors). Reference the _Using the Distance
Weighting Function_ header for instructions on generating this column. To calculate the Gi*
statistic, ensure the focal observation is in the neighbors array (i.e. the row is in the
weights column) and `star=true`. Significance is calculated with a z score.

### Parameters

- dataframe - the dataframe to perform the G statistic on
- x - The column name we want to perform hotspot analysis on
- weights - The column name containing the neighbors array. The neighbor column should be the contents of the neighbors with the same types as the parent row (minus neighbors). You can use `Weighting` class functions to achieve this.
- star - Whether the focal observation is in the neighbors array. If true this calculates Gi*, otherwise Gi

The output is the input DataFrame with the following columns added: G, E[G], V[G], Z, P.

## Using the Distance Weighting Function

The Weighting functions are provided at `org.apache.sedona.stats.Weighting` in scala/java and `sedona.stats.weighting` in python.

The function generates a column containing an array of structs containing a value column and a neighbor column.

The generic `addDistanceBandColumn` (`add_distance_band_column` in python) function annotates a dataframe with a weights column containing the other records within the threshold and their weight.

The dataframe should contain at least one `GeometryType` column. Rows must be unique. If one
geometry column is present it will be used automatically. If two are present, the one named
'geometry' will be used. If more than one are present and neither is named 'geometry', the
column name must be provided. The new column will be named 'cluster'.

### Parameters

#### addDistanceBandColumn

names in parentheses are python variable names

- dataframe - DataFrame with geometry column
- threshold - Distance threshold for considering neighbors
- binary - whether to use binary weights or inverse distance weights for neighbors (dist^alpha)
- alpha - alpha to use for inverse distance weights ignored when binary is true
- includeZeroDistanceNeighbors (include_zero_distance_neighbors) - whether to include neighbors that are 0 distance. If 0 distance neighbors are included and binary is false, values are infinity as per the floating point spec (divide by 0)
- includeSelf (include_self) - whether to include self in the list of neighbors
- selfWeight (self_weight) - the value to use for the self weight
- geometry - name of the geometry column
- useSpheroid (use_spheroid) - whether to use a cartesian or spheroidal distance calculation. Default is false

#### addBinaryDistanceBandColumn

names in parentheses are python variable names

- dataframe - DataFrame with geometry column
- threshold - Distance threshold for considering neighbors
- includeZeroDistanceNeighbors (include_zero_distance_neighbors) - whether to include neighbors that are 0 distance. If 0 distance neighbors are included and binary is false, values are infinity as per the floating point spec (divide by 0)
- includeSelf (include_self) - whether to include self in the list of neighbors
- selfWeight (self_weight) - the value to use for the self weight
- geometry - name of the geometry column
- useSpheroid (use_spheroid) - whether to use a cartesian or spheroidal distance calculation. Default is false

In both cases the output is the input DataFrame with the weights column added to each row.
61 changes: 61 additions & 0 deletions docs/tutorial/sql.md
Original file line number Diff line number Diff line change
Expand Up @@ -946,6 +946,67 @@ The output will look like this:
+--------------------+------------------+
```

## Perform Getis-Ord Gi(*) Hot Spot Analysis

Sedona provides an implementation of the [Gi and Gi*](https://en.wikipedia.org/wiki/Getis%E2%80%93Ord_statistics) algorithms to identify local hotspots in spatial data

The algorithm is available as a Scala and Python function called on a spatial dataframe. The returned dataframe has additional columns added containing G statistic, E[G], V[G], the Z score, and the p-value.

Using Gi involves first generating the neighbors list for each record, then calling the g_local function.
=== "Scala"

```scala
import org.apache.sedona.stats.Weighting.addBinaryDistanceBandColumn
import org.apache.sedona.stats.hotspotDetection.GetisOrd.gLocal

val distanceRadius = 1.0
val weightedDf = addBinaryDistanceBandColumn(df, distanceRadius)
gLocal(weightedDf, "val").show()
```

=== "Java"

```java
import org.apache.sedona.stats.Weighting;
import org.apache.sedona.stats.hotspotDetection.GetisOrd;
import org.apache.spark.sql.DataFrame;

double distanceRadius = 1.0;
DataFrame weightedDf = Weighting.addBinaryDistanceBandColumn(df, distanceRadius);
GetisOrd.gLocal(weightedDf, "val").show();
```

=== "Python"

```python
from sedona.stats.weighting import add_binary_distance_band_column
from sedona.stats.hotspot_detection.getis_ord import g_local

distance_radius = 1.0
weighted_df = addBinaryDistanceBandColumn(df, distance_radius)
g_local(weightedDf, "val").show()
```

The output will look like this:

```
+-----------+---+--------------------+-------------------+-------------------+--------------------+--------------------+--------------------+
| geometry|val| weights| G| EG| VG| Z| P|
+-----------+---+--------------------+-------------------+-------------------+--------------------+--------------------+--------------------+
|POINT (2 2)|0.9|[{{POINT (2 3), 1...| 0.4488188976377953|0.45454545454545453| 0.00356321373799772|-0.09593402008347063| 0.4617864875295957|
|POINT (2 3)|1.2|[{{POINT (2 2), 0...|0.35433070866141736|0.36363636363636365|0.003325666155464539|-0.16136436037034918| 0.4359032175415549|
|POINT (3 3)|1.2|[{{POINT (2 3), 1...|0.28346456692913385| 0.2727272727272727|0.002850570990398176| 0.20110780337013057| 0.42030714022155924|
|POINT (3 2)|1.2|[{{POINT (2 2), 0...| 0.4488188976377953|0.45454545454545453| 0.00356321373799772|-0.09593402008347063| 0.4617864875295957|
|POINT (3 1)|1.2|[{{POINT (3 2), 3...| 0.3622047244094489| 0.2727272727272727|0.002850570990398176| 1.6758983614177538| 0.04687905137429871|
|POINT (2 1)|2.2|[{{POINT (2 2), 0...| 0.4330708661417323|0.36363636363636365|0.003325666155464539| 1.2040263812249166| 0.11428969105925013|
|POINT (1 1)|1.2|[{{POINT (2 1), 5...| 0.2834645669291339| 0.2727272727272727|0.002850570990398176| 0.2011078033701316| 0.4203071402215588|
|POINT (1 2)|0.2|[{{POINT (2 2), 0...|0.35433070866141736|0.45454545454545453| 0.00356321373799772| -1.67884535146075|0.046591093685710794|
|POINT (1 3)|1.2|[{{POINT (2 3), 1...| 0.2047244094488189| 0.2727272727272727|0.002850570990398176| -1.2736827546774914| 0.10138793530151635|
|POINT (0 2)|1.0|[{{POINT (1 2), 7...|0.09448818897637795|0.18181818181818182|0.002137928242798632| -1.8887168824332323|0.029464887612748458|
|POINT (4 2)|1.2|[{{POINT (3 2), 3...| 0.1889763779527559|0.18181818181818182|0.002137928242798632| 0.15481285921583854| 0.43848442662481324|
+-----------+---+--------------------+-------------------+-------------------+--------------------+--------------------+--------------------+
```

## Run spatial queries

After creating a Geometry type column, you are able to run spatial queries.
Expand Down
4 changes: 4 additions & 0 deletions python/Pipfile
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,10 @@ mkdocs="*"
pytest-cov = "*"

scikit-learn = "*"
esda = "*"
libpysal = "*"
matplotlib = "*" # implicit dependency of esda
scipy = "<=1.10.0" # prevent incompatibility with pysal 4.7.0, which is what is resolved to when shapely >2 is specified

[packages]
pandas="<=1.5.3"
Expand Down
18 changes: 18 additions & 0 deletions python/sedona/stats/hotspot_detection/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.

"""Detecting across a region where a variable's value is significantly different from other values nearby."""
65 changes: 65 additions & 0 deletions python/sedona/stats/hotspot_detection/getis_ord.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.

"""Getis Ord functions. From the 1992 paper by Getis & Ord.
Getis, A., & Ord, J. K. (1992). The analysis of spatial association by use of distance statistics.
Geographical Analysis, 24(3), 189-206. https://doi.org/10.1111/j.1538-4632.1992.tb00261.x
"""

from pyspark.sql import Column, DataFrame, SparkSession

# todo change weights and x type to string


def g_local(
dataframe: DataFrame,
x: str,
weights: str = "weights",
permutations: int = 0,
star: bool = False,
island_weight: float = 0.0,
) -> DataFrame:
"""Performs the Gi or Gi* statistic on the x column of the dataframe.
Weights should be the neighbors of this row. The members of the weights should be comprised of structs containing a
value column and a neighbor column. The neighbor column should be the contents of the neighbors with the same types
as the parent row (minus neighbors). You can use `wherobots.weighing.add_distance_band_column` to achieve this. To
calculate the Gi* statistic, ensure the focal observation is in the neighbors array (i.e. the row is in the weights
column) and `star=true`. Significance is calculated with a z score. Permutation tests are not yet implemented and
thus island weight does nothing. The following columns will be added: G, E[G], V[G], Z, P.
Args:
dataframe: the dataframe to perform the G statistic on
x: The column name we want to perform hotspot analysis on
weights: The column name containing the neighbors array. The neighbor column should be the contents of
the neighbors with the same types as the parent row (minus neighbors). You can use
`wherobots.weighing.add_distance_band_column` to achieve this.
permutations: Not used. Permutation tests are not supported yet. The number of permutations to use for the
significance test.
star: Whether the focal observation is in the neighbors array. If true this calculates Gi*, otherwise Gi
island_weight: Not used. The weight for the simulated neighbor used for records without a neighbor in perm tests
Returns:
A dataframe with the original columns plus the columns G, E[G], V[G], Z, P.
"""
sedona = SparkSession.getActiveSession()

result_df = sedona._jvm.org.apache.sedona.stats.hotspotDetection.GetisOrd.gLocal(
dataframe, x, weights, permutations, star, island_weight
)

return DataFrame(result_df, sedona)
110 changes: 110 additions & 0 deletions python/sedona/stats/weighting.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.

"""Weighting functions for spatial data."""

from typing import Optional

from pyspark.sql import DataFrame, SparkSession


def add_distance_band_column(
dataframe: DataFrame,
threshold: float,
binary: bool = True,
alpha: float = -1.0,
include_zero_distance_neighbors: bool = False,
include_self: bool = False,
self_weight: float = 1.0,
geometry: Optional[str] = None,
use_spheroid: bool = False,
) -> DataFrame:
"""Annotates a dataframe with a weights column containing the other records within the threshold and their weight.
The dataframe should contain at least one GeometryType column. Rows must be unique. If one
geometry column is present it will be used automatically. If two are present, the one named
'geometry' will be used. If more than one are present and neither is named 'geometry', the
column name must be provided. The new column will be named 'cluster'.
Args:
dataframe: DataFrame with geometry column
threshold: Distance threshold for considering neighbors
binary: whether to use binary weights or inverse distance weights for neighbors (dist^alpha)
alpha: alpha to use for inverse distance weights ignored when binary is true
include_zero_distance_neighbors: whether to include neighbors that are 0 distance. If 0 distance neighbors are
included and binary is false, values are infinity as per the floating point spec (divide by 0)
include_self: whether to include self in the list of neighbors
self_weight: the value to use for the self weight
geometry: name of the geometry column
use_spheroid: whether to use a cartesian or spheroidal distance calculation. Default is false
Returns:
The input DataFrame with a weight column added containing neighbors and their weights added to each row.
"""
sedona = SparkSession.getActiveSession()
return sedona._jvm.org.apache.sedona.stats.Weighting.addDistanceBandColumn(
dataframe._jdf,
float(threshold),
binary,
float(alpha),
include_zero_distance_neighbors,
include_self,
float(self_weight),
geometry,
use_spheroid,
)


def add_binary_distance_band_column(
dataframe: DataFrame,
threshold: float,
include_zero_distance_neighbors: bool = True,
include_self: bool = False,
geometry: Optional[str] = None,
use_spheroid: bool = False,
) -> DataFrame:
"""Annotates a dataframe with a weights column containing the other records within the threshold and their weight.
Weights will always be 1.0. The dataframe should contain at least one GeometryType column. Rows must be unique. If
one geometry column is present it will be used automatically. If two are present, the one named 'geometry' will be
used. If more than one are present and neither is named 'geometry', the column name must be provided. The new column
will be named 'cluster'.
Args:
dataframe: DataFrame with geometry column
threshold: Distance threshold for considering neighbors
include_zero_distance_neighbors: whether to include neighbors that are 0 distance. If 0 distance neighbors are
included and binary is false, values are infinity as per the floating point spec (divide by 0)
include_self: whether to include self in the list of neighbors
geometry: name of the geometry column
use_spheroid: whether to use a cartesian or spheroidal distance calculation. Default is false
Returns:
The input DataFrame with a weight column added containing neighbors and their weights (always 1) added to each
row.
"""
sedona = SparkSession.getActiveSession()
return sedona._jvm.org.apache.sedona.stats.Weighting.addBinaryDistanceBandColumn(
dataframe._jdf,
float(threshold),
include_zero_distance_neighbors,
include_self,
geometry,
use_spheroid,
)
Loading

0 comments on commit 61bb000

Please sign in to comment.