Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

convert_excel_date, convert_matlab_date for Polars #1365

Merged
merged 110 commits into from
Jul 3, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
110 commits
Select commit Hold shift + click to select a range
862c7dd
add make_clean_names function that can be applied to polars
Apr 19, 2024
01531cc
add examples for make_clean_names
Apr 20, 2024
0fb440e
changelog
Apr 20, 2024
5e944b2
limit import location for polars
Apr 20, 2024
501d9c6
limit import location for polars
Apr 20, 2024
9506832
fix polars in environment-dev.yml
Apr 20, 2024
1ae8edd
install polars in doctest
Apr 20, 2024
3b1829b
limit polars imports - user should have polars already installed
Apr 20, 2024
52fd80c
use subprocess.run
Apr 20, 2024
2dce78b
add subprocess.devnull to docstrings
Apr 20, 2024
37b3feb
add subprocess.devnull to docstrings
Apr 20, 2024
0953f2d
add subprocess.devnull to docstrings
Apr 20, 2024
d7c71b6
add subprocess.devnull to docstrings
Apr 20, 2024
40b8502
add os.devnull
Apr 20, 2024
4f11d09
add polars as requirement for docs
Apr 20, 2024
54b179c
add polars to tests requirements
Apr 20, 2024
25b39b9
delete irrelevant folder
Apr 20, 2024
a09f34b
changelog
Apr 20, 2024
1b375f8
create submodule for polars
Apr 21, 2024
799532f
fix doctests
Apr 21, 2024
dbce4b9
fix tests; add polars to documentation
Apr 21, 2024
1c642e6
fix tests; add polars to documentation
Apr 21, 2024
407d21b
import janitor.polars
Apr 21, 2024
aedfc65
control docs output for polars submodule
Apr 21, 2024
db9b486
exclude functions in docs rendering
Apr 21, 2024
6a91e67
exclude functions in docs rendering
Apr 21, 2024
7a88078
show_submodules=true
Apr 21, 2024
6d7885e
fix docstring rendering for polars
Apr 21, 2024
944fa02
Expression -> expression
Apr 21, 2024
b9aefaa
Merge dev into samukweku/polars_clean_names
ericmjl Apr 23, 2024
e9c370a
rename functions.py
Apr 23, 2024
ee66d2a
pivot_longer implemented for polars
Apr 29, 2024
959b082
changelog
Apr 30, 2024
3177503
keep changes related only to pivot_longer
Apr 30, 2024
ee899b2
pd -> pl
Apr 30, 2024
8ea9b71
pd -> pl
Apr 30, 2024
d12ae1a
df.pivot_longer -> df.janitor.pivot_longer
Apr 30, 2024
652f3e3
df.pivot_longer -> df.janitor.pivot_longer
Apr 30, 2024
9b9c1a9
pd -> pl
Apr 30, 2024
69c273f
pd -> pl
Apr 30, 2024
b3391e8
add >>> df
Apr 30, 2024
4ffaac5
add >>> df
Apr 30, 2024
1de57bb
keep changes related only to polars pivot_longer
Apr 30, 2024
8097bb4
code update
May 3, 2024
1bcd1b4
names_transform if there is not dot value
May 4, 2024
a492b85
add support for lazyframe
May 4, 2024
193fdc7
fix column order in docs
May 4, 2024
cd5baeb
fix column order in docs
May 4, 2024
1ba2650
fix column order in docs
May 4, 2024
908d71d
fix column order in docs
May 4, 2024
0cc6a93
separate namespaces for lazyframe and eager dataframe
May 5, 2024
0d41284
separate namespaces for lazyframe and eager dataframe
May 5, 2024
e8e6baf
improve perf if no dot_value and values_to is a str
May 5, 2024
2886ea1
fix docs
May 6, 2024
beec9e2
handle deprecation warning for pd.unique
May 6, 2024
a8d97ff
Merge dev into samukweku/polars_pivot_longer
ericmjl May 6, 2024
302a39b
Merge branch 'dev' into samukweku/polars_pivot_longer
samukweku May 10, 2024
7e4cc00
added pivot_longer_spec
May 12, 2024
a2df26f
update docs
May 12, 2024
3635639
update docs
May 12, 2024
d799ef5
update docs
May 12, 2024
198886d
update docs
May 12, 2024
c8a89eb
update docs
May 12, 2024
f733cfd
update docs
May 12, 2024
765837f
update docs
May 12, 2024
0fbe05a
update docs
May 12, 2024
b51f803
reduce repetitiveness in function calls
May 12, 2024
4798b09
update docs
May 12, 2024
30a290b
update docs
May 12, 2024
0755e91
update docs
May 13, 2024
2eeba86
update docs
May 13, 2024
1efdc96
refactor code - create spec then unpivot
May 13, 2024
41f450d
refactor code - create spec then unpivot
May 13, 2024
e03c11a
update docs
May 13, 2024
ba6df7d
update docs
May 13, 2024
0aac3cb
update docs
May 13, 2024
7f689a7
update docs
May 13, 2024
56bca98
add check for spec dataframe
May 17, 2024
65ee3b8
fix doc fail
May 17, 2024
7a69b54
fix doc fail
May 17, 2024
3365f01
exclude non polars code
May 17, 2024
04a8444
Merge dev into samukweku/polars_pivot_longer
ericmjl May 19, 2024
2e67269
convert_excel_date and convert_matlab_date for polars
May 21, 2024
4ac9f7e
convert_excel_date and convert_matlab_date for polars
May 21, 2024
a0994b9
update docs
May 21, 2024
850d41f
update docs - convert_matlab_date
May 21, 2024
41bcbc7
update docs - convert_matlab_date
May 21, 2024
a00a5ac
Merge dev into samukweku/convert_excel_date_polars
ericmjl May 23, 2024
63a623b
Merge dev into samukweku/convert_excel_date_polars
ericmjl May 27, 2024
e9284f6
Merge branch 'dev' into samukweku/convert_excel_date_polars
samukweku Jun 3, 2024
3b63a9c
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jun 3, 2024
43d742e
fix conflicts
Jun 3, 2024
24dcc36
Merge dev into samukweku/convert_excel_date_polars
ericmjl Jun 4, 2024
f36fe60
Merge branch 'dev' into samukweku/convert_excel_date_polars
samukweku Jun 11, 2024
d233c3d
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jun 11, 2024
98f530e
fix docs
Jun 11, 2024
ee38161
fix docs
Jun 11, 2024
5fda208
Merge dev into samukweku/convert_excel_date_polars
ericmjl Jun 12, 2024
7f4de97
Merge branch 'dev' into samukweku/convert_excel_date_polars
samukweku Jun 14, 2024
77601a2
Merge dev into samukweku/convert_excel_date_polars
ericmjl Jun 18, 2024
01954f7
Merge dev into samukweku/convert_excel_date_polars
ericmjl Jun 18, 2024
7048457
Merge dev into samukweku/convert_excel_date_polars
ericmjl Jun 18, 2024
94c90bf
Merge dev into samukweku/convert_excel_date_polars
ericmjl Jun 19, 2024
aa873e2
fix tests
Jun 21, 2024
fed8bb0
fix tests
Jun 21, 2024
f141632
fix tests
Jun 21, 2024
1e28e22
fix docs
Jun 21, 2024
7b941b3
changelog
Jun 21, 2024
f5f219e
Merge dev into samukweku/convert_excel_date_polars
ericmjl Jun 28, 2024
ce32626
Merge dev into samukweku/convert_excel_date_polars
ericmjl Jun 28, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@

## [Unreleased]

- [ENH] Added `convert_excel_date` and `convert_matlab_date` methods for polars - Issue #1352
- [ENH] Added a `complete` method for polars. - Issue #1352 @samukweku
- [ENH] `read_commandline` function now supports polars - Issue #1352
- [ENH] Improved performance for non-equi joins when using numba - @samukweku PR #1341
Expand Down
86 changes: 47 additions & 39 deletions janitor/functions/convert_date.py
Original file line number Diff line number Diff line change
@@ -1,24 +1,22 @@
import datetime as dt
from typing import Hashable
from typing import Hashable, Union

import pandas as pd
import pandas_flavor as pf
from pandas.api.types import is_numeric_dtype
from pandas.errors import OutOfBoundsDatetime

from janitor.utils import deprecated_alias
from janitor.utils import deprecated_alias, refactored_function


@pf.register_dataframe_method
@deprecated_alias(column="column_name")
@deprecated_alias(column="column_names")
def convert_excel_date(
df: pd.DataFrame, column_name: Hashable
df: pd.DataFrame, column_names: Union[Hashable, list]
) -> pd.DataFrame:
"""Convert Excel's serial date format into Python datetime format.

This method mutates the original DataFrame.
This method does not mutate the original DataFrame.

Implementation is also from
Implementation is based on
[Stack Overflow](https://stackoverflow.com/questions/38454403/convert-excel-style-date-with-pandas).

Examples:
Expand All @@ -38,40 +36,36 @@ def convert_excel_date(

Args:
df: A pandas DataFrame.
column_name: A column name.

Raises:
ValueError: If there are non numeric values in the column.
column_names: A column name, or a list of column names.

Returns:
A pandas DataFrame with corrected dates.
""" # noqa: E501

if not is_numeric_dtype(df[column_name]):
raise ValueError(
"There are non-numeric values in the column. "
"All values must be numeric."
if not isinstance(column_names, list):
column_names = [column_names]
# https://stackoverflow.com/a/65460255/7175713
dictionary = {
column_name: pd.to_datetime(
df[column_name], unit="D", origin="1899-12-30"
)
for column_name in column_names
}

df[column_name] = pd.TimedeltaIndex(
df[column_name], unit="d"
) + dt.datetime(
1899, 12, 30
) # noqa: W503
return df
return df.assign(**dictionary)


@pf.register_dataframe_method
@deprecated_alias(column="column_name")
@deprecated_alias(column="column_names")
def convert_matlab_date(
df: pd.DataFrame, column_name: Hashable
df: pd.DataFrame, column_names: Union[Hashable, list]
) -> pd.DataFrame:
"""Convert Matlab's serial date number into Python datetime format.

Implementation is also from
Implementation is based on
[Stack Overflow](https://stackoverflow.com/questions/13965740/converting-matlabs-datenum-format-to-python).

This method mutates the original DataFrame.
This method does not mutate the original DataFrame.

Examples:
>>> import pandas as pd
Expand All @@ -84,29 +78,38 @@ def convert_matlab_date(
2 737124.498500
3 737124.000000
>>> df.convert_matlab_date('date')
date
0 2018-03-06 00:00:00.000000
1 2018-03-05 19:34:50.563200
2 2018-03-05 11:57:50.399999
3 2018-03-05 00:00:00.000000
date
0 2018-03-06 00:00:00.000000000
1 2018-03-05 19:34:50.563199671
2 2018-03-05 11:57:50.399998876
3 2018-03-05 00:00:00.000000000

Args:
df: A pandas DataFrame.
column_name: A column name.
column_names: A column name, or a list of column names.

Returns:
A pandas DataFrame with corrected dates.
""" # noqa: E501
days = pd.Series([dt.timedelta(v % 1) for v in df[column_name]])
df[column_name] = (
df[column_name].astype(int).apply(dt.datetime.fromordinal)
+ days
- dt.timedelta(days=366)
)
return df
# https://stackoverflow.com/a/49135037/7175713
if not isinstance(column_names, list):
column_names = [column_names]
dictionary = {
column_name: pd.to_datetime(df[column_name] - 719529, unit="D")
for column_name in column_names
}

return df.assign(**dictionary)


@pf.register_dataframe_method
@pf.register_dataframe_method
@refactored_function(
message=(
"This function will be deprecated in a 1.x release. "
"Please use `pd.to_datetime` instead."
)
)
@deprecated_alias(column="column_name")
def convert_unix_date(df: pd.DataFrame, column_name: Hashable) -> pd.DataFrame:
"""Convert unix epoch time into Python datetime format.
Expand All @@ -116,6 +119,11 @@ def convert_unix_date(df: pd.DataFrame, column_name: Hashable) -> pd.DataFrame:

This method mutates the original DataFrame.

!!!note

This function will be deprecated in a 1.x release.
Please use `pd.to_datetime` instead.

Examples:
>>> import pandas as pd
>>> import janitor
Expand Down
3 changes: 3 additions & 0 deletions janitor/polars/__init__.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
from .clean_names import clean_names, make_clean_names
from .complete import complete
from .dates_to_polars import convert_excel_date, convert_matlab_date
from .pivot_longer import pivot_longer, pivot_longer_spec
from .row_to_names import row_to_names

Expand All @@ -10,4 +11,6 @@
"make_clean_names",
"row_to_names",
"complete",
"convert_excel_date",
"convert_matlab_date",
]
112 changes: 112 additions & 0 deletions janitor/polars/dates_to_polars.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
from __future__ import annotations

from janitor.utils import import_message

from .polars_flavor import register_expr_method

try:
import polars as pl
except ImportError:
import_message(
submodule="polars",
package="polars",
conda_channel="conda-forge",
pip_install=True,
)


@register_expr_method
def convert_excel_date(expr: pl.Expr) -> pl.Expr:
"""
Convert Excel's serial date format into Python datetime format.

Inspiration is from
[Stack Overflow](https://stackoverflow.com/questions/38454403/convert-excel-style-date-with-pandas).

Examples:
>>> import polars as pl
>>> import janitor.polars
>>> df = pl.DataFrame({"date": [39690, 39690, 37118]})
>>> df
shape: (3, 1)
┌───────┐
│ date │
│ --- │
│ i64 │
╞═══════╡
│ 39690 │
│ 39690 │
│ 37118 │
└───────┘
>>> expression = pl.col('date').convert_excel_date().alias('date_')
>>> df.with_columns(expression)
shape: (3, 2)
┌───────┬────────────┐
│ date ┆ date_ │
│ --- ┆ --- │
│ i64 ┆ date │
╞═══════╪════════════╡
│ 39690 ┆ 2008-08-30 │
│ 39690 ┆ 2008-08-30 │
│ 37118 ┆ 2001-08-15 │
└───────┴────────────┘

!!! info "New in version 0.28.0"

Returns:
A polars Expression.
""" # noqa: E501
expression = pl.duration(days=expr)
expression += pl.date(year=1899, month=12, day=30)
return expression


@register_expr_method
def convert_matlab_date(expr: pl.Expr) -> pl.Expr:
"""
Convert Matlab's serial date number into Python datetime format.

Implementation is from
[Stack Overflow](https://stackoverflow.com/questions/13965740/converting-matlabs-datenum-format-to-python).


Examples:
>>> import polars as pl
>>> import janitor.polars
>>> df = pl.DataFrame({"date": [737125.0, 737124.815863, 737124.4985, 737124]})
>>> df
shape: (4, 1)
┌───────────────┐
│ date │
│ --- │
│ f64 │
╞═══════════════╡
│ 737125.0 │
│ 737124.815863 │
│ 737124.4985 │
│ 737124.0 │
└───────────────┘
>>> expression = pl.col('date').convert_matlab_date().alias('date_')
>>> df.with_columns(expression)
shape: (4, 2)
┌───────────────┬─────────────────────────┐
│ date ┆ date_ │
│ --- ┆ --- │
│ f64 ┆ datetime[μs] │
╞═══════════════╪═════════════════════════╡
│ 737125.0 ┆ 2018-03-06 00:00:00 │
│ 737124.815863 ┆ 2018-03-05 19:34:50.563 │
│ 737124.4985 ┆ 2018-03-05 11:57:50.399 │
│ 737124.0 ┆ 2018-03-05 00:00:00 │
└───────────────┴─────────────────────────┘

!!! info "New in version 0.28.0"

Returns:
A polars Expression.
""" # noqa: E501
# https://stackoverflow.com/questions/13965740/converting-matlabs-datenum-format-to-python
expression = expr.sub(719529).mul(86_400_000)
expression = pl.duration(milliseconds=expression)
expression += pl.datetime(year=1970, month=1, day=1)
return expression
12 changes: 0 additions & 12 deletions tests/functions/test_convert_excel_date.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,15 +18,3 @@ def test_convert_excel_date():
)

assert df["hire_date"].dtype == "M8[ns]"


@pytest.mark.functions
def test_convert_excel_date_with_string_data():
"""Raises ValueError if values of column are not numeric"""
df = pd.read_excel(
Path(pytest.EXAMPLES_DIR) / "notebooks" / "dirty_data.xlsx",
engine="openpyxl",
).clean_names()

with pytest.raises(ValueError):
df.convert_excel_date("certification")
11 changes: 11 additions & 0 deletions tests/polars/functions/test_convert_excel_date_polars.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
import polars as pl

import janitor.polars # noqa: F401


def test_convert_excel_date():
df = pl.DataFrame({"dates": [42580.3333333333]})

expression = pl.col("dates").convert_excel_date().alias("dd")
expression = df.with_columns(expression).get_column("dd")
assert expression.dtype.is_temporal() is True
20 changes: 20 additions & 0 deletions tests/polars/functions/test_convert_matlab_date_polars.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
import polars as pl

import janitor.polars # noqa: F401


def test_convert_matlab_date():
df = pl.DataFrame(
{
"dates": [
733_301.0,
729_159.0,
734_471.0,
737_299.563_296_356_5,
737_300.000_000_000_0,
]
}
)
expression = pl.col("dates").convert_matlab_date().alias("dd")
expression = df.with_columns(expression).get_column("dd")
assert expression.dtype.is_temporal() is True
Loading