Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to allocate 1.72 TiB #1597

Closed
3 tasks done
ibobak opened this issue May 29, 2024 · 1 comment
Closed
3 tasks done

Unable to allocate 1.72 TiB #1597

ibobak opened this issue May 29, 2024 · 1 comment

Comments

@ibobak
Copy link

ibobak commented May 29, 2024

Current Behaviour

I have 5.6 million rows dataset.

It looks like you are doing something wrong with memory allocation. I have 256GB of RAM and 44 core CPU, but yet this is not enough....

---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
Cell In[52], line 25
     23 pdf_hb_props_flat = df_hb_props_flat.toPandas()
     24 report = ProfileReport(pdf_hb_props_flat, title='Props')
---> 25 report.to_file(f"profiling/{app_code}.html")

File /opt/anaconda3/envs/generic/lib/python3.10/site-packages/ydata_profiling/profile_report.py:379, in ProfileReport.to_file(self, output_file, silent)
    376         self.config.html.assets_prefix = str(output_file.stem) + "_assets"
    377     create_html_assets(self.config, output_file)
--> 379 data = self.to_html()
    381 if output_file.suffix != ".html":
    382     suffix = output_file.suffix

File /opt/anaconda3/envs/generic/lib/python3.10/site-packages/ydata_profiling/profile_report.py:496, in ProfileReport.to_html(self)
    488 def to_html(self) -> str:
    489     """Generate and return complete template as lengthy string
    490         for using with frameworks.
    491 
   (...)
    494 
    495     """
--> 496     return self.html

File /opt/anaconda3/envs/generic/lib/python3.10/site-packages/ydata_profiling/profile_report.py:292, in ProfileReport.html(self)
    289 @property
    290 def html(self) -> str:
    291     if self._html is None:
--> 292         self._html = self._render_html()
    293     return self._html

File /opt/anaconda3/envs/generic/lib/python3.10/site-packages/ydata_profiling/profile_report.py:409, in ProfileReport._render_html(self)
    406 def _render_html(self) -> str:
    407     from ydata_profiling.report.presentation.flavours import HTMLReport
--> 409     report = self.report
    411     with tqdm(
    412         total=1, desc="Render HTML", disable=not self.config.progress_bar
    413     ) as pbar:
    414         html = HTMLReport(copy.deepcopy(report)).render(
    415             nav=self.config.html.navbar_show,
    416             offline=self.config.html.use_local_assets,
   (...)
    424             version=self.description_set.package["ydata_profiling_version"],
    425         )

File /opt/anaconda3/envs/generic/lib/python3.10/site-packages/ydata_profiling/profile_report.py:286, in ProfileReport.report(self)
    283 @property
    284 def report(self) -> Root:
    285     if self._report is None:
--> 286         self._report = get_report_structure(self.config, self.description_set)
    287     return self._report

File /opt/anaconda3/envs/generic/lib/python3.10/site-packages/ydata_profiling/profile_report.py:268, in ProfileReport.description_set(self)
    265 @property
    266 def description_set(self) -> BaseDescription:
    267     if self._description_set is None:
--> 268         self._description_set = describe_df(
    269             self.config,
    270             self.df,
    271             self.summarizer,
    272             self.typeset,
    273             self._sample,
    274         )
    275     return self._description_set

File /opt/anaconda3/envs/generic/lib/python3.10/site-packages/ydata_profiling/model/describe.py:74, in describe(config, df, summarizer, typeset, sample)
     72 # Variable-specific
     73 pbar.total += len(df.columns)
---> 74 series_description = get_series_descriptions(
     75     config, df, summarizer, typeset, pbar
     76 )
     78 pbar.set_postfix_str("Get variable types")
     79 pbar.total += 1

File /opt/anaconda3/envs/generic/lib/python3.10/site-packages/multimethod/__init__.py:369, in multimethod.__call__(self, *args, **kwargs)
    367 func = self.dispatch(*args)
    368 try:
--> 369     return func(*args, **kwargs)
    370 except TypeError as ex:
    371     raise DispatchError(f"Function {func.__code__}") from ex

File /opt/anaconda3/envs/generic/lib/python3.10/site-packages/ydata_profiling/model/pandas/summary_pandas.py:99, in pandas_get_series_descriptions(config, df, summarizer, typeset, pbar)
     96 else:
     97     # TODO: use `Pool` for Linux-based systems
     98     with multiprocessing.pool.ThreadPool(pool_size) as executor:
---> 99         for i, (column, description) in enumerate(
    100             executor.imap_unordered(multiprocess_1d, args)
    101         ):
    102             pbar.set_postfix_str(f"Describe variable:{column}")
    103             series_description[column] = description

File /opt/anaconda3/envs/generic/lib/python3.10/multiprocessing/pool.py:873, in IMapIterator.next(self, timeout)
    871 if success:
    872     return value
--> 873 raise value

File /opt/anaconda3/envs/generic/lib/python3.10/multiprocessing/pool.py:125, in worker(inqueue, outqueue, initializer, initargs, maxtasks, wrap_exception)
    123 job, i, func, args, kwds = task
    124 try:
--> 125     result = (True, func(*args, **kwds))
    126 except Exception as e:
    127     if wrap_exception and func is not _helper_reraises_exception:

File /opt/anaconda3/envs/generic/lib/python3.10/site-packages/ydata_profiling/model/pandas/summary_pandas.py:79, in pandas_get_series_descriptions.<locals>.multiprocess_1d(args)
     69 """Wrapper to process series in parallel.
     70 
     71 Args:
   (...)
     76     A tuple with column and the series description.
     77 """
     78 column, series = args
---> 79 return column, describe_1d(config, series, summarizer, typeset)

File /opt/anaconda3/envs/generic/lib/python3.10/site-packages/multimethod/__init__.py:369, in multimethod.__call__(self, *args, **kwargs)
    367 func = self.dispatch(*args)
    368 try:
--> 369     return func(*args, **kwargs)
    370 except TypeError as ex:
    371     raise DispatchError(f"Function {func.__code__}") from ex

File /opt/anaconda3/envs/generic/lib/python3.10/site-packages/ydata_profiling/model/pandas/summary_pandas.py:57, in pandas_describe_1d(config, series, summarizer, typeset)
     54     vtype = typeset.detect_type(series)
     56 typeset.type_schema[series.name] = vtype
---> 57 return summarizer.summarize(config, series, dtype=vtype)

File /opt/anaconda3/envs/generic/lib/python3.10/site-packages/ydata_profiling/model/summarizer.py:42, in BaseSummarizer.summarize(self, config, series, dtype)
     34 def summarize(
     35     self, config: Settings, series: pd.Series, dtype: Type[VisionsBaseType]
     36 ) -> dict:
     37     """
     38 
     39     Returns:
     40         object:
     41     """
---> 42     _, _, summary = self.handle(str(dtype), config, series, {"type": str(dtype)})
     43     return summary

File /opt/anaconda3/envs/generic/lib/python3.10/site-packages/ydata_profiling/model/handler.py:62, in Handler.handle(self, dtype, *args, **kwargs)
     60 funcs = self.mapping.get(dtype, [])
     61 op = compose(funcs)
---> 62 return op(*args)

File /opt/anaconda3/envs/generic/lib/python3.10/site-packages/ydata_profiling/model/handler.py:21, in compose.<locals>.func.<locals>.func2(*x)
     19     return f(*x)
     20 else:
---> 21     return f(*res)

File /opt/anaconda3/envs/generic/lib/python3.10/site-packages/ydata_profiling/model/handler.py:21, in compose.<locals>.func.<locals>.func2(*x)
     19     return f(*x)
     20 else:
---> 21     return f(*res)

File /opt/anaconda3/envs/generic/lib/python3.10/site-packages/ydata_profiling/model/handler.py:21, in compose.<locals>.func.<locals>.func2(*x)
     19     return f(*x)
     20 else:
---> 21     return f(*res)

File /opt/anaconda3/envs/generic/lib/python3.10/site-packages/ydata_profiling/model/handler.py:17, in compose.<locals>.func.<locals>.func2(*x)
     16 def func2(*x) -> Any:
---> 17     res = g(*x)
     18     if type(res) == bool:
     19         return f(*x)

File /opt/anaconda3/envs/generic/lib/python3.10/site-packages/multimethod/__init__.py:369, in multimethod.__call__(self, *args, **kwargs)
    367 func = self.dispatch(*args)
    368 try:
--> 369     return func(*args, **kwargs)
    370 except TypeError as ex:
    371     raise DispatchError(f"Function {func.__code__}") from ex

File /opt/anaconda3/envs/generic/lib/python3.10/site-packages/ydata_profiling/model/summary_algorithms.py:73, in series_hashable.<locals>.inner(config, series, summary)
     71 if not summary["hashable"]:
     72     return config, series, summary
---> 73 return fn(config, series, summary)

File /opt/anaconda3/envs/generic/lib/python3.10/site-packages/ydata_profiling/model/summary_algorithms.py:90, in series_handle_nulls.<locals>.inner(config, series, summary)
     87 if series.hasnans:
     88     series = series.dropna()
---> 90 return fn(config, series, summary)

File /opt/anaconda3/envs/generic/lib/python3.10/site-packages/ydata_profiling/model/pandas/describe_numeric_pandas.py:131, in pandas_describe_numeric_1d(config, series, summary)
    124 stats.update(
    125     {
    126         "mad": mad(present_values),
    127     }
    128 )
    130 if chi_squared_threshold > 0.0:
--> 131     stats["chi_squared"] = chi_square(finite_values)
    133 stats["range"] = stats["max"] - stats["min"]
    134 stats.update(
    135     {
    136         f"{percentile:.0%}": value
    137         for percentile, value in series.quantile(quantiles).to_dict().items()
    138     }
    139 )

File /opt/anaconda3/envs/generic/lib/python3.10/site-packages/ydata_profiling/model/summary_algorithms.py:57, in chi_square(values, histogram)
     53 def chi_square(
     54     values: Optional[np.ndarray] = None, histogram: Optional[np.ndarray] = None
     55 ) -> dict:
     56     if histogram is None:
---> 57         bins = np.histogram_bin_edges(values, bins="auto")
     58         histogram, _ = np.histogram(values, bins=bins)
     59     if len(histogram) == 0 or np.sum(histogram) == 0:

File <__array_function__ internals>:180, in histogram_bin_edges(*args, **kwargs)

File /opt/anaconda3/envs/generic/lib/python3.10/site-packages/numpy/lib/histograms.py:669, in histogram_bin_edges(a, bins, range, weights)
    472 r"""
    473 Function to calculate only the edges of the bins used by the `histogram`
    474 function.
   (...)
    666 
    667 """
    668 a, weights = _ravel_and_check_weights(a, weights)
--> 669 bin_edges, _ = _get_bin_edges(a, bins, range, weights)
    670 return bin_edges

File /opt/anaconda3/envs/generic/lib/python3.10/site-packages/numpy/lib/histograms.py:446, in _get_bin_edges(a, bins, range, weights)
    443         bin_type = np.result_type(bin_type, float)
    445     # bin edges must be computed
--> 446     bin_edges = np.linspace(
    447         first_edge, last_edge, n_equal_bins + 1,
    448         endpoint=True, dtype=bin_type)
    449     return bin_edges, (first_edge, last_edge, n_equal_bins)
    450 else:

File <__array_function__ internals>:180, in linspace(*args, **kwargs)

File /opt/anaconda3/envs/generic/lib/python3.10/site-packages/numpy/core/function_base.py:135, in linspace(start, stop, num, endpoint, retstep, dtype, axis)
    132     dtype = dt
    134 delta = stop - start
--> 135 y = _nx.arange(0, num, dtype=dt).reshape((-1,) + (1,) * ndim(delta))
    136 # In-place multiplication y *= delta/div is faster, but prevents the multiplicant
    137 # from overriding what class is produced, and thus prevents, e.g. use of Quantities,
    138 # see gh-7142. Hence, we multiply in place only for standard scalar types.
    139 _mult_inplace = _nx.isscalar(delta)

MemoryError: Unable to allocate 1.72 TiB for an array with shape (235833781945,) and data type float64

Expected Behaviour

I want to see no OOM

Data Description

row count: 5,654,761

root
|-- bl: double (nullable = true)
|-- bs: string (nullable = true)
|-- d: double (nullable = true)
|-- so: string (nullable = true)
|-- ta: double (nullable = true)
|-- tc: long (nullable = true)
|-- tnt: double (nullable = true)
|-- tog: double (nullable = true)
|-- tt: double (nullable = true)

Code that reproduces the bug

from ydata_profiling import ProfileReport

for app_code in APP_CODES:
    print("="*100)
    print(app_code)
    print("="*100)
    df_game = load(CI, f"flyanalytics_processed/{app_code}", a_format="parquet", a_row_count=True, a_ps=True)
    df_hb = change(df_game, a_where="event_type = 'heartbeat'")
    count(df_hb)
    
    fields  = [f"event_properties.{p}" for p in get_props(app_code, "heartbeat")]
    fields_str = "\n\t, ".join(fields)
    query = f"""
        select   
            {fields_str}
        from [0]
    """    
    df_hb_props = sql(query, df_hb)

    df_hb_props_flat = flattened_select(df_hb_props, a_replacements={"event_properties": "ep"})
    ps(df_hb_props_flat)
    
    pdf_hb_props_flat = df_hb_props_flat.toPandas()
    report = ProfileReport(pdf_hb_props_flat, title='Props')
    report.to_file(f"profiling/{app_code}.html")  


### pandas-profiling version

4.8.3

### Dependencies

```Text
a2wsgi==1.10.4
aiohttp==3.9.5
aiosignal==1.3.1
alembic==1.13.1
annotated-types==0.6.0
anyio==4.3.0
apache-airflow==2.7.1
apache-airflow-providers-common-sql==1.13.0
apache-airflow-providers-ftp==3.9.0
apache-airflow-providers-http==4.11.0
apache-airflow-providers-imap==3.6.0
apache-airflow-providers-sqlite==3.8.0
apispec==6.6.1
argcomplete==3.3.0
argon2-cffi==23.1.0
argon2-cffi-bindings==21.2.0
arrow==1.3.0
arviz==0.16.1
asgiref==3.8.1
asn1crypto==1.5.1
asttokens==2.4.1
async-lru==2.0.4
async-timeout==4.0.3
attrs==23.2.0
Babel==2.15.0
backcall==0.2.0
beautifulsoup4==4.12.3
bleach==6.1.0
blinker==1.8.2
boto3==1.28.29
botocore==1.31.85
cachelib==0.9.0
cachetools==5.3.3
cattrs==23.2.3
certifi==2024.2.2
cffi==1.16.0
charset-normalizer==3.3.2
click==8.1.7
cloudpickle==2.2.1
colorama==0.4.6
coloredlogs==15.0.1
colorlog==4.8.0
comm==0.2.2
ConfigUpdater==3.2
connexion==3.0.6
cons==0.4.6
contourpy==1.2.1
cron-descriptor==1.4.3
croniter==2.0.5
cryptography==42.0.7
cycler==0.12.1
dacite==1.8.1
databricks-cli==0.18.0
debugpy==1.8.1
decorator==5.1.1
defusedxml==0.7.1
Deprecated==1.2.14
dill==0.3.8
dnspython==2.6.1
docker==6.1.3
docutils==0.21.2
email-validator==1.3.1
entrypoints==0.4
et-xmlfile==1.1.0
etuples==0.3.9
exceptiongroup==1.2.0
executing==2.0.1
fastjsonschema==2.19.1
fastprogress==1.0.3
filelock==3.14.0
Flask==2.2.5
Flask-AppBuilder==4.3.6
Flask-Babel==2.0.0
Flask-Caching==2.3.0
Flask-JWT-Extended==4.6.0
Flask-Limiter==3.7.0
Flask-Login==0.6.3
Flask-Session==0.8.0
Flask-SQLAlchemy==2.5.1
Flask-WTF==1.2.1
flatbuffers==24.3.25
fonttools==4.51.0
fqdn==1.5.1
frozenlist==1.4.1
gitdb==4.0.11
GitPython==3.1.43
google-re2==1.1.20240501
googleapis-common-protos==1.63.0
graphviz==0.20.3
greenlet==3.0.3
grpcio==1.64.0
gunicorn==20.1.0
h11==0.14.0
h5netcdf==1.3.0
h5py==3.11.0
htmlmin==0.1.12
httpcore==1.0.5
httpx==0.27.0
humanfriendly==10.0
idna==3.7
ImageHash==4.3.1
importlib-metadata==6.11.0
importlib_resources==6.4.0
inflection==0.5.1
ipykernel==6.19.2
ipynb-py-convert==0.4.6
ipython==8.10.0
ipython-genutils==0.2.0
ipywidgets==7.6.5
isoduration==20.11.0
itsdangerous==2.2.0
jedi==0.19.1
Jinja2==3.1.4
jmespath==1.0.1
joblib==1.4.2
json5==0.9.25
jsonpointer==2.4
jsonschema==4.22.0
jsonschema-specifications==2023.12.1
jupyter-contrib-core==0.4.2
jupyter-contrib-nbextensions==0.7.0
jupyter-events==0.10.0
jupyter-highlight-selected-word==0.2.0
jupyter-lsp==2.2.5
jupyter-nbextensions-configurator==0.6.3
jupyter_client==7.4.4
jupyter_core==5.7.2
jupyter_server==2.14.0
jupyter_server_terminals==0.5.3
jupyterlab==4.2.1
jupyterlab-execute-time==3.1.2
jupyterlab_pygments==0.3.0
jupyterlab_server==2.27.2
jupyterlab_widgets==3.0.10
kiwisolver==1.4.5
lazy-object-proxy==1.10.0
lazyprofiler==0.1.1
limits==3.12.0
linkify-it-py==2.0.3
llvmlite==0.42.0
lockfile==0.12.2
logical-unification==0.4.6
lxml==5.2.2
Mako==1.3.5
Markdown==3.6
markdown-it-py==3.0.0
MarkupSafe==2.1.5
marshmallow==3.21.2
marshmallow-oneofschema==3.1.1
marshmallow-sqlalchemy==0.26.1
matplotlib==3.8.4
matplotlib-inline==0.1.7
mdit-py-plugins==0.4.1
mdurl==0.1.2
miniKanren==1.0.3
mistune==3.0.2
mlflow==2.5.0
more-itertools==10.2.0
mpmath==1.3.0
msgspec==0.18.6
multidict==6.0.5
multimethod==1.11.2
multipledispatch==1.0.0
nbclassic==1.0.0
nbclient==0.10.0
nbconvert==7.16.4
nbformat==5.10.4
nest-asyncio==1.6.0
networkx==3.3
notebook==7.2.0
notebook_shim==0.2.4
numba==0.59.1
numpy==1.23.5
oauthlib==3.2.2
onnx==1.15.0
onnxconverter-common==1.14.0
onnxmltools==1.12.0
onnxruntime==1.17.1
openai==1.22.0
openpyxl==3.1.2
opentelemetry-api==1.24.0
opentelemetry-exporter-otlp==1.24.0
opentelemetry-exporter-otlp-proto-common==1.24.0
opentelemetry-exporter-otlp-proto-grpc==1.24.0
opentelemetry-exporter-otlp-proto-http==1.24.0
opentelemetry-proto==1.24.0
opentelemetry-sdk==1.24.0
opentelemetry-semantic-conventions==0.45b0
optuna==3.5.0
optuna-fast-fanova==0.0.4
ordered-set==4.1.0
overrides==7.7.0
packaging==23.2
pandas==2.2.2
pandas-datareader==0.10.0
pandasql==0.7.3
pandocfilters==1.5.1
parso==0.8.4
pathspec==0.12.1
patsy==0.5.6
pendulum==2.1.2
pexpect==4.9.0
pgcopy==1.6.0
phik==0.12.4
pickleshare==0.7.5
pillow==10.3.0
platformdirs==4.2.2
plotly==5.22.0
pluggy==1.5.0
prison==0.2.1
prometheus_client==0.20.0
prompt-toolkit==3.0.43
protobuf==3.20.2
psutil==5.9.8
psycopg2==2.9.9
psycopg2-binary==2.9.7
ptyprocess==0.7.0
pure-eval==0.2.2
pyarrow==12.0.1
pycountry==23.12.11
pycparser==2.22
pydantic==2.7.0
pydantic_core==2.18.1
Pygments==2.18.0
PyJWT==2.8.0
pymc==5.6.0
pyparsing==3.1.2
pytensor==2.12.3
python-daemon==3.0.1
python-dateutil==2.9.0.post0
python-json-logger==2.0.7
python-multipart==0.0.9
python-nvd3==0.16.0
python-slugify==8.0.4
pytz==2023.4
pytzdata==2020.1
PyWavelets==1.6.0
PyYAML==6.0.1
pyzmq==26.0.3
querystring-parser==1.2.4
redshift-connector==2.0.911
referencing==0.35.1
requests==2.32.2
requests-toolbelt==1.0.0
rfc3339-validator==0.1.4
rfc3986-validator==0.1.1
rich==13.7.1
rich-argparse==1.4.0
rpds-py==0.18.1
s3transfer==0.6.2
scikit-learn==1.3.2
scipy==1.12.0
scramp==1.4.5
seaborn==0.12.2
Send2Trash==1.8.3
setproctitle==1.3.3
shap==0.42.1
six==1.16.0
skl2onnx==1.16.0
slicer==0.0.7
smart-open==6.3.0
smmap==5.0.1
sniffio==1.3.1
soupsieve==2.5
spark_framework @ git+https://github.com/ibobak/spark_framework.git@8dcf0f5b29e71721d4d6069a76ae4fde1e7e7bde
SQLAlchemy==1.4.49
SQLAlchemy-JSONField==1.0.2
SQLAlchemy-Utils==0.41.2
sqlparse==0.5.0
stack-data==0.6.3
starlette==0.37.2
statsmodels==0.14.2
sympy==1.12
tabulate==0.9.0
tenacity==8.0.1
termcolor==2.4.0
terminado==0.18.1
text-unidecode==1.3
threadpoolctl==3.5.0
tinycss2==1.3.0
tomli==2.0.1
toolz==0.12.1
tornado==6.2
tqdm==4.66.2
traitlets==5.9.0
typeguard==4.3.0
types-python-dateutil==2.9.0.20240316
typing_extensions==4.12.0
tzdata==2024.1
uc-micro-py==1.0.3
unicodecsv==0.14.1
uri-template==1.3.0
urllib3==2.0.7
visions==0.7.6
wcwidth==0.2.13
webcolors==1.13
webencodings==0.5.1
websocket-client==1.8.0
Werkzeug==3.0.3
widgetsnbextension==3.5.2
wordcloud==1.9.3
wrapt==1.16.0
WTForms==3.1.2
xarray==2024.3.0
xarray-einstats==0.7.0
xgboost==2.0.2
yarl==1.9.4
ydata-profiling==4.8.3
zipp==3.18.2

OS

Ubuntu 22.04

Checklist

  • There is not yet another bug report for this issue in the issue tracker
  • The problem is reproducible from this bug report. This guide can help to craft a minimal bug report.
  • The issue has not been resolved by the entries listed under Common Issues.
@fabclmnt
Copy link
Contributor

fabclmnt commented Jul 9, 2024

Hi @ibobak ,

if you are looking to work with datasets as big as the one you've mentioned please check YData Fabric Data Catalog. ydata-profiling was not designed for productions workloads for that reason this is something not expected to be supported.

@fabclmnt fabclmnt closed this as completed Jul 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants