Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Dependencies] List fugue as optional dependency #631

Open
sankarsiva123 opened this issue Sep 7, 2023 · 14 comments
Open

[Dependencies] List fugue as optional dependency #631

sankarsiva123 opened this issue Sep 7, 2023 · 14 comments

Comments

@sankarsiva123
Copy link

Description

I'm an actively using this package, I've noticed that the current list of dependencies includes the libraries Plotly and Fugue. While these libraries are undoubtedly powerful and useful, I believe it would be beneficial to make them optional dependencies rather than mandatory ones.

Use case

No response

@jmoralez
Copy link
Member

jmoralez commented Sep 7, 2023

Hey @sankarsiva123, thanks for using statsforecast. plotly was moved to be an optional dependency in #600, so I'll rename this to reference fugue instead.

I believe this issue is similar to #596 and we're working on reducing the total dependencies' size, hopefully it will be done for the next release.

@jmoralez jmoralez changed the title Would be nice if we were able to install Satsforecast pakage with plotly as optional dependency [Dependencies] List fugue as optional dependency Sep 7, 2023
@sankara-nps
Copy link

@jmoralez Thanks for the reply,
Even matplotlib made as optional dependency would nice .

@jmoralez
Copy link
Member

jmoralez commented Sep 8, 2023

Yes, we're going to make the package as lightweight as possible. We've moved the plotting to utilsforecast, which will be a dependency for this library but that only depends on numpy and pandas (matplotlib, polars, etc will be extras).

@vspinu
Copy link

vspinu commented Sep 14, 2023

We are no longer able to use statsforecast v1.5 on our spark cluster due to the dependency conflicts of the fugue dependencies. They require antlr4 higher than that what spark on java side is using. And that interferes with our cluster setup.

Given that fugue is still a pretty young project and doesn't seem to be required for the core functionality of statsforecast, adding it as a hard dependency seems a bit premature. Could we hope for a relatively quick "fix" on this one? Many thanks!!!

@jmoralez
Copy link
Member

We're working to keep only the minimum required dependencies in the next release. @kvnkho do you have any suggestions for @vspinu's problem?

@kvnkho
Copy link
Collaborator

kvnkho commented Sep 15, 2023

Hi @vspinu , it would require quite a bit of refactoring and would actually be a breaking change for many users as it breaks the core interface (will explain). The best we can do for you is lower the dependency on fugue-sql-antlr, which we'll do sometime. I talked to @goodwanghan . The rest of this comment is just more detail.

The reason Fugue is a core dependency is because it allows this syntax:

sf = StatsForecast(
    models=[AutoETS(season_length=7)],
    freq='D',
)
sf.forecast(df=df, h=horizon)

to work off the shelf with any DataFrame (Spark, Dask, Ray, Pandas). To support each of those without Fugue, there would have to be significant forking in the codebase. Think if-else and try statements and imports in random places. Fugue consolidates that. Note that removing Fugue would be the breaking change as a lot of people are already using this syntax on top of Spark specifically.

A bit of history, Fugue was made into an optional dependency at first. In order to parallelize Statsforecast, users would use code like this:

from statsforecast.distributed.utils import forecast
from statsforecast.distributed.fugue import FugueBackend

spark = SparkSession.builder.getOrCreate()
backend = FugueBackend(spark, {"fugue.spark.use_pandas_udf":True})

forecast(df, 
         [AutoARIMA()], 
         freq="D", 
         h=7, 
         parallel=backend).toPandas()

This introduced a lot more code for end users to write, and then other users did not want to learn about Fugue. They also could not use the same base StatsForecast class, instead having to call this utility function. To consolidate public-facing classes and keep the same interface for distributed execution and local execution, then Fugue needed to become a core dependency to make the experience invisible and seamless.

On size, Fugue is around 10MB so it's pretty trivial for installation.

Fugue utility functions are also used in the core code right now for things like identifying the input DataFrame type, conditional importing, and configuring the parallel backends. While not impossible to fork and create Fugue-specific code paths, it decreases code maintainability and quality and I am not sure it's worth it here.

In the Fugue issue, you mentioned you're aware it's a corner case because you guys actually use ANTLR on Python, which is quite uncommon. Given there are already a lot of statsforecast users using Spark with the base StatsForecast class, it's hard to do what would be a breaking change to support an edge case.

Thus, the best resolution for this is to lower the dependency on ANTLR.

@goodwanghan
Copy link
Contributor

goodwanghan commented Sep 15, 2023

@vspinu @jmoralez

I just want to reiterate some of the facts:

  1. Fugue is the core part of statsforecast to make the lib run seamlessly on different distributed environment
  2. Antler dependency is planned to be removed from Fugue's core dependency on 0.9.0
  3. This is an edge case only when you want consistency between Spark's Java legacy antlr requirement and Fugue's python requirement. Fugue + Spark is a very common combination, no other similar complaints so far.

Here is the resolution: fugue-project/fugue-sql-antlr#19

We will temporarily remove the lower constraint of antlr, and make this requirement into runtime. So as long as you don't use Fugue SQL, you have more flexibility on antlr version.

In fugue>=0.9.0, the core will only depend on pandas and arrow.

@goodwanghan
Copy link
Contributor

@sankarsiva123 @jmoralez In a few weeks, we will release fugue 0.9.0 whose core dependency (out of the fugue-project) is only pandas and pyarrow. Fugue integration is supposed to handle all dataframes other than pandas in a seamless and non-invasive way, with Fugue, statsforecast will expand the list of supported dataframes without any code change, that is why it is in the core dependency. In addition, the size of fugue project packages are very small and totally pythonic.

@mergenthaler
Copy link
Member

Thanks, @goodwanghan and @kvnkho. I also believe that Fugue is an essential aspect of this library and the Nixtlaverse. I think the 0.9.0 Fugue release will fully address the issues raised by the other contributors and make the dependencies of statsforecast even easier.

@vspinu
Copy link

vspinu commented Sep 15, 2023

Thanks @goodwanghan and @kvnkho for your impressive efforts and extensive explanation! Looking forward to 0.9.0.

@goodwanghan
Copy link
Contributor

Thanks @goodwanghan and @kvnkho for your impressive efforts and extensive explanation! Looking forward to 0.9.0.

@vspinu we have released the new version of fugue-sql-antlr yesterday, if you upgrade it to 0.1.7 the conflict should have been resolved, can you try?

@vspinu
Copy link

vspinu commented Sep 16, 2023

@goodwanghan It takes a few days for packages to propagate to our internal mirrors for security reasons. I will report back to the original issue once I verify it. I don't see why it shouldn't work though.

@goodwanghan
Copy link
Contributor

@vspinu have you verified whether the new package resolves the conflicts?

@vspinu
Copy link

vspinu commented Oct 18, 2023

Sorry for the late reply. It takes 2 weeks for us to propagate the package and then I got derailed to another project. The issue with antlr dependency still persists. I have provided more detail in the original.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants