Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MongoArchiver issues #6

Open
NicolaDonelli opened this issue Sep 6, 2021 · 4 comments
Open

MongoArchiver issues #6

NicolaDonelli opened this issue Sep 6, 2021 · 4 comments

Comments

@NicolaDonelli
Copy link
Contributor

NicolaDonelli commented Sep 6, 2021

There are a few possible issues with cgnal.data.layer.mongo.archivers.MongoArchiver class.

  1. the signature of "retrieve" method is not compatible with the one of Archiver, since Archiver.retrieve requires *args and **kwargs. I'd propose to change the signature of the ABC making it compatible with our requirements for Pandas and Mongo Archivers
  2. to make archiveOne and archiveMany signature coherent, so that archive could have a simpler output type, wouldn't it be better for this method to return self.archiveMany([obj]) (like we did with PandasArchiver)?
  3. 'archive' method's output type is not consistent with its' ancestor's return type (that should be 'MongoArchiver'). I'd propose to fix this returning self instead of those UpdateResults

@deusebio @memoclaudio @aiola

@NicolaDonelli NicolaDonelli added the good first issue Good for newcomers label Sep 7, 2021
@deusebio
Copy link
Contributor

Being the "archive" a side-effect probably makes also sense to return None (and this would also give the same output between archiveOne and archiveMany, which should always be None). Another possibility (but I would do it later in the future when we integrate Mondas nicely in the package) would be to return an Either, in order to also handle errors functionally.

@deusebio
Copy link
Contributor

deusebio commented Sep 14, 2021

about the arguments of retrieve it is actually quite a pain. Unfortunately we don't have an abstraction for "Query" yet, that is an object that represents the filters one wants to apply when querying persistence layers that is agnostic from the actual implementation (SQL wants strings, Pandas we could use functions, MongoDB we use dicts). I agree, this is really bothering me, but unfortunately, I don't quite have an elegant solution for handling this.

What we started to do in CIB News is to extend from the Archiver, implement methods that do the querying and call the retrieve with the generic output, e.g.

class DocumentsArchiver(Archiver):
    @abstractmethod
    def byPublishDate(publishDate: str): ...

class MongoDocumentsArchiver(MongoArchiver, DocumentsArchiver):
    
    def  byPublishDate(publishDate: str):
        return self.retrieve({"publishDate": publishDate})

class PandasDocumentsArchiver(PandasArchiver, DocumentsArchiver):
    
    def  byPublishDate(publishDate: str):
        return self.retrieve(lambda df: df[df["publishDate"] == publishDate])

In the code, we never use retrieve directly, but we generally use the custom retrieving methods, such as byPublishDate

But maybe, here, @nromeoarena can advise what it is best to do

@memoclaudio
Copy link

In my opinion, we should avoid function returning Node. For functions like archive, I do agree with @deusebio that Either could be the best solution. It will at least return the status of the operation in order to track errors.

@memoclaudio
Copy link

memoclaudio commented Oct 1, 2021

As an input object for the archive, I also agree with @deusebio and we should think more about a "Query" abstract object that will implement query in a specific language like QueryMongo, QueryPandas, QuerySQL. We can then create a factory that actually builds specific queries instead of putting the logic of a query inside the Archiver. As example

class ClientQueryBuilder(MongoQueryBuilder): def retrieveByDate(data :str) -> MongoQuery: def retrieveClientByRegion(region: str) -> MongoQuery:

the archiver will then contain only the archive method def archive(query: Query) -> Either and def retrieve(query: Query) -> Either, according to this refactoring the archiver will only contain the "connection to the database" logic.

Moreover if we centralize the querybulder/queryfactory we have a centralized point where we actually build the query and we do not need to write the same query in different part of the code without touching the archiver that will be finally independent form business logic.

@deusebio deusebio added avanced issue and removed good first issue Good for newcomers labels Jan 31, 2022
@NicolaDonelli NicolaDonelli transferred this issue from CGnal/cgnal-data-ingestion Feb 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants