Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allow callbacks as background jobs #1215

Closed
akostadinov opened this issue Jan 25, 2022 · 14 comments
Closed

allow callbacks as background jobs #1215

akostadinov opened this issue Jan 25, 2022 · 14 comments

Comments

@akostadinov
Copy link
Contributor

akostadinov commented Jan 25, 2022

Current project I'm working on tries to do index work as background jobs. Where index worker method is like this:

  def perform(model)
    callback = ThinkingSphinx::RealTime.callback_for(model.class.name.underscore)
    callback&.after_commit model
  end

This mostly works but not for models using STI (single-table inheritance). For models where index is defined on the base_class, then one has to use model.class.base_class and for models where index is on descendant, then #base_class shouldn't be used. So it is an error prone and ugly business.

Would be useful to be able to add callbacks like:

ThinkingSphinx::Callbacks.append(self, {behaviours: [:real_time], background: true})

Or some other way to allow that easily.

It also applies to ts:index. It would be useful to queue indexing in background job batches. Presently we use the above indexation worker and can't benefit from the new clean-up functionality.

== thinking-sphinx 5.4 on rails 5 (in process of upgrading).

/related to #1038/

@akostadinov
Copy link
Contributor Author

akostadinov commented Feb 3, 2022

FYI this is how I implemented it in the project:

class SphinxIndexationWorker < ApplicationJob
  def perform(model, id)
    indices_for_model(model).each do |index|
      instance = index.scope.find_by(model.primary_key => id)

      if instance
        reindex(index, instance)
      else
        delete_from_index(index, id)
      end
    end
  end

  private

  def indices_for_model(model)
    ThinkingSphinx::Configuration.instance.index_set_class.new(classes: [model])
  end

  def reindex(index, instance)
    # some indices are defined on model#base_class (*Plan) some on model itself (CMS::Page)
    callback = ThinkingSphinx::RealTime.callback_for(index.model.name.underscore)
    callback&.after_commit instance
  end

  def delete_from_index(index, *ids)
    # This is how deletion is performed by ThinkingSphinx::ActiveRecord::Callbacks::DeleteCallbacks#delete_from_sphinx
    ThinkingSphinx::Deletion.perform index, ids
  end
end

But it is good if it allows customization for different models. Because for accounts I have a little optimization to also soft delete dependent accounts:

    indices_for_model(model).each do |index|
      if account && index.scope.find_by(pk => id)
        reindex(index, account)
      elsif account && !account.master?
        delete_from_index(index, id, *account.buyers.pluck(:id))
      else
        delete_from_index(index, id)
      end
    end

@pat
Copy link
Owner

pat commented Feb 17, 2022

I have considered allowing real-time updates/deletions to occur via background jobs… but I've held off because I'm not sure if it's much of a performance benefit?

If it happens immediately:

  • Potentially run SQL to pull in associated data for the Sphinx index schema.
  • Send the update statement to Sphinx.

Compared to it happening via a background job:

  • Store the job details in a data-store (e.g. Redis)
  • Load the instance from the database again.
  • Potentially run SQL to pull in associated data for the Sphinx index schema.
  • Send the update statement to Sphinx.

I guess it's very much an app-specific consideration. There's also the challenge of providing appropriate integration with ActiveJob vs background job gems directly - for example, my understanding is that Sidekiq's performance is better when used directly rather than via ActiveJob.

This mostly works but not for models using STI (single-table inheritance). For models where index is defined on the base_class, then one has to use model.class.base_class and for models where index is on descendant, then #base_class shouldn't be used. So it is an error prone and ugly business.

This does sound painful - and I'm a little surprised, because I would hope that using the index_set_class with the classes option should navigate the hierarchy of inherited classes correctly. If you're finding that's not the case, then that sounds like a bug to me. Can you provide an example of where it's not working?

@pat
Copy link
Owner

pat commented Feb 17, 2022

And just to clarify: I'm not against background jobs for running callbacks in theory, but right now it feels like it involves adding too much code to Thinking Sphinx for what feels like an edge-case. I wonder if it could be done as a separate gem instead of adding to TS directly?

@akostadinov
Copy link
Contributor Author

akostadinov commented Feb 17, 2022

@pat, thank you for looking into it. I understand that it is hard to implement this in a way to fit everybody.

I don't know about ActiveJob vs direct Sidekiq performance penalties. In my current project we use both and handle a huge amount of jobs. I hope that later this year we will have better monitoring and data to see if there is any practical difference. In general I'm always in favor of using a standard instead of some implementation.

I think it is practical though, to first make it easy to implement a custom worker and then a common solution may more naturally occur.

Getting indices:

ThinkingSphinx::Configuration.instance.index_set_class.new(classes: [model])

(Re)index:

    callback = ThinkingSphinx::RealTime.callback_for(index.model.name.underscore)
    callback&.after_commit instance

Delete from index:

ThinkingSphinx::Deletion.perform index, ids

The above is messing up with library internals. If one can do

ThinkingSphinx::RealTime.index(instance)
ThinkingSphinx::Deletion.perform(model, id)

Or some equivalent easy way, that would be a huge improvement IMHO and allow indexing operations anywhere in object's lifecycle.

My issue with STI is most likely because I'm using library internal details to get the callbacks on the fly (as written above) and that's different from calling the actually registered callbacks on a model.

While on it, another benefit from doing indexing in the background is that this not only reduces request latency (Redis is supposedly fast) but also decouples requests from indexing so Sphinx errors would not fail it. Although Sphinx seems to be very stable.

pat added a commit that referenced this issue Feb 27, 2022
Prompted by discussion in #1215, this provides a direct interface to
perform index-related operations on an ActiveRecord index.

    processor = ThinkingSphinx::Processor.new(instance)
    processor.delete
    processor.upsert # real-time indices only
@pat
Copy link
Owner

pat commented Feb 27, 2022

This is a great suggestion. I've just added the following to the develop branch if you want to give it a shot:

processor = ThinkingSphinx::Processor.new(instance)

# to delete from all relevant indices:
processor.delete

# to add/update in all relevant real-time indices:
processor.upsert

While on it, another benefit from doing indexing in the background is that this not only reduces request latency (Redis is supposedly fast) but also decouples requests from indexing so Sphinx errors would not fail it. Although Sphinx seems to be very stable.

That's a good point about decoupling Sphinx errors - though yes, hopefully Sphinx is pretty stable with receiving updates! 😅 But perhaps this new processor class helps with your background job approach?

@akostadinov
Copy link
Contributor Author

In the background job I already don't have an instance. I only have an id. So it will not help for the delete case. Any suggestions?

pat added a commit that referenced this issue Jun 11, 2022
An evolution of f6657c9, as per feedback from @akostadinov in #1215. If
you’re dealing with a deleted record, then you may not have an instance
(and cannot retrieve it from the database, as it doesn’t exist). So, the
Processor class now allows for either an instance, or a model and id.

The model should be the actual class, not the string name of the class.
@pat
Copy link
Owner

pat commented Jun 11, 2022

Belatedly coming back to this, sorry for the delay! I've pushed a commit today to the develop branch that changes the possible arguments for ThinkingSphinx::Processor:

processor = ThinkingSphinx::Processor.new(instance: instance)
# or
processor = ThinkingSphinx::Processor.new(model: Model, id: 123)

Hopefully this covers what's needed for you with the delete case?

@akostadinov
Copy link
Contributor Author

I looked at the commit and it looks great. Thank you! I have some very high priority things to do at the moment but I hope to be able to try out this within 2 weeks.

@pat
Copy link
Owner

pat commented Dec 30, 2022

Half a year later, but finally pushed a new gem release that includes these changes (v5.5.0). Do let me know if you give it a shot and whether there's further evolutions that would be helpful!

In the meantime I'm going to close this issue, but happy to re-open later if required.

@pat pat closed this as completed Dec 30, 2022
@softbrada
Copy link

is there a chance something similar to be implemented for plain indices also ? so they can be reindexed in parallel?

@akostadinov
Copy link
Contributor Author

This fell through the cracks, sorry about that. I just now checked this better. It still not fully useful to upsert or delete depending on index scope 😬

Will you be open if I later submit a new method Processor#update that will check whether instance is part of the index scope and then decide to upsert or delete it from that index?

Presently I see that model.find is used which doesn't account for the scope.

@pat
Copy link
Owner

pat commented Jul 22, 2023

No worries about the delay @akostadinov - clearly I'm not being super prompt at responding to open-source things at the moment (life is extra busy this year).

I'm not entirely against a PR, I guess my concern is: is the behaviour of deleting if outside of a scope tied to your application, rather than a more general approach? (i.e. where deletions would only happen if the record is no longer in the database)

@pat
Copy link
Owner

pat commented Jul 22, 2023

@softbrada very sorry for the slow reply - perhaps you've found a different approach, but I just wanted to note that plain (SQL-backed) indices don't really handle parallelisation, because SQL-backed indices require rewriting files. If one index/file is being rewritten by one process, no other processes can/should touch it (and both Sphinx and Thinking Sphinx have some safeguards around this).

@akostadinov
Copy link
Contributor Author

@pat , please consider this pull request #1258

I think it is not application specific. If one specified an index scope, then most expected would be for that scope to filter which objects should be indexed and which - not. I found another issue that sounds similar or the same request - #1253 .

This obviously comes at an additional database roundtrip cost but since the new method is optional to use, then I think it should be fair.

The existing use cases should remain the same.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants