Tools to compute multiple indexes on SWH's raw contents:
- content:
- mimetype
- fossology-license
- metadata
- origin:
- metadata (intrinsic, using the content indexer; and extrinsic)
An indexer is in charge of:
- looking up objects
- extracting information from those objects
- store those information in the swh-indexer db
There are multiple indexers working on different object types:
- content indexer: works with content sha1 hashes
- revision indexer: works with revision sha1 hashes
- origin indexer: works with origin identifiers
Indexation procedure:
- receive batch of ids
- retrieve the associated data depending on object type
- compute for that object some index
- store the result to swh's storage
Current content indexers:
- mimetype (queue swh_indexer_content_mimetype): detect the encoding and mimetype
- fossology-license (queue swh_indexer_fossology_license): compute the license
- metadata: translate file from an ecosystem-specific formats to JSON-LD (using schema.org/CodeMeta vocabulary)
Current origin indexers:
- metadata: translate file from an ecosystem-specific formats to JSON-LD (using schema.org/CodeMeta and ForgeFed vocabularies)