Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potter/improve v2 connector docs #3401

Open
wants to merge 21 commits into
base: main
Choose a base branch
from

Conversation

potter-potter
Copy link
Contributor

No description provided.

- 36caa9b04378.json
```

(Note that the index and partition file names are deterministic and based on the BLABLABLA) In the case of the local source connector, it won't *download* files because they are already local. But for other source connectors there will be a `download` folder. Also note that the final file is named based on the original file with a `.json` extension since it has been partitioned. Not all output files will be named the same as the input file. An example is a table as a source file, the output will be based on a hash of BLABLABLA.
Copy link
Contributor Author

@potter-potter potter-potter Jul 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rbiseck3 can you leave a comment here on the two BLABLAs

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The source connector chooses the filename of the downloaded content, but ever cached file along the way is a hash of the current step along with the previous hash via the filename.


https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/ingest/v2/processes/connectors/local.py

If you look through the file you will notice these Classes (actually @dataclasses because BLABLABLA) and functions
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rbiseck3 can you leave a comment on the BLABLABLA

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
If you look through the file you will notice these Classes (actually @dataclasses because BLABLABLA) and functions
If you look through the file you will notice these interfaces


* chroma_destination_entry - Registers the Chroma destination connector with the pipeline. (!!! LINK `unstructured/ingest/v2/processes/connectors/__init__.py`)

Note that the `chroma.py` file imports the official Chroma python package when it *creates* the client and not at the top of the file. This is so that BLABLABLA
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rbiseck3 can you leave a quick comment on the BLABLABLA

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the chroma.py file imports the official Chroma python package when it creates the client and not at the top of the file. This allows the classes to be instantiated without error and only cause a runtime error due to missing imports.

@potter-potter potter-potter marked this pull request as ready for review August 13, 2024 14:52
@@ -0,0 +1,353 @@
# Developing V2 Connectors

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Use "sentence case" for all headings instead of "Title Case."

@@ -0,0 +1,353 @@
# Developing V2 Connectors
## Intro
The Unstructured open source repo processes documents (artifacts) in a pipeline. The Source and Destination connectors sit at the front and back of the pipeline. For a visual example see the flow diagram at the bottom (link to bottom).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm...The "repo" doesn't process them. Perhaps "library" instead? Also, coming very soon, the open-source library won't be able to process them either, based on the move that was begun over to unstructured-ingest. Perhaps "Unstructured Ingest" instead?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Use lowercase for names of things, here and throughout, unless it is a product name or feature name that we typically use uppercase for. For instance, "source" and "destination" instead of "Source" and "Destination."

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't forget the link here. Also, instead of "see the flow diagram at the bottom," provide the actual target heading name, to help readers more easily spot it at the end of this page.

## Intro
The Unstructured open source repo processes documents (artifacts) in a pipeline. The Source and Destination connectors sit at the front and back of the pipeline. For a visual example see the flow diagram at the bottom (link to bottom).

## Simplest Example of a Pipeline

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: Perhaps "basic" instead of "simple?" Something that's simple to us might not be perceived as simple to others.

The Unstructured open source repo processes documents (artifacts) in a pipeline. The Source and Destination connectors sit at the front and back of the pipeline. For a visual example see the flow diagram at the bottom (link to bottom).

## Simplest Example of a Pipeline
The simplest example of a pipeline starts with a local source connector, followed by a partioner, and then ends with a local destination connector. Here is what the code to run this looks like:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"partioner" -> "partitioner"

```
You can run this with `python local.py` (Adjust the `input_path` and `output_dir` as appropriate.)

The result is a partitioned `fake-text.txt.json` file in the `local-output` directory.
Copy link

@Paul-Cornell Paul-Cornell Aug 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fake-text.txt.json or just fake-text.json (here and below)?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a local-output directory in this code example?


Notice that the pipeline runs the following:

* context - The ProcessorConfig runs the pipeline. The arguments are related to the overall pipeline. We added some optional args to make development easier.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Enclose the names of code symbols in backticks, here and throughout, for better readability.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We added some optional args to make development easier.

Not sure what this means: which are optional, and why do they make development easier?


* context - The ProcessorConfig runs the pipeline. The arguments are related to the overall pipeline. We added some optional args to make development easier.
* source_connection - Takes arguments needed to connect to the source. Local files don't need anything here. Other connectors will.
* indexer - Takes the files in the `input_path` and creates .json files that point the downloader step to the right files

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: All sentences should end in a period, here and throughout.

* source_connection - Takes arguments needed to connect to the source. Local files don't need anything here. Other connectors will.
* indexer - Takes the files in the `input_path` and creates .json files that point the downloader step to the right files
* downloader - This does the actual downloading of the raw files (for non-blob files it may do something different like create a .txt file for every row in a source table)
* partitioner - Partitions the downloaded file, provided it is a partionable file type. ([link to file types supported](https://github.com/Unstructured-IO/unstructured/blob/0c562d80503f6ef96504c6e38f27cfd9da8761df/unstructured/file_utils/filetype.py))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: "link to file types supported" -> use a "see" or "view" phrase: i.e. "See the list of supported file types."

* indexer - Takes the files in the `input_path` and creates .json files that point the downloader step to the right files
* downloader - This does the actual downloading of the raw files (for non-blob files it may do something different like create a .txt file for every row in a source table)
* partitioner - Partitions the downloaded file, provided it is a partionable file type. ([link to file types supported](https://github.com/Unstructured-IO/unstructured/blob/0c562d80503f6ef96504c6e38f27cfd9da8761df/unstructured/file_utils/filetype.py))
* chunker/embedder - *Not represented here* but often needed to prepare files for upload to a vector database.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: Should you add them to the code example, even if they're "empty" definitions?

* downloader - This does the actual downloading of the raw files (for non-blob files it may do something different like create a .txt file for every row in a source table)
* partitioner - Partitions the downloaded file, provided it is a partionable file type. ([link to file types supported](https://github.com/Unstructured-IO/unstructured/blob/0c562d80503f6ef96504c6e38f27cfd9da8761df/unstructured/file_utils/filetype.py))
* chunker/embedder - *Not represented here* but often needed to prepare files for upload to a vector database.
* stager - *Not represented here* but is often used to prepare partitioned files for upload.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: Should you add it to the code example, even if it's an "empty" definition?

- 36caa9b04378.json
```

(Note that the index and partition file names are deterministic and based on the hash of the current step along with the previous step's hash.) In the case of the local source connector, it won't *download* files because they are already local. But for other source connectors there will be a `download` folder. Also note that the final file is named based on the original file with a `.json` extension since it has been partitioned. Not all output files will be named the same as the input file. This is the case for database like sources.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: You surround .json here in backticks but not for other extensions previously. You might want to, for better consistency and readability.


You can see the source/destination connector file that it runs here:

https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/ingest/v2/processes/connectors/local.py

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move over to unstructured-ingest?

Comment on lines +92 to +94
* local_source_entry - Used to register the source connector here: `unstructured/ingest/v2/processes/connectors/__init__.py`

* local_destination_entry - Used to register the destination connector here: `unstructured/ingest/v2/processes/connectors/__init__.py`

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move over to unstructured-ingest?

@hubert-rutkowski85
Copy link
Contributor

Hey @potter-potter it's quite old PR but I see you recently made commit here. It would be great to have stable and official docs to consult about connectors, since few people from deepsense are starting on migrating them right now. Do you have an ETA on when it could be merged? Of course we can look into it right now and it will help, but I'm not sure which parts are up to date, and which in the meantime became obsolete.

@potter-potter
Copy link
Contributor Author

Hey @potter-potter it's quite old PR but I see you recently made commit here. It would be great to have stable and official docs to consult about connectors, since few people from deepsense are starting on migrating them right now. Do you have an ETA on when it could be merged? Of course we can look into it right now and it will help, but I'm not sure which parts are up to date, and which in the meantime became obsolete.

@hubert-rutkowski85 yes. Its here:
https://github.com/Unstructured-IO/unstructured-ingest/tree/main/docs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants