-
Notifications
You must be signed in to change notification settings - Fork 797
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Potter/improve v2 connector docs #3401
base: main
Are you sure you want to change the base?
Conversation
unstructured/ingest/v2/README2.md
Outdated
- 36caa9b04378.json | ||
``` | ||
|
||
(Note that the index and partition file names are deterministic and based on the BLABLABLA) In the case of the local source connector, it won't *download* files because they are already local. But for other source connectors there will be a `download` folder. Also note that the final file is named based on the original file with a `.json` extension since it has been partitioned. Not all output files will be named the same as the input file. An example is a table as a source file, the output will be based on a hash of BLABLABLA. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rbiseck3 can you leave a comment here on the two BLABLAs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The source connector chooses the filename of the downloaded content, but ever cached file along the way is a hash of the current step along with the previous hash via the filename.
unstructured/ingest/v2/README2.md
Outdated
|
||
https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/ingest/v2/processes/connectors/local.py | ||
|
||
If you look through the file you will notice these Classes (actually @dataclasses because BLABLABLA) and functions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rbiseck3 can you leave a comment on the BLABLABLA
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you look through the file you will notice these Classes (actually @dataclasses because BLABLABLA) and functions | |
If you look through the file you will notice these interfaces |
unstructured/ingest/v2/README2.md
Outdated
|
||
* chroma_destination_entry - Registers the Chroma destination connector with the pipeline. (!!! LINK `unstructured/ingest/v2/processes/connectors/__init__.py`) | ||
|
||
Note that the `chroma.py` file imports the official Chroma python package when it *creates* the client and not at the top of the file. This is so that BLABLABLA |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rbiseck3 can you leave a quick comment on the BLABLABLA
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that the chroma.py
file imports the official Chroma python package when it creates the client and not at the top of the file. This allows the classes to be instantiated without error and only cause a runtime error due to missing imports.
@@ -0,0 +1,353 @@ | |||
# Developing V2 Connectors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: Use "sentence case" for all headings instead of "Title Case."
unstructured/ingest/v2/README2.md
Outdated
@@ -0,0 +1,353 @@ | |||
# Developing V2 Connectors | |||
## Intro | |||
The Unstructured open source repo processes documents (artifacts) in a pipeline. The Source and Destination connectors sit at the front and back of the pipeline. For a visual example see the flow diagram at the bottom (link to bottom). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm...The "repo" doesn't process them. Perhaps "library" instead? Also, coming very soon, the open-source library won't be able to process them either, based on the move that was begun over to unstructured-ingest
. Perhaps "Unstructured Ingest" instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: Use lowercase for names of things, here and throughout, unless it is a product name or feature name that we typically use uppercase for. For instance, "source" and "destination" instead of "Source" and "Destination."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't forget the link here. Also, instead of "see the flow diagram at the bottom," provide the actual target heading name, to help readers more easily spot it at the end of this page.
unstructured/ingest/v2/README2.md
Outdated
## Intro | ||
The Unstructured open source repo processes documents (artifacts) in a pipeline. The Source and Destination connectors sit at the front and back of the pipeline. For a visual example see the flow diagram at the bottom (link to bottom). | ||
|
||
## Simplest Example of a Pipeline |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: Perhaps "basic" instead of "simple?" Something that's simple to us might not be perceived as simple to others.
unstructured/ingest/v2/README2.md
Outdated
The Unstructured open source repo processes documents (artifacts) in a pipeline. The Source and Destination connectors sit at the front and back of the pipeline. For a visual example see the flow diagram at the bottom (link to bottom). | ||
|
||
## Simplest Example of a Pipeline | ||
The simplest example of a pipeline starts with a local source connector, followed by a partioner, and then ends with a local destination connector. Here is what the code to run this looks like: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"partioner" -> "partitioner"
``` | ||
You can run this with `python local.py` (Adjust the `input_path` and `output_dir` as appropriate.) | ||
|
||
The result is a partitioned `fake-text.txt.json` file in the `local-output` directory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fake-text.txt.json
or just fake-text.json
(here and below)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see a local-output
directory in this code example?
unstructured/ingest/v2/README2.md
Outdated
|
||
Notice that the pipeline runs the following: | ||
|
||
* context - The ProcessorConfig runs the pipeline. The arguments are related to the overall pipeline. We added some optional args to make development easier. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: Enclose the names of code symbols in backticks, here and throughout, for better readability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We added some optional args to make development easier.
Not sure what this means: which are optional, and why do they make development easier?
unstructured/ingest/v2/README2.md
Outdated
|
||
* context - The ProcessorConfig runs the pipeline. The arguments are related to the overall pipeline. We added some optional args to make development easier. | ||
* source_connection - Takes arguments needed to connect to the source. Local files don't need anything here. Other connectors will. | ||
* indexer - Takes the files in the `input_path` and creates .json files that point the downloader step to the right files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: All sentences should end in a period, here and throughout.
unstructured/ingest/v2/README2.md
Outdated
* source_connection - Takes arguments needed to connect to the source. Local files don't need anything here. Other connectors will. | ||
* indexer - Takes the files in the `input_path` and creates .json files that point the downloader step to the right files | ||
* downloader - This does the actual downloading of the raw files (for non-blob files it may do something different like create a .txt file for every row in a source table) | ||
* partitioner - Partitions the downloaded file, provided it is a partionable file type. ([link to file types supported](https://github.com/Unstructured-IO/unstructured/blob/0c562d80503f6ef96504c6e38f27cfd9da8761df/unstructured/file_utils/filetype.py)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: "link to file types supported" -> use a "see" or "view" phrase: i.e. "See the list of supported file types."
* indexer - Takes the files in the `input_path` and creates .json files that point the downloader step to the right files | ||
* downloader - This does the actual downloading of the raw files (for non-blob files it may do something different like create a .txt file for every row in a source table) | ||
* partitioner - Partitions the downloaded file, provided it is a partionable file type. ([link to file types supported](https://github.com/Unstructured-IO/unstructured/blob/0c562d80503f6ef96504c6e38f27cfd9da8761df/unstructured/file_utils/filetype.py)) | ||
* chunker/embedder - *Not represented here* but often needed to prepare files for upload to a vector database. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: Should you add them to the code example, even if they're "empty" definitions?
* downloader - This does the actual downloading of the raw files (for non-blob files it may do something different like create a .txt file for every row in a source table) | ||
* partitioner - Partitions the downloaded file, provided it is a partionable file type. ([link to file types supported](https://github.com/Unstructured-IO/unstructured/blob/0c562d80503f6ef96504c6e38f27cfd9da8761df/unstructured/file_utils/filetype.py)) | ||
* chunker/embedder - *Not represented here* but often needed to prepare files for upload to a vector database. | ||
* stager - *Not represented here* but is often used to prepare partitioned files for upload. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: Should you add it to the code example, even if it's an "empty" definition?
- 36caa9b04378.json | ||
``` | ||
|
||
(Note that the index and partition file names are deterministic and based on the hash of the current step along with the previous step's hash.) In the case of the local source connector, it won't *download* files because they are already local. But for other source connectors there will be a `download` folder. Also note that the final file is named based on the original file with a `.json` extension since it has been partitioned. Not all output files will be named the same as the input file. This is the case for database like sources. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: You surround .json
here in backticks but not for other extensions previously. You might want to, for better consistency and readability.
|
||
You can see the source/destination connector file that it runs here: | ||
|
||
https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/ingest/v2/processes/connectors/local.py |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move over to unstructured-ingest
?
* local_source_entry - Used to register the source connector here: `unstructured/ingest/v2/processes/connectors/__init__.py` | ||
|
||
* local_destination_entry - Used to register the destination connector here: `unstructured/ingest/v2/processes/connectors/__init__.py` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move over to unstructured-ingest
?
Hey @potter-potter it's quite old PR but I see you recently made commit here. It would be great to have stable and official docs to consult about connectors, since few people from deepsense are starting on migrating them right now. Do you have an ETA on when it could be merged? Of course we can look into it right now and it will help, but I'm not sure which parts are up to date, and which in the meantime became obsolete. |
@hubert-rutkowski85 yes. Its here: |
No description provided.