Potter/improve v2 connector docs #3401

potter-potter · 2024-07-15T21:31:49Z

No description provided.

…nector-docs

potter-potter · 2024-07-15T21:32:35Z

unstructured/ingest/v2/README2.md

+  - 36caa9b04378.json
+```
+
+(Note that the index and partition file names are deterministic and based on the BLABLABLA) In the case of the local source connector, it won't *download* files because they are already local. But for other source connectors there will be a `download` folder. Also note that the final file is named based on the original file with a `.json` extension since it has been partitioned. Not all output files will be named the same as the input file. An example is a table as a source file, the output will be based on a hash of BLABLABLA.


@rbiseck3 can you leave a comment here on the two BLABLAs

The source connector chooses the filename of the downloaded content, but ever cached file along the way is a hash of the current step along with the previous hash via the filename.

potter-potter · 2024-07-15T21:33:57Z

unstructured/ingest/v2/README2.md

+
+https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/ingest/v2/processes/connectors/local.py
+
+If you look through the file you will notice these Classes (actually @dataclasses because BLABLABLA) and functions


@rbiseck3 can you leave a comment on the BLABLABLA

Suggested change

If you look through the file you will notice these Classes (actually @dataclasses because BLABLABLA) and functions

If you look through the file you will notice these interfaces

potter-potter · 2024-07-15T21:34:24Z

unstructured/ingest/v2/README2.md

+
+* chroma_destination_entry - Registers the Chroma destination connector with the pipeline. (!!! LINK `unstructured/ingest/v2/processes/connectors/__init__.py`)
+
+Note that the `chroma.py` file imports the official Chroma python package when it *creates* the client and not at the top of the file. This is so that BLABLABLA


@rbiseck3 can you leave a quick comment on the BLABLABLA

Note that the chroma.py file imports the official Chroma python package when it creates the client and not at the top of the file. This allows the classes to be instantiated without error and only cause a runtime error due to missing imports.

Paul-Cornell · 2024-08-13T17:50:55Z

unstructured/ingest/v2/README2.md

@@ -0,0 +1,353 @@
+# Developing V2 Connectors


Suggestion: Use "sentence case" for all headings instead of "Title Case."

Paul-Cornell · 2024-08-13T17:52:36Z

unstructured/ingest/v2/README2.md

@@ -0,0 +1,353 @@
+# Developing V2 Connectors
+## Intro
+The Unstructured open source repo processes documents (artifacts) in a pipeline. The Source and Destination connectors sit at the front and back of the pipeline. For a visual example see the flow diagram at the bottom (link to bottom).


Hmm...The "repo" doesn't process them. Perhaps "library" instead? Also, coming very soon, the open-source library won't be able to process them either, based on the move that was begun over to unstructured-ingest. Perhaps "Unstructured Ingest" instead?

Suggestion: Use lowercase for names of things, here and throughout, unless it is a product name or feature name that we typically use uppercase for. For instance, "source" and "destination" instead of "Source" and "Destination."

Don't forget the link here. Also, instead of "see the flow diagram at the bottom," provide the actual target heading name, to help readers more easily spot it at the end of this page.

Paul-Cornell · 2024-08-13T17:55:46Z

unstructured/ingest/v2/README2.md

+## Intro
+The Unstructured open source repo processes documents (artifacts) in a pipeline. The Source and Destination connectors sit at the front and back of the pipeline. For a visual example see the flow diagram at the bottom (link to bottom).
+
+## Simplest Example of a Pipeline


Minor: Perhaps "basic" instead of "simple?" Something that's simple to us might not be perceived as simple to others.

Paul-Cornell · 2024-08-13T17:56:09Z

unstructured/ingest/v2/README2.md

+The Unstructured open source repo processes documents (artifacts) in a pipeline. The Source and Destination connectors sit at the front and back of the pipeline. For a visual example see the flow diagram at the bottom (link to bottom).
+
+## Simplest Example of a Pipeline
+The simplest example of a pipeline starts with a local source connector, followed by a partioner, and then ends with a local destination connector. Here is what the code to run this looks like:


"partioner" -> "partitioner"

Paul-Cornell · 2024-08-13T17:56:57Z

unstructured/ingest/v2/README2.md

+```
+You can run this with `python local.py` (Adjust the `input_path` and `output_dir` as appropriate.)
+
+The result is a partitioned `fake-text.txt.json` file in the `local-output` directory.


fake-text.txt.json or just fake-text.json (here and below)?

I don't see a local-output directory in this code example?

Paul-Cornell · 2024-08-13T17:58:43Z

unstructured/ingest/v2/README2.md

+
+Notice that the pipeline runs the following:
+
+* context - The ProcessorConfig runs the pipeline. The arguments are related to the overall pipeline. We added some optional args to make development easier.


Suggestion: Enclose the names of code symbols in backticks, here and throughout, for better readability.

We added some optional args to make development easier.

Not sure what this means: which are optional, and why do they make development easier?

Paul-Cornell · 2024-08-13T18:01:49Z

unstructured/ingest/v2/README2.md

+
+* context - The ProcessorConfig runs the pipeline. The arguments are related to the overall pipeline. We added some optional args to make development easier.
+* source_connection - Takes arguments needed to connect to the source. Local files don't need anything here. Other connectors will.
+* indexer - Takes the files in the `input_path` and creates .json files that point the downloader step to the right files 


Minor: All sentences should end in a period, here and throughout.

Paul-Cornell · 2024-08-13T18:02:52Z

unstructured/ingest/v2/README2.md

+* source_connection - Takes arguments needed to connect to the source. Local files don't need anything here. Other connectors will.
+* indexer - Takes the files in the `input_path` and creates .json files that point the downloader step to the right files 
+* downloader - This does the actual downloading of the raw files (for non-blob files it may do something different like create a .txt file for every row in a source table)
+* partitioner - Partitions the downloaded file, provided it is a partionable file type. ([link to file types supported](https://github.com/Unstructured-IO/unstructured/blob/0c562d80503f6ef96504c6e38f27cfd9da8761df/unstructured/file_utils/filetype.py))


Minor: "link to file types supported" -> use a "see" or "view" phrase: i.e. "See the list of supported file types."

Paul-Cornell · 2024-08-13T18:03:29Z

unstructured/ingest/v2/README2.md

+* indexer - Takes the files in the `input_path` and creates .json files that point the downloader step to the right files 
+* downloader - This does the actual downloading of the raw files (for non-blob files it may do something different like create a .txt file for every row in a source table)
+* partitioner - Partitions the downloaded file, provided it is a partionable file type. ([link to file types supported](https://github.com/Unstructured-IO/unstructured/blob/0c562d80503f6ef96504c6e38f27cfd9da8761df/unstructured/file_utils/filetype.py))
+* chunker/embedder - *Not represented here* but often needed to prepare files for upload to a vector database.


Minor: Should you add them to the code example, even if they're "empty" definitions?

Paul-Cornell · 2024-08-13T18:03:56Z

unstructured/ingest/v2/README2.md

+* downloader - This does the actual downloading of the raw files (for non-blob files it may do something different like create a .txt file for every row in a source table)
+* partitioner - Partitions the downloaded file, provided it is a partionable file type. ([link to file types supported](https://github.com/Unstructured-IO/unstructured/blob/0c562d80503f6ef96504c6e38f27cfd9da8761df/unstructured/file_utils/filetype.py))
+* chunker/embedder - *Not represented here* but often needed to prepare files for upload to a vector database.
+* stager - *Not represented here* but is often used to prepare partitioned files for upload.


Minor: Should you add it to the code example, even if it's an "empty" definition?

Paul-Cornell · 2024-08-13T18:05:34Z

unstructured/ingest/v2/README2.md

+  - 36caa9b04378.json
+```
+
+(Note that the index and partition file names are deterministic and based on the hash of the current step along with the previous step's hash.) In the case of the local source connector, it won't *download* files because they are already local. But for other source connectors there will be a `download` folder. Also note that the final file is named based on the original file with a `.json` extension since it has been partitioned. Not all output files will be named the same as the input file. This is the case for database like sources.


Minor: You surround .json here in backticks but not for other extensions previously. You might want to, for better consistency and readability.

Paul-Cornell · 2024-08-13T18:06:29Z

unstructured/ingest/v2/README2.md

+
+You can see the source/destination connector file that it runs here:
+
+https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/ingest/v2/processes/connectors/local.py


Move over to unstructured-ingest?

Paul-Cornell · 2024-08-13T18:06:44Z

unstructured/ingest/v2/README2.md

+* local_source_entry - Used to register the source connector here: `unstructured/ingest/v2/processes/connectors/__init__.py`
+
+* local_destination_entry - Used to register the destination connector here: `unstructured/ingest/v2/processes/connectors/__init__.py`


Move over to unstructured-ingest?

hubert-rutkowski85 · 2024-10-02T09:53:13Z

Hey @potter-potter it's quite old PR but I see you recently made commit here. It would be great to have stable and official docs to consult about connectors, since few people from deepsense are starting on migrating them right now. Do you have an ETA on when it could be merged? Of course we can look into it right now and it will help, but I'm not sure which parts are up to date, and which in the meantime became obsolete.

potter-potter · 2024-10-02T14:31:38Z

Hey @potter-potter it's quite old PR but I see you recently made commit here. It would be great to have stable and official docs to consult about connectors, since few people from deepsense are starting on migrating them right now. Do you have an ETA on when it could be merged? Of course we can look into it right now and it will help, but I'm not sure which parts are up to date, and which in the meantime became obsolete.

@hubert-rutkowski85 yes. Its here:
https://github.com/Unstructured-IO/unstructured-ingest/tree/main/docs

potter-potter added 16 commits June 18, 2024 14:11

improve v2 docs

9e8aa65

more improvement

bcc13d3

Merge remote-tracking branch 'origin/main' into potter/improve-v2-con…

8e711e3

…nector-docs

progress

ae91344

destination info

f5cf700

remove unnecessary

71140de

add coming soon

c864735

mod

972f19e

upload stage

1cf047c

little fixes

e9a7f92

add links

3b708a1

new connector

6f34094

destination rough done

8a3bf3a

add source connector instructions

e6cd58b

remove unneeded in readme 1

7a9280c

cleaning up blablas

3f3d7ac

potter-potter commented Jul 15, 2024

View reviewed changes

potter-potter added 3 commits July 15, 2024 14:45

add example

fa7db4a

finished readme

b094bfb

add diagram

0cae6d0

potter-potter marked this pull request as ready for review August 13, 2024 14:52

potter-potter requested a review from Paul-Cornell August 13, 2024 14:53

fix small errrors

129abd0

Paul-Cornell reviewed Aug 13, 2024

View reviewed changes

make suggested fixes

26671f8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potter/improve v2 connector docs #3401

Potter/improve v2 connector docs #3401

potter-potter commented Jul 15, 2024

potter-potter Jul 15, 2024 •

edited

Loading

rbiseck3 Jul 18, 2024

potter-potter Jul 15, 2024

rbiseck3 Jul 18, 2024

potter-potter Jul 15, 2024

rbiseck3 Jul 18, 2024

Paul-Cornell Aug 13, 2024

Paul-Cornell Aug 13, 2024

Paul-Cornell Aug 13, 2024

Paul-Cornell Aug 13, 2024

Paul-Cornell Aug 13, 2024

Paul-Cornell Aug 13, 2024

Paul-Cornell Aug 13, 2024 •

edited

Loading

Paul-Cornell Aug 13, 2024

Paul-Cornell Aug 13, 2024

Paul-Cornell Aug 13, 2024

Paul-Cornell Aug 13, 2024

Paul-Cornell Aug 13, 2024

Paul-Cornell Aug 13, 2024

Paul-Cornell Aug 13, 2024

Paul-Cornell Aug 13, 2024

Paul-Cornell Aug 13, 2024

Paul-Cornell Aug 13, 2024

hubert-rutkowski85 commented Oct 2, 2024

potter-potter commented Oct 2, 2024


		https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/ingest/v2/processes/connectors/local.py

		If you look through the file you will notice these Classes (actually @dataclasses because BLABLABLA) and functions


		* chroma_destination_entry - Registers the Chroma destination connector with the pipeline. (!!! LINK `unstructured/ingest/v2/processes/connectors/__init__.py`)

		Note that the `chroma.py` file imports the official Chroma python package when it creates the client and not at the top of the file. This is so that BLABLABLA


		Notice that the pipeline runs the following:

		* context - The ProcessorConfig runs the pipeline. The arguments are related to the overall pipeline. We added some optional args to make development easier.


		You can see the source/destination connector file that it runs here:

		https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/ingest/v2/processes/connectors/local.py

		* local_source_entry - Used to register the source connector here: `unstructured/ingest/v2/processes/connectors/__init__.py`

		* local_destination_entry - Used to register the destination connector here: `unstructured/ingest/v2/processes/connectors/__init__.py`

Potter/improve v2 connector docs #3401

Are you sure you want to change the base?

Potter/improve v2 connector docs #3401

Conversation

potter-potter commented Jul 15, 2024

potter-potter Jul 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Paul-Cornell Aug 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hubert-rutkowski85 commented Oct 2, 2024

potter-potter commented Oct 2, 2024

potter-potter Jul 15, 2024 •

edited

Loading

Paul-Cornell Aug 13, 2024 •

edited

Loading