Redesign of the LocalDocs (v2 and v3) database #2980

SINAPSA-IC · 2024-09-22T10:08:45Z

SINAPSA-IC
Sep 22, 2024

The localdocs(_v2) database could be redesigned for ease-of-use and legibility, some ideas being:

0.1. Use of Views, for quicker access;
0.2. the integer in AutoIncrement for IDs, while quick and painless, could be replaced with a GUID/UUID-as-text, since the IDs are unique which makes AutoIncrement useless, old-fashioned and confusing - unlike a GUID which cannot be mistaken for another thing than an IDentifier and can also act as a PK.

table: "documents":

1.1. create a field for the ID of the collection where each document exists; the new field could be named "collection_id" and would contain the values of the field "id" in the table "collections".
motivation:
documents within a category could be retrieved easily after 1 JOIN only ON [documents].[collection_id] = [collections].[id]

1.2. the values in the field "document_path" could be renamed to "document_file"
motivation:
the file path of each "document file" is stored repeatedly for each record
suggestion:
the file path should be stored separately, in the "collections" table

1.3. a field, say "document_filesize" could be inserted, that would store each document's/file's size

this would be useful when analyzing/plotting the number of chunks from a file vs. the size of the respective file

table: "collections"

2.1. create a field for the path - or, better, for the ID- to/of the corresponding collection
motivation: see 1.2. - the path of each file is stored along with the filename, which makes it a redundant to store, tiresome to look at until you find the actual filename, and unnecessary in the "documents" table; a unique string along with the ID of a collection would account for ease-of-use when linking, selecting and retrieving info from the table "documents"

the structure would become (for collection path, unwieldy):
ID | name | collection_path | start_update_time | last_update_time | embedding_model ...
15 | _localdocs_wikipedia | H:\localdocs\pathtocollection | 1725384467458 | 1725787485458 | nomic-embed-text-v1.5 ...

2.2. the "folder_id" in table "collections_items" could also be inserted-moved into the table "collections", as both tables have the same number of records and the field "id" in both tables may be the ID of each collection (in which case it is also redundant - present in 2 tables)

table "folders"

3.1. it only contains two fields - "ID" and "path", with the "path" field certainly corresponding to a collection's ID and name
suggestion: see 1.2., the "path" field should be simply placed in the "collections" table, because it is directly referring to collections, instead of it occupying a separate table which seemingly has no other role but to be JOINed with others.

To retrieve the names of files in a collection - along with their paths, which are of NO interest in this case, with the /user having to Remove or Split afterwards the resulted strings - the over-the-top SELECT command would be:

SELECT collections.name, documents.document_path
FROM documents
JOIN collection_items JOIN collections
ON
(documents.folder_id= collection_items.folder_id
AND
collection_items.collection_id= collections.id)
WHERE collections.name="localdocs_astro"

The CREATEs below are the current ones +

fields renamed as to suggest their location (table), for instance "chunk_" precedes the name of a field in table "chunks"
fields inserted, like "collection_path text not null" in table "collections"

CREATE TABLE chunks(
chunk_id integer primary key autoincrement,
chunk_document_id integer not null,
chunk_collection_id integer not null, -- new field
chunk_text text not null,
chunk_file text not null,
chunk_title text,
chunk_author text,
chunk_subject text,
chunk_keywords text,
chunk_page integer,
chunk_line_from integer,
chunk_line_to integer,
chunk_words integer default 0 not null,
chunk_tokens integer default 0 not null,
foreign key(document_id) references documents(id)
)

CREATE TABLE collection_items(
collection_item_collection_id integer not null,
collection_item_folder_id integer not null,
foreign key(collection_id) references collections(id)
foreign key(folder_id) references folders(id),
unique(collection_id, folder_id)
)

CREATE TABLE collections(
collection_id integer primary key,
collection_name text unique not null,
collection_path text not null, -- new field
collection_folder_id integer not null, -- new field, moved here from table "folders"
collection_update_time_start integer,
collection_update_time_last integer,
collection_embedding_model text
)

CREATE TABLE documents(
document_id integer primary key,
document_folder_id integer not null,
collection_id integer not null, -- new field
document_time integer not null,
document_path text unique not null,
document_filesize integer not null, -- new field
foreign key(folder_id) references folders(id)
)

CREATE TABLE embeddings(
model text not null,
folder_id integer not null,
chunk_id integer not null,
embedding blob not null,
primary key(model, folder_id, chunk_id),
foreign key(folder_id) references folders(id),
foreign key(chunk_id) references chunks(id),
unique(model, chunk_id)
)

CREATE TABLE folders(
folder_id integer primary key autoincrement, -- moveable into table "collections"
folder_path text unique not null -- likewise^, folder path is the same as collection path
folder_collection_id integer not null, -- new field, if these^ 2 remain here
)

apage43 · 2024-09-25T17:33:39Z

apage43
Sep 25, 2024
Collaborator

some caveats with UUIDs: they are larger to store (especially as text, though not as bad for DBs with a native uuid type), and random UUIDs will cause worst-case behavior in ordered index structures like b-trees.

serial integer IDs have a desirable properties: they are compact and generally place items added around the same time near each other in index structures, and new items are always added to the -end- of an index, this generally results in much better insert/update/lookup performance than random IDs

UUIDs are generally used when there is no ability to coordinate ID generation and there is no better way to be sure independent processes don't generate the conflicting IDs - that is: it might be necessary to use UUIDs (or more index-friendly variants like ULIDs) if you need to combine data from multiple sources/users who are not always writing to the same database or otherwise don't have an always-available central point of coordination, but if you don't need to deal with a situation like that they are mostly a headache.

0 replies

SINAPSA-IC · 2024-09-25T19:12:36Z

SINAPSA-IC
Sep 25, 2024
Author

This is the current situation with the filename in table "chunks" being repeated time and again, increasing the database size, yes, there's enough space nowadays, but there are certain protocols to follow, and also, like, needs, when designing and accessing, respectively, a database.

(here, "file" means "filename" - one would presume that "file" means a BLOB-as-the-file-itself, but nope)

0 replies

SINAPSA-IC · 2024-09-25T20:04:33Z

SINAPSA-IC
Sep 25, 2024
Author

0 replies

SINAPSA-IC · 2024-10-15T10:27:17Z

SINAPSA-IC
Oct 15, 2024
Author

in v3.4.1, while the loclaldocs_v3 database was improved by implementing views for faster search and retrieval, the database is still bloated with strings repeated thousands of times over - filename, title, author.
And those 10^3 in the image below become 10^4 for Big books, like encyclopedias, with 10^4 snippets extracted from each such book. Truly amazing.

This increases the size of the database, and is downright unprofessional.

Because each such group describing the Source of a snippet is, well, unique, those strings - filename, title, author, should be stored separately as unique values - as individual records (one row with <-3 fields) identifiable by the value of an ID field.
This would be useful/beneficial in 2 ways:

no more wasted space and bloated database - by storing 1 one number in 1 one field, instead of storing a) repeatedly b) the whole list of 3 three strings in c) 3 three fields corresponding to exactly) 1 one "chunk_text" (which is presumably unique, too)
at the end of a reply, a JOIN would be performed on the value of that identifier in the two tables ("chunks" and the dedicated table), to show the filename, title, author

In this regard, of text repeated over and over, the database looks like it was designed in a rush or amateurish, becoming a box filled with text that should/could only be stored once; what's worse is that code is written around it, which makes it difficult to modify the database later because of the weak scaffolding from the beginning, v3 already having quite a big number of tables.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Redesign of the LocalDocs (v2 and v3) database #2980

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Redesign of the LocalDocs (v2 and v3) database #2980

SINAPSA-IC Sep 22, 2024

Replies: 4 comments

apage43 Sep 25, 2024 Collaborator

SINAPSA-IC Sep 25, 2024 Author

SINAPSA-IC Sep 25, 2024 Author

SINAPSA-IC Oct 15, 2024 Author

SINAPSA-IC
Sep 22, 2024

apage43
Sep 25, 2024
Collaborator

SINAPSA-IC
Sep 25, 2024
Author

SINAPSA-IC
Sep 25, 2024
Author

SINAPSA-IC
Oct 15, 2024
Author