Redesign of the LocalDocs (v2 and v3) database #2980
Replies: 4 comments
-
some caveats with UUIDs: they are larger to store (especially as text, though not as bad for DBs with a native uuid type), and random UUIDs will cause worst-case behavior in ordered index structures like b-trees. serial integer IDs have a desirable properties: they are compact and generally place items added around the same time near each other in index structures, and new items are always added to the -end- of an index, this generally results in much better insert/update/lookup performance than random IDs UUIDs are generally used when there is no ability to coordinate ID generation and there is no better way to be sure independent processes don't generate the conflicting IDs - that is: it might be necessary to use UUIDs (or more index-friendly variants like ULIDs) if you need to combine data from multiple sources/users who are not always writing to the same database or otherwise don't have an always-available central point of coordination, but if you don't need to deal with a situation like that they are mostly a headache. |
Beta Was this translation helpful? Give feedback.
-
This is the current situation with the filename in table "chunks" being repeated time and again, increasing the database size, yes, there's enough space nowadays, but there are certain protocols to follow, and also, like, needs, when designing and accessing, respectively, a database. (here, "file" means "filename" - one would presume that "file" means a BLOB-as-the-file-itself, but nope) |
Beta Was this translation helpful? Give feedback.
-
in v3.4.1, while the loclaldocs_v3 database was improved by implementing views for faster search and retrieval, the database is still bloated with strings repeated thousands of times over - filename, title, author. This increases the size of the database, and is downright unprofessional. Because each such group describing the Source of a snippet is, well, unique, those strings - filename, title, author, should be stored separately as unique values - as individual records (one row with <-3 fields) identifiable by the value of an ID field.
In this regard, of text repeated over and over, the database looks like it was designed in a rush or amateurish, becoming a box filled with text that should/could only be stored once; what's worse is that code is written around it, which makes it difficult to modify the database later because of the weak scaffolding from the beginning, v3 already having quite a big number of tables. |
Beta Was this translation helpful? Give feedback.
-
The localdocs(_v2) database could be redesigned for ease-of-use and legibility, some ideas being:
0.1. Use of Views, for quicker access;
0.2. the integer in AutoIncrement for IDs, while quick and painless, could be replaced with a GUID/UUID-as-text, since the IDs are unique which makes AutoIncrement useless, old-fashioned and confusing - unlike a GUID which cannot be mistaken for another thing than an IDentifier and can also act as a PK.
1.1. create a field for the ID of the collection where each document exists; the new field could be named "collection_id" and would contain the values of the field "id" in the table "collections".
motivation:
documents within a category could be retrieved easily after 1 JOIN only ON [documents].[collection_id] = [collections].[id]
1.2. the values in the field "document_path" could be renamed to "document_file"
motivation:
the file path of each "document file" is stored repeatedly for each record
suggestion:
the file path should be stored separately, in the "collections" table
1.3. a field, say "document_filesize" could be inserted, that would store each document's/file's size
2.1. create a field for the path - or, better, for the ID- to/of the corresponding collection
motivation: see 1.2. - the path of each file is stored along with the filename, which makes it a redundant to store, tiresome to look at until you find the actual filename, and unnecessary in the "documents" table; a unique string along with the ID of a collection would account for ease-of-use when linking, selecting and retrieving info from the table "documents"
ID | name | collection_path | start_update_time | last_update_time | embedding_model ...
15 | _localdocs_wikipedia | H:\localdocs\pathtocollection | 1725384467458 | 1725787485458 | nomic-embed-text-v1.5 ...
2.2. the "folder_id" in table "collections_items" could also be inserted-moved into the table "collections", as both tables have the same number of records and the field "id" in both tables may be the ID of each collection (in which case it is also redundant - present in 2 tables)
3.1. it only contains two fields - "ID" and "path", with the "path" field certainly corresponding to a collection's ID and name
suggestion: see 1.2., the "path" field should be simply placed in the "collections" table, because it is directly referring to collections, instead of it occupying a separate table which seemingly has no other role but to be JOINed with others.
To retrieve the names of files in a collection - along with their paths, which are of NO interest in this case, with the /user having to Remove or Split afterwards the resulted strings - the over-the-top SELECT command would be:
SELECT collections.name, documents.document_path
FROM documents
JOIN collection_items JOIN collections
ON
(documents.folder_id= collection_items.folder_id
AND
collection_items.collection_id= collections.id)
WHERE collections.name="localdocs_astro"
The CREATEs below are the current ones +
CREATE TABLE chunks(
chunk_id integer primary key autoincrement,
chunk_document_id integer not null,
chunk_collection_id integer not null, -- new field
chunk_text text not null,
chunk_file text not null,
chunk_title text,
chunk_author text,
chunk_subject text,
chunk_keywords text,
chunk_page integer,
chunk_line_from integer,
chunk_line_to integer,
chunk_words integer default 0 not null,
chunk_tokens integer default 0 not null,
foreign key(document_id) references documents(id)
)
CREATE TABLE collection_items(
collection_item_collection_id integer not null,
collection_item_folder_id integer not null,
foreign key(collection_id) references collections(id)
foreign key(folder_id) references folders(id),
unique(collection_id, folder_id)
)
CREATE TABLE collections(
collection_id integer primary key,
collection_name text unique not null,
collection_path text not null, -- new field
collection_folder_id integer not null, -- new field, moved here from table "folders"
collection_update_time_start integer,
collection_update_time_last integer,
collection_embedding_model text
)
CREATE TABLE documents(
document_id integer primary key,
document_folder_id integer not null,
collection_id integer not null, -- new field
document_time integer not null,
document_path text unique not null,
document_filesize integer not null, -- new field
foreign key(folder_id) references folders(id)
)
CREATE TABLE embeddings(
model text not null,
folder_id integer not null,
chunk_id integer not null,
embedding blob not null,
primary key(model, folder_id, chunk_id),
foreign key(folder_id) references folders(id),
foreign key(chunk_id) references chunks(id),
unique(model, chunk_id)
)
CREATE TABLE folders(
folder_id integer primary key autoincrement, -- moveable into table "collections"
folder_path text unique not null -- likewise^, folder path is the same as collection path
folder_collection_id integer not null, -- new field, if these^ 2 remain here
)
Beta Was this translation helpful? Give feedback.
All reactions