Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

validity mask #16

Open
ludnic opened this issue Apr 18, 2023 · 15 comments
Open

validity mask #16

ludnic opened this issue Apr 18, 2023 · 15 comments

Comments

@ludnic
Copy link
Collaborator

ludnic commented Apr 18, 2023

the next steps in the data chunk test involve retrieving the validity mask of a vector:

do chunk_idx = 0, chunk_count - 1
        chunk = duckdb_result_get_chunk(result, chunk_idx)
        row_count = duckdb_data_chunk_get_size(chunk)
        do row_idx = 0, row_count - 1
          do col_idx = 0, col_count - 1
            ! Get the column
            vector = duckdb_data_chunk_get_vector(chunk, col_idx);
            validity = duckdb_vector_get_validity(vector)

and I was already looking earlier at the other vector functions. However from the api documentation I see:

If all values are valid, this function MIGHT return NULL!
The validity mask is a bitset that signifies null-ness within the data chunk. 
It is a series of uint64_t values, where each uint64_t value contains validity for 64 tuples. 
The bit is set to 1 if the value is valid (i.e. not NULL) or 0 if the value is invalid (i.e. NULL).

the validity mask is a bitset saved in int64 (possibly array?). I was looking for ways of getting it into fortran but not sure what is the right way of doing it. I was thinking it should be an allocatable int64 array but not sure how to get the size.

@ludnic ludnic mentioned this issue Apr 18, 2023
@freevryheid
Copy link
Owner

I suspect it is referring to the 64 bits in the c_int64_t value so we can use it as an integer rather than a bit array. If I remember there were some bit functions in the stdlib, which may be useful here as a quick check. I'll pull in the stdlib as a dev dep.

@ludnic
Copy link
Collaborator Author

ludnic commented Apr 19, 2023

great, I saw your changes. From the comment I thought the validity mask could be an array and wasn't sure how to handle it on the fortran side. but it seems to be a scalar (at least in the test)

@ludnic
Copy link
Collaborator Author

ludnic commented Apr 19, 2023

ah I see now. it is always a scalar and gets reassigned.

@ludnic ludnic closed this as completed Apr 19, 2023
@ludnic ludnic reopened this Apr 19, 2023
@ludnic
Copy link
Collaborator Author

ludnic commented Apr 19, 2023

actually reading this again. I'm still confused. validity seems to be an array in the general case. Looking at the alternative way that is suggested to check the validity of an entry:

idx_t entry_idx = row_idx / 64; 
idx_t idx_in_entry = row_idx % 64; 
bool is_valid = validity_mask[entry_idx] & (1 « idx_in_entry);

validity_mask should be an array and if the row index is greater than 64 entry_idx will point to the second element of the array (index 1). So it still seems we should get it in fortran as an array? In this test row_count is 40, so the entry_idx should be always 0.

@freevryheid
Copy link
Owner

latest fix correctly casts the validity pointer to a 64bit integer. The bit strings are now correct.

validity is a pointer to a variable in memory. In this case an int64, which in turn is an array of 64 bits.

In fortran, if you wanted to check the bit you could use:
btest(validity,·row_idx+1)

I've added this to the code to demo

@lnicotina-rms
Copy link

it seems to me that btest(validity, row_idx+1) (btw probably the bit counting should start at 0 in btest?) does the same as validity & (1 << idx_in_entry), but what happens if row_idx > 64?

@freevryheid
Copy link
Owner

Noted a potential issue that I've not considered but which was highlighted in the

duckdb_validity_set_row_validity(validity,·0,·.false.)

function.

The helper function casts validity from int64 to c_int64_t and in doing so changes the reference of validity. This requires a recast to update validity - another c interop gotcha.

I've updated https://github.com/freevryheid/duckdb/blob/main/src/duckdb.f90#L2357 to illustrate.

@ludnic
Copy link
Collaborator Author

ludnic commented May 2, 2023

I see the issue! Thanks for the note, I hadn't realised!

@freevryheid
Copy link
Owner

Makes me wonder if we should make the _ functions public as well. This would require the use of iso_c_binding but would negate the overhead of the helper functions.

@ludnic
Copy link
Collaborator Author

ludnic commented May 2, 2023

I quite like the current look of the API with the helper functions, it seems quite clean and better than having to deal with the iso_c_binding types. maybe it is me not being used to it, but having to deal with the casting every time you want to deal with duckdb seems ugly and error prone so I think it's better to do the painful helpers just once.

@freevryheid
Copy link
Owner

latest fix returns validity as a c_ptr that then needs to be casted as a c_int64_t. Trick was to forego using a helper function for this as the reference created by the helper function for the returned validity is not pointing to the chunks reference - we had the same issue with the duckdb_validity_set_row_validity as indicated previously. This fix requires changes to the previous tests checking validity, I commented these out for testing. I'll address when I find time later.

@freevryheid
Copy link
Owner

fixed the validity tests - the drawback of this approach is that we need to work with pointers and import iso_c_binding but then this is a c library after all.

@ludnic
Copy link
Collaborator Author

ludnic commented Jun 5, 2023

I have seen the changes in the validity tests and have used the approach for the chunk varchar test.

However I think there might be an issue with the validity mask still, since in this test the index of duckdb_validity_row_is_valid(vector_validity, i) goes higher than 64 so I'm not sure what happens now that the validity mask is an int64 pointer.

@freevryheid
Copy link
Owner

@ludnic you were right from the start all along. So I guess validity is better handled as a c_ptr. It is an int64 array but not sure how many elements it comprises. It should still be possible to convert to a int64 array as in c_f_pointer(c_ptr, f_int64, shape=[]).

@ludnic
Copy link
Collaborator Author

ludnic commented Jun 12, 2023

thanks for fixing it. just passing it on as a c_ptr and getting the right result from duckdb_validity_row_is_valid works very well and you don't really need to access the validity mask directly I guess.

I'm still not clear how to deal with the pointers to vectors since we don't know the size. There is another case that is not as straightforward in the data_chunk_varchar_result test I'm looking at. I will make a separate issue in case you have suggestions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants