Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fs.open returns ResourceNotFound (404) even though fs.exists finds the path #70

Open
poudro opened this issue May 19, 2020 · 10 comments · May be fixed by #71
Open

fs.open returns ResourceNotFound (404) even though fs.exists finds the path #70

poudro opened this issue May 19, 2020 · 10 comments · May be fixed by #71

Comments

@poudro
Copy link

poudro commented May 19, 2020

Been having a strange issue with fs.open on various files.

The code is basically

    if not fs.exists(path_in_fs):
        return None

    try:
        with fs.open(path_in_fs, 'rb') as f:
            <- read file
    except Exception as err:
        raise err

Sometimes fs.exists will say that a file exists in s3 but the fs.open step will return a ResourceNotFound error.

I tracked the error in fs.open to this section of the code https://github.com/PyFilesystem/s3fs/blob/master/fs_s3fs/_s3fs.py#L433-L441, but I'm a bit at a loss as to why an exception should be triggered here but not for the fs.exists.

Finally, if I delete the file via other means (awscli or in s3 browser), and recreate the file (by copying a local copy to s3 via awscli), it will then work without triggering an error.

Any help on what I'm doing wrong would be greatly appreciated.

Because of this last element it's a bit hard to reproduce, but if you need more details please let me know.

@poudro
Copy link
Author

poudro commented May 20, 2020

I was able to reproduce, seems like if some other user wrote the file this occurs.

If I comment out line 439 it works as expected.

@poudro poudro linked a pull request May 20, 2020 that will close this issue
@shadiakiki1986
Copy link

shadiakiki1986 commented Jul 29, 2021

I ran in to this issue and investigated a bit.

TL;DR: This behavior is intended by s3fs and was introduced as a fix to issue #17 in PR #21, found through the git blame here and explained in the docs here

It turns out that it's related to how S3 handles the directories that represent the path to the file. S3 is not really a filesystem as it doesn't create the directory structure to the file, but only stores the full path as a key. So if you upload to s3 something like s3://bucket/folder/file.txt using aw scli, then /folder doesn't exist in itself, whereas on a regular filesystem (say in my ubuntu terminal) it would. For s3fs in particular, and since it is part of PyFilesystem, it has to comply with how other filesystems work, so it does some extra stuff that would not be done if you were using the aws cli. For example, if you upload a file to s3 with s3fs to s3://bucket/folder/file.txt, it would create both /folder/ and /folder/file.txt. At the same time, if you read that same path after having created it with s3fs, then the above issue won't happen. That's because the obj.load (that you commented out in PR #71 ) is verifying that the /folder/ path exists, as would make sense in a normal filesystem. However, if you read that path after having uploaded it with aws cli for example, then the above issue will happen because the aws cli doesn't create /folder/. The s3fs way to handle this would be to use the dir_path argument in the S3FS constructor. Here is a full example to illustrate in code what I blaberred about in the prose above:

from fs_s3fs import S3FS
from zipfile import ZipFile

s3fs = S3FS('dolphicom')
assert s3fs.exists("mobysound.org-mirror-v20210704/workshops/5th_DCL_data_bottlenose.zip")
assert not s3fs.exists("mobysound.org-mirror-v20210704/workshops/") # expected since s3 is not a regular filesystem
try:
  s3fs.getinfo("5th_DCL_data_bottlenose.zip") # raises exception
  assert False
except Exception as e:
  assert True
  print(f"Got exception: {e}") # Got exception: resource '5th_DCL_data_bottlenose.zip' not found

s3fs = S3FS('dolphicom', dir_path="mobysound.org-mirror-v20210704/workshops/")
assert s3fs.exists("5th_DCL_data_bottlenose.zip")
s3fs.getinfo("5th_DCL_data_bottlenose.zip") # does not raise exception
assert True

As a last note, I would suggest to add s3 to the issue title so that others facing the same issue could find this and to close this issue as well as the PR because I don't think that s3fs will change this behaviour.

@poudro
Copy link
Author

poudro commented Nov 18, 2021

The issue I mention only happens when a different user uploads the file whether it be via s3fs or aws cli even when all permitions are ok.

If user who is trying to read it via s3fs is the one that did the upload, via either s3fs or aws cli, it works fine.

So your explanation doesn't hold since it does in fact work, but only sometimes, depending on who did the upload with aws cli.

I'm pretty sure this is not intended behavior on s3fs part as sometimes it works sometimes it doesn't depending on who uploaded the file even when the permitions are ok.

@geoffjukes
Copy link
Contributor

@poudro - The explanation does hold. Let me explain it differently....

In a POSIX filesystem, the full file path always exists. So you can 'stat' every part of it. So with "/temp/file.ext" as an example; temp exists, and so does file.ext

In Object storage (such as S3) it is possible for /temp/file.ext to exist WITHOUT /temp/ existing - because "folders" aren't a thing. Objects use keys that look like folder paths, but are not. For /temp/ to "exist" it has to be created. Quite literally, and empty object with the key /temp/.

All that said - I have the same issue (and gripe) with S3FS, because not all S3 clients create the /foldername/ objects, and I don't always have control over the bucket. If I glob or walk a bucket with this issue, I get errors, and don't have any good way to solve it.

I wish there was a way to tell S3FS to not 'stat' dirs for specific mounts....

@poudro
Copy link
Author

poudro commented Mar 2, 2022

In that case, why does it work as expected sometimes and doesn't work other times?

To be clear, I'm not talking about the semantics of POSIX vs Object storage here (which I am familiar with), I'm talking about the fact that when a program "sometimes works and sometimes doesn't" it heavily smells of a bug somewhere.

My solution works for me, it's maybe not the best approach, if someone has a better way around it that's fine.

So to summarize again:

  • s3fs works as expected when the same user uploads the file (either via s3fs, via aws web interface or aws cli)
  • s3fs returns an exception when a different user uploaded the file, even though all the read permissions are set properly and reading via aws web interface and aws cli work

To me this is not proper behavior and has nothing to do with POSIX vs Object storage

@geoffjukes
Copy link
Contributor

@poudro Are you saying that files uploaded by 2 users, with the same key prefix, result in an error? e.g. /some/place/user1.txt vs /some/place/user2.txt

Or are the key prefixes different? e.g. /user1/file.ext vs /user2/file.ext

Are the users using the same software to upload/create the objects? Or different software? I'd like to try and recreate your issue, so some specifics would help me to do that.

@poudro
Copy link
Author

poudro commented Mar 2, 2022

@geoffjukes here's the more detailed scenario:

  • There is a single file foo that contains bar
  • There are two users U1 and U2
  • U1 uploads the file to s3://bucket/path/foo (via s3fs, aws webinterface or aws cli, all three result in same outcome) and sets read permissions so both U1 and U2 can read it

In this scenario U1 can read file via s3fs, but U2 will raise a ResourceNotFound when accessing via s3fs even though U2 can successfully access file via aws webinterface and aws cli.

@geoffjukes
Copy link
Contributor

geoffjukes commented Mar 2, 2022

@poudro s3://bucket/path/foo does not imply that an object the object s3://bucket/path/ exists, which is what causes the error you are seeing.

When U1 creates the object at s3://bucket/path/foo are they first creating s3://bucket/path (such as with makedirs)? or just creating s3://bucket/path/foo directly? I know that the AWS web interface and the AWS CLI allow for direct object creation, without creating the intermediate objects for "faking" the directories - which, again, is the source of the issue you (and I) are experiencing.

Note that "seeing" a "directory" in the web interface, does not imply an object exists with that key.

Edit: Re-reading - Just confirming - U1 can read the object with S3FS, U2 cannot, both using the exact same code, but different accounts? Then maybe the permission issue is on the object at /path/ and not /path/foo.

@poudro
Copy link
Author

poudro commented Mar 2, 2022

The permissions in my experiment were actually set at the bucket level, seems unlikely to me then that it could be linked to permissions at the path level.

@geoffjukes
Copy link
Contributor

geoffjukes commented Mar 2, 2022

It happens. I've experienced it myself many times.

EDIT: https://docs.aws.amazon.com/AmazonS3/latest/userguide/managing-acls.html

Bucket and object permissions are independent of each other. An object does not inherit the permissions from its bucket. For example, if you create a bucket and grant write access to a user, you can't access that user’s objects unless the user explicitly grants you access.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants