Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement subqueries #4

Open
kashav opened this issue May 15, 2017 · 5 comments
Open

Implement subqueries #4

kashav opened this issue May 15, 2017 · 5 comments

Comments

@kashav
Copy link
Owner

kashav commented May 15, 2017

Introduce subqueries to be used in condition statements, example:

"SELECT name FROM foo WHERE name NOT IN (SELECT name FROM ../bar WHERE date < ...)"

This shouldn't be too hard to implement.

First step is to introduce an IN operator for strings/numbers. The operator will return true if current the value is in the provided list.

The subquery should be evaluated individually. It returns a []Result for all files that passed the query (a Result struct holds each of the applicable attributes for a single file). Now we can apply IN to this result list.

Idea from https://news.ycombinator.com/item?id=14344493.

@kashav
Copy link
Owner Author

kashav commented May 17, 2017

Work for this has begun on feature/subquery.

@kjhoerr
Copy link

kjhoerr commented May 17, 2017

Will this feature include the ability to reference values in superqueries? For instance,

"SELECT * FROM /foo AS foo WHERE name NOT IN (
  SELECT name FROM /bar AS bar WHERE foo.size < bar.size
)"

@kashav
Copy link
Owner Author

kashav commented May 17, 2017

@kjhoerr Just want to clarify that I've understood this correctly.

Since we traverse each "source" directory provided to the FROM clause, the reference will point to each file and/or subdirectory of the source that we're currently traversing.

Using the example that you provided, if the directory structure was:

.
├── foo
│   └── bar
│       └── baz
├── bar
│   └── baz.txt

The query would run the subquery three times, first with foo referring to ./foo, then with foo referring to ./foo/bar, and then with foo referring to ./foo/bar/baz.

And each subquery would traverse each file of ./bar and compare the size of said file (which'll be referred to as bar) with that of foo.

If this is indeed what you're suggesting, I like it and I don't mind including the feature (sounds like a fun implementation), but based on my initial ideas, this may be very expensive to run. That being said, there might be a better implementation that I haven't thought of yet.

@kjhoerr
Copy link

kjhoerr commented May 17, 2017

Yes, such that it's quite like the implementation of subqueries in other SQL languages. The subquery is run against each entry of the superquery, so naturally it's very expensive to run. Then you could run queries that could for instance check if any given file appears in all the given directories.

Here's a mockup of the query, since I'm not sure how you could reference the parent folder:

# check for files that appear in all user's directories - target random but specific user
SELECT *
  FROM /home/user1 AS u1
  WHERE NOT EXISTS (
    FROM /home/. AS u2 WHERE NOT EXISTS ( 
      FROM `u2`/* WHERE u1.name = u2.name
    )
  )

Where EXISTS just returns true or false depending on whether or not any rows are returned from the subquery. Logically, it checks if a home exists that does not contain that file - if so, the file isn't included in the query. (No homes that do not contain the file can exist.)

In that case, since it has to know each name of each file before running the subquery to check against other directories, and it does so for each file. So, yes, enourmously expensive 🤕 I'm not sure if there's any cheaper way to write this sort of query, though.

kashav added a commit that referenced this issue May 20, 2017
Subquery implementation
=======================

  - Evaluates subqueries as individual queries on parse step. Currently
    lets you do something like:

    SELECT
      all
    FROM
      ~/Desktop
    WHERE
      name IN (
        SELECT
          name
        FROM
          $GOPATH/src
        WHERE
          name = %.go) AND
      size > 2mb

    Some points for discussion:

      - No support for SELECTing and comparing more than one attribute in a
        subquery yet, I'm not really sure _how_ to implement this (or even
        if it's necessary).

  - Also implemented boilerplate for superquery/subquery references (as
    discussed in #4#issuecomment-302104954). Currently runs **terribly**
    though, not really sure if there's a way to improve performance. Also,
    haven't added support for substituting directory/file names yet.

IN operator (non-subquery)
==========================

  - Haven't decided if I'm going to keep this. Currently only works for `name`.
    Accepts a comma-separated list of strings and checks if filename is in list.

  - Example:

    SELECT
      all
    FROM
      .
    WHERE
      name IN [main.go, query.go]

Project layout changes
======================

  - Introduce individual packages for tokenizer and parser;
  - Move compare package to query.

Unit tests
==========

  - Adds tests for tokenizer and compare functions.
  - Some tokenizer tests are currently failing (reflect.DeepEqual is saying
    identical strings don't match, might be a type issue (interface{} vs.
    string)).

Bug fixes
=========

  - Minor: Fix less-than-equal raw text (was >= before, doesn't really change
    anything, since we're not using the raw text for anything).
@kashav
Copy link
Owner Author

kashav commented Jun 4, 2017

This was partially implemented in #27.

Basic subqueries are working as expected (see README.md#subqueries), e.g.

>>> SELECT all FROM . WHERE name IN (SELECT name FROM ../foo)

However, there's no support for referencing values between queries yet. The boilerplate is implemented though, I hope to invest more time in completing this in the coming days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants