⚠️ WIP: unfinished project. Shared for the value of the ideas exposed here. Let us know if you're interested in teaming up or using the ideas shared here.
SQL All ⭐ Stars or allstars
is a smart semantic layer that takes
a new approach:
- mostly inferred: it looks at your physical schema and makes educated guesses around your semantics (joins, metrics, dimensions, hierarchies). Eventually it will also look at your usage patterns to infer more useful semantics.
- progressively enrichable: building on it's inferences, you can curate, enrich and rewire things as needed at your own pace. It's useful on day 0, and becomes more useful as you extend the semantics.
- served as a virtual database: it exposes itself primarily as a simple database, as one or many collections of flat tables. This means it's universally adopted by anything that can talk to a database. Under the hood, allstars takes your simple SQL against those wide tables, and transpiles the SQL into more complex SQL doing the joining, the union-ing and whatever else is needed to get to your data
- RESTful too: get all the JSONs you need here too, allowing deeper integrations beyond the virtual database
- 100% open: unlike a lot of semantic layers that are tightly coupled with some BI tool, this is 100% open and self-standing
SELECT customer_name, SUM(units_sold), SUM(pre_orders)
FROM ⭐
GROUP BY ⭐;
Something that's novel about SQL All ⭐ Stars is the idea of exposing this semantic layer primarily as a virtual database. This shows as a one or many wide tables exposing essentially all of the columns that you want to expose from your database. While you or your BI tool run very simple SQL against large tables, allstars transpiles this query into a more complex query against the underlying physical schema.
Without getting too deep in the mechanics, there's one large table to rule them all (the "superstar" ⭐ table), and one table per "query context", representing sets of tables that can be joined together to answer queries.
If you are familiar with dimensional modeling and star schemas, you can think of a "query context" as a specific star within a collection of star schemas. If you expose multiple star schemas within your semantic layer, you'll typically have one query context per star schema, and each one will be exposed as a virtual table.
Now about that "superstar" ⭐ table, what happens if you try to use metrics and dimensions that are unrelated (in different query contexts)? Well SQL Allstars will simply return a clear error message at in "The metric [inventory] cannot be joined with [customer.name]".
On top of these large tables, allstars exposes a set of metadata tables with all the semantics (metrics, dimensions, hierarchies, ...)
SELECT ⭐ FROM ⭐.metrics;
SELECT ⭐ FROM ⭐.dimensions;
SELECT ⭐ FROM ⭐.hierachies;
SELECT ⭐ FROM ⭐.tables;
{{ #... }}
If you are curious as to how it works behind the scene, the TLDR is that SQL Allstars is implemented as a Python dbapi driver that acts as a bit of a proxy in front of your database.
As a "transpiler with context" allstars can allow you to sprinkle some sugar in your SQL. Let's get ahead of ANSI SQL and implement things. Ideas to be implemented:
FROM ⭐
- straight from the superstar tableGROUP BY ⭐
- group by all dimensions referenced in the queryGROUP BY ⭐.DAY
- same as above, but truncate the one time column by day, WEEK, MONTH or other timeframe keywords- trailing commas aloud
- a nice hint system
- window function abstraction(?)
If the virtual database doesn't suit your needs, a RESTful API is also exposed. Comprehensive calls to extract and alter semantics, transpile SQL, sync different sources and targets, and much more.
allstars makes inferences based on your physical schema, naming conventions and can "compiles" all this to a static, deterministic model that can be pushed to the file system for source control, or to a database for driver retrieval.
Some expected workflows around creating, enriching and updating your semantics:
- first inference: where allstars extracts your physical models, infers your joins and populates dimensions and metrics
- mechanics enrichment: where a data engineer or analyst engineer will add new objects, add join criterias that could not be detected, and creates some rich, relevant dimensions and metrics
- presentation erichment: add nice labels, descriptions and slap a folder structure around all your metrics, dimensions and hierarchies
- update: your schemas have evolved, new tables and column created, time to
allstars extract
the latest changes in your schema to keep things up-to-date
As you progressively make these changes, they are all "compiled", versioned and made static to serve predictable outcomes.
- the filesystem as a set of human readable yaml files
- API/DSL: if yaml is too verbose and you need something more dynamic, you can use the python objects directly, and compile to yaml
- a database, where the content is serialized into a
_meta
table
- a git repo+ref, so that the driver can load up directly from a uri
- a REST service
- s3://artifact.zip
In many cases you'll want the semantic layer to be deterministic and static. This is how semantic layers typically work.
Alternatively, you can run allstars in dynamic mode, point it to a database schema, and let it learn the schema it's working with, and receive hints as to how to behave as it goes. In this mode, it'll look up the physical schema, infer possible joins, and even receive SQL-like commands to enrichment as in
INSERT ('table1', 'table2', 'table_1.id = table2.id') INTO ⭐.joins;
INSERT ('Order Count', 'COUNT(DISTINCT order_id)', 'orders') INTO ⭐.metrics;
allstars is NOT a service that lives somewhere in between your database
and your application, it's a Python dbapi
driver + sqlalchemy
dialect.
This means it's NOT an external service you need to launch, keep up
and observe, it's meant to be installed and used as a simple database driver
and won't delay or buffer your queries.
What if I'm not in python? well, hoping a community develops and builds driver for other languages that interact with allstars REST service that's written in Python.
It's also very likely that someone will want to host allstars as a proxy
service to a database. There's probably an allstar-dbproxy
project in the
future of allstars
.
As allstars
becomes more popular, decoupling the specificiation from the
python implementation will become important, using something like protobufs
to define the object schemas and share them across languages.
Opening up the hood, the semantic layer has the following internal representation ->
Relations, as pointers to physical tables and views, including the underlying columns along with their data types. The semantic layer is essentially a fancy rich map to get to these things. SQL All ⭐ Stars has a full in-memory representation of all this to navigate it all.
To complete the picture from the relations, we have a full map of which joins can be used against the underlying relations. Each join maps two relations, has a type of join assigned (left, right, full, outer, ...) and has its cardinality (one-to-one, one-to-many, many-to-one, ...).
When semantic layers get big, it becomes large spaghetti plate of joins, and not all joins can be followed. The general goal is to start from a metric, and join it to it's dimension while avoiding "fan traps" where the metric gets duplicated while following a one-to-many join.
Query contexts are effectively a collection of joins that can be used together safely to generate a query. In traditional dimensional modeling, where you have a collection of fact tables and dimension tables, you'll typically have one query context per fact table. In the case of a dimensional query where metrics from multiple fact tables are chosen along with shared dimensions, the semantic layer will resolve by generating multiple queries, each against a single query context, and merge the results.
The semantic layer is essentially a menu of user-relatable objects that act as a map to more complex set and more abstract set of physical database objects. Like on a restaurant menu, all of these objects are clearly organized in sections and sub-sections (folders), have nice labels, and longer descriptions.
A metric is a simple aggregate SQL expression against one or many relations. It typically will point to a single relation but can in come cases span across relation and require a join to compute. Note that metrics can reference other metrics, in which case they become a more complex metric that inherits the underlying relations.
A dimension is a simple, non-aggregate SQL expression against one or many relations. Like metrics, they typically a built on top of a single relation but not always.
Note that we made the decision to use the term dimension even though it can be confused with the idea of a dimension table in dimensional modeling where it refers to a collection of attributes sharing the same atomicity. Here a dimension is a dimension of the metric, and typically represents a single "attribute", not a collection of attributes.
Hierarchies are linear collections of dimensions where we want to enable
"drilling" or "zooming" operations in the BI tool. Think of a
geospatial hierarchy may
be pointing from Customer Zone
-> Customer Country
-> Customer Region
-> Customer City
.
Multiple alternate hierarchies can be defined for the sames dimensions, and while they generally should be thought of as many-to-one as you go down the hierarchy, this is not strickly enforce in SQL All ⭐ Stars as you may want to enable your users to drill "across" dimensions that don't have pure many-to-one relationships.
Folders can be used to structure how the objects defined here are organized and presented to the user. Each folder has a key, label and description.
Filters are simple, reusable, labeled, documented filters. This is not commonly used and is generally reserved for filters that involve complex logic (think subquery) and are used and reused.
Database schemas are full of semantics, and data modeling and naming conventions augments these semantics even more.
- if you have two tables in a given schema, and one of them has
customer_id
as its PRIMARY KEY and the other table happens to havecustomer_id
, it’s safe to infer you can join the two table on that PK - even better, if a
FOREIGN KEY
constraint is defined, you can even more safely infer that the join stipulated in the physical layer is safe to use - less clearly, if one table is called
customer
has anid
column, and some other table has acustomer_id
you can pretty much guess that the [customer.id](http://customer.id) = other_table.customer_id
is valid
- looking at numerical column names, it may make sense to create some
SUM(amount)
SUM(quantity)
[optional but interesting, maybe a V4 kinda thing] Interestingly, if you have a corpus of SQL as say a database log of user queries or other-bi-tools generate, those in theory could be parsed to infer patterns that answers:
- which tables can be joined and how
- which metrics are commonly used (look for aggregate expressions!)
- commonly used filters
The tool could have a feature something that reads through valid SQL and suggests augmentation of the semantic layer based on query pattern. Probably AI-based.
- Business metadata is not vital, and often takes a moment to settle, let’s put in pretty label, pretty description, fancy folders structures when we get there
- Hierarchies are nice but we can do without as we start
- Things like “List of Values”, informing how a dropdown for a given dimension should be populated for instance, are nice to haves, we don’t need this on day 0
- Aggregate awareness semantics aren’t vital
- …
- the driver has the whole semantic model in-memory
- When querying a virtual table like
FROM ⭐
, we read the SQL and transpile it into a proper SQL that can run against the physical model - When querying metadata, serve the model straight from memory
CREATE TABLE _meta AS (
git_sha STRING,
object_type STRING,
object_id STRING,
json_blob STRING,
);
Proposed implementation is in python as a dbapi driver + sqlalchemy dialect.
- in-memory cache of the whole semantic layer seems important, maybe it can boostrap
- ability to transpile from simple ANSI SQL to any dialect (using something like
sqlglot
library
See some examples here -> https://github.com/preset-io/allstars/tree/master/examples/
Writing down some useful things
# read the physical schema, make inferences, merge the enrichemnts
# generate the compiled/ folder
$ semantic_layer_tool compile env=dev
# push the semantic layer a given database
$ semantic_layer_tool push env=dev
- a way to auto-document the semantic layer, + automated ERD diagrams
- AI-powered inference tool
- List of value caching (or redirecting) to offload the expensive SELECT DISTINCT as BI tools populate their filter list