Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question around VECTORIZE BY #79

Open
Benoss opened this issue Apr 22, 2020 · 1 comment
Open

Question around VECTORIZE BY #79

Benoss opened this issue Apr 22, 2020 · 1 comment

Comments

@Benoss
Copy link

Benoss commented Apr 22, 2020

First, I think your project is brilliant, I am using ClickHouse for a few project and your idea fits really well with what I wanted to do!

My question is vectorDelta, is there a way to get the n-x value of column? maybe a function vectorData(metric, -5) would return the n-x value of metric. Then vectorDelta could just be calculated

select
        key,
        timestamp,
        metric,
        vectorData(metric, -5) as metric_delta
    from default.vector_example
    vectorize by (key, timestamp)

Also maybe vectorFind(metric, timestamp=yesterday) would return the data of metric for a given timestamp in the vector?

Real world problems I am trying to solve.
Compare a value with last week, with the average of last week, with the average of the last 3 days etc etc

@jason-godden
Copy link
Contributor

Hey @Benoss ,

Thanks for your patience here - has been a very busy last few weeks with COVID.

I think a vectorFind or Filter method sounds great. Perhaps the ClickHouse combiner concept would be the right way to go?

vectorWindowAggIf(
   avg, metric,
   toStartOfDay(timestamp) = toStartOfDay(now() - interval 1 day))

There are vectorLag and vectorLead functions that should perform like the proposed vectorData but perhaps I should put a new function in that can move forward or back that aligns with the ClickHouse neighbor terminology - vectorNeighbor? Let me know what you think.

I had a play with the problem you noted - I don't think this will really do what you want but it sort of achieves the previous week comparison, with the average and the last 3 days etc..

# Generate some data
key = (ArrayVector
    .range(1,1).cast('String').map('concat', String('device-')).take(24*60*60)
    .cache().alias('key'))
timestamp = (ArrayVector
    .rand(DateTime('2019-01-07 00:00:00'), DateTime('2019-03-01 23:59:59'), 24*60*60)
    .cache().alias('timestamp'))
metric = ArrayVector.rand(1, 8192, 24*60*60).cache().alias('metric')

data = v.table.fromVector('default.vector_example', 
    (key, timestamp, metric), engines.Memory(), replace=True)

# Aggregate values to a date and return a single key/series for examples

dt = v.q("""
    SELECT 
        key, 
        toDate(toStartOfDay(timestamp)) AS t_d, 
        sum(metric) AS metric
    FROM default.vector_example
    WHERE t_d >= '2019-01-07'
    GROUP BY key, t_d
    ORDER BY key, t_d
    LIMIT 1 BY key, t_d
    LIMIT 100
""").cache()

# For the relative date - grab the value 1 week ago, calculate the moving average 
# for the whole of the previous week, calculate the moving average for the last 3 
# days.

# Next compare the relative difference to the current value and the calculated 
# moving stats. Ofcourse can also compare realtive to future stats as well 
# depending on the WindowAgg parameters.

(dt
    .vectorizeBy('key', 't_d')
    .select(
        'key', 't_d', 'metric',
        funcs.vector.vectorLag(c('metric'), -7).alias('lag_1week'),
        funcs.vector.vectorWindowAgg(c('avg'), c('metric'), -7, -1).alias('mv_avg_prev_week'),
        funcs.vector.vectorWindowAgg(c('avg'), c('metric'), -3, -1).alias('mv_avg_last_3days'))
    .select(
        'key', 't_d', c('metric').alias('current_value'),
        c('metric/lag_1week').alias('change_vs_relative_day_1week'),
        c('metric/mv_avg_prev_week').alias('change_vs_avg_1week'),
        c('metric/mv_avg_last_3days').alias('change_vs_avg_last3days')
    )
).s

And thanks for the feedback on the project. This is still in a POC phase (need tests and fixes to brittle code). The actual frontend SQL introspection and rewriting itself will be moving to a dedicated high performance backend server in C with Python interlock that will sit with ClickHouse so the same SQL queries/extensions can be orchestrated by the backend for ODBC and other clients too. Have a full roadmap of ideas here...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants