Influxdb: Mathematics across measurements

Created on 4 Aug 2015  ·  90Comments  ·  Source: influxdata/influxdb

Apologies if this is a duplicate, I had a look and couldn't see a relevant issue.

I can see from the documentation how to select from multiple measurements (although it calls them series, still, at https://influxdb.com/docs/v0.9/query_language/data_exploration.html )

For example, with data inserted by telegraf, you can do:
select * from disk_used,disk_total where host = 'myhostname' and path = '/'

How would you express that as a percentage? I've tried variations of the following, and none seem to work:

select disk_used.value/disk_total.value from disk_used, disk_total where host = 'myhostname' and path='/'

The "mydb"."retentionpolicy"."measurement" syntax doesn't work there, either.

Is it a good idea to add aggregation functions for cases like diff(value1, value2) from m1, m2 and divide(value, value) from m1, m2, or should the arithmetic operators be working?

Also, I noticed when experimenting that it's also not possible to divide one derivative by another. For example, if I have two counters, bytes transferred and api calls made - both of which are constantly going up - how would you calculate the mean bytes per api call?

1.x areinfluxql flutriaged kinfeature-request

Most helpful comment

Mathematics Across Measurements

A series is defined as a single field organized by a collection of tags
contained within a measurement. When you write multiple fields using the
line protocol, this writes multiple series to the measurement with the
same set of tags.

When selecting from a single measurement and selecting from multiple
fields, an implicit join is done on the time of each point in the
series for the fields that are requested.

Ideal Behavior

At the moment, when multiple measurements are used, they are not joined
together and are instead listed one by one. Ideally, measurements would
automatically be merged and look like this.

Ideally, measurements would automatically be joined on time for each of
these measurements. For example, with this, you might get:

cpu value=1 0
cpu value=2 10
cpu value=3 20
mem value=4 0
mem value=5 10
mem value=6 20

> SELECT * FROM cpu, mem
name: cpu
time    cpu.value   mem.value
----    ---------   ---------
0       1           4
10      2           5
20      3           6

This is how it should look and how it should return. While the name is
included, it mostly doesn't mean anything when the query returns like
this.

Grouping Behavior

In the existing system, series can be grouped into multiple series
instead of just a single series by using GROUP BY <tag>. We will still
continue to support this and this will continue to be an important way
of organizing series.

So if I have the following:

cpu,host=server01,region=useast value=1 0
cpu,host=server02,region=useast value=2 0
mem,host=server01,region=useast value=3 0
mem,host=server02,region=uswest value=4 0

> SELECT * FROM cpu, mem GROUP BY host
name: cpu
tags: host=server01
time    cpu.value   mem.value
----    ---------   ---------
0       1           3

name: cpu
tags: host=server02
time    cpu.value   mem.value
----    ---------   ---------
0       2           4

The series that have the same host are grouped together. If you want to
match a differently named tag, then you may have issues since I don't
think we'll support that for the initial iteration of this feature. We
would need to provide some way to tell the query that you want to
interpret one of the tags as a different tag and it is likely not worth
trying to get that syntax correct if people don't need that
functionality.

As can be seen from this example, the measurements from 'server02' do
not match the same server. One is from region 'uswest' and one is from
'useast'. Because we are not grouping by region, these can be matched
together. If you wanted the query to match the region too, you would do
this and get the following output:

> SELECT * FROM cpu, mem GROUP BY host, region
name: cpu
tags: host=server01,region=useast
time    cpu.value   mem.value
----    ---------   ---------
0       1           3

name: cpu
tags: host=server02,region=useast
time    cpu.value   mem.value
----    ---------   ---------
0       2

name: cpu
tags: host=server02,region=uswest
time    cpu.value   mem.value
----    ---------   ---------
0                   4

Notice how the measurement name is still cpu. When using a join, the
measurement name is frankly irrelevant and should be discarded, but we
aren't going to because a measurement name is required and is used in
some locations.

Be aware that if you GROUP BY * it will include all possible tags.
That means if you have one measurement with 3 tags and one with 2 tags,
they may never match because the measurements with 2 tags will have an
empty value for the third tag and will never match.

Old Behavior

Instead of these being joined, they will be returned as separate series
and the measurement name is considered important. Using this paradigm,
it's not really possible to join the measurements in a way that makes
sense.

cpu value=1 0
cpu value=2 10
cpu value=3 20
mem value=4 0
mem value=5 10
mem value=6 20

> SELECT * FROM cpu, mem
name: cpu
time    value
----    -----
0       1
10      2
20      3

name: mem
time    value
----    -----
0       4
10      5
20      6

Proposed Syntax

The new syntax will force the ideal behavior to occur, but we cannot
break backwards compatibility. If this proves to be a good output
format, we may consider using this format in the future.

For the initial implementation, we need to implement OUTER and INNER
joins, although it's unclear if we need to use those words. Kapacitor
implements them implicitly by using INNER if fill(none) is specified
and using OUTER for any other fill option. Since we default to
fill(null), I think we should default to an OUTER join and do the same
thing as Kapacitor.

SELECT * FROM cpu, mem JOIN MEASUREMENTS

JOIN would be a new token, but MEASUREMENTS is a currently existing
token. Rather than try to do something like this:

SELECT * FROM cpu JOIN mem

I am proposing we use something more friendly to the old syntax. This
way, it is more natural to include support for regexes and multiple
measurements. For three measurements, I think this is awkward:

SELECT * FROM cpu JOIN mem JOIN gpu

With a regex, it also doesn't match since the implication is that each
entry refers to a single measurement.

# Does this join cpu and gpu or not?
SELECT * FROM /[cg]pu]/ JOIN mem

The proposed syntax takes away that ambiguity by telling the query to
join all of the measurements within the measurements list.

Consequences to the Existing System

This will fix math for even within the same measurement. Right now,
there are no particular rules for how these measurements are joined
together to perform intra-measurement mathematics. That means if you try
to do math on multiple fields within the same measurement, they can get
out of sync _very_ easily and there is no way to control how these
behave.

This will also implement the feature requested in #7216 since
intra-measurement joins will begin to start using the JOIN functionality
for their behavior. This is an additive change and is not a breaking
change because the default fill option is null which is what it
currently does. This will just allow a person to customize this.

The existing return value for multiple measurements will not be changed
since that would be a breaking change. Users will have to use the new
syntax for telling InfluxDB to perform the join for measurements rather
than just within a measurement.

Additional Considerations

Tolerance Margins

Kapacitor implements tolerance margins so that if two points are close
enough in timestamps, they will be joined in the case that their
timestamps are within the tolerance rather than requiring the timestamps
to be exact. Additional points will just follow the same rules that were
set by fill.

Field name conflicts

When a field name conflicts between multiple measurements, which should
be used or should both be used? With standard SQL, the query will return
an error saying that it is unable to resolve the ambiguity between the
two. For conditions, this may be necessary since we don't want to guess
which value they intended in a situation like this:

SELECT * FROM cpu, mem WHERE value > 2

We could transform this to one of the following though:

SELECT * FROM cpu, mem WHERE cpu.value > 2 AND mem.value > 2
SELECT * FROM cpu, mem WHERE cpu.value > 2 OR mem.value > 2

I'm not sure either of these would be proper though and there wouldn't
really be a way for us to determine which of these the user intends. But
when we use a regex, we expand it similar to how we expand wildcards. We
also expand wildcards. If we have a conflict, it would be possible to
transform the following:

SELECT value FROM cpu, mem

Into this statement:

SELECT cpu.value, mem.value FROM cpu, mem

This would be the same as if we had done this with the original.

SELECT /^value$/ FROM cpu, mem

But is this a behavior that the user would expect? That's less clear.
And since this behavior would be different from SQL, it's not clear that
it's something a user would expect.

I would be tempted to make this an error for the initial implementation
and then we can expand it later, but some feedback on what others think
would be very helpful here.

Unused measurements

Testing with SQL, SQL does not seem to care if you use an implicit JOIN
and don't use one of the values inside of the JOIN. Because of that,
we're just not going to care either.

Example:

sqlite> .headers on
sqlite> CREATE TABLE a(id integer, host varchar(256));
sqlite> CREATE TABLE b(id integer, region varchar(256));
sqlite> INSERT INTO a VALUES(1, 'server01');
sqlite> INSERT INTO b VALUES(2, 'useast');
sqlite> SELECT a.id FROM a, b;
id
1
sqlite> SELECT * FROM a, b;
id|host|id|host
1|server01|2|useast

This seems like an easy enough pattern and, since it has precedent,
seems like something we should do.

Please leave any feedback you have about the proposed syntax and
behavior. If you like the proposal and think it works for your use case,
please leave a reaction. No :+1: comments since that will make
discussion harder.

All 90 comments

:+1:
I work with sensor networks and find this limitation frustrating. For example, I wish to compute weighted averages like this:
SELECT sum(oxygen_percentage.value * flow_rate.value) / sum(flow_rate.value) FROM oxygen_percentage, flow_rate WHERE site_id = '3'
But InfluxDB returns nothing. Even SELECT oxygen_percentage.value FROM oxygen_percentage doesn't work. Using 0.9.3-rc1 master (0163945).

Same here. I'd also like to calculate values across different series like:

select * from mysql_value where type='mysql_commands' and type_instance='show_tables' +
select * from mysql_value where type='mysql_commands' and type_instance='show_databases'

Cheers,
Szop

same as @hexluthor, I feel this is very limiting: if we need to correlate data coming from various sensors we currently have to write all data as fields in the same measurement... But would it be a good idea in terms of data structure to have a single measurement with more than 50 fields? Will it impact query performance?
And this sensor data does not always get logged with the same sampling frequency, so this is not always possible to combine data in the same measurement if we want to keep data with high sampling frequency.

I'm not comfortable with distorting the data structure (dropping natural data organization) because of technical limitations.
In the sysadmin world, it would be like putting all the cpu, ram, disk, and apache response time metrics in the same measurement for the sole purpose of being able to correlate apache response time with cpu, ram, or disk metrics.

Also, what are the actual technical issues that prevent InfluxDB to support queries with simple math operations across measurements?

This was recently changed to a "feature request" so that means it will be evaluated in future releases if we are going to add it or not. There are a couple work arounds right now, and that is to save a calculated field when you write data, such as storing another field for oxygen_percentage.value * flow_rate.value. I understand this isn't ideal, but it might get you moving forward.

Otherwise, I think these requests are sane, but they will take some work. I believe sum() / sum() is supposed to work already, but I thought I remember seeing a bug about math still not behaving properly.

@corylanou about the work around you're talking about: the oxygen_percentage.value * flow_rate.value field should be created when new points are created or is there a way to compute the calculation afterwards in a continuous query?

Yes, I believe you should be able to do that in a CQ and then you can select from that retention policy.

How can we do it in a continuous query? I thought the syntax of normal queries and continuous ones was the same, so if it's possible in one, it should be possible in the other.

instead of sum(value & value), you are doing a CQ with select val * val as newval and then you can select sum(newval) from your new data that was calculated from a CQ.

And that works across measurements? Using @bbinet's example, this would work?

select oxygen_percentage.value * flow_rate.value as newmeasurement from oxygen_percentage, flow_rate 

Hmm, it should, but I just tried this basic test and it crashed the server :cry:

> create database math
> use math
Using database math
> insert mul a=1,b=2
> select * from mul
name: mul
---------
time                            a       b
2015-09-21T12:17:36.377625368Z  1       2

> select a*b as c from mul
ERR: Get http://localhost:8086/query?db=math&q=select+a%2Ab+as+c+from+mul: EOF

I logged another issue here: https://github.com/influxdb/influxdb/issues/4183

and that was only from one measurement :)

Hopefully this is a central bug in our post-processing that when fixed will fix all of it. I'll see if I can fix it today. It appears to be just a bad reference while putting the math together, so it might be a quick fix.

Thanks @corylanou, but as @srfraser said in his previous comment, your example comes from the same measurement: is it supposed to work with multiple measurements?
I thought that queries running as continuous were the same as normal queries so if maths does not work across multiple measurements in a normal query, I thought it won't work neither in a continuous query.
Is it wrong?

Ah, yes, I keep forgetting we don't calculate across values. Although in a simple query we should support this. The biggest problem is type checking and overflow so that when you take an unsigned int and multiple it by a float, etc. that we are able to properly convert to a common type for the math, and not overflow either.

Ok, I see.
That would be great if cross measurements calculation could be possible at least for series which shares the same type (since no type conversion would be needed)

+1 We REALLY want this for our use-case!

+1, really missing this feature.

+1

+1

+1

:+1:

:+1:

:+1:

+1

+1

+1 thought I was going crazy, but this is a pretty substantial omission that might mean I've got to use another project instead of influx. Many times there is just no way to get correlated information into the same measurement. Even after an arduous journey with CQs, I only found that tags aren't included in CQ writes so there is no way to even fan-in with multiple CQs. Why were the MERGE and JOIN features from 0.8 dropped without there being a replacement? With the 0.9 documentation recommending the optimal way to structure things is to have many series and a single field named “value” (or some other key of your choice) used consistently across all series. and there apparently being no way to migrate from that kind of structure to the sort recommended at https://docs.influxdata.com/influxdb/v0.10/concepts/schema_and_data_layout/ I'm worried we're left hanging.

A viable CQ approach would be OK, but it is a lot more work than simply joining time-grouped measurements at query time, the way that influx used to work.

@graphex not that it solves the main problem, but tags are in fact included in CQ writes if the CQs have something like group by time(30m), * in them.

+1

:+1:

+1

The proposed syntax above likely won't work since it conflicts with another potential query.

> insert cpu value.host=2
> select * from cpu
name: cpu
---------
time                    value.host
1460469115659777269     2

This seems to currently be a valid query. @pauldix any ideas what syntax we should use for this kind of feature?

I'm also looking for a way to do math between two measurements.

I've got one measurement for Volts, another for Amps. The data is being provided by two different pieces of equipment. I would like to multiply the Volts value with the Amps value (time correlated) to get a calculated Watts value.

+1

:+1:

+1 Really need this as not having this is great miss.

+1

+1

+1

+1

+1

+1

+1

Can I request that this issue be locked? I'd like to receive notifications about actual updates, and not just random "+1"s.

Done. If you are a person interested in this issue, please add a 👍 reaction to the top of the message instead of a +1 comment and then click "Subscribe".

I am unlocking this to continue meaningful discussion on the issue. _Please refrain from adding meaningless +1's_. This is a high traffic issue with a lot of subscribers. If you want to express your approval for the feature, please use a reaction at the top of the issue.

Unfortunately, locking an issue also locks GitHub reactions. I did not know that when I locked it.

Hi,

I am currently using Grafana and InfluxDB for monitoring purposes. I have two measurements.
Measurement 1 : Domain,Available Capacity, Threshold
Measurement 2 : Domain, Peak TPS
My use case is plot graph if the Peak TPS exceeds Threshold. Here i am dealing with two measurements. Can you please suggest how can i use data from two measurements to plot the graph when the condition is satisfies.(Peak TPS > Threshold)

I have a web service which calls multiple downstream services on each request (the services called may change based on the request). I have various timing measurements in my application's components: cache put/get, call time for each downstream service, etc.

The ability to perform mathematics across these measurements would enable rich and sophisticated graphs to expose data such as what % of the request is spent in each timing component while allowing the measurements to still be independent of each other (I don't want to include all timing metrics as fields in a "request_timings" measurement because some timings are independent of the web request - for example, Redis/cache timing metrics are not just used per request but by other application components).

More importantly this enables alarming on arbitrary "calculated" or "derived" measurements which is extremely useful for creating precise, unambiguous alarms.

This feature is extremely important in time series market data for stocks and in other financial systems. Consider this simple example, you have market data tick with say price and size. You want to derive notional value across all ticks something like (price * size). This seems infeasible in current setup. Also, joining on timestamps and tags, could be error prone.

The current schema seems to well only for independent measurements like sensors or cpu etc.

As others have mentioned, this is a highly requested and important feature. I use influx for recording metrics from my house and home server lab, of of the things I track is whole-house energy consumption. I capture wattage and Voltage coming off the meter, which does fluctuate. Currently in grafana, I'm simply dividing currant by 124 (average voltage rating I get) to get the currant in amps, but since voltage fluctuates between 120 and 125 or so, I'd like my current rating to be more accurate by being able to divide the two series. Graphite has this, can we please get this added?

IANAIE (Influx Employee) but IIUC this feature is especially tricky for Influx compared to most TSDBs because there's no defined time resolution. So if you're trying to join a series at 13-second resolution with one at 19-second resolution, You're Gonna Have a Bad Time (prime numbers chosen to accentuate pain).

Now, if it happens that you know for an ironclad fact that two of your series have exactly the same time granularity and cadence-- in this limited case it's safe and relatively easy to do a join. But Influx doesn't carry this kind of metadata, or constrain the points that you write to any particular resolution.

So, in the general case, in order to make progress you have to invent a whole DSL for deciding stuff like how to dedupe fields and combine values when two points snap to the same rounded timestamp.

I think a good way to start cutting into this would be to support joins between two members of the set { subqueries, measurements derived from a CQ } that have the same "group by" criteria (tags and, critically, time).

It would be nice if simple expressions could be used to adapt the tag values and paper over syntactic mismatches.

It would also be nice if you could collapse any aggregates that appear on both sides, e.g. "(foo.A + foo.B) as APlusB", "avg(foo.A, foo.B)", etc.

Mathematics Across Measurements

A series is defined as a single field organized by a collection of tags
contained within a measurement. When you write multiple fields using the
line protocol, this writes multiple series to the measurement with the
same set of tags.

When selecting from a single measurement and selecting from multiple
fields, an implicit join is done on the time of each point in the
series for the fields that are requested.

Ideal Behavior

At the moment, when multiple measurements are used, they are not joined
together and are instead listed one by one. Ideally, measurements would
automatically be merged and look like this.

Ideally, measurements would automatically be joined on time for each of
these measurements. For example, with this, you might get:

cpu value=1 0
cpu value=2 10
cpu value=3 20
mem value=4 0
mem value=5 10
mem value=6 20

> SELECT * FROM cpu, mem
name: cpu
time    cpu.value   mem.value
----    ---------   ---------
0       1           4
10      2           5
20      3           6

This is how it should look and how it should return. While the name is
included, it mostly doesn't mean anything when the query returns like
this.

Grouping Behavior

In the existing system, series can be grouped into multiple series
instead of just a single series by using GROUP BY <tag>. We will still
continue to support this and this will continue to be an important way
of organizing series.

So if I have the following:

cpu,host=server01,region=useast value=1 0
cpu,host=server02,region=useast value=2 0
mem,host=server01,region=useast value=3 0
mem,host=server02,region=uswest value=4 0

> SELECT * FROM cpu, mem GROUP BY host
name: cpu
tags: host=server01
time    cpu.value   mem.value
----    ---------   ---------
0       1           3

name: cpu
tags: host=server02
time    cpu.value   mem.value
----    ---------   ---------
0       2           4

The series that have the same host are grouped together. If you want to
match a differently named tag, then you may have issues since I don't
think we'll support that for the initial iteration of this feature. We
would need to provide some way to tell the query that you want to
interpret one of the tags as a different tag and it is likely not worth
trying to get that syntax correct if people don't need that
functionality.

As can be seen from this example, the measurements from 'server02' do
not match the same server. One is from region 'uswest' and one is from
'useast'. Because we are not grouping by region, these can be matched
together. If you wanted the query to match the region too, you would do
this and get the following output:

> SELECT * FROM cpu, mem GROUP BY host, region
name: cpu
tags: host=server01,region=useast
time    cpu.value   mem.value
----    ---------   ---------
0       1           3

name: cpu
tags: host=server02,region=useast
time    cpu.value   mem.value
----    ---------   ---------
0       2

name: cpu
tags: host=server02,region=uswest
time    cpu.value   mem.value
----    ---------   ---------
0                   4

Notice how the measurement name is still cpu. When using a join, the
measurement name is frankly irrelevant and should be discarded, but we
aren't going to because a measurement name is required and is used in
some locations.

Be aware that if you GROUP BY * it will include all possible tags.
That means if you have one measurement with 3 tags and one with 2 tags,
they may never match because the measurements with 2 tags will have an
empty value for the third tag and will never match.

Old Behavior

Instead of these being joined, they will be returned as separate series
and the measurement name is considered important. Using this paradigm,
it's not really possible to join the measurements in a way that makes
sense.

cpu value=1 0
cpu value=2 10
cpu value=3 20
mem value=4 0
mem value=5 10
mem value=6 20

> SELECT * FROM cpu, mem
name: cpu
time    value
----    -----
0       1
10      2
20      3

name: mem
time    value
----    -----
0       4
10      5
20      6

Proposed Syntax

The new syntax will force the ideal behavior to occur, but we cannot
break backwards compatibility. If this proves to be a good output
format, we may consider using this format in the future.

For the initial implementation, we need to implement OUTER and INNER
joins, although it's unclear if we need to use those words. Kapacitor
implements them implicitly by using INNER if fill(none) is specified
and using OUTER for any other fill option. Since we default to
fill(null), I think we should default to an OUTER join and do the same
thing as Kapacitor.

SELECT * FROM cpu, mem JOIN MEASUREMENTS

JOIN would be a new token, but MEASUREMENTS is a currently existing
token. Rather than try to do something like this:

SELECT * FROM cpu JOIN mem

I am proposing we use something more friendly to the old syntax. This
way, it is more natural to include support for regexes and multiple
measurements. For three measurements, I think this is awkward:

SELECT * FROM cpu JOIN mem JOIN gpu

With a regex, it also doesn't match since the implication is that each
entry refers to a single measurement.

# Does this join cpu and gpu or not?
SELECT * FROM /[cg]pu]/ JOIN mem

The proposed syntax takes away that ambiguity by telling the query to
join all of the measurements within the measurements list.

Consequences to the Existing System

This will fix math for even within the same measurement. Right now,
there are no particular rules for how these measurements are joined
together to perform intra-measurement mathematics. That means if you try
to do math on multiple fields within the same measurement, they can get
out of sync _very_ easily and there is no way to control how these
behave.

This will also implement the feature requested in #7216 since
intra-measurement joins will begin to start using the JOIN functionality
for their behavior. This is an additive change and is not a breaking
change because the default fill option is null which is what it
currently does. This will just allow a person to customize this.

The existing return value for multiple measurements will not be changed
since that would be a breaking change. Users will have to use the new
syntax for telling InfluxDB to perform the join for measurements rather
than just within a measurement.

Additional Considerations

Tolerance Margins

Kapacitor implements tolerance margins so that if two points are close
enough in timestamps, they will be joined in the case that their
timestamps are within the tolerance rather than requiring the timestamps
to be exact. Additional points will just follow the same rules that were
set by fill.

Field name conflicts

When a field name conflicts between multiple measurements, which should
be used or should both be used? With standard SQL, the query will return
an error saying that it is unable to resolve the ambiguity between the
two. For conditions, this may be necessary since we don't want to guess
which value they intended in a situation like this:

SELECT * FROM cpu, mem WHERE value > 2

We could transform this to one of the following though:

SELECT * FROM cpu, mem WHERE cpu.value > 2 AND mem.value > 2
SELECT * FROM cpu, mem WHERE cpu.value > 2 OR mem.value > 2

I'm not sure either of these would be proper though and there wouldn't
really be a way for us to determine which of these the user intends. But
when we use a regex, we expand it similar to how we expand wildcards. We
also expand wildcards. If we have a conflict, it would be possible to
transform the following:

SELECT value FROM cpu, mem

Into this statement:

SELECT cpu.value, mem.value FROM cpu, mem

This would be the same as if we had done this with the original.

SELECT /^value$/ FROM cpu, mem

But is this a behavior that the user would expect? That's less clear.
And since this behavior would be different from SQL, it's not clear that
it's something a user would expect.

I would be tempted to make this an error for the initial implementation
and then we can expand it later, but some feedback on what others think
would be very helpful here.

Unused measurements

Testing with SQL, SQL does not seem to care if you use an implicit JOIN
and don't use one of the values inside of the JOIN. Because of that,
we're just not going to care either.

Example:

sqlite> .headers on
sqlite> CREATE TABLE a(id integer, host varchar(256));
sqlite> CREATE TABLE b(id integer, region varchar(256));
sqlite> INSERT INTO a VALUES(1, 'server01');
sqlite> INSERT INTO b VALUES(2, 'useast');
sqlite> SELECT a.id FROM a, b;
id
1
sqlite> SELECT * FROM a, b;
id|host|id|host
1|server01|2|useast

This seems like an easy enough pattern and, since it has precedent,
seems like something we should do.

Please leave any feedback you have about the proposed syntax and
behavior. If you like the proposal and think it works for your use case,
please leave a reaction. No :+1: comments since that will make
discussion harder.

I updated the above comment to include information about how grouping works. This is important because it determines how to get similar series to match with each other which would be very important for this feature.

Notice how the measurement name is still cpu. When using a join, the
measurement name is frankly irrelevant and should be discarded, but we
aren't going to because a measurement name is required and is used in
some locations.

Perhaps we could create a new measurement name by concatenating the measurements from the SELECT clause? E.g.

> SELECT * FROM cpu, mem GROUP BY host
name: cpu|mem
tags: host=server01
time    cpu.value   mem.value
----    ---------   ---------
0       1           3

I'd advocate for using a rare character as the join, rather than - or _. That will make it somewhat less ambiguous

The proposed syntax takes away that ambiguity by telling the query to
join all of the measurements within the measurements list.

Works for me. Also removes any worry about OUTER vs INNER since the user never specifies those.

We could transform this to one of the following though:

SELECT * FROM cpu, mem WHERE cpu.value > 2 AND mem.value > 2
SELECT * FROM cpu, mem WHERE cpu.value > 2 OR mem.value > 2

In standard SQL users are quite used to using table.column to return values in queries with more than one table. I think for InfluxQL, a similar syntax works well. If a field or tag that is NOT part of the join is used, then the user must specify the parent measurement.

Tolerance Margins

I do think we should include these, but I would be fine with waiting for a future release. The initial implementation can join only on explicit timestamp matches. Let's debug the rest of the syntax and behavior before letting users create 1h tolerance joins on data.

Question: What does it mean to do something like:

SELECT * FROM cpu, cpu GROUP BY host

I personally don't like the idea of concatenating the measurement names. I'm not convinced measurement names should even exist anymore. I noticed Grafana will concatenate the measurement name and the column name and I think we could probably get that working so they ignored the measurement name if it was empty and just used the column. When we start returning the measurement name in the column output, I think that's probably preferable. Although there is a bit of a question. If I do this:

SELECT mean(cpu.value) AS mean_value FROM cpu

Should the column be cpu.mean_value or mean_value?

I think for something like:

SELECT * FROM cpu, cpu GROUP BY host

That would be rewritten to:

SELECT cpu.value, cpu.value FROM cpu GROUP BY host

Although we still don't have the best idea of what happens when you combine the same measurement from multiple databases/retention policies. It seems to me like the measurement should be tied to the field rather than being placed into the FROM clause. I'm not entirely sure how to do that syntactically, but having the FROM come at the end is syntactically ambiguous in these cases. I've been wondering if it should be rewritten like this:

SELECT db0.autogen.cpu.value FROM db0.autogen.cpu

I think it might be cleaner and a bit more flexible to use a slightly different syntax. Why not make join a function call? So you can do:

SELECT value
FROM join(0, mem, cpu), foo

-- or
SELECT value
FROM join(/foo.*/)

Then you could have multiple measurements returned. You'd also be able to specify the tolerance as the first argument. The function join could be an outer join and then later if we want to add inner we'd just add that as function inner_join.

I agree that if the user has something in the WHERE clause that matches against a field, it should return an error like SQL does if it's ambiguous.

I think the issue with doing it like that is you would then be mixing two different types of output for two queries that have to work completely differently. What would the output of this example return?

SELECT value FROM join(0, mem, cpu), foo

We have the tolerance in there, which should be an optional parameter since many people probably won't care too much about it, and then we have two measurements. Makes sense, but then we have another measurement attached to the end. We also have only one variable and it is ambiguous who it refers to for the join, but makes complete sense for foo.

So the mechanics of join isn't really that of a function. It's not a simple transformation of data within the query language, but it acts more like changing the functionality completely.

As an example syntax for tolerance, I was thinking something like this personally:

SELECT cpu.value, mem.value FROM cpu, mem JOIN MEASUREMENTS WITH tolerance(1s)

Most of these tokens are already reserved keywords. We could decide if tolerance() should be a function or not, but I think it reads decently well. I think it also sets us up to make JOIN MEASUREMENTS the default whenever we decide to make a breaking change.

The measurement on the end would be another result set just like it is now if you select FROM foo, bar. The variable for tolerance is between all series being joined. If you wanted to have different tolerances, then we'd need to support nesting like: join(0, foo, join(2, mem, cpu))

No breaking changes until 2.0 (which we're not even thinking about yet).

I guess it's not that big of a deal to be able to do join(0, mem, cpu), foo since you can get the same thing from doing a multi-statement query:

select value from mem, cpu JOIN MEASUREMENTS;
select value from foo;

I'm not that married to it since your last suggestion accomplishes much the same thing and it's already too late for me to shoehorn my functional dream into the query language. Thoughts from others?

Problem Statement

While attempting to implement this, I ran into another issue which I'll
describe here and I'll describe the solution that we determined would be
most appropriate. For those following this issue closely, my apologies
for not writing up the problem before we started talking internally so
anybody who wanted could contribute. I'm writing it up here for the sake
of transparency before we attempt to implement it.

The first thing to understand is how series are joined together
currently. When you have a single measurement with multiple fields like
the following, we need to join the values together in some way.

cpu,host=server01 idle=20,user=70,system=10 0
cpu,host=server02 idle=30,user=50,system=20 0

When we execute a query to retrieve all of the fields, we are querying 6
different series. There are 3 fields and 2 series keys for a total of 6
series, but we are only returning 2 rows at time zero. How do we
determine which series are joined together? According to the other
proposal, the series would just be joined together in a fairly random
order so you could intermix values from cpu,host=server01 with values
from cpu,host=server02. There are a few problems with this though.

  1. It's confusing for users. They expect values that have the same
    series key to be matched with each other even if that's not how the
    storage engine works. The current system matches them up so why
    wouldn't it continue to work?
  2. You would be no longer able to query tags because you would be unable
    to determine which series to take tag information from.

But, if we don't use this method of joining together series, joining
series between measurements becomes very complicated. Imagine that you
have the following in the mem measurement and join it with the above.

mem,host=server01,region=uswest free=80 0

Since this series does not have a series key that matches anything
above, the value would not be joined with any row in cpu above. While
we could force series keys to be the same across measurements for this
feature to work, we think that's too limiting.

Updated Proposal

We are proposing a new syntax and new functionality to join. When series
are being joined (either between multiple measurements or a single
measurement), the user will be able to specify which parts of the series
key should be considered while joining. Only tags that are mentioned in
the join can be used in the resulting query (including GROUP BY). The
join will happen before anything else so series will be joined into a
single row before they are grouped into different series. I think this
contrasts with Kapacitor which does the grouping first and then joins
the series. /cc @nathaniel

The syntax will be something like this:

SELECT cpu.idle, cpu.user, mem.free FROM cpu, mem JOIN SERIES ON host

JOIN MEASUREMENTS has been changed to JOIN SERIES to put the focus
on how we are joining series together. Series from different
measurements are not treated in a special way.

Series will be joined together with the cartesian product for any values
at the same time. So if you tell a query to join on host and have 2
series that match with 2 series in the other field, you will end up with
4 rows total. In particular, that means queries like this will work in a
way we think is most fitting.

cpu,host=server01,num=0 value=20 0
cpu,host=server01,num=1 value=30 0
mem,host=server01 value=10 0

> SELECT cpu.value, mem.value FROM cpu, mem JOIN SERIES ON host
time    cpu.value   mem.value
----    ---------   ---------
0       20          10
0       30          10

Feedback

  • When there is a tag selected in the query, is it reasonable to infer
    that series should be joined on that tag?
  • When a GROUP BY is used, is it reasonable to infer that it should be
    included in the JOIN so the user doesn't have to repeat it?

For now, we're going to keep things explicit and try to make this more
user friendly with feedback in the future. If you can think of any
problems or you think that this is the wrong way to deal with joining
measurements, please post a comment so we can address any concerns. This
is a difficult problem and we appreciate and welcome any community
involvement.

When there is a tag selected in the query, is it reasonable to infer that series should be joined on that tag?

As long as "Only tags that are mentioned in the join can be used in the resulting query" is true then I think this is reasonable and in fact necessary.

When a GROUP BY is used, is it reasonable to infer that it should be included in the JOIN so the user doesn't have to repeat it?

I think this is reasonable. I don't think a query would make sense if the GROUP BY tag weren't also part of the join.

Given the cross-product output, what would I get from the following query?

cpu,host=server01,num=0 value=20 0
cpu,host=server01,num=1 value=30 0
mem,host=server01 value=10 0

> SELECT COUNT(cpu.value), COUNT(mem.value) FROM cpu, mem JOIN SERIES ON host
???

I was originally thinking the answer would be 2 for both of those columns, but we had another discussion today and @pauldix is doing a writeup of how he thinks the mechanics for this should work. We've been throwing around the idea of making this only work for aggregates rather than allowing raw query math with join, but we're unsure if that would solve the main use case people have.

For all of those following this issue, we are having difficulty at the moment with scoping the issue and determining what _exact_ use cases people need. We understand one use case to be doing something like correlating the number of requests with the number of errors using aggregates like this:

SELECT count(errors.value) / count(requests.value)

But I would also like to hear other use cases. Specifically, the above one is more about performing aggregates on two different measurements rather than doing math between those measurements. I would like to hear more use cases involving math between measurements and why you are trying to do it that way so we can determine if we are working on the correct feature and designing it correctly.

Thanks.

We're using InfluxDB to store measurements gathered from the LUT Green Campus. We mainly gather measurements related to energy, which typically means turning voltage and current measurements to power. The electrical system has three phases (L1, L2, L3), which are used as fields in the schema. An example query to calculate active_power from measured voltage, current and phi for a single phase L1 would be something similar to

SELECT mean(voltage.L1) * mean(current.L1) * mean(cos(phi.L1)) GROUP BY time(5m), panel_group

Furthermore, the values for apparent_power and reactive_power can be derived from these three measurements.

As another data point on use cases we have exactly the same use case as @kipe at Flywheel Building Intelligence.

Any chance for those who just "+1"ed to remove their comments and +1 the OP instead?

It would make it easier to follow the discussion. Thanks.

Hey Guys,

Another use case which I got was calculating a percentage, where the denominator value is in one measurement and the numerator is in another, and I needed to do something like

SELECT (sum(a.value) / sum(b.value)) * 100

where a and b are separate measurements, which is similar to what you guys mentioned. Is there no roundabout way to do this for now?

@thealphaking01 for your example, is that supposed to be sum(a.value) / sum(b.value) or is the second value supposed to be the raw values?

@jsternberg
a.value is a subset of b.value. Didn't get what you meant by raw values? For a particular duration, sum(a.value) gives me the number satisfying a criteria, while sum(b.value) gives me all possible values. Hence the percentage calculation.

Raw values are the ones directly written into the system. So SELECT value FROM cpu returns the raw values for the value field in the cpu measurement. The reason why I ask is because I didn't knonw if your above example had a typo or not since a.value is inside of an aggregate and b.value is not.

Sorry, that was a typo. Fixed it now.

I'm not sure if this is the correct case to implement the issue of case #3525 for inequalities? I would like to do something along the lines of:

SELECT mean(value) > 40

where the result is returned as a full time series of boolean values. My specific use case for this is determining if a pump is operating based on the current draw it is using.

Hi Jonathan,
Apparently I received this email because it mentions "@nathaniel" and
nathaniel is my GitHub username; but I am not associated with this project
in any way.
Can I please be removed from this email chain? If you would remove me from
the CC line if/when you continue the conversation further, that would be
great.

Thanks,
Nathaniel Cunningham, Ph.D.
Lincoln, NE

On Tue, Sep 27, 2016 at 7:01 PM, Jonathan A. Sternberg <
[email protected]> wrote:

Problem Statement

While attempting to implement this, I ran into another issue which I'll
describe here and I'll describe the solution that we determined would be
most appropriate. For those following this issue closely, my apologies
for not writing up the problem before we started talking internally so
anybody who wanted could contribute. I'm writing it up here for the sake
of transparency before we attempt to implement it.

The first thing to understand is how series are joined together
currently. When you have a single measurement with multiple fields like
the following, we need to join the values together in some way.

cpu,host=server01 idle=20,user=70,system=10 0
cpu,host=server02 idle=30,user=50,system=20 0

When we execute a query to retrieve all of the fields, we are querying 6
different series. There are 3 fields and 2 series keys for a total of 6
series, but we are only returning 2 rows at time zero. How do we
determine which series are joined together? According to the other
proposal, the series would just be joined together in a fairly random
order so you could intermix values from cpu,host=server01 with values
from cpu,host=server02. There are a few problems with this though.

  1. It's confusing for users. They expect values that have the same
    series key to be matched with each other even if that's not how the storage
    engine works. The current system matches them up so why wouldn't it
    continue to work?
  2. You would be no longer able to query tags because you would be
    unable to determine which series to take tag information from.

But, if we don't use this method of joining together series, joining
series between measurements becomes very complicated. Imagine that you
have the following in the mem measurement and join it with the above.

mem,host=server01,region=uswest free=80 0

Since this series does not have a series key that matches anything
above, the value would not be joined with any row in cpu above. While
we could force series keys to be the same across measurements for this
feature to work, we think that's too limiting.
Updated Proposal

We are proposing a new syntax and new functionality to join. When series
are being joined (either between multiple measurements or a single
measurement), the user will be able to specify which parts of the series
key should be considered while joining. Only tags that are mentioned in
the join can be used in the resulting query (including GROUP BY). The
join will happen before anything else so series will be joined into a
single row before they are grouped into different series. I think this
contrasts with Kapacitor which does the grouping first and then joins
the series. /cc @nathaniel https://github.com/nathaniel

The syntax will be something like this:

SELECT cpu.idle, cpu.user, mem.free FROM cpu, mem JOIN SERIES ON host

JOIN MEASUREMENTS has been changed to JOIN SERIES to put the focus
on how we are joining series together. Series from different
measurements are not treated in a special way.

Series will be joined together with the cartesian product for any values
at the same time. So if you tell a query to join on host and have 2
series that match with 2 series in the other field, you will end up with
4 rows total. In particular, that means queries like this will work in a
way we think is most fitting.

cpu,host=server01,num=0 value=20 0
cpu,host=server01,num=1 value=30 0
mem,host=server01 value=10 0

SELECT cpu.value, mem.value FROM cpu, mem JOIN SERIES ON host
time cpu.value mem.value


0 20 10
0 30 10

Feedback

  • When there is a tag selected in the query, is it reasonable to infer
    that series should be joined on that tag?
  • When a GROUP BY is used, is it reasonable to infer that it should be
    included in the JOIN so the user doesn't have to repeat it?

For now, we're going to keep things explicit and try to make this more
user friendly with feedback in the future. If you can think of any
problems or you think that this is the wrong way to deal with joining
measurements, please post a comment so we can address any concerns. This
is a difficult problem and we appreciate and welcome any community
involvement.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/influxdata/influxdb/issues/3552#issuecomment-250033940,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEKc6ccxw_WrBp-7oTmJiFjXBO6DJkXks5qua5sgaJpZM4FljNk
.

Opening a feature request kicks off a discussion.
Requests may be closed if we're not actively planning to work on them.

Proposal: Implement SQL Joins in the Influx Query Language
Current behavior: Not supported.
Desired behavior: Supported.

Use case:

When requesting data from InfluxDB it would be very useful to combine metadata, configuration data, etc., from one "table" with the time-series data. For example:

assets (measurement)
time (actually not used in this table)
id (tag) = cryptic internal identifier
friendly (field) = character string
port (field) = number (TCP/IP port number used by the device)

i.e.,
insert assets,id=y268938rjnau3 friendly='Asset1',port=64200

rundata (measurement)
time
id (tag) = cryptic internal identifier
temperature (field)
pressure (field)

This works great:
select last(temperature) from rundata group by mac

But this is what I really want:

Timestamp Machine Temperature

10/05/2016 8:04:06pm C2475 1675.4
10/02/2016 9:02:11am C7524 850.5

That might be done like this:

select
c.time as "Timestamp", a.friendly as "Machine",last(c.temperature) as "Temperature"
from
rundata c,assets a
where
c.id = a.id
group by
c.mac

@nathaniel we don't have the capability to unsubscribe you, but if you click the link in the email notification and follow it to this issue's Github page, you can click the "Unsubscribe" button at the bottom of the right hand side. Sorry about the accidental mention.

I dont know if this should be a separate issue, but we would really like to be able to do simple stuff like this:

SELECT non_negative_derivative(last(PortXmitData * 4), $interval) from ibstats WHERE "interface" =~ /ib.*/ AND "host" =~ /$hostname/ and $timeFilter group by time($granularity),interface,host fill(none)

Note the PortXmitData * 4; this blows up today. The reason for this specific reason is that the metrics from IB equipment are returned as "octets divided by 4” (¼ of the actual number of Bytes)" which Grafana does not know how to deal with (because its a stupid unit). However, there are other cases where doing this sort of simple arithmetic is very useful, and hopefully this is a fairly simple feature to implement.

One other feature that is commonly requested is the ability to plot a metric against the mean/95%ile for each item in a GROUP BY. Any method to achieve that is fine by us.

I understand your issue… makes sense… the same issue in my opinion.
But I don’t see tx_bytes*4 in your query…

I dont know if this should be a separate issue, but we would really like to be able to do simple stuff like this:

SELECT non_negative_derivative(last(PortXmitData), $interval) from ibstats WHERE ("interface" =~ /ib._/ OR "interface" =~ /p._/) AND "host" =~ /$hostname/ and $timeFilter group by time($granularity),interface,host fill(none)

Note the tx_bytes * 4; this blows up today. The reason for this specific reason is that the metrics from IB equipment are returned as "octets divided by 4” (¼ of the actual number of Bytes)" which Grafana does not know how to deal with (because its a stupid unit). However, there are other cases where doing this sort of simple arithmetic is very useful, and hopefully this is a fairly simple feature to implement.

One other feature that is commonly requested is the ability to plot a metric against the mean/95%ile for each item in a GROUP BY. Any method to achieve that is fine by us.


You are receiving this because you commented.
Reply to this email directly, view it on GitHubhttps://github.com/influxdata/influxdb/issues/3552#issuecomment-257987465, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AVlReR3OTMcZ_DcCy4OJy3ea4qU58p5Pks5q6PFqgaJpZM4FljNk.

@inselbuch my apologies, I fixed the example query.

The problem is that this works:
non_negative_derivative(last(PortXmitData), $interval)
And this does not:
non_negative_derivative(last(PortXmitData * 4), $interval)

+1

Edit: +1ed the OP

+1

I am currently in the process of trying to switch from graphite to influx with grafana as our front end.. I currently have queries like this for graphite but I havent found a way to convert them to influx yet..

asPercent(reporting1_dev.memory.used,reporting1_dev.memory.total)
asPercent(reporting1_dev.memory.buffers,reporting1_dev.memory.total)

Also like this

asPercent(nonNegativeDerivative(reporting1_dev.cpu.total.iowait),nonNegativeDerivative(sumSeries(reporting1_dev.cpu.total.*)))
asPercent(sumSeries(nonNegativeDerivative(reporting1_dev.cpu.total.{guest,iowait,nice,steal,irq,softirq})),nonNegativeDerivative(sumSeries(reporting1_dev.cpu.total.*)))

Any help would be great!!!
Thanks!
Josh

So I found that I if I am doing math all of the info has to be in the same table.

+1 we need to do some maths between measurements

This is a huge feature missing from InfluxDB, making it much less powerful than Prometheus.

+1

+1

+1

I think the point has been stated with all of the +1s... This thread should be locked until there is progress on it.

+1

I'm locking this again. While that means you won't be able to add your 👍 reaction, I think there are enough of them that we're well aware people want this feature. We do want to hear any resources that may be useful in terms of implementing this. You can look in the commit log to find my email. If you want to wait for comments regarding this issue, please use the "Subscribe" button on the issue instead of responding to the issue.

Thank you.

Was this page helpful?
0 / 5 - 0 ratings