Stoqs: Implement a function to get data pertaining to more than 2 parameters

Created on 30 Oct 2018  路  8Comments  路  Source: stoqs/stoqs

Extend the function implemented https://github.com/stoqs/stoqs/blob/master/stoqs/contrib/analysis/__init__.py, _getMeasuredPPData, which gets the measured data when given two parameters. Extending this function to get all parameters or a given list of parameters for a given platform will allow for more data and features when exploring and modeling for the output data. This can be vital to improving the performance of a machine learning algorithm.
The goal is to get this data into a pandas dataframe, or similar, to have an easier base to work with when implementing further machine learning algorithms.
Myself, @MBARIMike, @bretstine and @markmocek will be exploring this issue further for part of Fall Capstone 2018

All 8 comments

This will be an important addition to the STOQS code base!

I think what we'd like to enhance is the createLabels() function of the classify.py program, which calls a slightly different method in __init__.py: _getPPData(). When this method is called only the MeasuredParameter IDs are used from the return. If I interpret the desire of this Issue correctly here is a list of functional requirements:

  • [ ] Produce a table of any number of MeasuredParameter data values from a Platform
  • [ ] Return MeasuredParameter IDs along with the data so that new labels can be added to the DB
  • [ ] Be able to constrain selection based on time or depth range
  • [ ] Be able to constrain select based on the value range of a MeasuredParameter

If I recall correctly, _getPPData() is a generalized improvement over _getMeasuredPPData() in that it reuses methods already developed for the UI and allows passage of a pvDict dictionary that holds any number of MeasuredParameter value constraints to the selection.

The already developed code for the UI constructs raw SQL statements that execute self-join statements in order to retrieve multiple Parameters for plotting in the Parameter-Parameter section of the UI. This code would be difficult to extend. Perhaps we can take a fresh approach to get the data in a suitable format for exploration and modeling using Machine Learning techniques.

Here's a start on a fresh approach, a Django query that gets the first 20 data values from dorado:

(venv-stoqs) [vagrant@localhost stoqsgit]$ stoqs/manage.py shell_plus
...

In [1]: mps = MeasuredParameter.objects.using('stoqs_september2013_o').filter(
   ...:                 measurement__instantpoint__activity__platform__name='dorado')
   ...:

In [2]: for i, mp in enumerate(mps[:20]):
   ...:     if i == 0:
   ...:         print("time, depth, latitude, longitude, parameter__name, measuredparameter__datavalue")
   ...:     print(f"{mp.measurement.instantpoint.timevalue}, {mp.measurement.depth:.2f},"
   ...:           f" {mp.measurement.geom.y:.6f}, {mp.measurement.geom.x:.6f}"
   ...:           f" {mp.parameter.name}, {mp.datavalue}")
   ...:
time, depth, latitude, longitude, parameter__name, measuredparameter__datavalue
2013-09-17 18:42:20, -0.03, 36.734970, -122.128144 sigmat, 25.1383576072121
2013-09-17 18:42:20, -0.03, 36.734970, -122.128144 spice, 0.830712889765499
2013-09-17 18:42:20, -0.03, 36.734970, -122.128144 altitude, 1395.68956636994
2013-09-17 18:42:20, -0.03, 36.734970, -122.128144 temperature, 13.9910522171992
2013-09-17 18:42:20, -0.03, 36.734970, -122.128144 salinity, 33.6403972259011
2013-09-17 18:42:20, -0.03, 36.734970, -122.128144 oxygen, 5.670288605996
2013-09-17 18:42:20, -0.03, 36.734970, -122.128144 nitrate, 0.21
2013-09-17 18:42:20, -0.03, 36.734970, -122.128144 bbp420, 0.00231458255927606
2013-09-17 18:42:20, -0.03, 36.734970, -122.128144 bbp700, 0.00228426640768986
2013-09-17 18:42:20, -0.03, 36.734970, -122.128144 fl700_uncorr, 0.000823624706576738
2013-09-17 18:42:20, -0.03, 36.734970, -122.128144 biolume, 194666664.695293
2013-09-17 18:42:20, -0.03, 36.734970, -122.128144 roll, -4.08951048388392
2013-09-17 18:42:20, -0.03, 36.734970, -122.128144 pitch, -0.105888989907026
2013-09-17 18:42:20, -0.03, 36.734970, -122.128144 yaw, 175.513420572358
2013-09-17 18:42:20, -0.03, 36.734970, -122.128144 sepCountList, None
2013-09-17 18:42:20, -0.03, 36.734970, -122.128144 mepCountList, None
2013-09-17 18:42:18, -0.04, 36.734989, -122.128162 sigmat, 25.1403727711047
2013-09-17 18:42:18, -0.04, 36.734989, -122.128162 spice, 0.829269194464183
2013-09-17 18:42:18, -0.04, 36.734989, -122.128162 altitude, 1395.49904668803
2013-09-17 18:42:18, -0.04, 36.734989, -122.128162 temperature, 13.9828055034561

Maybe there's a way to pivot an output like this to get the data in a format amenable to analysis in Pandas?

@MBARIMike that definitely looks like the direction we were trying to go in. Maybe "extension" of an existing function was the wrong way to word things given we would be starting fresh. Thank you for making that clarification.

Also, Pandas has a DataFrame.from_records() method that will import Django data into a data frame, e.g:

In [1]: import pandas as pd

In [2]: mps = MeasuredParameter.objects.using('stoqs_september2013_o').filter(
   ...:                 measurement__instantpoint__activity__platform__name='dorado')
   ...:

In [3]: df = pd.DataFrame.from_records(mps.values(
   ...:     'measurement__instantpoint__timevalue', 'measurement__depth',
   ...:     'measurement__geom', 'parameter__name', 'datavalue', 'id'
   ...:     ))
   ...:

In [4]: df.head(20)
Out[4]:
       datavalue       id  measurement__depth                         measurement__geom measurement__instantpoint__timevalue parameter__name
0   2.476802e+01  5664562           -0.055507    [-121.934897431052, 36.90470983771924]                  2013-09-16 20:55:49          sigmat
1   1.262683e+00  5673227           -0.055507    [-121.934897431052, 36.90470983771924]                  2013-09-16 20:55:49           spice
2   2.546787e+01  5690556           -0.055507    [-121.934897431052, 36.90470983771924]                  2013-09-16 20:55:49        altitude
3   1.582349e+01  5577911           -0.055507    [-121.934897431052, 36.90470983771924]                  2013-09-16 20:55:49     temperature
4   3.367453e+01  5629901           -0.055507    [-121.934897431052, 36.90470983771924]                  2013-09-16 20:55:49        salinity
5   6.593205e+00  5586576           -0.055507    [-121.934897431052, 36.90470983771924]                  2013-09-16 20:55:49          oxygen
6   5.360300e+02  5595241           -0.055507    [-121.934897431052, 36.90470983771924]                  2013-09-16 20:55:49         nitrate
7   9.528316e-03  5603906           -0.055507    [-121.934897431052, 36.90470983771924]                  2013-09-16 20:55:49          bbp420
8   6.610731e-03  5612571           -0.055507    [-121.934897431052, 36.90470983771924]                  2013-09-16 20:55:49          bbp700
9   4.761394e-04  5621236           -0.055507    [-121.934897431052, 36.90470983771924]                  2013-09-16 20:55:49    fl700_uncorr
10  9.728126e+09  5638566           -0.055507    [-121.934897431052, 36.90470983771924]                  2013-09-16 20:55:49         biolume
11 -1.292509e+01  5647231           -0.055507    [-121.934897431052, 36.90470983771924]                  2013-09-16 20:55:49            roll
12 -6.497791e+00  5655896           -0.055507    [-121.934897431052, 36.90470983771924]                  2013-09-16 20:55:49           pitch
13  5.802254e+01  5664561           -0.055507    [-121.934897431052, 36.90470983771924]                  2013-09-16 20:55:49             yaw
14           NaN  5690705           -0.055507    [-121.934897431052, 36.90470983771924]                  2013-09-16 20:55:49    sepCountList
15           NaN  5691417           -0.055507    [-121.934897431052, 36.90470983771924]                  2013-09-16 20:55:49    mepCountList
16  2.476093e+01  5664563           -0.082238  [-121.93492018129153, 36.90469289678784]                  2013-09-16 20:55:47          sigmat
17  1.270436e+00  5673228           -0.082238  [-121.93492018129153, 36.90469289678784]                  2013-09-16 20:55:47           spice
18  2.544076e+01  5690555           -0.082238  [-121.93492018129153, 36.90469289678784]                  2013-09-16 20:55:47        altitude
19  1.585611e+01  5577910           -0.082238  [-121.93492018129153, 36.90469289678784]                  2013-09-16 20:55:47     temperature

So instead of manipulating x and y such as loadLabeledData does, we could write a new function with this code and return the pandas data frame. Would you suggest adding to classify.py to do this or creating a new file?

I suggest creating a new file for now. Perhaps it could be a Jupyter Notebook that demonstrates an analysis.

So looking at classify.py, would we need to construct a process_command_line() function for this new file?

We'd need to understand the functional requirements better; perhaps a new option (or implementation of an aspirational option already in classify.py) is an approach. I'd like to see a Jupyter Notebook demonstration - that will help us decide.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

MBARIMike picture MBARIMike  路  6Comments

MBARIMike picture MBARIMike  路  32Comments

disimone picture disimone  路  3Comments

gijzelaerr picture gijzelaerr  路  3Comments

LeenaShekhar picture LeenaShekhar  路  3Comments