Runtime: DataFrame Discussion

Created on 5 Feb 2018  Â·  72Comments  Â·  Source: dotnet/runtime

With work underway on Tensor, and the new perf optimizations available via Span, I think it's time the .NET team seriously considered adding a DataFrame type. Working with large amounts of data has become critical in many applications these days, and is the reason libraries like Pandas for Python have been so successful. It's time to bring similar capabilities to .NET, and it needs to start with the framework itself adding a DataFrame type.

There are DataFrame libraries out there available for .NET, but the problem is that they each have their own implementations of DataFrames, that are minimally compatible with the BCL or any of the other libraries out there. From my experience, support for many of these various libraries is also pretty weak.

I think the BCL team implementing its own DataFrame is the first, most important step to improving the state of working with data in .NET. All we have right now is DataTable, a powerful but ancient type that is not well optimized for many scenarios and which I've variously seen .NET team members refer to as "legacy".

I'm creating this issue to garner opinions on the topic. I don't have a specific API in mind to propose at this time, as gauging interest first is probably more important.

Let's make .NET as great for data analysis as any other platform.

api-suggestion area-Meta

Most helpful comment

I am extremely in favor of this! Here goes a long post; thank you to anyone who gets through it all.

API shape proposal-ish

To try to move the discussion forward, I'll put forward something that looks kind of like an API shape proposal. It's actually just a few core operations from pandas, but hopefully it gives a sense of what operations a real API shape proposal would have to cover. A real API shape proposal would need to be much more C#-like and cover a much wider set of scenarios. I'm using pandas here, but Matlab and R basically smell the same, the way that C/C++, Java, and C# all smell the same.

First I'll define a dataframe to make all of the examples actually runnable.

import pandas as pd
df = pd.DataFrame([[1, 2, 'a', pd.Timestamp('2016-01-01')],
                   [3, 4, 'b', pd.Timestamp('2016-01-02')],
                   [5, 6, 'c', pd.Timestamp('2016-01-03')],
                   [7, 8, 'a', pd.Timestamp('2016-01-04')],
                   [9, 10, 'b', pd.Timestamp('2016-01-05')],
                   [11, 12, 'c', pd.Timestamp('2016-01-06')]],
                  columns=['int1', 'int2', 'strs', 'dts'])

Here are the core operations that a dataframe needs to be able to support:

Selection:

# this selects just the rows where int1 is even (selecting a subset of columns isn't shown here)
df[df['int1'] % 2 == 0]

Indexes for fast selection:

# efficiently select just the rows between these two dates
df = df.set_index('dts')
df.loc[pd.Timestamp('2016-01-01'):pd.Timestamp('2016-01-04')]

Projection and Append:

# for each row, this multiplies int1 by 2 and adds int2 (i.e. a projection)
# and puts that new value in the int3 column (i.e. appending a column)
# appending a row is not shown
df['int3'] = (df['int1'] * 2) + df['int2']

Assignment:

# updates the int1 column in row 2 to be 100
df[2, 'int1'] = 100

(this seems a little contrived, but this is the building block for common operations like replacing all NaNs with 0)

GroupBy:

df2 = df.groupby('strs', as_index=False)['int1'].sum()

> strs ints
> a       8
> b      12
> c      16

Row/column labels:

# columns and rows have names that can be changed
df2 = df2.rename(columns={'int1': 'int1_summed'})

Join:

pd.merge(df, df2, how='inner', on='strs')

Sort:

df.sort_values(by=['str', 'int1'])

There are other things, like reading/writing CSVs and SQL tables, resampling and other time-related functions, rolling window calculations, more advanced selection, pivotting and transposing (e.g. see the table of contents at https://pandas.pydata.org/pandas-docs/stable/) but I think those will fall into place fairly easily once the core operations described above are defined. No comment here on dataframes being mutable/immutable (some of the syntax I showed here strongly suggests mutability because that's how pandas/Matlab/R work, but I don't think that has to be the case).

How is this different from LINQ on a List<T> of POCOs?

  • Appending columns, selecting a subset of columns, or renaming column labels doesn't work well if you have a typed POCO. Even if you use ExpandoObject, ExpandoObject doesn't allow setting new properties with a string (e.g. expandoObjectRow.Set("nameOfProperty", value)), which is an important limitation. So if you go down this road you have to choose between creating a new type for every stage of your operation and boxing value types.
  • Joining is hard (aka verbose). E.g. doing an outer join on xs and ys on on would require something like:
var ysDict = ys.ToDictionary(y => y[on], y => y);
var xsDict = xs.ToDictionary(x => x[on], x => x);
xs.Select(x => (x, ysDict.TryGetValue(x[on], out var y) ? y : null))
    .Concat(ys.Where(y => !xsDict.ContainsKey(y[on])).Select(y => (null, y)));

(and this hasn't even solved the appending columns problem!)

  • The performance benefits of having columnar data in memory that @mungojam mentions above is important when you have, say, 100 million rows.
  • Indexes for fast selection are not available, especially for a range query like I show above

How is this different from DataTable?

In addition to the good points above by @mgravell (heavy, i.e. change tracking) and @MgSam (box all value types), I'll also point out that DataTable doesn't have the required richness. I'm not super familiar with DataTable, so I might make some mistakes here, but I believe it doesn't support joins, renaming columns, indexes, assignment to a single cell, or any of the "bonus" things I mentioned above as non-core functionality, like resampling and other time-related functions, pivotting and transposing.

Why should dataframes be added to .NET?

  • Developers that love .NET are abandoning .NET for python because of the lack of dataframes and the associated ecosystem. At my company, we've used solely .NET for years, and have recently had to add first-class support for python in order to support researchers/scientists/data analysts that need tools that .NET doesn't have. To this point, it doesn't just end at DataFrames. In some ways, all of the comments asking how this is different from LINQ and DataTable etc. are right in a way--if .NET just added DataFrames and stopped there, especially just the core operations that I proposed above, it wouldn't be an earth-shaking development in the history of .NET. But if everything else came together, either from the .NET team or from the open source community, like a fully rich dataframe API (i.e. feature parity with pandas/R/Matlab), a REPL (we're almost there but we need 64-bit support!), Jupyter notebook support, a plotting library on par with matplotlib/Matlab/R, a library of statistical functions on par with pandas+scipy/Matlab/R, and an machine learning framework on par with Tensorflow/PyTorch (i.e. ML.NET), then that would basically open up .NET to an entire new set of use cases. This obviously won't happen overnight, and there's a reasonable argument that the python/pandas ecosystem have an insurmountable lead, but changing the world requires unreasonable people :).
  • There's an opportunity. Python/pandas currently holds the lead, I believe, because R and Matlab don't have a good story for general purpose programming, i.e., you can't write a web server in R/Matlab the way you can with Python/Django. But, pandas is at a critical juncture, because datasets have grown to 100 million rows or more, and the current architecture of pandas is limiting: http://wesmckinney.com/blog/apache-arrow-pandas-internals/. I think if you look at the Arrow project and the problems it's trying to solve, that gives a pretty good description of the opportunity. I'll give my spin on it here:

    • Pandas is "slow" at manipulating very large datasets.

    • Another aspect of the opportunity is the IDataView thing, i.e., the ability to reference an underlying potentially massive datastore and perform lazy operations on it, and then materialize the data and continue working on it using the exact same API. This is related to the previous point that pandas wasn't built in a world of large datasets, and that means that it needs a new paradigm for loading large datasets, not just the data manipulation aspect. (I haven't dug deeply into IDataView much yet, so I might be misunderstanding this.)

    • There's a distributed computation aspect as well. This is what Spark dataframes offers, which is why pandas-Spark compabitility is an important angle of the Arrow project. DryadLinq anyone? :) Also see dask: https://dask.pydata.org/en/latest/

  • Finally, I think there's an opportunity to make dataframes optionally typed like @mungojam talks about above, which would be really amazing. Also see @tpetricek's paper http://tomasp.net/academic/papers/pivot/pivot-ecoop17.pdf, especially figure 4. It's always a tough and contentious argument to make, but I believe (mostly on faith) that being able to optionally specify types statically would make data scientists more productive the way that some people believe that static types make general purpose programmers more productive.

A full example of how dataframes are used in real life

I'll go through a real-ish example here in pandas to demonstrate how dataframes are used and how C#/.NET as it stands are inadequate for the use case. Let's say you have a financial dataset as a CSV that has an "id" column (which might represent a stock or a bond), a "date" column, and a "value1" and "value2" column. Let's say you have a second dataset that contains returns for "id"s and "date"s. The first thing you need to do is read the CSVs, but I'm going to start with some synthetic data for convenience:

import random
import string
import numpy as np
import pandas as pd
import sklearn.linear_model

def random_string(n):
    return ''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(n))

ids = np.array([random_string(6) for _ in range(900)])
dates = pd.date_range('2013-01-01', '2018-06-01')
df1 = pd.DataFrame({'id': np.tile(ids, len(dates)),
                    'date': np.repeat(dates, len(ids)),
                    'value1': np.random.rand(len(ids) * len(dates)),
                    'value2': np.random.rand(len(ids) * len(dates))})

df2 = pd.DataFrame({'id': np.tile(ids, len(dates)),
                    'date': np.repeat(dates, len(ids)),
                    'returns': np.random.rand(len(ids) * len(dates))})

Here's what df1 looks like:

>>>               date      id    value1    value2
>>> 0       2013-01-01  132ZXV  0.769004  0.237304
>>> 1       2013-01-01  HLX7J7  0.241722  0.273988
>>>             ...     ...       ...       ...
>>> 1780198 2018-06-01  N5EIUS  0.655888  0.220536
>>> 1780199 2018-06-01  MF8YVT  0.458046  0.118847             

Let's say you want to replace any NaNs in the data with 0.0, multiply value1 and value2 together, take a 5-observation rolling sum of that per id, take the zscore of that for each date, join that against the returns, and do a rolling 250-day linear regression on each day. Here's what that would look like in pandas:

df1 = df1.fillna(0.0)
df1['value3'] = df1['value1'] * df1['value2']
# there's some ugly cruft here--no one said pandas is perfect!
df1['value3_rolling_sum'] = df1.groupby('id')['value3'].rolling(5).sum() \
    .reset_index().drop('id', axis=1).sort_values(by='level_1').set_index('level_1')['value3']
df1['value3_zscore'] = df1.groupby('date')['value3_rolling_sum'].apply(lambda gr: (gr - gr.mean()) / gr.std())
merged = pd.merge(df1, df2, how='left', on=['id', 'date'])
merged = merged.set_index('date')
models = []
for d in merged.index.unique():
    regression_inputs = merged.loc[d - pd.DateOffset(250):d]
    model = sklearn.linear_model.LinearRegression()
    models.append(model.fit(regression_inputs[['value3_rolling_sum']], regression_inputs['returns']))

This is very long, but I wrote out the roughly equivalent thing in C# to try to bring home my point:

// prepare synthetic data
var random = new Random();

var alphanumeric = "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";
string RandomString(int n)
    => new string(Enumerable.Range(0, n).Select(_ => alphanumeric[random.Next(alphanumeric.Length)]).ToArray());

var ids = Enumerable.Range(0, 900).Select(_ => RandomString(6)).ToList();
var dates = Enumerable.Range(0, 1978).Select(i => new DateTimeOffset(2013, 1, 1, 0, 0, 0, TimeSpan.Zero) + TimeSpan.FromDays(i)).ToList();

var df1 = dates.SelectMany(date => ids.Select(id => (Id: id, Date: date, Value1: random.NextDouble(), Value2: random.NextDouble()))).ToList();
var df2 = dates.SelectMany(date => ids.Select(id => (Id: id, Date: date, Returns: random.NextDouble()))).ToList();


// replace nans and multiply Value1 and Value2
var value3s = df1.Select(row => (row.Value1 == double.NaN ? 0.0 : row.Value1) * (row.Value2 == double.NaN ? 0.0 : row.Value2));
// Within each id, rolling 5 day sum
var value3RollingSums = df1.Zip(value3s, (origRow, value3) => (origRow, value3))
    .GroupBy(row => row.origRow.Id, row => row.value3)
    .SelectMany(gr =>
    {
        var values = gr.ToList();
        var result = new double[values.Count];
        for (var i = 0; i < values.Count; i++)
            for (var j = Math.Max(0, i - 4); j <= i; j++)
                result[i] += values[i];
        return result;
    });
// Within each date, compute a zscore
var value3Zscores = df1.Zip(value3RollingSums, (origRow, value3RollingSum) => (origRow, value3RollingSum))
    .GroupBy(row => row.origRow.Date, row => row.value3RollingSum)
    .SelectMany(gr =>
    {
        var values = gr.ToList();
        var average = values.Average(); // assume an implementation
        var standardDeviation = values.StandardDeviation(); // assume an implementation
        return values.Select(v => (v - average) / standardDeviation);
    });
// merge with df2
var df2Dict = df2.ToDictionary(row => (row.Id, row.Date));
var merged = df1.Zip(value3Zscores, (origRow, value3Zscore) => (origRow, value3Zscore, df2Row: df2Dict[(origRow.Id, origRow.Date)])).ToList();
var models = new List<LinearRegressionModel>();
// do linear regression
foreach (var date in merged.Select(row => row.origRow.Date).Distinct().OrderBy(date => date))
{
    var startDate = date - TimeSpan.FromDays(250);
    var data = merged.Where(row => row.origRow.Date >= startDate && row.origRow.Date <= date).ToList();
    // assume an implementation of LinearRegressionModel
    var model = LinearRegressionModel.Fit(data.Select(row => row.value3Zscore).ToArray(), data.Select(row => row.df2Row.Returns).ToArray());
    models.Add(model);
}

If anything I would say that the C# version here looks better than it actually is in real life. In real life, this isn't the only computation we're doing. We're trying out other ways of normalizing the data, we're inspecting the dataframe after every step to double-check that our calculations are doing what we meant to do, we're trying other models like Lasso and Ridge and Random Forests, we have more than just one variable that we're trying to fit, we're trying rolling sums and regressions based on days rather than observations and vice versa. I would say that throwing all of these things into the mix make it even harder to do in C# and only trivially more difficult to do in Python/pandas.

All 72 comments

Great to get community interest.
cc @eerhardt @ericstj who have been thinking about this.

Could you define "DataFrame" for the purpose of this discussion? Because that is a vague, overloaded, and open-ended term - it would be good to know the intent as it relates to this discussion. For example, pandas defines it as:

Two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). Arithmetic operations align on both row and column labels. Can be thought of as a dict-like container for Series objects.

which sounds ... awful :) I can't think of many scenarios where that would be preferable to, say, a List<T> or Span<T>...

which sounds ... awful :) I can't think of many scenarios where that would be preferable to, say, a List or Span...

I think the scenarios come from a more "data scientist" and Machine Learning angle. Take a look at https://github.com/ageron/handson-ml/blob/master/02_end_to_end_machine_learning_project.ipynb starting with the

In [5]: housing = load_housing_data() line.

Imagine trying to explore a data set (loaded through .csv or otherwise) like this in .NET code, and how much more code you were need to write to accomplish something like:

housing["ocean_proximity"].value_counts()

that sounds like something that LINQ would excel at, with little more than a few helper extension methods - i.e.

var counts = housing.ValueCounts<string>("ocean_proximity");
// the <string> here just makes the return type stronger, could be done without

with an extension method something like below (completely untested). Note: I'm totally not a fan of DataSet for most purposes - I'm just trying to understand what problem this suggestion is trying to solve, and how it would be different to DataTable. I'm very cautious of the idea of adding another type that is basically the same as DataTable to represent ad-hoc data without a very good reason. If the reason is that DataTable is too heavy and has a lot of things (like change tracking, etc) that never apply, then that's great - I just think having a very clear statement of the problem is a prerequisite to having a good solution.

public static Dictionary<TKey, int> ValueCounts<TKey>(this DataTable table, string columnName)
{
    var col = table.Columns[columnName];
    return table.Rows.Cast<DataRow>().GroupBy(x => x.Field<TKey>(col)).ToDictionary(
        x => x.Key, x => x.Count());
}

Edit: personally, I'd much rather use a class HousingData POCO that looks like the shape of the data, i.e. has a Longitude, Latitude etc - along with great tools for loading that; but I acknowledge that not all scenarios fit the "we know the shape ahead of time" world.

@mgravell I think the Pandas definition is as good as any.

You're right, it is very similar to a DataTable. However, DataTables are heavy, not generic (and thus box any value types) and do not support many of the useful properties of DataFrames.

I love Linq, but it is different type of abstraction. Common operations with DataFrames include aligning columns, performing operations between columns, joining frames, etc. Writing ad-hoc types or using Linq for this is much more verbose and time consuming. A key insight to realize about DataFrames vs Linq is that you're working with columns or frames of data vs individual items, as you do with Linq.

I strongly suggest looking at some example usage in Pandas or checking out Deedle. (Deedle has a pretty good implementation of a DataFrame, but it is written in F# and thus uses a lot of F#-isms that are problematic for other .NET languages, it also appears to be essentially abandoned).

but it is written in F# and thus uses a lot of F#-isms that are problematic for other .NET languages

Can you explain a bit what is F#-ism and what are those problems? If I am relying on a nuget package with .NET assembly, I shouldn't care about the source language the IL was produced from. Their usage docs has C# examples: http://bluemountaincapital.github.io/Deedle/csharpseries.html

@kasper3 F# adds a lot of its own types and conventions on top of the .NET Framework. It's in FSharp.Core.dll. You absolutely can use it from C#, and it is very powerful, but you quickly run into a lot of annoying impedance mismatches.

The biggest one being that F# has an Option<T> type. So Deedle is written to work with Option<T> rather than nullables. This means for double?, you have 3 different ways to say null- Double.NaN, null, Option<double?>.None. It makes it super annoying and error prone when doing analysis.

Though I digress. Really I'm just trying to show why there is a need for a DataFrame type to be baked into the core framework.

I've been trying to build a UWP app which reads data from IOT sensors and I've been banging my head against this exact problem. I can't use pandas or R because I'm in UWP-land and C# doesn't have a decent DataFrame API to use MathNet with in the same way I'd use SciPy with Pandas. I'm pretty close to just abandoning .NET and building the whole thing in Python with the desktop bridge. Unfortunate but there's really a gap in this area where Python and R excel. This would obviously be most useful on the server-side ML pipelines, but wanted to highlight how this is also useful to client developers. Particularly when developers will want to pre-process data they're feeding into ONNX models for WindowsML.

@mattdot check out the issue I referenced, it seems there's a new dataframe-ish abstraction called IDataView in just open sourced ML.NET library. Maybe it can be useful to you?

I'm pretty close to just abandoning .NET and building the whole thing

if there is some API shape proposal that you have in mind that would help your use-case, please do share. that will increase the chances of getting this discussion move forward.

@MgSam So DataFrame is column oriented instead of row oriented? Do chime in to https://github.com/dotnet/machinelearning to share your views. Why not @mattdot also, maybe that's doable already.

@veikkoeeva I'm not sure column-oriented vs row-oriented is the right description. It's a tabular data structure. It definitely makes column-based operations easy and much more readable. This definition for an R data frame is pretty good.

As soon as I get some free time I intend to play around with the IDataView.

@glebuk @TomFinley

The ML.NET has the concept of "IDataView." (IDV) We are porting documentation for it in this PR 173
We hope it would provides some context for this discussion.
It is similar to DataFrame but have significant difference. Dataframe is fully materialized schematized data table in memory that supports random access. IDV is similar to a SQL View. It is also schematized view of data, but it is accessed with cursors that allows for streaming. While DataFrame is very good for ad-hoc data exploration, the IDV was optimized for very efficient, scalable (inf rows by billions of features) data plane for Machine Learning pipelines.

Please excuse my extreme naivete, but does a .net core implementation of Apache Arrow relate to this?

Wes McKinney recently posted on his blog announcing Ursa Labs which would open .net up to a whole new crowd if there were collaboration.

@bryanrcarlson I mentioned Apache Arrow as a means for ML.NET to integrate with R and Python - check out https://github.com/dotnet/machinelearning/issues/69 if you're interested

I've been a C# programmer for 10 years, but in the last few years have learnt R and now Python. I love C#, but it is really lacking in this area when compared with Pandas in Python and dplyr/tidyr in R.

Those tools make it so easy to work with tabular data and for people who we want to move out of Excel for LoB data tools they are a much easier sell than C# at present.

Both R and Python benefit here during the exploration stage by being dynamically typed. They also benefit from having a really good REPL story.

I think a neat C# implementation would start in the dynamic world using ExpandoObject or similar and would need good REPL runtime intellisense. It would then let coders generate the statically typed classes and code when they are ready to generate an actual tool. The readr package in R applies a similar approach when reading CSV files, it can auto-detect the column data types when you first import a file but then prints out the R code that you should put in to specify those types explicitly (and encourages you to do so).

The column vs. row thing is important for performance. Columnar data is generally the way things are going for performance reasons, e.g. the Parquet data format. A lot of the operations people do on data tables benefit from using SIMD along a column of contiguous memory data.

I agree that a DataFrame would be useful, especially in the Data Science and Machine Learning area.

It's basically an n-dimensional tensor (could be a scalar, vector, matrix, or higher level tensor) which has label-vectors for the individual dimensions.

The labels should be preserved as far as possible when mathematical operations are applied.

I guess a DataFrame type must not (but may anyways) be part of the core library. Using something like the Deedle NuGet should do the job nicely. Especially when you can convert from and to the Tensor type.

I arrived at this discussion as I too am searching for a suitable data analysis framework. Deedle is ok but suffers very bad performance on even moderately large datasets. Extreme Optimisation has an implementation that is very fast from what I have played around with.
I am mainly using R for data analysis, particularly the data.table class which has extremely good performance using into the hundreds of millions of data rows.
The problem I am finding is that I want the performance and flexibility of data.table but in a language where I can implement financial instruments (from simple swaps to structured deal types) a lot better like C# - OO, extension methods, interfaces, abstract classes etc etc. The lack of suitable implementation in C# means I am forced into using S4 classes and OO in a functional language.
Another issue is the lack of a decent IDE. I still think nothing comes close to the abilities of the MS IDEs

Adding to ML.NET "project" just to keep track (no decision implied but this is a good discussion)

I am extremely in favor of this! Here goes a long post; thank you to anyone who gets through it all.

API shape proposal-ish

To try to move the discussion forward, I'll put forward something that looks kind of like an API shape proposal. It's actually just a few core operations from pandas, but hopefully it gives a sense of what operations a real API shape proposal would have to cover. A real API shape proposal would need to be much more C#-like and cover a much wider set of scenarios. I'm using pandas here, but Matlab and R basically smell the same, the way that C/C++, Java, and C# all smell the same.

First I'll define a dataframe to make all of the examples actually runnable.

import pandas as pd
df = pd.DataFrame([[1, 2, 'a', pd.Timestamp('2016-01-01')],
                   [3, 4, 'b', pd.Timestamp('2016-01-02')],
                   [5, 6, 'c', pd.Timestamp('2016-01-03')],
                   [7, 8, 'a', pd.Timestamp('2016-01-04')],
                   [9, 10, 'b', pd.Timestamp('2016-01-05')],
                   [11, 12, 'c', pd.Timestamp('2016-01-06')]],
                  columns=['int1', 'int2', 'strs', 'dts'])

Here are the core operations that a dataframe needs to be able to support:

Selection:

# this selects just the rows where int1 is even (selecting a subset of columns isn't shown here)
df[df['int1'] % 2 == 0]

Indexes for fast selection:

# efficiently select just the rows between these two dates
df = df.set_index('dts')
df.loc[pd.Timestamp('2016-01-01'):pd.Timestamp('2016-01-04')]

Projection and Append:

# for each row, this multiplies int1 by 2 and adds int2 (i.e. a projection)
# and puts that new value in the int3 column (i.e. appending a column)
# appending a row is not shown
df['int3'] = (df['int1'] * 2) + df['int2']

Assignment:

# updates the int1 column in row 2 to be 100
df[2, 'int1'] = 100

(this seems a little contrived, but this is the building block for common operations like replacing all NaNs with 0)

GroupBy:

df2 = df.groupby('strs', as_index=False)['int1'].sum()

> strs ints
> a       8
> b      12
> c      16

Row/column labels:

# columns and rows have names that can be changed
df2 = df2.rename(columns={'int1': 'int1_summed'})

Join:

pd.merge(df, df2, how='inner', on='strs')

Sort:

df.sort_values(by=['str', 'int1'])

There are other things, like reading/writing CSVs and SQL tables, resampling and other time-related functions, rolling window calculations, more advanced selection, pivotting and transposing (e.g. see the table of contents at https://pandas.pydata.org/pandas-docs/stable/) but I think those will fall into place fairly easily once the core operations described above are defined. No comment here on dataframes being mutable/immutable (some of the syntax I showed here strongly suggests mutability because that's how pandas/Matlab/R work, but I don't think that has to be the case).

How is this different from LINQ on a List<T> of POCOs?

  • Appending columns, selecting a subset of columns, or renaming column labels doesn't work well if you have a typed POCO. Even if you use ExpandoObject, ExpandoObject doesn't allow setting new properties with a string (e.g. expandoObjectRow.Set("nameOfProperty", value)), which is an important limitation. So if you go down this road you have to choose between creating a new type for every stage of your operation and boxing value types.
  • Joining is hard (aka verbose). E.g. doing an outer join on xs and ys on on would require something like:
var ysDict = ys.ToDictionary(y => y[on], y => y);
var xsDict = xs.ToDictionary(x => x[on], x => x);
xs.Select(x => (x, ysDict.TryGetValue(x[on], out var y) ? y : null))
    .Concat(ys.Where(y => !xsDict.ContainsKey(y[on])).Select(y => (null, y)));

(and this hasn't even solved the appending columns problem!)

  • The performance benefits of having columnar data in memory that @mungojam mentions above is important when you have, say, 100 million rows.
  • Indexes for fast selection are not available, especially for a range query like I show above

How is this different from DataTable?

In addition to the good points above by @mgravell (heavy, i.e. change tracking) and @MgSam (box all value types), I'll also point out that DataTable doesn't have the required richness. I'm not super familiar with DataTable, so I might make some mistakes here, but I believe it doesn't support joins, renaming columns, indexes, assignment to a single cell, or any of the "bonus" things I mentioned above as non-core functionality, like resampling and other time-related functions, pivotting and transposing.

Why should dataframes be added to .NET?

  • Developers that love .NET are abandoning .NET for python because of the lack of dataframes and the associated ecosystem. At my company, we've used solely .NET for years, and have recently had to add first-class support for python in order to support researchers/scientists/data analysts that need tools that .NET doesn't have. To this point, it doesn't just end at DataFrames. In some ways, all of the comments asking how this is different from LINQ and DataTable etc. are right in a way--if .NET just added DataFrames and stopped there, especially just the core operations that I proposed above, it wouldn't be an earth-shaking development in the history of .NET. But if everything else came together, either from the .NET team or from the open source community, like a fully rich dataframe API (i.e. feature parity with pandas/R/Matlab), a REPL (we're almost there but we need 64-bit support!), Jupyter notebook support, a plotting library on par with matplotlib/Matlab/R, a library of statistical functions on par with pandas+scipy/Matlab/R, and an machine learning framework on par with Tensorflow/PyTorch (i.e. ML.NET), then that would basically open up .NET to an entire new set of use cases. This obviously won't happen overnight, and there's a reasonable argument that the python/pandas ecosystem have an insurmountable lead, but changing the world requires unreasonable people :).
  • There's an opportunity. Python/pandas currently holds the lead, I believe, because R and Matlab don't have a good story for general purpose programming, i.e., you can't write a web server in R/Matlab the way you can with Python/Django. But, pandas is at a critical juncture, because datasets have grown to 100 million rows or more, and the current architecture of pandas is limiting: http://wesmckinney.com/blog/apache-arrow-pandas-internals/. I think if you look at the Arrow project and the problems it's trying to solve, that gives a pretty good description of the opportunity. I'll give my spin on it here:

    • Pandas is "slow" at manipulating very large datasets.

    • Another aspect of the opportunity is the IDataView thing, i.e., the ability to reference an underlying potentially massive datastore and perform lazy operations on it, and then materialize the data and continue working on it using the exact same API. This is related to the previous point that pandas wasn't built in a world of large datasets, and that means that it needs a new paradigm for loading large datasets, not just the data manipulation aspect. (I haven't dug deeply into IDataView much yet, so I might be misunderstanding this.)

    • There's a distributed computation aspect as well. This is what Spark dataframes offers, which is why pandas-Spark compabitility is an important angle of the Arrow project. DryadLinq anyone? :) Also see dask: https://dask.pydata.org/en/latest/

  • Finally, I think there's an opportunity to make dataframes optionally typed like @mungojam talks about above, which would be really amazing. Also see @tpetricek's paper http://tomasp.net/academic/papers/pivot/pivot-ecoop17.pdf, especially figure 4. It's always a tough and contentious argument to make, but I believe (mostly on faith) that being able to optionally specify types statically would make data scientists more productive the way that some people believe that static types make general purpose programmers more productive.

A full example of how dataframes are used in real life

I'll go through a real-ish example here in pandas to demonstrate how dataframes are used and how C#/.NET as it stands are inadequate for the use case. Let's say you have a financial dataset as a CSV that has an "id" column (which might represent a stock or a bond), a "date" column, and a "value1" and "value2" column. Let's say you have a second dataset that contains returns for "id"s and "date"s. The first thing you need to do is read the CSVs, but I'm going to start with some synthetic data for convenience:

import random
import string
import numpy as np
import pandas as pd
import sklearn.linear_model

def random_string(n):
    return ''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(n))

ids = np.array([random_string(6) for _ in range(900)])
dates = pd.date_range('2013-01-01', '2018-06-01')
df1 = pd.DataFrame({'id': np.tile(ids, len(dates)),
                    'date': np.repeat(dates, len(ids)),
                    'value1': np.random.rand(len(ids) * len(dates)),
                    'value2': np.random.rand(len(ids) * len(dates))})

df2 = pd.DataFrame({'id': np.tile(ids, len(dates)),
                    'date': np.repeat(dates, len(ids)),
                    'returns': np.random.rand(len(ids) * len(dates))})

Here's what df1 looks like:

>>>               date      id    value1    value2
>>> 0       2013-01-01  132ZXV  0.769004  0.237304
>>> 1       2013-01-01  HLX7J7  0.241722  0.273988
>>>             ...     ...       ...       ...
>>> 1780198 2018-06-01  N5EIUS  0.655888  0.220536
>>> 1780199 2018-06-01  MF8YVT  0.458046  0.118847             

Let's say you want to replace any NaNs in the data with 0.0, multiply value1 and value2 together, take a 5-observation rolling sum of that per id, take the zscore of that for each date, join that against the returns, and do a rolling 250-day linear regression on each day. Here's what that would look like in pandas:

df1 = df1.fillna(0.0)
df1['value3'] = df1['value1'] * df1['value2']
# there's some ugly cruft here--no one said pandas is perfect!
df1['value3_rolling_sum'] = df1.groupby('id')['value3'].rolling(5).sum() \
    .reset_index().drop('id', axis=1).sort_values(by='level_1').set_index('level_1')['value3']
df1['value3_zscore'] = df1.groupby('date')['value3_rolling_sum'].apply(lambda gr: (gr - gr.mean()) / gr.std())
merged = pd.merge(df1, df2, how='left', on=['id', 'date'])
merged = merged.set_index('date')
models = []
for d in merged.index.unique():
    regression_inputs = merged.loc[d - pd.DateOffset(250):d]
    model = sklearn.linear_model.LinearRegression()
    models.append(model.fit(regression_inputs[['value3_rolling_sum']], regression_inputs['returns']))

This is very long, but I wrote out the roughly equivalent thing in C# to try to bring home my point:

// prepare synthetic data
var random = new Random();

var alphanumeric = "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";
string RandomString(int n)
    => new string(Enumerable.Range(0, n).Select(_ => alphanumeric[random.Next(alphanumeric.Length)]).ToArray());

var ids = Enumerable.Range(0, 900).Select(_ => RandomString(6)).ToList();
var dates = Enumerable.Range(0, 1978).Select(i => new DateTimeOffset(2013, 1, 1, 0, 0, 0, TimeSpan.Zero) + TimeSpan.FromDays(i)).ToList();

var df1 = dates.SelectMany(date => ids.Select(id => (Id: id, Date: date, Value1: random.NextDouble(), Value2: random.NextDouble()))).ToList();
var df2 = dates.SelectMany(date => ids.Select(id => (Id: id, Date: date, Returns: random.NextDouble()))).ToList();


// replace nans and multiply Value1 and Value2
var value3s = df1.Select(row => (row.Value1 == double.NaN ? 0.0 : row.Value1) * (row.Value2 == double.NaN ? 0.0 : row.Value2));
// Within each id, rolling 5 day sum
var value3RollingSums = df1.Zip(value3s, (origRow, value3) => (origRow, value3))
    .GroupBy(row => row.origRow.Id, row => row.value3)
    .SelectMany(gr =>
    {
        var values = gr.ToList();
        var result = new double[values.Count];
        for (var i = 0; i < values.Count; i++)
            for (var j = Math.Max(0, i - 4); j <= i; j++)
                result[i] += values[i];
        return result;
    });
// Within each date, compute a zscore
var value3Zscores = df1.Zip(value3RollingSums, (origRow, value3RollingSum) => (origRow, value3RollingSum))
    .GroupBy(row => row.origRow.Date, row => row.value3RollingSum)
    .SelectMany(gr =>
    {
        var values = gr.ToList();
        var average = values.Average(); // assume an implementation
        var standardDeviation = values.StandardDeviation(); // assume an implementation
        return values.Select(v => (v - average) / standardDeviation);
    });
// merge with df2
var df2Dict = df2.ToDictionary(row => (row.Id, row.Date));
var merged = df1.Zip(value3Zscores, (origRow, value3Zscore) => (origRow, value3Zscore, df2Row: df2Dict[(origRow.Id, origRow.Date)])).ToList();
var models = new List<LinearRegressionModel>();
// do linear regression
foreach (var date in merged.Select(row => row.origRow.Date).Distinct().OrderBy(date => date))
{
    var startDate = date - TimeSpan.FromDays(250);
    var data = merged.Where(row => row.origRow.Date >= startDate && row.origRow.Date <= date).ToList();
    // assume an implementation of LinearRegressionModel
    var model = LinearRegressionModel.Fit(data.Select(row => row.value3Zscore).ToArray(), data.Select(row => row.df2Row.Returns).ToArray());
    models.Add(model);
}

If anything I would say that the C# version here looks better than it actually is in real life. In real life, this isn't the only computation we're doing. We're trying out other ways of normalizing the data, we're inspecting the dataframe after every step to double-check that our calculations are doing what we meant to do, we're trying other models like Lasso and Ridge and Random Forests, we have more than just one variable that we're trying to fit, we're trying rolling sums and regressions based on days rather than observations and vice versa. I would say that throwing all of these things into the mix make it even harder to do in C# and only trivially more difficult to do in Python/pandas.

I think the choice of indexing will be interesting.
The df[df['int1'] % 2 == 0] style of indexing can, at best, be described as fugly. It smells a bit MatLab-ish.
The R data.table syntax of dt[int1 %% 2 == 0] is somewhat more elegant and a touch SQL/LINQ like. It also handles numeric references such as dt[1:20], column subsetting as well dt[1:20, list(int1, int2, int3)] as well as grouping within the brackets. Now some of these things may be a step too far with a .group() method being more appropriate, for instance.
Ability to reference from the end of the dataframe much like MatLab's m[end-9:end] or an open request like m[end-9:] or m[-9:] would also be good.

What pandas does bring to the table are the inbuilt interpolation, periodicity change, rolling etc calculations whereas in R you resort to other libraries such as zoo.

I think that, as .Net is a little late here, it is worth trying to utilise the Apache Arrow and Parquet work as you don't want to be both late to the party and the odd one out. Both Python and R are headed down this road and, if .Net wants to be part of the data analysis / machine learning party, .Net should go that way too.

Sorry to spam, but any update on whether this is likely to happen? Our department might come back from the brink of python if a pandas-like library was on the horizon

cc @shauheen @chris-lauren

Just to chime in, this would make all the difference. Pandas/DataFrames have made Python exceptional for working with timeseries. LINQ and SQL Server are both very weak for timeseries. I don't know about ML, but there are lots of uses for Pandas outside of ML. Take finance for example. We need tools for working with timeseries and operations like joins in-memory because performance is critical, but we also need code to be stable and performant. It doesn't make sense to use Python for executing trades, capturing data from APIs, etc, so we end up passing data to/from Python. Oracle and timeseries-based databases like KDB can be helpful, but performing operations in code is often critical. Python.NET can bridge Pandas and .NET to a degree, but there is no efficient way to pass data to Python as a dataframe or to convert C# arrays or dictionaries to dataframes in Python. Time keeps going by and Microsoft keeps falling further behind, and some of the comments in this thread give the impression that Microsoft still fails to understand why Pandas is so popular.

Just to chime in, this would make all the difference. Pandas/DataFrames have made Python exceptional for working with timeseries. LINQ and SQL Server are both very weak for timeseries. I don't know about ML, but there are lots of uses for Pandas outside of ML. Take finance for example. We need tools for working with timeseries and operations like joins in-memory because performance is critical, but we also need code to be stable and performant. It doesn't make sense to use Python for executing trades, capturing data from APIs, etc, so we end up passing data to/from Python. Oracle and timeseries-based databases like KDB can be helpful, but performing operations in code is often critical. Python.NET can bridge Pandas and .NET to a degree, but there is no efficient way to pass data to Python as a dataframe or to convert C# arrays or dictionaries to dataframes in Python. Time keeps going by and Microsoft keeps falling further behind, and some of the comments in this thread give the impression that Microsoft still fails to understand why Pandas is so popular.

This is a great conversation.

Has anyone started working on a spec for what this might look like as part of .NET?

We are beginning to dig into this but still fairly early in the process. I can assure you all we love pandas. Its an amazing library and we do fully appreciate the need for an equivalent in .NET. We simply are resource constrained at the moment and would love contributions / help spec'ing out and implementing

Awesome! Any existing resources you can point us to? Would love to help out

I'm interested in helping too. I'm not going to steer the ship but I can help with grunt work.

Some time has passed since my first comment in this thread. I'd seriously recommend anyone considering specifying this to look into Apache Arrow project (in-memory column store model), and others building on top of it (like https://rapids.ai/, also https://arrow.apache.org/powered_by/). It's a huge chance to catch up by plugging into existing data science ecosystem. It's too late to try to win others (Python, R), last chance is to join them.

In terms of the actual API design, .NET could again leverage Arrow implementers' experience and pick the best route. Being able to efficiently interop on large data sets with other processes would be a huge win for all data-processing programs. GPU-interop is another great boon.

To sum up, I think Apache Arrow .NET implementation could be a good foundation for .NET data frame.

As a complete layman looking at Arrow is very appealing. It took a while to look through this thread about Arrow, so I gather the few pieces here to save some trouble for other:

https://github.com/dotnet/corefx/issues/26845#issuecomment-391793741
https://github.com/dotnet/corefx/issues/26845#issuecomment-406106187

And the posts just below these.

Looks like someone just finished an initial C# implementation of Arrow

https://github.com/apache/arrow/tree/master/csharp

Good find, @nh43de.

Is this the way to go then; to build upon RecordBatch and implement an API on top (with array and index notation similar to pandas and/or data manip verbs similar to dplyr)? Or should we still consider defining a DataFrame type that can load in Arrow format?

PyArrow has a Table class that is a "tool to help with wrangling multiple record batches..." PyArrow Tables.

One thing to note is that a lot of data is in netCDF or geoTIFF format. By a lot I mean many of the scientific data sets having large amounts of data. It would be great if the .NET story working with these formats were better. If not otherwise, maybe reading something like geoTIFF (these are used as "raster data sets") and storing the data in Arrow format for other work. Now the tools are fairly non-existent and even the ones wrapping ones like gdal_translate (to give search engine fodder) could benefit of more people improving them. I think this is closely related to issues with .NET and geographic types for which there's some push currently.

(There's https://github.com/predictionmachines/SDSlite but it is abandoned.)

We're building the underlying project of machine learning, which implements ndarray and dataframe. Some discussion about SciSharp STACK 's Pandas.NET https://github.com/SciSharp/Pandas.NET/issues/13

@pgovind is now looking at this.

As a short heads-up: The Arrow C# implementation is getting some drive in the last days again: https://github.com/apache/arrow/tree/master/csharp It would be nice to have some further eyes and contributors on this as the core Arrow developers are all (yet) no .NET experts.

@xhochy - I've been looking at the Arrow C# implementation for the past couple of days, and I have a few proposed changes that I'm working on - mostly performance enhancements.

https://issues.apache.org/jira/browse/ARROW-4502
https://issues.apache.org/jira/browse/ARROW-4503

I hope to find more places I can help out there.

I used R and Python data frames for a long time. But being primarily a C# developer, I longed for two things: type safety and LINQ. So a few years ago I created a C# data frame that stores typed data and inherits from a generic list. Data wrangling is done with the familiar LINQ methods Where, GroupBy, Select, etc.

Is it more verbose than R or Python? Yes. Do you approach data analysis the same way? Possibly, but probably not. Does it handle large datasets well? It's OK... the eyeball test says R wins.

But, I used this every day for a year to do data analysis and was pleased. Please tell me what you think- I'm interested in hearing thoughts / comments. If this has legs I might be interested in collaborating to improve it; I haven't worked on it in a while:

https://www.nuget.org/packages/Frames/

Here are some examples:

Creating frames

Frame employees = Frame.ReadCSV<string, DateTime, double, double, double, double, double>(@"http://www.spearing.com/files/Employees.csv");

 Frame frame = new Frame();

 frame["Names"] = new string[] { "Bob", "Mary", "Joe" };
 frame["StartDate"] = new DateTime[] {
    new DateTime(2016, 10, 1),
    new DateTime(2016, 6, 8),
    new DateTime(2017, 9, 2)
 };

 frame["Ages"] = col(41.0, 28.0, 35.0);
 frame["LowScore"] = col(78.0, 81.0, 85.0);

 frame["HighScore"] = new double[] { 90.0, 92.0, 87.0}.ToColumn();

Retrieving data:

double averageAge = frame["Ages"].As<double>().Average();
double totalScore = frame["HighScore"].As<double>().Sum();

double[] ages = frame["Ages"].As<double>().ToArray();

List<double> lowScores = frame["LowScore"].As<double>();

Column calculations:

 frame["ScoreDiff"] = frame["HighScore"].As<double>() - frame["LowScore"].As<double>();
 frame["HighPlus1"] = frame["HighScore"].As<double>() + 1.0;

 frame["Hours"] = new double[] { 25, 30, 38 };
 double[] hourlyRate = new double[] { 15, 20, 12 };

 frame["Pay"] = frame["Hours"].As<double>() * hourlyRate;

Generic collections:

 public class Employee
 {
     public string Name { get; set; }
     public int Age { get; set; }
     public double HighScore { get; set; }
 }

 List<Employee> employees = new List<Employee>()
 {
     new Employee() {Name = "Bob", Age = 40, HighScore = 90.0 },
     new Employee() {Name = "Mary", Age = 28, HighScore = 92.0 },
     new Employee() {Name = "Joe", Age = 35, HighScore = 87.0 }
 };

 Frame employeesFrame = employees.ToFrame();

Filtering:

 DateTime startDate = new DateTime(2016, 9, 1);
 Frame newEmployees = frame
     .Where(row => row.Get<DateTime>("StartDate") >= startDate)
     .ToFrame(); 

Grouping:

 Frame empYearSummary = frame
     .GroupBy(row => row.Get<DateTime>("StartDate").Year)
     .Select(grp => new
     {
         Year = grp.Key,
         AverageAge = grp.Average(row => row.Get<double>("Ages")),
         Count = grp.Count()
     })
     .ToFrame();

@jackimburgia that looks like just the sort of API that is needed and is really promising. Is it open source? I couldn't see any links. Does it store data as columns or rows?

@mungojam - thanks for the feedback. It stores data as columns (typed collections); but typically I interacted with the data in the context of a row (I hope that makes sense).

I haven't opened it up yet- I put it on Nuget about a year and half ago to gauge interest / to see if this is something the .Net community wants or needs.

FYI to anyone who is following this discussion. We've started a prototype of a .NET DataFrame over in corefxlab. https://github.com/dotnet/corefxlab/pull/2656. Feedback, contributions, help, etc. would be greatly appreciated.

@eerhardt Awesome! You're probably already aware of the work on Variant, but IMHO it seems like it could be a great fit for a DataFrame to avoid boxing in the common case when you're working with primitive value types.

Good luck with the new effort. I open sourced the Data Frames project I worked on for reference:

https://github.com/jackimburgia/Frames/tree/master/src

Wonder what the guys over at the MS Trill project think of this initiative?

cc @pgovind @cybertyche @YohDeadfall @eerhardt

(small puff of smoke) I HAVE BEEN SUMMONED

I few notes from what I see in the (wonderful) discussion above.

Typed and untyped need not be mutually exclusive

There appears to be a fairly even split between people who want the more concise, untyped DataFrame type and those who want a boxing-free, type-safe DataFrame. I would like to suggest that we may be able to achieve a superposition of having and eating cake here.

For instance, we may be able to have a situation wherein:
public sealed class DataFrame<T> : DataFrame

Alternatively, we could have:
public sealed class DataFrame : DataFrameBase
public sealed class DataFrame<T> : DataFrameBase

But the former would be far better. I think of it as a parallel to the relationship between LambdaExpression and Expression.

I would personally set an extremely high bar for having separate implementations for ML.Net and for DataFrame

The scenarios in ML.Net and those described above have significant overlap. ML.Net has not yet achieved a 1.0 release. As noted above already, there is already a DataTable class (one that has an extraordinary number of shortcomings relative to the needs stated by people here) but if we also have a DataFrame class that's aaaaaaaalmost IDataView but not quite then I think we have failed. To that end, I hope the corefx and ML.Net teams are talking like nowish about where this goes. It need not be that DataFrame and IDataView refer to literally the same class, as it could be like the relationship between DataTable and DataView in the Fx. But there's too much overlap in scenario here for there to be divergent evolution.

Trill could join this ecosystem as well if we had first-class time support

The main difference between an abstraction like DataFrame and IStreamable (Trill's basic abstraction) is support for the progression of time. ML.Net may also be looking to Trill for some of its capabilities such as windowing, so this is likely a good discussion to keep going.

Some of the more untyped-friendly operations can be done in the typed world with a little imagination and some code generation

Adding a column was brought up as a great example of an operation that can be ridiculous to do in a typed environment but much easier over an untyped set. One of the things we did in Trill was implement a set of overloads to the .Select() method that take operations like adding a column and take it from ridiculous down to maybe a mild laugh. It's not as concise as in untyped-world, but still better than otherwise. For adding a column, the result looks something like:

public static IStreamable<TKey, TNew> Select<TKey, TOld, TNew, TField1>(
this IStreamable<TKey, TOld> source,
Expression<Func<TNew>> initializer,
Expression<Func<TNew, TField1>> fieldSelector1,
Expression<Func<TOld, TField1>> fieldInitializer1) where TNew : new()

It allows you to do something like this:

IStreamable<TKey, OldType> stream = …;
var newStream = stream.Select(() => new NewType(), val => val.NewField, old => NewFieldValue(old));

Again, not perfect, but better than the alternative. Any field that is the same between OldType and NewType in the above is carried over automatically because MAGIC er I mean code generation. Trill's got a few methods that contain magic like that; for instance, Trill streams have Pivot() and Unpivot() methods.

Anyway, those are my first thoughts on the matter. Feel free to conjure me again. (disappears in a smaller puff of smoke)

Closing, as work is actively proceeding in https://github.com/dotnet/corefxlab/tree/master/src/Microsoft.Data
and further discussion should happen in corefxlab.
cc: @eerhardt, @pgovind

@mungojam @hrichardlee I've been working on a DataFrame library for .NET which implement many of the ideas in this thread and I'd appreciate some feedback when you have time:
F#: https://notebooks.azure.com/allisterb/projects/sylvester/html/Sylvester.DataFrame.ipynb
C#: https://dotnetfiddle.net/1K9tdy

@allisterb This is great. Implementing iEnumerable (making your dataframe implementation linq-friendly) sets it apart from Deedle. I haven't tried the commercial dataframe products for .NET, so I don't know if they are linq-friendly, but it looks like the implementation in corefxlab doesn't implement iEnumerable yet.

it looks like the implementation in corefxlab doesn't implement iEnumerable yet.

Not yet, but we plan on it - see https://github.com/eerhardt/corefxlab/commit/5cf06f4afc23a98f282a7511993e2550784ed719. I just haven't had time to submit the PR yet, but will this week.

If there is anything I can do to contribute to this, or if you'd like me to be a reviewer on any work in this space, please let me know.

@joshnyce Thanks, I tried to address what I though were some of the shortcomings of Deedle and being able to use .NET LINQ as the main query library was one of them as .NET developers are already familiar with how LINQ works.

@cybertyche Hey thanks a lot! I guess the main thing I need help with is getting the word out. I really thing .NET developers would prefer to stay in C# or F# for doing data analysis instead of using pandas or R and I think F# + Sylvester's data frame + Azure Notebooks offers a comparable experience.

Wondering if Microsoft.Data will solve the same problems, or will we still need 3rd party frameworks?

We now have a preview of DataFrame out on Nuget here. We'd definitely appreciate any feedback we get! DataFrame implements IDataView, so it can be passed to ML.NET APIs easily. DataFrame is also based on Apache Arrow, so we can use DataFrames to write User Defined Functions(UDFs) in .Net for Spark.

Another exciting development is .NET Core support on Jupyter Notebooks. Check out https://devblogs.microsoft.com/dotnet/net-core-with-juypter-notebooks-is-here-preview-1/ to get started.

And finally, this is a sample showing Jupyter Notebook + .NET DataFrame + Charting! Click on https://github.com/dotnet/try/blob/master/Notebook.md to try it out on binder!

Can you give an example of some basic usage? I was unable to get it to work for a few basic cases.

C# var a = new DataFrame(); a.Columns.Add(new PrimitiveDataFrameColumn<DateTime>("DOB")); a.Columns.Add(new StringDataFrameColumn("Name", 100)); //What does the length param refer to? a.Columns.Add(new PrimitiveDataFrameColumn<int>("Age")); //This throws a.Append(new object[] { DateTime.Parse("2017/01/01"), "Mary", 50 }); a.Append(new object[] { DateTime.Parse("2011/03/01"), "Sue" , 15 }); a.Append(new object[] { DateTime.Parse("2015/05/01"), "John", 35 });

A few thoughts/questions:

  • How do you actually see what is in the DataFrame? I don't see a Rows property and it doesn't implement IEnumerable.
  • ToString() also threw every time I tried to use it.
  • I see a LoadCsv method in the Jupyter example but I don't see that on the DataFrame type.
  • Some documentation on what the various methods do would really be helpful. For example, what do Add, Subtract, Multiple, Divide do? Does this work on heterogeneous data?

Thanks for all the hard work!

@pgovind Trying my luck and hinting a bit, would you happen to have examples coming on how to read netCDF/HDF5 files? :)

Here's a quick example of DataFrame in action:

            PrimitiveDataFrameColumn<DateTime> dateTimes = new PrimitiveDataFrameColumn<DateTime>("DateTimes"); // Default length is 0.
            PrimitiveDataFrameColumn<int> ints = new PrimitiveDataFrameColumn<int>("Ints", 3); // Makes a column of length 3. Filled with nulls initially
            StringDataFrameColumn strings = new StringDataFrameColumn("Strings", 3); // Makes a column of length 3. Filled with nulls initially

            // Append 3 values to dateTimes
            dateTimes.Append(DateTime.Parse("2017/01/01"));
            dateTimes.Append(DateTime.Parse("2017/01/01"));
            dateTimes.Append(DateTime.Parse("2017/01/01"));

            DataFrame df = new DataFrame(new List<DataFrameColumn> { dateTimes, ints, strings }); // This will throw if the columns are of different lengths

            // To change a value directly through df
            df[0, 1] = null; // 0 is the rowIndex, and 1 is the columnIndex. This sets the 0th value in the Ints columns to null

            // Modify ints and strings columns by indexing
            ints[1] = 100;
            strings[1] = "Foo!";

            // Indexing can throw when types don't match.
            // ints[1] = "this will throw because I am a string";  
            // DataType can be used to figure out the type of data in a column. 
                ints.DataType; // returns System.Int32

            // Add 5 to ints in place
            ints.Add(5, inPlace: true);

            // Add 5 to ints through the DataFrame
            df["Ints"].Add(5, inPlace: true);

            // We can also use binary operators. Binary operators produce a copy, so assign it back to our Ints column 
            df["Ints"] = (ints / 5) * 100;

            // Fill nulls in our columns, if any. ints[0], ints[2], strings[0] and strings[1] are null
            df["Ints"].FillNulls(-1, inPlace: true);
            df["Strings"].FillNulls("Bar", inPlace: true);

            // To inspect the first row
            IList<object> row0 = df[0];

            // Filter rows based on equality
            DataFrame filtered = df.Filter(strings.ElementwiseNotEquals("Foo!"));

            // Sort our dataframe using the Ints column
            DataFrame sorted = df.Sort("Ints", ascending: true);

            // GroupBy 
            GroupBy groupBy = df.GroupBy("DateTimes");

            // Count of values in each group
            DataFrame grouped = groupBy.Count(); // Alternatively find the count in just the desired columns

Many of the APIs expose a "inPlace" parameter. The aim here is ease of use to inspect values in a notebook.

@MgSam :
Your example threw when you tried to add the "Age" column because it's length(0) did not match the "Name" column's length(100). The DataFrame's RowCount is updated the first time a column is added to it. We then expect every other column to have the same length. Technically, adding "Name" itself should've thrown because it's length(100) does not match "DOB"s length(0). I'm looking into why we didn't throw :)
Where did you try ToString(). On a? I've overridden it to output the first few rows in the DataFrame. If you can provide a sample that throws when it shouldn't, I'll file a bug
LoadCsv is a static method. You can use it like DataFrame.LoadCsv(path/to/file)
Most of the methods have documentation on them. However, it may not be hooked up into intellisense on a notebook yet. Did you try to use the binary operator methods (Add, Subtract etc) on Visual Studio? You should be able to see the documentation there.
We will be adding a Rows property soon to DataFrame. It will still be loosely typed though.

Suggestion. How about adding a tutorial notebook on dotnet/try with some dummy data so that we can start to explore the library on binder directly. It will provide testers a taste of all functionalities before documentations are out.

That's the plan. I'll be adding a sample notebook and an accompanying blogpost going over the major features in the preview soon.

@veikkoeeva : We only support reading csv files in a DataFrame at the moment. Or, you can make a dataframe from an Arrow RecordBatch. I opened https://github.com/dotnet/corefxlab/issues/2785

@pgovind
Thanks! Some more feedback:

  • The other column types seem to have a default value for the length parameter of 0, but StringDataFrameColumn does not. This caused some of my confusion.
  • I'm not seeing any XML docs on any methods or classes. Could this be an issue with the package? (My XML docs look fine for stuff outside this package)
  • How do you add/remove columns/rows for an already constructed DataFrame?
  • Are operations between two DataFrames supported? (Concatenation, mathematical, etc)?
  • Are you planning on having DataFrame implement IEnumerable? Given the indexer it has, it seems like IEnumerable> would make sense.
  • Can you insert spaces between the columns in DataFrame.ToString()?
  • A ToCsvString() method that creates a properly escaped and formatted CSV out of the entire data frame would be really useful. This is an extremely common thing people might want to do.
  • I think a Visual Studio Visualizer for DataFrame would also be really helpful. I actually built an extension which added a better visualizer for DataTable. Unfortunately it's not currently in the extension gallery because a subsequent VS update completely broke it and I haven't had time to update. If there's interest and I could get some support from MS, I'd be happy to update this extension to work for DataFrame and DataTable. (It supports searching the data, easy copy/paste, and displayed in a non-modal window)

I have not tried this yet and takes some weeks to get off of from a business trip, but as quick feedback in fear of missing:

  • Does the ToString take account formatting cases
  • And are functions like ToCsvString extension methods? While one can provide an extension method of one's liking, it would seem something to consider that not to complicate the core types with something like that if there are multiple implementation available and people probably keep finding edge cases from the data and libraries they work with.

Is there a design document that shows why the API is built the way it is? Also interested in performance and implementation concerns.

@MgSam

How do you add/remove columns/rows for an already constructed DataFrame?

df.Columns returns a Collection<DataFrameColumn>. The usual Collection methods can be used to remove/add columns. df.Append to append a row to a DataFrame df. We don't have a way to remove a row in place from a DataFrame yet. If you really need to remove some rows, use df.Filter(rowIndices) where rowIndices is a PrimitiveDataFrameColumn<long>. The Filter method returns a new DataFrame containing only rows from rowIndices. The "Getting Started" notebook will have an example :)

Are operations between two DataFrames supported? (Concatenation, mathematical, etc)?

We've got Join and Merge. By mathematical, do you want to add 2 DataFrames for example? The 0.1.0 release doesn't support that, but it does support mathematical operations on columns. You could run a for loop over the columns of your dataframes and call math operations on them. I'm sure future release will add math operations on pairs of DataFrames.

Are you planning on having DataFrame implement IEnumerable? Given the indexer it has, it seems like IEnumerable would make sense.

The next relase will add a Rows property that will implement IEnumerable :)

Can you insert spaces between the columns in DataFrame.ToString()?

Yup:). However, wait till the blog post and we might win you over to the notebook side where ToString() is redundant.

A ToCsvString() method that creates a properly escaped and formatted CSV out of the entire data frame would be really useful. This is an extremely common thing people might want to do.

Agreed. Will definitely make it into the next version

I think a Visual Studio Visualizer for DataFrame would also be really helpful. I actually built an extension which added a better visualizer for DataTable. Unfortunately it's not currently in the extension gallery because a subsequent VS update completely broke it and I haven't had time to update. If there's interest and I could get some support from MS, I'd be happy to update this extension to work for DataFrame and DataTable. (It supports searching the data, easy copy/paste, and displayed in a non-modal window)

That would be awesome. Maybe put up a PR and tag me? DataFrame.ToString is just a "quickly see my data" method. Not really meant for much else.

@veikkoeeva

Does the ToString take account formatting cases

Nope. But we can add the overload it if you feel like it's important. I suggest waiting for the first blog post to come out. It goes over how to use DataFrame + .NET notebook and adds custom formatting for the DataFrame that solves many pain points.

And are functions like ToCsvString extension methods? While one can provide an extension method of one's liking, it would seem something to consider that not to complicate the core types with something like that if there are multiple implementation available and people probably keep finding edge cases from the data and libraries they work with.

We actually don't have a ToCsv method yet. What you stated is one of the reasons we don't have support for many formats in 0.1.0. If we had gone down a list of popular formats and tried to support all of them, IO would've taken up a bunch of our time. @eerhardt might have a better answer than me here. My gut feeling is that we can have a "DataFrame.IO.Extensions" package that deals with all the IO ops for popular formats. That way the core DataFrame methods are independent. The public methods already defined on DataFrame should make it easy enough to populate it with data.

Is there a design document that shows why the API is built the way it is? Also interested in performance and implementation concerns.

No. My forthcoming blog post will go over some of the design and implementation considerations. It's a little early to talk about performance numbers IMO. I've done very little perf work for the 0.1.0 release, save for minor optimization and profiling. Understanding the performance concerns requires going over the DataFrame design a little bit. The backing store is the Arrow format for primitive data types i.e. the columns are stored as a collection of Memory<byte>. 3 consequences arise from this:

  1. A column can have a length greater than int.MaxValue
  2. SIMD operations are possible because we store values column major
  3. Going from Memory<byte> to Span<T> has a cost. Therefore, if you index into a column to get/set values, you pay this cost. If you do this in a for loop, it gets expensive as you can imagine. This is where most of the perf issues lie at the moment. There's also costs associated with boxing value types that have shown up in my profiling. One way to get around this is to convert each Memory<byte> to Span<T> once. This is what some methods do and that alleviates a lot of the perf issues. Unfortunately, not all of the algorithms can get away with this.

Finally, we haven't implemented any SIMD ops and/or multi-threading yet. So there's still loads of perf to gain :)

Here's the promised blog post :)

Thanks for the blog post!

Some more feedback:
It would be great to add static factory methods for creating DataFrameColumns. That way we can benefit from type inference (which also makes refactoring easier). Example:
``` C#
class DataFrameColumn
{
public static PrimitiveDataFrameColumn Create(string name, IEnumerable values) where T : unmanaged
{
return new PrimitiveDataFrameColumn(name, values);
}

public static StringDataFrameColumn Create(string name, IEnumerable<string> values)
{
    return new StringDataFrameColumn(name, values);
}
//... Also overloads for all the other constructors for PrimitiveDataFrameColumn and StringDataFrameColumn...

}

Usage:
``` C#
var doubles = new[] { 3.0, 4.0, 5.0};
var ints = new[] { 3, 4, 5};
var strings = new[] { "foo", "bar", "baz" };

var col1 = DataFrameColumn.Create("doubles", doubles);
var col2 = DataFrameColumn.Create("ints", ints);
var col3 = DataFrameColumn.Create("strings", strings);

EDIT: Happy to try my hand at a PR for this if the API looks ok.

@MgSam : I agree. The API looks good to me. Feel free to tag me on the PR when you put it up and we can get it into the next preview :)

+1 for the ToCsv method! That would be so useful.

Was this page helpful?
0 / 5 - 0 ratings