Nested dictionaries are commonly emitted by web APIs that speak json. Getting this sort of data into pandas isn't very easy right now, without manual data structure munging, as the dicts reaing objects rather then converted into a flat naming hirerchy.
Here's a common example of data:
In [95]: data=[dict(user=dict(uid=123,full_name='Alice'),followers=1),
...: dict(user=dict(uid=456,full_name='Bob'),followers=2)]
...: data
Out[95]:
[{'followers': 1, 'user': {'full_name': 'Alice', 'uid': 123}},
{'followers': 2, 'user': {'full_name': 'Bob', 'uid': 456}}]
Pandas keeps the dicts as objects:
In [96]: df=pd.DataFrame(data)
...: df
Out[96]:
followers user
0 1 {u'uid': 123, u'full_name': u'Alice'}
1 2 {u'uid': 456, u'full_name': u'Bob'}
But I'd love to see something along the lines of:
In [91]: df=pd.DataFrame(data,flatten_dicts=True)
...: df
Out[94]:
followers user.uid user.full_name
0 1 123 Alice
1 2 456 Bob
Well, this is in general a non-trivial problem and that's why its not done by default. You can do the below.
In [10]: from pandas.io.json import json_normalize
In [11]: json_normalize(data)
Out[11]:
followers user.full_name user.uid
0 1 Alice 123
1 2 Bob 456
see docs here: http://pandas-docs.github.io/pandas-docs-travis/io.html#normalization
I suppose a 2 level dict could be examined and this done.
Don't want additional context specific keywords. But you could try this type of soln an dsee if it breaks anything. If it works ok, would consider it.
I don't understand, you don't want new keywords but would be ok with a flatten keyword?
In any case, json_normalize works fine for me - I just didn't know it was hidden there. Thanks!
no new keywords
but u maybe could inspect a 2-level dict and flatten
why don't u give it a shot
Wouldn't that break existing code? If no new keywords are allowed (I can see why), I think it's better to leave it as-is. json_normalize
works fine for me, I'll add any extra features I need there.
@kay1793 here's a couple of things to try (and can see what works best):
pd.read_json
interpret this (it normally takes a string / file handle), and essentially call json_normalize
if its a nested dict-of-dicts (we might be bending the definition a bit though)DataFrame
constructor deal with this and see if it can do unambiguous interpretation (e.g. you have a dict of dict / scalar mix, instead of the current behavior, actually call json_normalize
) - might break things, but unknown until you trypd.JSON
which is essentially a wrapper around json_normalize
(and is a more expandable way of doing things like this)read_json
data schema isn't wonderful but it is what it is, I don't think making it as mysterious and full of private cases as the Dataframe constructor is a good idea.json_normalize
works fine for me, and when it comes to API design I'm kind of conservative, so none of those options seem acceptable to me. I vote "do nothing".
Edit: clarified my 2nd point.
@kay1793 I think you missed my point on these three. I was suggesting you actually try it and see if its a problem to try to infer it.
I did disagree with you, but I'm not sure what I've misunderstood. Can you be more specific?
@kay1793
I was trying to find a nice way to promote json_normalize
its kind of buried.
that's why I suggested making read_json
try it (e.g. if a string/filename is not presented). E.g. you have a 'json' like object, read_json
can internally call json_normalize
and to try to figure it out.
Doing this by default is problematic on many levels in the DataFrame constructor (though I wanted you to try it and see, maybe it IS possible to infer these types of multi-level dicts).
Most helpful comment
Well, this is in general a non-trivial problem and that's why its not done by default. You can do the below.
see docs here: http://pandas-docs.github.io/pandas-docs-travis/io.html#normalization
I suppose a 2 level dict could be examined and this done.
Don't want additional context specific keywords. But you could try this type of soln an dsee if it breaks anything. If it works ok, would consider it.