Pandas: Include missing data count in pd.Dataframe.describe method

Created on 30 Jun 2018  路  6Comments  路  Source: pandas-dev/pandas

Code Sample

d = {'col1': [1, np.nan], 'col2': [3, 4]}
df = pd.DataFrame(data=d)
df.describe()

Problem description

  • Output
    image

The describe method generally only include 9 summary statistics(count, mean, std, min, 25%, 50%, 75%, max, missing) but no missing count which is very import in realworld data analysis.

To include missing count I have to use the following code,

d = {'col1': [1, np.nan], 'col2': [3, 4]}
df = pd.DataFrame(data=d)
des1 = df.describe()
des2 = df.isnull().sum().to_frame(name = 'missing').T
pd.concat([des1, des2])

And the output
image

Expected Output

Expect include missing count in describe method.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.5.final.0
python-bits: 64
OS: Darwin
OS-release: 16.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.0
pytest: 3.5.1
pip: 10.0.1
setuptools: 39.1.0
Cython: 0.28.2
numpy: 1.14.3
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 6.4.0
sphinx: 1.7.4
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.4
blosc: None
bottleneck: 1.2.1
tables: 3.4.3
numexpr: 2.6.5
feather: None
matplotlib: 2.2.2
openpyxl: 2.5.3
xlrd: 1.1.0
xlwt: 1.2.0
xlsxwriter: 1.0.4
lxml: 4.2.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.7
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

Enhancement Missing-data

Most helpful comment

Agree, this is default behavior of R's summary(df) function for obvious reasons. More useful than sd anyway.

All 6 comments

@77QingLiu : Are you proposing that we mix together some of the output of DataFrame.info() (this gives you non-null info) and DataFrame.describe()?

@gfyoung , Yes, Exactly

cc @jreback @jorisvandenbossche

Include missing data count in pd.Dataframe.describe() is definitely necessary.

count is the non missing length
so i guess you could add length (or size) is what we call it

Agree, this is default behavior of R's summary(df) function for obvious reasons. More useful than sd anyway.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

matthiasroder picture matthiasroder  路  3Comments

hiiwave picture hiiwave  路  3Comments

andreas-thomik picture andreas-thomik  路  3Comments

venuktan picture venuktan  路  3Comments

amelio-vazquez-reina picture amelio-vazquez-reina  路  3Comments