Statsmodels: STATA Styled Conditional Logit

Created on 3 Jul 2019  路  5Comments  路  Source: statsmodels/statsmodels

Is your feature request related to a problem? Please describe

When fitting conditional logit, if for some features there is no within-group variance it raises an error:

ValueError: need covariance of parameters for computing (unnormalized) covariances

However, in STATA, it does not raise an error, it rather informs that variable A has no withing group variance, and it will be dropped. Then, it fits the model without that variable, mentioning it also in the output. Example (look at the note and also the last variable in results):

. clogit choice scaledtrip scaledairlines scaledflights scaledprice containslcc clustersizetomenusize clust
> ercounttomenusize clusterdisptomenudisp , group( individual )
string variables not allowed in varlist;
clusterdisptomenudisp is a string variable
r(109);

. clogit choice scaledtrip scaledairlines scaledflights scaledprice containslcc clustersizetomenusize clust
> ercounttomenusize, group( individual )
note: clustercounttomenusize omitted because of no within-group variance.

Iteration 0:   log likelihood = -20772.193  
Iteration 1:   log likelihood = -19968.164  
Iteration 2:   log likelihood = -19932.252  
Iteration 3:   log likelihood = -19932.117  
Iteration 4:   log likelihood = -19932.117  

Conditional (fixed-effects) logistic regression

                                                Number of obs     =    410,201
                                                LR chi2(6)        =   14317.91
                                                Prob > chi2       =     0.0000
Log likelihood = -19932.117                     Pseudo R2         =     0.2643

----------------------------------------------------------------------------------------
                choice |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-----------------------+----------------------------------------------------------------
            scaledtrip |  -.4886174   .0232772   -20.99   0.000    -.5342399    -.442995
        scaledairlines |  -.7686357   .0365782   -21.01   0.000    -.8403276   -.6969439
         scaledflights |    -1.2399   .0327024   -37.91   0.000    -1.303996   -1.175805
           scaledprice |  -.5370699   .0134553   -39.92   0.000    -.5634418    -.510698
           containslcc |   .6804741   .0404264    16.83   0.000     .6012399    .7597083
 clustersizetomenusize |   -1.27629   .1263185   -10.10   0.000     -1.52387   -1.028711
clustercounttomenusize |          0  (omitted)
----------------------------------------------------------------------------------------

I wonder can it be implemented in statsmodels in a way it is done in STATA?
How complex it can be??

type-enh

All 5 comments

The example below shows how this warning works.

Running the code below should produce a warning that some groups were dropped, something like this:

Dropped 3 groups and 30 observations for having no within-group variance

This warning is based on variation (or lack of variation) in the endog variable.

I think you are asking about lack of variation in an exog variable. In general, statsmodels does not check for design matrix singularity (I think R and Stata may do this always by default). There are tradeoffs here, like the cost of an extra QR decomposition on the design matrix. In statsmodels, given a singular design matrix, you may get NaN, Inf, zero, numerical warnings/errors, or any combination thereof.

In conditional logit, the situation is slightly more problematic, because you can have a non-singular exog where certain variables have no within-group variation. This is still a non-identified model. You will see below that when uncommenting the line that creates such a variable, at least in my setup you get a coefficient estimate of zero and a huge standard error. But there is no guarantee that this is always what will happen.

import numpy as np                                                                                         
import statsmodels                                                                                         
from statsmodels.discrete.conditional_models import ConditionalLogit                                       

n = 100                                                                                                    
m = 10                                                                                                     

x = np.random.normal(size=(n*m, 3))                                                                        
grp = np.kron(np.arange(n), np.ones(m))                                                                    
g = np.kron(np.random.normal(size=n), np.ones(m))                                                          

# Uncomment next line to produce a perfectly confounded variable                                                                                                           
#x[:, 0] = np.kron(np.random.normal(size=n), np.ones(m))                                                   

lpr = x.sum(1) + g                                                                                         
pr = 1 / (1 + np.exp(-lpr))                                                                                
y = (np.random.uniform(size=n*m) < pr).astype(np.int)                                                      

model = ConditionalLogit(y, x, groups=grp)                                                                 
result = model.fit() 

I don't think the question of how to handle a singular exog (or non-identified parameterization) is unique to ConditionalLogit. It is not hard to check for this of course, but it seems to me that there should be a more uniform way to implement aggressive argument checking, e.g. a keyword argument like check_args that has levels "none", "shape", and "values" that would do progressively more expensive pre-checking of arguments.

@kshedden , yes. I tried your example and saw the warning. I found out about that part of the code yesterday when was looking through source code.
I also agree about the aggressive argument checking. Maybe having a low-level check functionality which can then be extended per use case?
Also, the order of the check seems important to me. Checking first endog and then exog every time does not seem reasonable to me (maybe I am wrong)
I am not very into software development, I am just thinking out loud here.

(I'm on vacation and don't check email often)

We need helper functions that can check for non-identified parameters in non trivial cases.
In simple OLS it is essentially just matrix_matrix rank or vif that the user can run on exog to find singular or near singular design.
In more difficult cases non-identified parameters or non-existence of MLE can come from the interaction of endog and exog. The only case we have currently is perfect separation for binary endog models like Logit, GLM Binomial. There we check during fit.

Similar cases would be where we have only a few observations that determine a parameter but we use large sample inference, e.g. a cell in exog that has only a few observations. This is similar to the empty cell case (column of zeros) that patsy might create with interaction of categorical variables.

related #1908

Yes, @josef-pkt , I totally agree with you. We should have several helper functions for sanity check purposes.
I think in conditional logit case, the behavior should replicate the one from STATA, which is, drop exog variables from the analysis.

How complex should this solution be?
I am very interested in this. both in problem itself and solving it

Was this page helpful?
0 / 5 - 0 ratings

Related issues

Freakwill picture Freakwill  路  5Comments

samosun picture samosun  路  4Comments

joequant picture joequant  路  4Comments

ChenLGit picture ChenLGit  路  4Comments

mustafaaydn picture mustafaaydn  路  4Comments