Re: FW: [S] general statistical issue: weighting observations in logi - Relogit

11 Nov 2002

Thanks for your note.  The best way to fix the two forms of bias in rare 
events logistic regression is to use one of the techniques we discussed in 
the paper you cite.  Weighting so that they have a 50:50 distribution 
would not help since you'd have to weight it back down (using what we call 
prior correction or weighting) to get unbiased estimates.

the point that is easy to confuse is that there are 2 things going on.  
first is in case-control data, you must correct for the fact that you're 
sampling retrospectively.  Then whether or not you have case-control data 
or prospectively collected data, "rare events" is defined as the fraction 
of y's in the population, not in your sample.  With rare events, logistic 
regression (or logistic regression with a correction for case-control 
sampling) underestimates Pr(Y=1).  the corrections we discuss will fix 
this problem.

Gary King
David Florence Professor of Government 
Director, Harvard-MIT Data Center

http://GKing.Harvard.Edu, King(a)Harvard.Edu    
Direct (617) 495-2027, Assistant (617) 495-9271
Data Center (617) 495-4734, eFax (928) 832-7022

On Mon, 11 Nov 2002, Siwik, Thomas (DE - Duesseldorf) wrote:

...
  Bitte beachten Sie zunächst die Informationen am Ende
dieser E-Mail / at
 first please note the information at the end of this email 
 ----------------------------------------------------------------------------

 Dear Mr. King,

 I found at your HP the interesting paper:
 Logistic Regression in Rare Events Data
 Gary King & Langche Zeng

 I have to evaluate an logistic regression, in which
 the rare events are weighted equally to the frequent events
 to avoid the problem of bias of coefficients.  

 May I allowed to ask an expert on that field two simple question:
 - Is a 50%/50%-weighting an appropriate method to avoid bias?
 - Aren't the drawbacks even worse biased betas and the risk
 of overfitting the rare events?

 I cannot find any reliable reference emphasizing this method.
 However, may be it makes sense as an easy to apply rule of thumb.

 Forth following I have attached the short discussion I have initiated
 at the s-news-list.

 Thank you & Best regards
 Thomas Siwik
 ________________________________________________

   Dr. Thomas Siwik
   Deloitte & Touche
   Financial Risk Solutions
   Bahnstrasse 16
   40212 Duesseldorf 
   Germany

   eMail:   tsiwik(a)deloitte.de
   Fon:     ++49.(0)211.8772 - 147 / - 133
   Fax:     ++49.(0)211.8772 - 443
   http:     www.deloitte.de/Dienstl/Bran-fr1.htm
 ________________________________________________

 -----Original Message-----
 From: Siwik, Thomas (DE - Duesseldorf) [mailto:tsiwik@deloitte.de]
 Sent: Montag, 4. November 2002 20:24
 To: &#x27;Hongjiew(a)aol.com&#x27;
 Cc: &#x27;s-news(a)lists.biostat.wustl.edu&#x27;
 Subject: Re: [S] general statistical issue: weighting observations in
 logi

  Statistician (1999). One of the nice properties
of logistic 
 regression (not sure it carries over to general logit models) 
 is that if oversample is done based on the response variable, 
 the coefficients estimates of the predictors are not changed. 
 Only the intercept needs to be adjusted. There are researches  
 I understand your point. If one gives weight lambda to the
 observations with Y=1 the odds-ratio is - heuristically speaking -
 changed by lambda as well. That results in an adjustment of the
 intercept.

 However, couldn't it be an asympthotic property of the
 predictors? I tried to find your assertion in the normal equation:
 lambda*Sum(P(Y=0|x_i)*x_i;i|y_i=1) = Sum(P(Y=1|x_i)*x_i;i|y_i=0)
 If this equation is re-written with the adjusted P(Y) there
 remains still a term, which vanishes only for T->infinity. Maybe
 I put it wrong, but additionally I ran an example in S+ showing
 different estimates for all coefficients.

 My intuition tells me that the overweighting of rare events
 causes overfitting and high sensibility to rare events. It
 seems to me to be not an appropriate method to reduce a possible
 bias of predictors of rare events.

 I found a very readable paper illustrating the problem of
 rare events:
 Logistic Regression in Rare Events Data
 Gary King & Langche Zeng
 http://gking.harvard.edu

 Thomas

  -----Original Message-----
 From: Hongjiew(a)aol.com [mailto:Hongjiew@aol.com]
 Sent: Donnerstag, 31. Oktober 2002 17:48
 To: fharrell(a)virginia.edu; dcts(a)dcts.de
 Cc: s-news(a)lists.biostat.wustl.edu
 Subject: Re: [S] general statistical issue: weighting observations in
 logit-regression

 I agree that if any weighted sampling is used, the parameters 
 need to be adjusted to make valid inference on the original 
 population. But there might be practical reasons to use 
 response based sampling. In the area I am in (database 
 marketing), we frequently encounter situation where the 
 occurrences of events are very "rare" or E(Y=1)<=0.01 for 
 example. There will be computational problems associated with 
 the prediction, see "Predictive performance of the binary 
 logit model in unbalanced samples" by J. S Cramer in "The 
 Statistician (1999). One of the nice properties of logistic 
 regression (not sure it carries over to general logit models) 
 is that if oversample is done based on the response variable, 
 the coefficients estimates of the predictors are not changed. 
 Only the intercept needs to be adjusted. There are researches 
 on the area of comparing statistical efficiency between 
 multiple sampling schemes in logistic regression setting. For 
 example, if N (overall population) =100K where 10% of them 
 Y=1.One may take a response based sample (so #y=1 is close to 
 #y=0) and one may make a true random sample. The first sample 
 usually can be much smaller than the second one to generate 
 comparable estimates (see " The effect of sample size and 
 proportion of buyers in the sample on the performance of list 
 segmentation equations generated by regression analysis" by 
 Berger and Magliozzi in Journal of direct marketing. 

 In a message dated 10/30/2002 8:27:15 PM Eastern Standard 
 Time, fharrell(a)virginia.edu writes:

 On Wed, 30 Oct 2002 23:22:05 +0100
 DCTS &lt;dcts(a)dcts.de&gt; wrote:

 > 
 > I am confronted with a Logit-regression, in which y=0 is   much less frequent
  > than y=1. It is argued that the less
frequent   observations with y=0 should
  > receive higher weights in the regression,
such that the   proportion is
   balanced
between Ys being 0 and 1.  
 Who argues that?  No, you don't want to distort the data.    If your sample is
a random sample from the population to 
 which you want to infer, then rely on maximum likelihood to 
 give good parameter estimates.  You weight observations if 
 you oversampled a segment of the population and you want to 
 represent the original population [even then don't always 
 weight as this reduces efficiency when compared with 
 covariate adjustment for oversampling factors].

 Frank Harrell

 > 
 > To my knowledge there are usually two motivations to use   weights others
than
  > unity:
 > - prior knowledge of the probability of y=0
 > - optimisation of a cost function (in the example above   y=0 is much more
  > expensive and should be predicted with
higher attention)
 > 
 > In my limited econometric library and in the internet I   wasn't able to
find
  > a discussion on the issue of weighting
observations. If   someone has a good
  > hint to a source or could sketch the ideas
of   consequences, pros and cons I
  > would be very pleased.
 > 
 > 
 > Thank you,
 > Thomas
 > 
 >   --------------------------------------------------------------------
  > This message was distributed by  
s-news(a)lists.biostat.wustl.edu.  To
  > unsubscribe send e-mail to  
s-news-request(a)lists.biostat.wustl.edu with
   the BODY
of the message:  unsubscribe s-news  

 -- 
 Frank E Harrell Jr              Prof. of Biostatistics & Statistics
 Div. of Biostatistics & Epidem. Dept. of Health Evaluation Sciences
 U. Virginia School of Medicine   
 http://hesweb1.med.virginia.edu/biostat
 > --------------------------------------------------------------------  
 ----------------------------------------------------------------------------

 Diese Nachricht und jeder übermittelte Anhang beinhaltet vertrauliche
 Informationen und ist nur für die Personen oder das Unternehmen bestimmt, an
 welche sie tatsächlich gerichtet ist. 
 Sollten Sie nicht der Bestimmungsempfänger sein, weisen wir Sie darauf hin,
 dass die Verbreitung, das (auch teilweise) Kopieren sowie der Gebrauch der
 empfangenen E-Mail und der darin enthaltenen Informationen gesetzlich
 verboten ist und gegebenenfalls Schadensersatzpflichten auslösen kann.
 Sollten Sie diese Nachricht aufgrund eines Übermittlungsfehlers erhalten
 haben, bitten wir Sie, den Sender unverzüglich hiervon in Kenntnis zu
 setzen. 
 Sicherheitswarnung: Bitte beachten Sie, dass das Internet kein sicheres
 Kommunikationsmedium ist. Obwohl wir im Rahmen unseres Qualitätsmanagements
 und der gebotenen Sorgfalt Schritte eingeleitet haben, um einen
 Computervirenbefall weitestgehend zu verhindern, können wir wegen der Natur
 des Internet das Risiko eines Computervirenbefalls dieser E-Mail nicht
 ausschliessen. 
 This message (including any attachments) contains confidential information
 intended for a specific individual or entity as the intended recipient. 
 If you are not the intended recipient, you are hereby notified that any
 distribution, any copying of this message in part or in whole, or any taking
 of action based on it, is strictly prohibited by law and may cause
 liability. In case you have received this message due to an error in
 transmission, we ask you to notify the sender immediately. 
 Safety warning: Please note that the Internet is not a safe means of
 communication or form of media. Although we are continuously increasing our
 due care of preventing virus attacks as a part of our Quality Management, we
 are not able to fully prevent virus attacks as a result of the nature of the
 Internet. 
 ----------------------------------------------------------------------------

-
relogit mailing list served by Harvard-MIT Data Center
List Address: relogit(a)latte.harvard.edu
Subscribe/Unsubscribe: http://lists.hmdc.harvard.edu/?info=relogit