Thanks for your note. The best way to fix the two forms of bias in rare
events logistic regression is to use one of the techniques we discussed in
the paper you cite. Weighting so that they have a 50:50 distribution
would not help since you'd have to weight it back down (using what we call
prior correction or weighting) to get unbiased estimates.
the point that is easy to confuse is that there are 2 things going on.
first is in case-control data, you must correct for the fact that you're
sampling retrospectively. Then whether or not you have case-control data
or prospectively collected data, "rare events" is defined as the fraction
of y's in the population, not in your sample. With rare events, logistic
regression (or logistic regression with a correction for case-control
sampling) underestimates Pr(Y=1). the corrections we discuss will fix
this problem.
Gary King
David Florence Professor of Government
Director, Harvard-MIT Data Center
, King(a)Harvard.Edu
Direct (617) 495-2027, Assistant (617) 495-9271
Data Center (617) 495-4734, eFax (928) 832-7022
On Mon, 11 Nov 2002, Siwik, Thomas (DE - Duesseldorf) wrote:
Bitte beachten Sie zunächst die Informationen am Ende
dieser E-Mail / at
first please note the information at the end of this email
----------------------------------------------------------------------------
Dear Mr. King,
I found at your HP the interesting paper:
Logistic Regression in Rare Events Data
Gary King & Langche Zeng
I have to evaluate an logistic regression, in which
the rare events are weighted equally to the frequent events
to avoid the problem of bias of coefficients.
May I allowed to ask an expert on that field two simple question:
- Is a 50%/50%-weighting an appropriate method to avoid bias?
- Aren't the drawbacks even worse biased betas and the risk
of overfitting the rare events?
I cannot find any reliable reference emphasizing this method.
However, may be it makes sense as an easy to apply rule of thumb.
Forth following I have attached the short discussion I have initiated
at the s-news-list.
Thank you & Best regards
Thomas Siwik
________________________________________________
Dr. Thomas Siwik
Deloitte & Touche
Financial Risk Solutions
Bahnstrasse 16
40212 Duesseldorf
Germany
eMail: tsiwik(a)deloitte.de
Fon: ++49.(0)211.8772 - 147 / - 133
Fax: ++49.(0)211.8772 - 443
http:
www.deloitte.de/Dienstl/Bran-fr1.htm
________________________________________________
-----Original Message-----
From: Siwik, Thomas (DE - Duesseldorf) [mailto:tsiwik@deloitte.de]
Sent: Montag, 4. November 2002 20:24
To: 'Hongjiew(a)aol.com'
Cc: 's-news(a)lists.biostat.wustl.edu'
Subject: Re: [S] general statistical issue: weighting observations in
logi
Statistician (1999). One of the nice properties
of logistic
regression (not sure it carries over to general logit models)
is that if oversample is done based on the response variable,
the coefficients estimates of the predictors are not changed.
Only the intercept needs to be adjusted. There are researches
I understand your point. If one gives weight lambda to the
observations with Y=1 the odds-ratio is - heuristically speaking -
changed by lambda as well. That results in an adjustment of the
intercept.
However, couldn't it be an asympthotic property of the
predictors? I tried to find your assertion in the normal equation:
lambda*Sum(P(Y=0|x_i)*x_i;i|y_i=1) = Sum(P(Y=1|x_i)*x_i;i|y_i=0)
If this equation is re-written with the adjusted P(Y) there
remains still a term, which vanishes only for T->infinity. Maybe
I put it wrong, but additionally I ran an example in S+ showing
different estimates for all coefficients.
My intuition tells me that the overweighting of rare events
causes overfitting and high sensibility to rare events. It
seems to me to be not an appropriate method to reduce a possible
bias of predictors of rare events.
I found a very readable paper illustrating the problem of
rare events:
Logistic Regression in Rare Events Data
Gary King & Langche Zeng
http://gking.harvard.edu
Thomas
-----Original Message-----
From: Hongjiew(a)aol.com [mailto:Hongjiew@aol.com]
Sent: Donnerstag, 31. Oktober 2002 17:48
To: fharrell(a)virginia.edu; dcts(a)dcts.de
Cc: s-news(a)lists.biostat.wustl.edu
Subject: Re: [S] general statistical issue: weighting observations in
logit-regression
I agree that if any weighted sampling is used, the parameters
need to be adjusted to make valid inference on the original
population. But there might be practical reasons to use
response based sampling. In the area I am in (database
marketing), we frequently encounter situation where the
occurrences of events are very "rare" or E(Y=1)<=0.01 for
example. There will be computational problems associated with
the prediction, see "Predictive performance of the binary
logit model in unbalanced samples" by J. S Cramer in "The
Statistician (1999). One of the nice properties of logistic
regression (not sure it carries over to general logit models)
is that if oversample is done based on the response variable,
the coefficients estimates of the predictors are not changed.
Only the intercept needs to be adjusted. There are researches
on the area of comparing statistical efficiency between
multiple sampling schemes in logistic regression setting. For
example, if N (overall population) =100K where 10% of them
Y=1.One may take a response based sample (so #y=1 is close to
#y=0) and one may make a true random sample. The first sample
usually can be much smaller than the second one to generate
comparable estimates (see " The effect of sample size and
proportion of buyers in the sample on the performance of list
segmentation equations generated by regression analysis" by
Berger and Magliozzi in Journal of direct marketing.
In a message dated 10/30/2002 8:27:15 PM Eastern Standard
Time, fharrell(a)virginia.edu writes:
On Wed, 30 Oct 2002 23:22:05 +0100
DCTS <dcts(a)dcts.de> wrote:
>
> I am confronted with a Logit-regression, in which y=0 is
much less frequent
> than y=1. It is argued that the less
frequent
observations with y=0 should
> receive higher weights in the regression,
such that the
proportion is
balanced
between Ys being 0 and 1.
Who argues that? No, you don't want to distort the data.
If your sample is
a random sample from the population to
which you want to infer, then rely on maximum likelihood to
give good parameter estimates. You weight observations if
you oversampled a segment of the population and you want to
represent the original population [even then don't always
weight as this reduces efficiency when compared with
covariate adjustment for oversampling factors].
Frank Harrell
>
> To my knowledge there are usually two motivations to use
weights others
than
> unity:
> - prior knowledge of the probability of y=0
> - optimisation of a cost function (in the example above
y=0 is much more
> expensive and should be predicted with
higher attention)
>
> In my limited econometric library and in the internet I
wasn't able to
find
> a discussion on the issue of weighting
observations. If
someone has a good
> hint to a source or could sketch the ideas
of
consequences, pros and cons I
> would be very pleased.
>
>
> Thank you,
> Thomas
>
>
--------------------------------------------------------------------
> This message was distributed by
s-news(a)lists.biostat.wustl.edu. To
> unsubscribe send e-mail to
s-news-request(a)lists.biostat.wustl.edu with
the BODY
of the message: unsubscribe s-news
--
Frank E Harrell Jr Prof. of Biostatistics & Statistics
Div. of Biostatistics & Epidem. Dept. of Health Evaluation Sciences
U. Virginia School of Medicine
http://hesweb1.med.virginia.edu/biostat
> --------------------------------------------------------------------
----------------------------------------------------------------------------
Diese Nachricht und jeder übermittelte Anhang beinhaltet vertrauliche
Informationen und ist nur für die Personen oder das Unternehmen bestimmt, an
welche sie tatsächlich gerichtet ist.
Sollten Sie nicht der Bestimmungsempfänger sein, weisen wir Sie darauf hin,
dass die Verbreitung, das (auch teilweise) Kopieren sowie der Gebrauch der
empfangenen E-Mail und der darin enthaltenen Informationen gesetzlich
verboten ist und gegebenenfalls Schadensersatzpflichten auslösen kann.
Sollten Sie diese Nachricht aufgrund eines Übermittlungsfehlers erhalten
haben, bitten wir Sie, den Sender unverzüglich hiervon in Kenntnis zu
setzen.
Sicherheitswarnung: Bitte beachten Sie, dass das Internet kein sicheres
Kommunikationsmedium ist. Obwohl wir im Rahmen unseres Qualitätsmanagements
und der gebotenen Sorgfalt Schritte eingeleitet haben, um einen
Computervirenbefall weitestgehend zu verhindern, können wir wegen der Natur
des Internet das Risiko eines Computervirenbefalls dieser E-Mail nicht
ausschliessen.
This message (including any attachments) contains confidential information
intended for a specific individual or entity as the intended recipient.
If you are not the intended recipient, you are hereby notified that any
distribution, any copying of this message in part or in whole, or any taking
of action based on it, is strictly prohibited by law and may cause
liability. In case you have received this message due to an error in
transmission, we ask you to notify the sender immediately.
Safety warning: Please note that the Internet is not a safe means of
communication or form of media. Although we are continuously increasing our
due care of preventing virus attacks as a part of our Quality Management, we
are not able to fully prevent virus attacks as a result of the nature of the
Internet.
----------------------------------------------------------------------------
-
relogit mailing list served by Harvard-MIT Data Center
List Address: relogit(a)latte.harvard.edu
Subscribe/Unsubscribe: