thanks for your note. pls see below...
On Wed, 27 Jul 2005, Parry Clarke wrote:
Dear Professor King,
My name is Parry Clarke and I am currently writing up my PhD on baboon
intersexual conflict. Im sorry to bother you but I was wondering if you
could answer some questions regarding Logistic Regression of Rare events.
I am currently attempting to model the occurrence of male aggression
directed at oestrous females. The behaviour itself is fairly rare and so
I have binary coded its occurrence. In total I have 1745 sampling units,
or rows of data, with only 21 of these coded 1 and 1724 coded 0.
Initially I carried out standard logistic regression, but found that
when it came to diagnosis all my influential points were my 1s. So
that, if deletion diagnostics were performed I was left with no variance
In addition, my intercepts did not seem very convincing.
However, having now discovered your work on the subject and the package you
have written for R things are looking up!, but I just have a couple of
questions:
1)Are there diagnostics unique to the rare event analysis?
2)Is influential point examination redundant in rare event logistic
regression?
3)How do I get deviance estimates of the final model?
relogit estimates the same coefficients as logit from the same model.
they do differ tho in order to get better properties. so just as using
weighted least squares will give different answers -- and will fit the
data less well than least squares -- we generally prefer wls to ls when
there are weights available. so in both relogit and wls, you have the
same issue of how to deal with diagnostics. there are no special
diagnostics, but in both (and lots of other methods) the issue is that you
can't really treat all the observations equally and an outlier for one
observation isn't the same as another. so in relogit for example, an
extra 1 inadvertently included in the dataset will be much more
consequential than an extra 0.
4)Part of my analysis has to been trying to relate
other forms of male
aggression to oestrous female-directed aggression. However, these other forms
are also rare and so when I enter them into the analysis as a dichotomous
explanatory variable I get answers that are not really supported by the
observed data: For example: overlap between the 1s in the response and
explanatory variables may only be 2 or 3 data points but the final model
suggests that a large amount of the deviance is explained and the
coefficients are highly significant: Is this simply an artifact of the rarity
of both the dependent and independent variable and am I, therefore, better
off excluding them from the multivariate analysis and doing separate
contingency table analysis with them?
rare events in explanatory variables is a different issue involving the
sensitivity of the estimates to the coding of X. since almost all
relevant models are conditional on X, you have little choice but to run
the analysis as is unless you reconceptualize the project (such as having
both variables being dependent variables).
5)In a number of your articles (e.g. Explaining Rare
Events In
International Relations) your talk at length about sampling strategies
and database trimming to create favourable ratio of 0s to 1s and to cut
down on costs. In a situation such as mine where the data is collected
and the database fixed is there any need to trim and subset. If so how
do you suggest I go about that and is there a command in R for randomly
selecting subsets of data?
if you have the data, you should use it. no reason to subsample at that
point. we do it to demonstrate what would happen if you couldn't afford
to collect all the data, as is the case in many fields. but more data are
better generally and here too.
Once again I am sorry to bother you with such a lengthy email and I hope it
is not too much of an imposition.
best of luck with your research,
Gary King
Yours sincerely,
Parry Clarke.
_________________________________________________________________
Winks & nudges are here - download MSN Messenger 7.0 today!
http://messenger.msn.co.uk