By default, sim does 1000 simulations, split over the number of MI data
sets. Since you have 1000 MI data sets, that is 1 simulation per data set.
I would increase the number of simulations to something reasonable per
dataset OR (much preferred) reduce the number of MI data sets.
With 1000 MI data sets, you are averaging over the simulation uncertainty
that the MI procedure is supposed to introduce. More is not better in this
case. I'd go with something like 5 or 10 data sets.
If you have too few observations, I would also suggest bootstrapping (as a
precursor to using MI) to see how unstable the model is.
Best, Olivia
On Fri, Sep 21, 2012 at 12:22 PM, <jrickles(a)ucla.edu> wrote:
Zelig experts,
[I apologize in advance for the long email.]
I am working with a colleague to figure out a good way to place a
confidence interval around an average treatment effect for the treated
(ATT) when using the Zelig sim(z.out) function and sample size is not
particularly large. It would be great to get some advice about whether we
are correctly using sim.out and whether our alternative approach makes
sense.
My (cursory) understanding of sim.out is that the ATT point estimate and
confidence interval is based on the posterior distribution of the 1,000
conditional expected values for the counterfactual. One concern is that
this can produce an appropriate confidence interval asymptotically but may
be too narrow in finite samples.
As context, we were asked to estimate the effect of attending a magnet
school verses a comparison school (for simplicity assume strong
ignorability even though it probably doesn't hold and the within-school
nesting) on a test score. We tried to equate treatment & control groups
using inverse probability of treatment weighting (IPTW). After running
x.out1 <- setx(z.out, data =data.t, cond = TRUE)
s.out <- sim(z.out, x = x.out1)
We get the following:
summary(s.out)
Model: ls
Number of simulations: 1000
Mean Values of Observed Data (n = 156)
(Intercept) ZMath1011 ZRead1011 ZWrite1011
1.00000000 0.06322742 0.05521872 -0.28610638
Pooled Expected Values: E(Y|X)
mean sd 2.5% 97.5%
0.03321294 0.80429472 -1.56758254 1.44453357
Pooled Average Treatment Effect for the Treated: Y - EV
mean sd 2.5% 97.5%
0.05244464 0.01945088 0.01379341 0.08863216
We're worried the sd of 0.02 only reflects between-imputation variance and
not within-sample variance, so we pulled out the expected value matrix and
recalculated the ATT & standard error treating the 1,000 expected values as
1,000 multiply imputed data sets and then used Little & Rubin combination
rules to get total variance:
## Merge Expected Values with Main Treatment-Unit Data File ##
> id<-data.t$SASID # vector of student ids
> ev<-s.out$qi$ev # matrix of expected values & students
> datar <- NULL for (i in 1:ncol(ev))
{ # loop over each treatment student
+ tmp <-
cbind(c(1:nrow(ev)),rep(id[i],**nrow(ev)),ev[,i])
+ datar <- rbind(datar,tmp)
+ }
> datar <- data.frame(datar)
names(datar) <- c("m","SASID","EV")
> datat<-merge(datar,data.t,
by="SASID") # merge with main data set
> ## Calculate ATT ##
>
datat$ATT<-datat$ZMath1112-**datat$EV # individual level effect
> att.m<-aggregate(datat$ATT,by=**list(datat$m),mean) # mean ATT per
> imputation
> att.v<-aggregate(datat$ATT,by=**list(datat$m),var) # variance of ATT per
> imputation
> W <- mean(att.v$x) # average within
variance
> B <- sum((att.m$x-mean(att.m$x))^2)**/(nrow(ev)-1) # between variance
> T <- sqrt(W/ncol(ev)) + (1+(1/nrow(ev)))*B # total standard error
> # ATT point estimate & standard
error #
> mean(att.m$x); T
[1] 0.05244464
[1] 0.02913942
> # ATT confidence interval #
> mean(att.m$x)-2*T; mean(att.m$x)+2*T
[1] -0.005834205
[1] 0.1107235
So using this approach returns the same point estimate, but a somewhat
larger standard error (0.029 vs. 0.019). As a point of reference, if you
just run a regression on the full sample (weighted by IPTW) you get
ATT=0.053 (se=0.031).
We would like to estimate the ATT for different subgroups as well as the
overall ATT, and sample size will really become an issue for some
subgroups. Our main question is whether you think our approach is
appropriate or whether we should stick with the sd & confidence interval
produced by sim(z.out) ... or if there's something better we should do.
Thank you,
Jordan Rickles
-
--
Zelig Mailing List, served by HUIT
Send messages: zelig(a)lists.gking.harvard.edu
[un]subscribe Options:
http://lists.gking.harvard.**
edu/mailman/listinfo/zelig<http://lists.gking.harvard.edu/mailman/listin…
Zelig program information:
http://gking.harvard.edu/**zelig/<http://gking.harvard.edu/zelig/
Zelig mailing list
Zelig(a)lists.gking.harvard.edu
https://lists.gking.harvard.**edu/mailman/listinfo/zelig<https://lists.g…