I've few questions regarding using zelig for logistic model for rare events. For
certain data set, we see
Y=1 for 10% of data.
1) What will be the difference between zelig and glm with weights? More specifically,
use weight
9 for Y=1 when Y=1 for 10% of data.
2) Is there a way to generate prediction values for all data so that I can check the
predictive scores?
In glm, one can use
data$reponse<-predict(logit, type="response")
to get probability for all data points. Is there something similar in
zelig?
3) I need to implement the model in Java programming language. Which set of
coefficients that
I should use for computing (1/ (1 + e^(-z))? One from summary(logit) or summary(s.out).
Thank you so much.
summary(logit)
Call:
zelig(formula = label ~ query_length + user_query_word_count +
user_query_noun_word_count + user_query_noun_length_ratio +
user_query_noun_token_ratio + query_title_wordoverlap_count +
query_title_unique_wordoverlap_count + query_title_jaccard +
query_desc_noun_cosine + query_desc_noun_bigram_cosine +
query_desc_wordoverlap_count + query_desc_unique_wordoverlap_count +
query_desc_bigramoverlap_count + query_desc_triigramoverlap_count +
query_desc_jaccard + query_desc_idx + query_title_wordoverlap_count_ratio +
query_title_unique_wordoverlap_count_ratio + title_length +
desc_length + stem_query_title_noun_cosine + stem_query_title_wordoverlap_count +
stem_query_title_unique_wordoverlap_count + query_title_noun_order +
query_title_overlap_coefficient + query_sdesc_idx + query_sdesc_idx_ratio +
query_sdesc_noun_order_ratio + query_sdesc_overlap_coefficient +
stitle_word_count + stitle_noun_word_count + query_stitle_order_ratio +
query_stitle_noun_order_ratio + query_stitle_overlap_coefficient +
unique_user_query_ratio + unique_user_query_token_ratio +
unique_user_query_noun_token_ratio + avg_token_length + bad_query +
duplicated_title_word + query_topic_size + top_query_prob +
top_title_prob + nonstop_user_query_word_count + nonstop_user_query_length +
nonstop_user_query_word_count_ratio + nonstop_user_query_length_ratio +
nonstop_query_title_cosine + nonstop_query_title_noun_cosine +
nonstopboth_query_title_wordoverlap_count +
nonstopboth_query_title_unique_wordoverlap_count +
nonstopboth_query_title_bigramoverlap_count +
nonstopboth_query_title_trigramoverlap_count +
wo_query_word_count + wo_user_query_length + wo_user_query_word_count_ratio +
wo_user_query_length_ratio + woboth_query_title_cosine +
woboth_query_title_wordoverlap_count + woboth_query_title_unique_wordoverlap_count +
woboth_query_title_bigramoverlap_count + woboth_query_title_trigramoverlap_count +
woboth_query_title_jaccard, model = "relogit", data = data)
Deviance Residuals:
Min 1Q Median 3Q Max
-4.8505 -0.4359 -0.3178 -0.2324 3.5587
Coefficients:
Estimate Std. Error z value
Pr(>|z|)
(Intercept) 2.259e+00 1.301e+00 1.736 0.082497 .
query_length 1.090e-01 6.115e-02 1.783 0.074618 .
user_query_word_count -1.199e+00 3.841e-01 -3.122 0.001798
**
user_query_noun_word_count 3.099e-01 9.121e-02 3.397 0.000680
***
user_query_noun_length_ratio 6.851e-01 3.952e-01 1.733 0.083029 .
user_query_noun_token_ratio -2.225e+00 4.932e-01 -4.512 6.42e-06
***
query_title_wordoverlap_count 3.799e+00 1.364e+00 2.786 0.005339
**
query_title_unique_wordoverlap_count -3.738e+00 1.374e+00 -2.721 0.006500
**
query_title_jaccard 9.996e-01 8.823e-01 1.133 0.257239
query_desc_noun_cosine 9.275e-01 1.933e-01 4.797 1.61e-06
***
query_desc_noun_bigram_cosine -4.158e-01 2.505e-01 -1.660 0.096932 .
query_desc_wordoverlap_count -5.828e-01 2.825e-01 -2.063 0.039095 *
query_desc_unique_wordoverlap_count 6.378e-01 2.877e-01 2.217 0.026622 *
query_desc_bigramoverlap_count 3.466e-01 7.529e-02 4.604 4.15e-06
***
query_desc_triigramoverlap_count -1.438e-01 9.850e-02 -1.460 0.144189
query_desc_jaccard -8.027e-01 5.086e-01 -1.578 0.114521
query_desc_idx 6.954e-03 4.408e-03 1.578 0.114616
query_title_wordoverlap_count_ratio -3.529e+00 2.419e+00 -1.459 0.144676
query_title_unique_wordoverlap_count_ratio 2.736e+00 2.354e+00 1.162 0.245042
title_length 7.499e-04 1.867e-03 0.402 0.687959
desc_length 7.714e-04 5.498e-04 1.403 0.160639
stem_query_title_noun_cosine 8.778e-01 2.929e-01 2.997 0.002731
**
stem_query_title_wordoverlap_count -1.116e+00 4.617e-01 -2.417 0.015669 *
stem_query_title_unique_wordoverlap_count 7.958e-01 4.669e-01 1.704 0.088333 .
query_title_noun_order 2.054e-01 6.864e-02 2.992 0.002770
**
query_title_overlap_coefficient 6.650e-01 7.297e-01 0.911 0.362143
query_sdesc_idx -5.445e-02 1.104e-02 -4.932 8.14e-07
***
query_sdesc_idx_ratio 5.256e-01 1.209e-01 4.347 1.38e-05
***
query_sdesc_noun_order_ratio -4.208e-01 1.036e-01 -4.061 4.88e-05
***
query_sdesc_overlap_coefficient 3.283e-01 1.375e-01 2.387 0.016991 *
stitle_word_count -5.913e-02 2.296e-02 -2.575 0.010020 *
stitle_noun_word_count 3.630e-02 2.160e-02 1.680 0.092881 .
query_stitle_order_ratio 6.025e-02 1.386e-01 0.435 0.663817
query_stitle_noun_order_ratio -5.695e-01 1.597e-01 -3.566 0.000362
***
query_stitle_overlap_coefficient 7.291e-01 2.798e-01 2.606 0.009174
**
unique_user_query_ratio -1.176e+01 3.785e+00 -3.106 0.001893
**
unique_user_query_token_ratio 7.769e+00 4.176e+00 1.860 0.062863 .
unique_user_query_noun_token_ratio 1.099e+00 2.248e-01 4.889 1.01e-06
***
avg_token_length 1.845e-01 3.198e-02 5.770 7.94e-09
***
bad_query -1.675e+00 3.761e-01 -4.453 8.45e-06
***
duplicated_title_word -4.118e-02 6.963e-02 -0.591 0.554238
query_topic_size 4.438e-04 4.115e-04 1.079 0.280735
top_query_prob 2.678e-05 1.724e-05 1.553 0.120324
top_title_prob 7.116e-04 3.666e-04 1.941 0.052221 .
nonstop_user_query_word_count -3.467e-01 2.986e-01 -1.161 0.245578
nonstop_user_query_length 9.078e-02 4.886e-02 1.858 0.063166 .
nonstop_user_query_word_count_ratio -5.344e-01 6.490e-01 -0.823 0.410257
nonstop_user_query_length_ratio 4.750e-01 2.510e-01 1.892 0.058471 .
nonstop_query_title_cosine -4.957e-01 9.418e-01 -0.526 0.598631
nonstop_query_title_noun_cosine 4.872e-01 2.363e-01 2.062 0.039224 *
nonstopboth_query_title_wordoverlap_count -1.175e+00 6.102e-01 -1.925 0.054193 .
nonstopboth_query_title_unique_wordoverlap_count 1.370e+00 6.346e-01 2.158 0.030916 *
nonstopboth_query_title_bigramoverlap_count 2.005e-01 1.364e-01 1.470 0.141630
nonstopboth_query_title_trigramoverlap_count -2.466e-01 1.386e-01 -1.779 0.075197 .
wo_query_word_count 1.608e+00 2.784e-01 5.777 7.59e-09
***
wo_user_query_length -2.232e-01 3.581e-02 -6.234 4.56e-10
***
wo_user_query_word_count_ratio -3.705e+00 4.107e-01 -9.021 < 2e-16
***
wo_user_query_length_ratio 6.501e-01 2.450e-01 2.653 0.007975
**
woboth_query_title_cosine -7.155e-01 8.576e-01 -0.834 0.404152
woboth_query_title_wordoverlap_count -2.636e+00 9.711e-01 -2.714 0.006648
**
woboth_query_title_unique_wordoverlap_count 2.668e+00 9.796e-01 2.724 0.006452
**
woboth_query_title_bigramoverlap_count -2.325e-01 1.374e-01 -1.692 0.090635 .
woboth_query_title_trigramoverlap_count 2.379e-02 1.474e-01 0.161 0.871803
woboth_query_title_jaccard 1.016e+00 8.140e-01 1.248 0.211879
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05
'.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 15902 on 27023 degrees of freedom
Residual deviance: 13840 on 26960 degrees of freedom
AIC: 13968
Number of Fisher Scoring iterations: 6
Rare events bias correction performed
x.out <-setx(logit)
s.out <- sim(logit, x=x.out)
summary(s.out)
Model: relogit
Number of simulations: 1000
Values of X
(Intercept) query_length user_query_word_count user_query_noun_word_count
user_query_noun_length_ratio user_query_noun_token_ratio
1 1 14.04448 2.38958 1.846174
0.799601 0.8041626
query_title_wordoverlap_count query_title_unique_wordoverlap_count query_title_jaccard
query_desc_noun_cosine
1 1.393502 1.383474 0.1707578
0.2172732
query_desc_noun_bigram_cosine query_desc_wordoverlap_count
query_desc_unique_wordoverlap_count query_desc_bigramoverlap_count
1 0.03953082 1.341363
1.332482 0.2646906
query_desc_triigramoverlap_count query_desc_jaccard query_desc_idx
query_title_wordoverlap_count_ratio
1 0.04965956 0.05158569 15.36238
0.6300092
query_title_unique_wordoverlap_count_ratio title_length desc_length
stem_query_title_noun_cosine stem_query_title_wordoverlap_count
1 0.6277077 50.22062 194.9923
0.2768097 0.8819568
stem_query_title_unique_wordoverlap_count query_title_noun_order
query_title_overlap_coefficient query_sdesc_idx
1 0.8771832 0.7967732
0.6319689 9.485235
query_sdesc_idx_ratio query_sdesc_noun_order_ratio query_sdesc_overlap_coefficient
stitle_word_count stitle_noun_word_count
1 0.6333273 0.2978386 0.4154294
7.703523 4.863714
query_stitle_order_ratio query_stitle_noun_order_ratio query_stitle_overlap_coefficient
unique_user_query_ratio
1 0.4901117 0.4502644 0.6018532
0.9962895
unique_user_query_token_ratio unique_user_query_noun_token_ratio avg_token_length
bad_query duplicated_title_word query_topic_size
1 0.9963705 0.9465786 5.820267
0.017836 0.1809133 41.078
top_query_prob top_title_prob nonstop_user_query_word_count nonstop_user_query_length
nonstop_user_query_word_count_ratio
1 29.34992 4.350128 2.293406 13.69819
0.9667078
nonstop_user_query_length_ratio nonstop_query_title_cosine
nonstop_query_title_noun_cosine
1 0.9168887 0.3212199
0.3147863
nonstopboth_query_title_wordoverlap_count
nonstopboth_query_title_unique_wordoverlap_count
1 1.35091
1.341844
nonstopboth_query_title_bigramoverlap_count nonstopboth_query_title_trigramoverlap_count
wo_query_word_count wo_user_query_length
1 0.3052102 0.06283304
2.289483 13.67166
wo_user_query_word_count_ratio wo_user_query_length_ratio woboth_query_title_cosine
woboth_query_title_wordoverlap_count
1 0.9583803 0.9118931 0.3395925
1.342621
woboth_query_title_unique_wordoverlap_count woboth_query_title_bigramoverlap_count
woboth_query_title_trigramoverlap_count
1 1.332889 0.3096137
0.06571936
woboth_query_title_jaccard
1 0.1856693
Expected Values: E(Y|X)
mean sd 2.5% 97.5%
1 0.06205613 0.00174499 0.05877185 0.06553267
Predicted Values: Y|X
0 1
1 0.917 0.083