In an earlier article I discussed how to do a cross-tabulation in SPSS. But what if you do not have a data set with the values of the two variables of interest?
For example, if you do a critical appraisal of a published study and only have proportions and denominators.
In this article it will be demonstrated how SPSS can come up with a cross table and do a Chi-square test in both situations. And you will see that the results are exactly the same.
‘Normal’ dataset
If you want to test if there is an association between two nominal variables, you do a Chi-square test.
In SPSS you just indicate that one variable (the independent one) should come in the row, (more…)
There are many effect size statistics for ANOVA and regression, and as you may have noticed, journal editors are now requiring you include one.
Unfortunately, the one your editor wants or is the one most appropriate to your research may not be the one your software makes available (SPSS, for example, reports Partial Eta Squared only, although it labels it Eta Squared in early versions).
Luckily, all the effect size measures are relatively easy to calculate from information in the ANOVA table on your output. Here are a few common ones: (more…)
If you’re like most researchers, your statistical training focused on Regression or ANOVA, but not both. It all depends on whether your field focuses more on experimental data (Biology, Psychology) or observed data (Sociology, Economics). Maybe one class covered a bit of the other, but most people are comfortable in one, but not the other.
This, in my opinion, is a shame. (Okay, I was going to say tragedy, but let’s be real. Tsunami that kills thousands=tragedy. Different scale here).
First of all, the distinction between ANOVA and linear regression is arbitrary. They’re really the same model with different outfits on.
Second, regardless of which one you normally use, you’re going to occasionally have to use the other kind of predictor variables–categorical or continuous. And we can come up with nice names for these models–a regression with dummy variables or an Analysis of Covariance.
But real understanding of the relationships among variables comes only when you dispense of the names and can focus on analyzing and interpreting the model using the kinds of variables you have.
There are other examples, but today I’m going to focus on an ANOVA model with a continuous covariate.
A common model is one in which one predictor is categorical (we’ll use 4 categories) and the other is continuous. Here is an example of a scatterplot of just such a model:
There are four groups, each of which received a different training. The continuous moderator is Age, and the outcome is OverallPost, which is the post-training test score to see how well they learned the material in each training program.
As you can see, the effect of the training program is moderated by age. Another way to say that is there is a significant interaction between Age and Training Group. The effect of the training is depending on the trainee’s age.
One way to interpret this significant interaction is to compare the slopes of the four lines, which is easily done with any regression coefficient table. (Okay, not always easily done, but easily found in…)
But this doesn’t make very much sense when Age is really a moderator–a predictor we want to control for, and see how it affects the relationship between the independent (IV) and dependent variables (DV), but not really the IV we’re interested in.
A better way to do it in this situation is to compare the means among groups at a low value of Age, say 20, and again at a high value of Age, say 50. You can get p-values, adjusted for multiple comparisons, using either SAS or SPSS GLM.
SAS Proc GLM uses the LSMeans statement and SPSS GLM uses EMMeans. They do the same thing–calculate the mean of Y for each group, at a specific value of the covariate.
If you use the menus in SPSS, you can only get those EMMeans at the Covariate’s mean, which in this example is about 25, where the vertical black line is. This isn’t very useful for our purposes. But we can change the value of the covariate at which to compare the means using syntax.
So it would tell us that at a young age of say 20, the three treatment groups (green, tan, and purple lines) all have means higher than the control (blue). Young people learned more in all three treatment groups.
But at an older age, say 50, the means of the purple and tan groups were not significantly different from the control group’s (blue), and the green (EIQ group) did worse!
In SPSS GLM, the syntax would be:
UNIANOVA OverallPost BY group WITH NEWAGE
/METHOD=SSTYPE(3)
/INTERCEPT=INCLUDE
/EMMEANS=TABLES(group) WITH(NEWAGE=MEAN) COMPARE ADJ(SIDAK)
/EMMEANS=TABLES(group) WITH(NEWAGE=45) COMPARE ADJ(SIDAK)
/EMMEANS=TABLES(group) WITH(NEWAGE=20) COMPARE ADJ(SIDAK)
/PRINT=PARAMETER
/CRITERIA=ALPHA(.05)
/DESIGN=NEWAGE group NEWAGE*group.
Assume you have just done a cohort study. How do you actually do the cross-tabulation to calculate the cumulative incidence in both groups?
Best is to always put the outcome variable (disease yes/no) in the columns and the exposure variable in the rows. In other words, put the dependent variable–the one that describes the problem under study–in the columns. And put the independent variable–the factor assumed to cause the problem–in the rows.
Let’s take as example a cohort study used to see whether there is a causal relationship between the use of a certain water source and the incidence of diarrhea among children under five in a village with different water sources. In this case, the variable diarrhea (yes/no) should be in the columns. The variable water source (suspected/other) should be in the rows.
SPSS will put the lowest value of the variable in the first column or row. So in order to get those with diarrhea in the first column you should label ‘diarrhea’ as 1 and ‘no diarrhea’ as 2. The same is true for the exposure variable: label the ‘suspected water source’ as 1 and the ‘other water source’ as 2.
You will then be able to calculate the cumulative incidence (risk of developing the disease) among those with the exposure: a / (a + b) and among those without the exposure: c / (c + d).
In the case of the diarrhea study (Table 1), you could calculate the cumulative incidence of diarrhea among those exposed to the suspected water source, which would be (78 / 1,500 =) 5.2%.
You can also do this for those exposed to other water sources, which would be (50 / 1,000 =) 5.0%.
SPSS can give you these percentages immediately (in cell ‘a’ and ‘c’ respectively), when you ask to display row percentages in the Cells option (Table 2).
Cross-tabulation in Case-Control Studies
When you have used a case-control design for the diarrhea study, the actual cross-tabulation is quite similar, only “presence of diarrhea yes/no”, is now changed into “cases” and “controls.
Label the cases as 1, and the controls as 2. Be aware that row percentages have no meaning in terms of occurrence of disease in case-control studies. This is because in case-control studies the researcher determines how many patients and how many controls are included.
The ratio between the number of patients and controls (e.g. 2 : 1 or 4 : 1) influences the row percentages. So in a case-control study, the cumulative incidence cannot be calculated.
When having conducted a case-control study, you can ask to display column percentages. That gives you the proportion of those exposed to the suspected water source among the cases (in cell ‘a’) and among the controls (in cell ‘b’).
Table 3 gives the SPSS output for the same diarrhea study assuming that it had a case-control design. Using the data provided, (78 / 128 =) 60.9% of the cases were exposed to the suspected water source, while this was (1,422 / 2,372 =) 59.9% of the controls (asked for column percentages).
Another article will be devoted to measures of association: How do you actually compare cumulative incidence rates in cohort studies? And what measure of association can be used in case-control studies?
About the Author: With expertise in epidemiology, biostatistics and quantitative research projects, Annette Gerritsen, Ph.D. provides services to her clients focussing on the methodological soundness of each phase of an epidemiological study to ensure getting valid answers to the proposed research questions. She is the founder of Epi Result.
The Analysis Factor uses cookies to ensure that we give you the best experience of our website. If you continue we assume that you consent to receive cookies on all websites from The Analysis Factor.
This website uses cookies to improve your experience while you navigate through the website. Out of these, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may affect your browsing experience.
Necessary cookies are absolutely essential for the website to function properly. This category only includes cookies that ensures basic functionalities and security features of the website. These cookies do not store any personal information.
Any cookies that may not be particularly necessary for the website to function and is used specifically to collect user personal data via analytics, ads, other embedded contents are termed as non-necessary cookies. It is mandatory to procure user consent prior to running these cookies on your website.