Data Analysis and Surveying 101:
Basic research methods and biostatistics as they apply to the
Theresa Jackson Hughes, MPH American College Health Association December 2006
What we will cover today
Research Methods
Sampling Frame and Sampling Generalizability Bias Reliability and Validity Levels of measurement Statistical significance Other key terms Appropriate statistical tests Fun examples from the Spring 2005 dataset!
Biostatistics
Get excited! Its data time!!!
Research Methods
To do successful research, you don't need to know everything, you just need to know of one thing that isn't known.
Arthur Schawlow
That's the nature of research - you don't know what in hell you're doing.
Harold "Doc" Edgerton
If we knew what it was we were doing, it would not be called research, would it?
Albert Einstein
What exactly is research?
Scientific research is systematic, controlled, empirical, and critical investigation of natural phenomena guided by theory and hypotheses about the presumed relations among such phenomena.
Kerlinger, 1986
Research is an organized and systematic way of finding answers to questions
Important Components of Empirical Research
Problem statement, research questions, purposes, benefits Theory, assumptions, background literature Variables and hypotheses Operational definitions and measurement Research design and methodology Instrumentation, sampling Data analysis Conclusions, interpretations, recommendations
Sampling
What is your population of interest?
To whom do you want to generalize your results?
All students (18 and over) Undergraduates only Greeks Athletes Other
Can you sample the entire population?
Sampling
A sample is a smaller (but hopefully representative) collection of units from a population used to determine truths about that population (Field, 2005) Why sample?
Resources (time, money) and workload Gives results with known accuracy that can be calculated mathematically
The sampling frame is the list from which the potential respondents are drawn
Registrars office Class rosters Must assess sampling frame errors
Types of Samples
Probability (Random) Samples
Simple random sample Systematic random sample Stratified random sample
Proportionate Disproportionate
Cluster sample
Non-Probability Samples
Convenience sample Purposive sample Quota
Sample Size
Depends on expected response rate
Average 85% for paper
FINAL SAMPLE DESIRED / .85 = SAMPLE
Average 25% for web
FINAL SAMPLE DESIRED / .25 = SAMPLE
Size of Campus <600 600-2,999 3,000-9,999 10,000-19,999 20,000-29,000 30,000 Final Desired N All students 600 700 800 900 1,000
Bias and Error
Bias and Error
Systematic Error or Bias: unknown or unacknowledged error created during the design, measurement, sampling, procedure, or choice of problem studied
Error tends to go in one direction
Examples: Selection, Recall, Social desirability
Random
Unrelated to true measures
Example: Momentary fatigue
Reliability and Validity
Reliability
The extent to which a test is repeatable and yields consistent scores Affected by random error/bias
Validity
The extent to which a test measures what it is supposed to measure A subjective judgment made on the basis of experience and empirical indicators Asks "Is the test measuring what you think its measuring? Affected by systematic error/bias
Reliability vs. Validity
In order to be valid, a test must be reliable; but reliability does not guarantee validity.
Levels of Measurement
Levels of Measurement
Nominal
Gender
Male, Female
Interval
Body Mass Index (BMI)
Vaccinations
Yes, No, Unsure
Ordinal
Personal health status
Excellent, Very good, Good, Fair, Poor
Ratio
Number of drinks Number of sexual partners Perception percentages Blood alcohol concentration (BAC)
Last 30 days
Never used, Not in last 30 days, 1-2 days, 3-5 days, 6-9 days, 10-19 days, 20-29 days, All 30 days
Biostatistics
It is commonly believed that anyone who tabulates numbers is a statistician. This is like believing that anyone who owns a scalpel is a surgeon.
R. Hooke
Torture numbers, and they'll confess to anything.
Gregg Easterbrook
98% of all statistics are made up.
Author Unknown
Types of Statistics
Descriptive statistics
Describe the basic features of data in a study Provide summaries about the sample and measures
Inferential statistics
Investigate questions, models, and hypotheses Infer population characteristics based on sample Make judgments about what we observe
Descriptive Statistics
Mode Median Mean Central Tendency Variation Range Variance Standard Deviation Frequency
Descriptive Statistics Examples
Categorical Variables (Nominal/Ordinal)
Q1 Gen health Frequency 9145 23767 16442 3737 565 132 53788 323 54111 Percent 16.9 43.9 30.4 6.9 1.0 .2 99.4 .6 100.0 Valid Percent 17.0 44.2 30.6 6.9 1.1 .2 100.0 Cumulative Percent 17.0 61.2 91.8 98.7 99.8 100.0
Valid
Missing Total
1 excellent 2 very good 3 good 4 fair 5 poor 6 don't know Total System
Descriptive Statistics Examples
Categorical Variables (Nominal/Ordinal)
Q49 Year in school * Q46 Sex Crosstabulation Q46 Sex 1 female 2 male 1st year undergrad Count 7366 4154 % of Total 14.5% 8.2% 2nd year under Count 6755 3678 % of Total 13.3% 7.2% 3rd year under Count 6195 3333 % of Total 12.2% 6.6% 4th year under Count 5192 2676 % of Total 10.2% 5.3% 5th year or more under Count 1380 985 % of Total 2.7% 1.9% graduate Count 5088 3246 % of Total 10.0% 6.4% adult special Count 203 105 % of Total .4% .2% other Count 266 145 % of Total .5% .3% Count 32445 18322 % of Total 63.9% 36.1% Total 11520 22.7% 10433 20.6% 9528 18.8% 7868 15.5% 2365 4.7% 8334 16.4% 308 .6% 411 .8% 50767 100.0% Q49 Year in school 1 2 3 4 5 6 7 8 Total
Descriptive Statistics Examples
Continuous Variables (Interval/Ratio)
Descriptiv e Statistics Q48 Weight in pounds HT_INCH Height in Inches Q13 How many drinks Q12 Hours alcohol BAC Blood Alcohol Content Valid N (listwise) N 51935 52017 53374 53326 50604 50218 Range 534 56.00 88 65 2.47 Minimum 52 48.00 0 0 .00 Maximum 586 104.00 88 65 2.47 Mean 153.16 67.2035 4.42 2.99 .0731 Std. Deviation 35.791 4.01241 4.401 2.726 .08357 Variance 1281.031 16.099 19.370 7.430 .007
Hypotheses
Null hypotheses
Presumed true until statistical evidence in the form of a hypothesis test indicates otherwise
There is no effect/relationship There is no difference in means
Alternative hypotheses
Tested using inferential statistics
There is an effect/relationship There is a difference in means
Alpha, Beta, Power, Effect Size Alpha probability of
making a Type I error
Reject null when null is true Level of significance, p value
Null is true
Reject null Alpha Type I error 1 Alpha
CORRECT NONREJECTION
Null is false
1 Beta Power
CORRECT REJECTION
Beta probability of making a Type II error
Fail to reject null when null is false
Power probability of correctly rejecting null
1 Beta
Fail to Reject null
Beta Type II error
Effect Size
Measure of the strength of the relationship between two variables
Lets test some hypotheses!!!
Test of the mean of one continuous variable
College students report drinking an average of 5 drinks the last time they partied/socialized
Hypotheses
Ho: = 5 HA: 5
Test: Two-tailed t-test Result: Reject null
One-Sample Statistics N 53374 Mean 4.42 Std. Deviation 4.401 Std. Error Mean .019
How many drinks
One-Sample Test Test Value = 5 95% Confidence Interval of the Difference Lower Upper -.62 -.54
How many drinks
t -30.352
df 53373
Sig. (2-tailed) .000
Mean Difference -.578
Test of a single proportion of one categorical variable
20% of college students report their health is excellent
Hypotheses
Ho: p = 20 HA: p 20 (one-tailed)
Test: Z-test for a single proportion Result: Reject null
Binomial Test Category <= 1 >1 N 9145 44643 53788 Observed Prop. .170 .830 1.000 Test Prop. .2 Asymp. Sig. (1-tailed) .000 a,b
Gen health
Group 1 Group 2 Total
a. Alternative hypothesis states that the proportion of cases in the first group < .2. b. Based on Z Approximation.
Test of a relationship between two continuous variables
There is a relationship between the number of drinks students report drinking the last time they drank and the number of sex partners they have had within the last school year
Hypotheses
Ho: = 0 HA: 0
Test: Pearson Product Moment Correlation Result: Reject null Correlations
How many drinks Pearson Correlation Sig. (2-tailed) N Pearson Correlation Sig. (2-tailed) N How many drinks 1 53374 .238** .000 52576 Partners you had .238** .000 52576 1 52896
Partners you had
**. Correlation is significant at the 0.01 level (2-tailed).
Men and women report significantly different numbers of sexual partners over the past 12 months
Hypotheses Test: Independent Samples t-test OR One-way ANOVA Result: Reject null
Group Statistics N 32687 18474 Partners you had Sex female male Mean 1.34 1.82 Std. Deviation 2.017 3.627 Std. Error Mean .011 .027
Test of the difference between two means
1 = 2 1 2
Independent Samples Test Levene's Test for Equality of Variances t-test for Equality of Means 95% Confidence Interval of the Difference Lower Upper -.532 -.540 -.434 -.426
F Partners you had Equal variances assumed Equal variances not assumed 867.978
Sig. .000
t -19.360
df 51159
Sig. (2-tailed) .000 .000
Mean Difference -.483 -.483
Std. Error Difference .025 .029
-16.704 25065.988
Test of the difference between two or more means
Mean BAC reported differs across student residences
Hypotheses Test: One-way ANOVA Result: Reject null
Blood Alcohol Content 95% Confidence Interval for Mean Lower Bound Upper Bound .0730 .0752 .1062 .1193 .0598 .0646 .0760 .0785 .0581 .0631 .0545 .0613 .0724 .0738
1 = 2 = 3 = 4 = 5 = 6 i j for at least one pair i, j
Descriptiv es
residence hall frat/sorority house other university housing off campus with parents other Total
N 21285 781 3620 18151 4279 2266 50382
Mean .0741 .1127 .0622 .0773 .0606 .0579 .0731
Std. Deviation .08215 .09278 .07357 .08539 .08490 .08296 .08357
Std. Error .00056 .00332 .00122 .00063 .00130 .00174 .00037
Minimum .00 .00 .00 .00 .00 .00 .00
Maximum 1.27 .75 1.41 2.47 1.17 1.26 2.47
ANOVA Blood Alcohol Content Sum of Squares 3.188 348.695 351.884 df 5 50376 50381 Mean Square .638 .007 F 92.123 Sig. .000
Between Groups Within Groups Total
Test of the difference between two or more means
Multiple Comparisons Dependent Variable: Blood Alcohol Content Games-Howell Mean Difference (I-J) Std. Error -.03865* .00337 .01190* .00135 -.00316* .00085 .01350* .00141 .01623* .00183 .03865* .00337 .05055* .00354 .03548* .00338 .05215* .00356 .05488* .00375 -.01190* .00135 -.05055* .00354 -.01506* .00138 .00160 .00178 .00433 .00213 .00316* .00085 -.03548* .00338 .01506* .00138 .01667* .00144 .01940* .00185 -.01350* .00141 -.05215* .00356 -.00160 .00178 -.01667* .00144 .00273 .00217 -.01623* .00183 -.05488* .00375 -.00433 .00213 -.01940* .00185 -.00273 .00217 (I) Currently live residence hall (J) Currently live frat/sorority house other university housing off campus with parents other frat/sorority house residence hall other university housing off campus with parents other other university housing residence hall frat/sorority house off campus with parents other off campus residence hall frat/sorority house other university housing with parents other with parents residence hall frat/sorority house other university housing off campus other other residence hall frat/sorority house other university housing off campus with parents Sig. .000 .000 .003 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .947 .323 .003 .000 .000 .000 .000 .000 .000 .947 .000 .809 .000 .000 .323 .000 .809 95% Confidence Interval Lower Bound Upper Bound -.0483 -.0290 .0081 .0157 -.0056 -.0007 .0095 .0175 .0110 .0215 .0290 .0483 .0404 .0606 .0258 .0451 .0420 .0623 .0442 .0656 -.0157 -.0081 -.0606 -.0404 -.0190 -.0111 -.0035 .0067 -.0017 .0104 .0007 .0056 -.0451 -.0258 .0111 .0190 .0125 .0208 .0141 .0247 -.0175 -.0095 -.0623 -.0420 -.0067 .0035 -.0208 -.0125 -.0035 .0089 -.0215 -.0110 -.0656 -.0442 -.0104 .0017 -.0247 -.0141 -.0089 .0035
*. The mean difference is significant at the .05 level.
Test for a relationship between two categorical variables
Is there an association between being a member of a fraternity/sorority and ever being diagnosed with depression?
Hypotheses
Ho: There is no association between being a member of a fraternity/sorority and ever being diagnosed with depression. HA: There is an association between being a member of a fraternity/sorority and ever being diagnosed with depression.
Test: Chi-square test for independence Result: Fail to reject null
Test for relationship between two categorical variables
Ev er - Depression * Frat or sorority? Crosstabulation Frat or sorority? yes no 681 7692 715.6 7657.4 3744 39657 3709.4 39691.6 4425 47349 4425.0 47349.0 Total 8373 8373.0 43401 43401.0 51774 51774.0
Ever - Depression
yes no
Total
Count Expected Count Count Expected Count Count Expected Count
Chi-Square Tests Value 2.185 b 2.122 2.211 df 1 1 1 Asymp. Sig. (2-sided) .139 .145 .137 Exact Sig. (2-sided) Exact Sig. (1-sided)
Pearson Chi-Square a Continuity Correction Likelihood Ratio Fisher's Exact Test Linear-by-Linear Association N of Valid Cases
.141 2.185 51774 1 .139
.073
a. Computed only for a 2x2 table b. 0 cells (.0%) have expected count less than 5. The minimum expected count is 715. 62.
An significant association does not indicate causation Statistical significance is not always the same as practical significance Multiple factors contribute to whether your results are significant It gets easier and easier as you practice!
Important Points to Remember
Questions???