1.1. How many records are there in the dataset (“frmgham2.csv”)? How many participants
are there in the teaching dataset? Explain why these two numbers are different.
The database is a subset of the data collected as part of the Framingham study and includes a set of variables related to the medical history and adjudicated event data on 4,434 participants.
Each participant has 1 to 3 observations depending on the number of exams the subject attended, and as a result there are 11,627 observations on the 4,434 participants.
1.2. Create a file that contains only records related to the First Examination (“PERIOD=1”) and call it “DATASET1”. How many subjects undergo the first examination? Note that, all the questions below (from (Q1.2) onward) will be based on “DATASET1” only.
I used IBM SPSS to process the database that was received.
The code :
DATASET COPY DATASET1.sav.
DATASET ACTIVATE DATASET1.sav.
FILTER OFF.
USE ALL.
SELECT IF (PERIOD = 1).
EXECUTE.
DATASET ACTIVATE DataSet1.
1.3. Design a table with relevant summary/descriptive statistics, stratified by gender (“SEX”), to describe the population enrolled in the first examination. Write a paragraph (no more than 250 words) to describe the data and any findings from this table. This table should contain the following variables : “AGE”, “BMI”,”SYSBP”, “DIABP”, “CURSMOKE”, “CIGPDAY”, ”DIABETES”.
The indicated variables were labeled as is indicated in the Framingham Heart Study Longitudinal Data Documentation. The summary is displayed in Table 1.
The sequence
/Analyze
/ Descriptive Statistics
/ Explore
was used. Then I selected the variables.
The code used:
EXAMINE VARIABLES=AGE BMI SYSBP DIABP CURSMOKE CIGPDAY DIABETES BY SEX
/PLOT BOXPLOT STEMLEAF HISTOGRAM
/COMPARE GROUPS
/PERCENTILES (5,10,25,50,75,90,95) HAVERAGE
/STATISTICS DESCRIPTIVES
/CINTERVAL 95
/MISSING LISTWISE
/NOTOTAL.
In the study were participating 1923 men and 2460 women. The average age of men is slightly lower than age of women. 25% of men are less than 42 years old. 25% of men are 57 year old, or greater. Also, 25% of women are 43 years or less. 25% of women are 57 year old or greater.. The Body Mass Index (BMI) in men has is little bit higher than women, relative variability of the group of women is higher; the coefficient of variation is- 17.78% for women and 13.07% for men. 25% of men have a BMI less or equal to 23.96 while 25% of women have a BMI less or equal to 22.54. The quartile distribution of Blood Pressure suggests that there is a high incidence of arterial hypertension, since 25% of men have values greater than or equal to 141.50 mmHg and 25% of women have values greater than or equal to 146.38 mmHg. The relative greater variability is found in the number of cigarettes smoked each day, which in the case of men is 104.3%, while the case of women reaches 158.0%. There is a high incidence of cigarette smoking, which is 60.4% of men and 40.4% of women.
Table 1. Summary statistics the population enrolled in the first examination
Characteristic
Gender
Men
Women
( n = 1,944 )
( n = 2,490 )
Age at exam (years)
Mean ± SD
49.79 ± 8.72
50.02 ± 8.64
Q1
42.00
43.00
Q3
57.00
57.00
Systolic Blood Pressure (mmHg)
Mean ± SD
131.77 ± 19.33
133.74 ± 24.36
Q1
118.00
116.00
Q3
141.50
146.38
Body Mass Index
Mean ± SD
26.17 ± 3.42
25.59 ± 4.55
Q1
23.96
22.54
Q3
28.34
27.82
Diastolic Blood Pressure (mmHg)
Mean ± SD
83.75 ± 11.46
82.56 ± 12.41
Q1
76.00
74.00
Q3
90.00
89.00
Current cigarette smoking (%)
60.4
40.4
Number of cigarettes smoked each day
Mean ± SD
13.22 ± 13.79
5.67 ± 8.96
Q1
0.00
0.00
Q3
20.00
10.00
Diabetes (%)
3.0
2.5
[Question 2]
2.1. Examine the distribution of the variable “BMI” for those who have Hypertension and those who are Hypertension-free using histogram. Describe the distribution of this variable.
The Body Mass Index for the group without the disease is symmetrical, with the exception of the right tail of the graph, which shows the existence of atypical values or extreme values.
The Body Mass Index for the group of people who have Hypertension (fig 2), presents a bell form, but with a greater dispersion than in the previous case. It is observed that the range of variation of the data and the interval in which the greater number of observations is concentrated is greater.
2.2. Generate a box and whisker plot of “BMI” for those who have Hypertension and those who are Hypertension-free at First Examination . Include the boxplots in your answer. Describe different aspects of the box and whisker plot.
The box and whisker chart confirms the presence of numerous atypical values and extreme values in the two groups of Prevalent Hypertensive. Both distributions present positive asymmetry, since the data extend towards the higher BMI values. However, in the interquartile range in the distribution is more symmetrical. The median is a similar distances from the first and third quartiles in each group. The median BMI in first and third quartile is higher in the group of people with hypertension.
2.3. Using the histogram from (2.1) and boxplots from (2.2), state the relationship between Hypertension and BMI?
People with hypertension have higher BMI than people who do not have hypertension. The Hypertension seems to be associated with higher BMI values.
2.4 Conduct a statistical analysis to study the relationship between having Hypertension and BMI. State your hypotheses and write up your conclusions.
.
Figure 4- Histograms of Body Mass Index by Prevalent Hypertensive
The histograms of the BMI by Prevalent Hypertensive shows us the series are adjusted to the normal distribution and the samples are from independent populations, so that the parametric test of difference of means can be done.
Table 2- Group statistics for BMI by Prevalent Hypertensive
Prevalent Hypertensive
N
Mean
Std. Deviation
Std. Error Mean
Body Mass Index
Free of disease
2,992
25.0044
3.55539
.06500
Prevalent disease
1,423
27.6161
4.58385
.12151
Table 3-Independent Samples Test for BMI
Levene’s Test for Equality of Variances
t-test for Equality of Means
F
Sig.
t
df
Sig. (2-tailed)
Mean Difference
Std. Error Difference
95% Confidence Interval of the Difference
Lower
Upper
BMI
Equal variances assumed
61.975
.000
-20.70
4,413
.000
-2.61175
.12612
-2.8590
-2.3645
Equal variances not assumed
-18.95
2,264.035
.000
-2.61175
.13781
-2.8819
-2.3415
Taking into account the Two-Sample t-tests of independent samples, the null hypothesis that the mean of BMI for the two groups are equal is rejected (t = -18.95, df = 2264.04, p < 0.001). There is a significant big difference between those groups.
2.5. From your statistical investigation of the relationship between BMI and Hypertension above, can you conclude that Hypertension (or no-hypertension) is induced by high (or low) BMI? Explain your reason.
We can’t conclude that BMI is higher in the Prevalent disease group. This result does not prove that Hypertension is induced by high BMI. The hypothesis test used only allows us to make a conclusion about parameters of the two distribution analyzed and does not allow establishing causal relations.
[Question 3] In Q3 we are interested in the First Examination data only (“PERIOD=1” or “DATASET1”). Using the values of BMI, one can categorize a subject into “underweight”, “normal”, “overweight”, “obese” (4 groups). Create a new variable “BMIGP” using the definitions below…..
Body Mass Index was written into another variable with this code:
RECODE BMI (Lowest thru 18.49=1) (18.5 thru 24.99=2) (25 thru 29.99=3) (30 thru Highest=4)
INTO BMIGP.
VARIABLE LABELS BMIGP ‘BMI Groups’.
EXECUTE.
3.1. Display the frequency table of the new variable “BMIGP”. Include any missing values on your table if there any.
Taking into account the participants of Body mass Index groups we can say that 57 people represents 1,3% of the total valid data , and are underweight and 577 people who represents 13.1 % of the valid data are obese.
Table 4-Frequency distribution of BMIGP
Frequency
Percent
Valid Percent
Cumulative Percent
Valid
Underweight
57
1.3
1.3
1.3
Normal
1,936
43.7
43.9
45.1
Overweight
1,845
41.6
41.8
86.9
Obese
577
13.0
13.1
100.0
Total
4,415
99.6
100.0
Missing
System
19
.4
Total
4,434
100.0
3.2. Cross-tabulate “BMIGP” with smoking status (“CURSMOKE”). What is the prevalence of smoking in each BMI group? What can you observe from these results in terms of the relationship between Smoking status and BMI?
As seen from table 5, in group of overweight participants the prevalence of smoking is 44.6%, while in normal weight group the prevalence is 57.5%. So we can tell that as the weight increases , the prevalence of smoking decreases .
Current cigarette smoking at exam
Total
Not current smoker
Current smoker
BMI Groups
Underweight
Count
19
38
57
% within BMI Groups
33.3%
66.7%
100.0%
Normal
Count
823
1,113
1,936
% within BMI Groups
42.5%
57.5%
100.0%
Overweight
Count
1,023
822
1,845
% within BMI Groups
55.4%
44.6%
100.0%
Obese
Count
376
201
577
% within BMI Groups
65.2%
34.8%
100.0%
Total
Count
2,241
2,174
4,415
% within BMI Groups
50.8%
49.2%
100.0%
Table 5-Crosstab of BMI Groups by Current cigarette smoking at exam
3.3. Conduct an analysis to study the relationship between smoking status and BMI groups. State your hypotheses and interpret your findings.
To check for the relationship between smoking status and BMI groups we use the Chi-square test of independence.
As indicated (Daniel, 2013), “perhaps the most frequent, use of the Chi-square distribution is to test the null hypothesis that two criteria of classification, when applied to the same set of entities, are independent. We say that two criteria of classification are independent if the distribution of one criterion is the same no matter what the distribution of the other criterion.”(!!!!!!!!!!!!)
Table 6- Chi-Square tests of BMI Groups by Current cigarette smoking at exam
Value
df
Asymp. Sig. (2-sided)
Pearson Chi-Square
123.759a
3
.000
Likelihood Ratio
124.906
3
.000
Linear-by-Linear Association
122.690
1
.000
N of Valid Cases
4,415
a. 0 cells (0.0%) have expected count less than 5. The minimum expected count is 28.07.
Hypothesis :
H0: BMI Groups and Current cigarette smoking are independent.
H1: The two variables are not independent.
α = 0.05
So, taking into account the Chi-Square test, the null hypothesis shows no relationship between those two variables, and is rejected (2 = 123.76, df = 3, p < 0.001). However we can see the relationship between Body Mass Index groups and Current cigarette smoking.
[Question 4]
4.1. Use any software to generate a scatterplot of “SYSBP” (on Y-axis) and “BMI” (on X-axis). Include the plot in the report. Calculate Pearson correlation coefficient. Use visual inspection as well as Pearson coefficient to describe the relationship between BMI and SYSBP.
Using SPSS I generated the scatter plot and we can see a large number of points which are based in the bottom left of the scatterplot. The shape of the point cloud observed, so we can assume a weak positive linear correlation between BMI and Systolic Blood Pressure.
Figure 5. Scatterplot of Systolic Blood Pressure (mmHg) vs. Body Mass index
Table 7. Correlation coefficient between Systolic Blood Pressure (mmHg) and Body Mass index
Systolic Blood Pressure (mmHg)
Body Mass Index
Systolic Blood Pressure (mmHg)
Pearson Correlation
1
.328**
Sig. (2-tailed)
.000
N
4,434
4,415
Body Mass Index
Pearson Correlation
.328**
1
Sig. (2-tailed)
.000
N
4,415
4,415
**. Correlation is significant at the 0.01 level (2-tailed).
The PCC (Pearson correlation coefficient) is 0.328. As BMI increases, the Systolic Blood Pressure goes up.
Based upon the t-test for the correlation coefficient, the null hypothesis that the correlation of BMI and Systolic Blood Pressure in the population is zero is rejected (t = 23.11, df = 4432, p < 0.001). We can assume the linear correlation between BMI and Systolic Blood Pressure in the population
4.2. Assume Simple Linear Regression (SLR) analysis is to be used to study the relationship between Systolic Blood Pressure and BMI. Treat “BMI” as the independent variable and “SYSBP” as the dependent variable.
4.2.1. Write down the model/formula of SLR in term of “BMI” and “SYSBP”
The (SLR)- simple linear regression formula:
,
where ε is the error term, which is the difference between the observed value of BMI and the estimated value of BMI by the model.
(Gujarati, 2002) points out that this model is called linear because the parameters to be estimated (β0 and β1) are elevated only to the first power.
Therefore, it is a model that is linear in the parameters. In addition, this model will adjust the point cloud by a straight line. The parameter β0 corresponds to the intercept, while the parameter β1 corresponds to the slope of that straight line.
4.2.2. State the estimation method used to find the best linear fit of the data.
I used the method of least squares, and the resulting line is called least-squares line (Daniel, 2013).
It says that the method consists of minimizing the sum of the squared deviations of the observed values of the dependent variable from its estimated values with the regression line. In other words, it is a matter of minimizing the square of the distances of each point to the straight line obtained.
4.2.3. List the assumptions used behind SLR for SLR to be a valid model for data analysis.
The regression have the following assumptions:
Linear relationship: there is a linear correlation between the variables.
Multivariate normality: the variables are distributed normally.
No auto-correlation: there is a little or no autocorrelation in the data.
Homoscedasticity: The variation around the regression equation is the same for all of the values of the independent variables
4.2.4. Examine all the assumptions listed in (4.2.3) using “SYSBP/BMI” in “DATASET1”. Report whether each assumption has been satisfied or not and justify your answer.
The scatterplot in Figure 4 shows a weak positive linear correlation between the variables considered. So, first assumption is satisfied.
To evaluate the normality of the distributions of the variables, the histograms are shown in Figures 6 and 7. In each histogram is displayed the normal curve. The SYSBB variable is not normally distributed, but approaches that distribution. The BMI is normally distributed.
Figure 6. Histogram of Systolic Blood Pressure
Figure 7. Histogram of Body Mass Index
Figure 8. Boxplot of Systolic Blood Pressure by BMI groups
The Histogram of SBP (Systolic Blood Pressure), shows that, except for the values identified as atypical or extreme values in each group, the Systolic Blood Pressure presents a similar variability in the four groups. The intervals in which the Systolic Blood Pressure varies,overlap. As can be seen in the box diagram, the assumption of equality of variances is satisfied.
The absence of autocorrelation can be observed in Table 8. In the model summary is reported the Durbin-Watson d statistic. Since d = 1.947, very close to 2, there is no serial correlation. (Gujarati, 2002) establish that “As a rule of thumb, if an application finds that d is equal to 2, it can be assumed that there is no first order autocorrelation, either positive or negative.”
4.2.5. Conduct SLR using any computer software. Write out the estimate regression line. Is it the relationship between BMI and Systolic BP statistically significant at the 5% level? Justify your answer. Use the estimated coefficient(s) to explain the size of effect between BMI and Systolic BP.
Table 10 shows us that the regression equation is: SYSBP = 86.667 + 1.789 * BMI
There is no interpretation for intercept coefficient. This coefficient is not interpretable, since it does not make sense that there is a person with a BMI equal to zero. Additionally, when the BMI increases in a unit, the Systolic Blood Pressure increases in 1.789 mmHg.
The model summary shows that the standard error of the estimates = 21.1249. It is the dispersion of estimations with respect to the observed values. The model explains the 10.7% of the total variability of Systolic Blood Pressure.
Table 8. Model summary of regressionb
Model
R
R Square
Adjusted R Square
Std. Error of the Estimate
Durbin-Watson
1
.328a
.108
.107
21.1249
1.947
a. Predictors: (Constant), Body Mass Index
b. Dependent Variable: Systolic Blood Pressure (mmHg)
Table 9. ANOVAa
Model
Sum of Squares
df
Mean Square
F
Sig.
1
Regression
237,559.912
1
237,559.912
532.334
.000b
Residual
1,969,348.985
4,413
446.261
Total
2,206,908.896
4,414
a. Dependent Variable: Systolic Blood Pressure (mmHg)
b. Predictors: (Constant), Body Mass Index
As seen from ANOVA results (table 9), the null hypothesis of all the regression coefficient are equal to zero is rejected (F = 532.33; df = 1, 446261; p < 0.001; R2 =0.108). The constant and BMI has the ability to explain the variation in Systolic Blood Pressure.
Table 10. Coefficients of linear regression modela
Model
Unstandardized Coefficients
Standardized Coefficients
t
Sig.
95.0% Confidence Interval for B
B
Std. Error
Beta
Lower Bound
Upper Bound
1
(Constant)
86.667
2.029
42.722
.000
82.690
90.644
Body Mass Index
1.789
.078
.328
23.072
.000
1.637
1.940
a. Dependent Variable: Systolic Blood Pressure (mmHg)
Taking into account the regression result, the null hypothesis that the regression coefficients are equal to zero is rejected (t = 42.71, p < 0.001 for constant; t = 23.07, p < 0.001 for BMI).
The regression coefficient for the constant and the independent variable are not equal to zero and should be included in the model to predict Systolic Blood Pressure.
References
Daniel, W. W., & Cross, C.L. (2013). A Foundation for Analysis in the Health Sciences.
Gujarati, D. (2002). Basic econometrics: With software disk package. New York: McGraw-Hill.