*Q1. Explore the data. Plot and produce summary statistics to identify the key characteristics of the data and produce a report of your findings. Ideally, I would expect between 5 and 10 tables or figures accompanied by a description of your main findings. Among the topics that you might choose to discuss: identification of possible outliers or mistakes in the data, distribution of the variables provided, correlations and relationships between variables?*

The housing dataset gives a historical data on housing in Boston area. We use R software to analyse this data. The main purpose of this project is to use different statistical methods and techniques to interpret the data.

This dataset is a data frame with 506 rows and 14 variables. MEDV is our dependent variable.

After loading the data in R, with the output of head() we checked if the data is correctly imported.

The output of str() gives us the relevant information of our dataframe, like the number of observations, number of variables, names of each column, the class of each column, and sample values from each column.

The above table shows that only CR01, CHAS, RAD, TAX are Categorical data, the rest of the 11 variables data type is numberical.

To get more detailed statistical information from each column, summary() function is used. It shows the minimum value, maximum value, median, mean, and the 1st and 3rd quartile values for each column in our dataset. It also gives information about the missing values, if any is present. We see that there are no missing values in any of our variables.

A matrix of scatterplot is produced with all the variables to show the relationship between the variables.

Since we have a lot of variables, we would need a bigger screen to see the correlations between the variables so we may use another correlation matrix to represent the data.

Correlations:

Correlation is a type of relationship between any two or more variables. Its value lies in the range of 0 to 1. 0 value shows that the variable is unrelated. As you move from and go towards 1 the correlation gets stronger. The correlation value can be positive or negative, a positive correlation means increase in one variable leads to increase in the other and negative correlation means increase in one variable leads to decrease in the other variable.

We use correlation matrix to visualise the correlation. Let’s look at the correlations of variables individually.

CRIM: CRIM, per capita crime rate by town, has a strong positive correlation with RAD, index of accessibility to radial highways, this shows that there is more crime on highways maybe because they are not vigilantly patrolled. MEDV, Median value of owner-occupied homes in $1000s, has a negative correlation, since nobody wants to live in a neighbourhood which has high crime rate so that is why such neighbour hoods prices with high crime will have low rates. ZN: ZN, Proportion of residential land zoned for lots over 25,000 sq. ft, shows the highest positive correlation with DIS, weighted distances to five Boston employment centers, which means as the distance from employment centres increases, the size of residential plots increases. This means they are located away from city area. ZN has the lowest negative correlation with NOX, Nitric oxide concentration, which means there is less pollution in areas which are far from employment centers.

INDUS: INDUS, Proportion of non-retail business acres per town, has strongest positive correlation with NOX, Nitric oxide concentration, which tells that industrial area has high amount of nitric oxide in the air. INDUS has a negative correlation with DIS, the industry decreases as you move near the employment centers

NOX: NOX, Nitric oxide concentration, has strong positive correlation with variable INDUS which tells that amount of nitric oxide in air is high in industrial areas. It has negative correlation with ZN, Proportion of residential land zoned for lots over 25,000 sq. ft, which shows that residential areas have less amount of nitric oxide, given that they are located away from industrial areas.

RM: RM, Average number of rooms per dwelling, has strong positive correlation with the variable MEDV, Median value of owner-occupied homes in $1000s which tells that a house which more rooms will have higher median value. RM has strong negative correlation with the variable LSTAT, Percentage of lower status of the population. This shows that number of rooms in a dwelling increase as the percentage of low population decreases.

AGE: AGE, Proportion of owner-occupied units built prior to 1940, has strong positive correlation with NOX, Nitric oxide concentration. This means that the owner-occupied units are located in areas which have high amount of nitric oxide. AGE has negative correlation with DIS, weighted distances to five Boston employment centers, this means that as AGE increases the distances from employment centers decreases which implies that the units are located in an area where employment centers are now.

DIS: DIS, Weighted distances to five Boston employment centers, has a positive correlation with the variable ZN, Proportion of residential land zoned for lots over 25,000 sq. ft, which means as the size of residential plots increases their distance from employment centers increases. DIS has negative correlation with AGE, INDUS and NOX which means that more the distance to employment center, the less will be the nitric oxide, less proportion of owner-occupied units built prior to 1940.

RAD: RAD, Index of accessibility to radial highways, has strong positive correlation with TAX, Full-value property tax rate per $10,000, which means properties close to highway have high taxes. RAD has a negative correlation with DIS, weighted distances to five Boston employment centers, which indicates that there will be low taxes if the distance is longer from employment center.

TAX: TAX, Full-value property tax rate per $10,000, has high positive correlation with RAD, Index of accessibility to radial highways, which means that the closer the property is to the highway, higher the property value and taxes on that property. It has negative correlation with DIS, weighted distances to five Boston employment centers, which indicates that if the distance from the employment center is less than the TAX will be higher.

LSTAT: LSTAT, Percentage of lower status of the population, has positive correlation with INDUS, AGE and NOX. This means LSTAT increases as we go to areas where nitric oxide is high or where proportion of owner-occupied units built in 1940 is high. It has negative correlation with MEDV which shows that as median value of houses increases, LSTAT lowers (population increases), therefore the demand of houses increases.

MEDV: MEDV, Median value of owner-occupied homes in $1000s, has positive correlation with RM, Average number of rooms per dwelling, which means that the median value of houses increases as the average number of rooms in a house increase. MEDV has a negative correlation with LSTAT which shows that as median value of houses increases, LSTAT lowers (population increases), therefore the demand of houses increases.

We might have to choose a few significant variables hence first we need to see the correlation of all variables with the MEDV (Dependent variable).

The table validates our analysis of MEDV in the correlation matrix that RM, average number of rooms, has the strongest positive correlation with the MEDV, while the percentage of lower status population, LSTAT and the pupil-teacher ratio by town, PTRATIO, have strong negative correlation along with NOX, nitric oxide concentration. Zero Varience Check:

The output ( [1] 0 ) shows that there are no variable with zero or near zero variance when we check the near zero varience.

Histogram:

We use Histograms to effectively analyse the frequency of the data and the overall data density. The Histograms can be seen down below. We observed that CRIM, DIS and ZN have a positively skewed distribution where as PTRATIO and Age have left skewed distribution. We also observed NOX has binomial distribution, INDUS, CRIM, PTRATIO and ZN have unimodal distribution. RM has a symmetric distribution. ZN, CRIM and PTRATIO value only lies in small range.

ggplot2:

We used ggplot2 to visualize the distribution and density of MEDV. The black curve represents the density. We also plotted the boxplot. We observe that the median value of owner-occupied homes is skewed to the right, with a number of outliers to the right. We may transform ‘MEDV’ column using functions like natural logrithm, while modeling the hypothesis for regression analysis.

Box Plot

In order to identify the outliers in the data we use boxplot as individual data points are plotted. After plotting we observe from the images above that the variables CRIM, ZN, PTRATIO, DIS, LSTAT, MEDV have outliers.

Outliers can be dealt in a number of ways. The variables with outliers can be removed if they are not significant but instead of removing, we use median instead of mean for our analysis. CRIM and ZN both have their mean greater than the median hence they have almost similar boxplots. AGE and PTRATIO have their mean smaller than the median. Since outliers do not influence the interquartile range, we can use it to measure the data spread.

Variables NOX, AGE and INDUS have the largest interquartile range indicating how firmly the distribution is defined.

*Q2. Develop a regression model to predict MEDV from one or more of the other variables. Discuss your methodology including, for example, variable selection, goodness of fit, performance. Consider both linear and nonlinear models. Produce a report of your findings supported by plots and statistical analysis. (35 marks)*

First, we run linear regression to predict the MEDV (dependent variable) while using the 10 explanatory variables out of 13 since the categorical data which is CR01, CHAS and RAD will be removed. When we run it, summary(Linear_Model1) displays multiple R-squared value of 0.7167. We see from the results that there are many variables which have a high P Value and are insignificant, so we think that the multiple r square value can be improved by removing these variables. So, we remove INDUS, AGE, TAX which had 0 stars and higher P Value. After removing we are left with 7 explanatory variables, so we run the function summary(Linear_Model2) which gives us the multiple R- squared value as 0.7163 which is almost the same as earlier output. Again, we repeat the procedure of removing the variables with less stars, so we remove ZN and CRIM. Now with 5 significant explanatory variables with lower P values and high stars our multiple mean square value comes out to be 0.7081 in our summary(Linear_Model3). By changing different variables we see that some variables which are showing low significance when computed against all the variables usually are significant when they are placed with a few variables just to check the fitness of our models so in Linear_Model4 you can see that when NOX and PTRATIO are replaced by CRIM and ZN, the result show a decrease in the value of R-square(0.6737). In Linear_Model5 we remove CRIM, DIS and ZN, keeping only two variables LSTAT and RM. The results again show a decrease in R square and adjusted R square having values of 0.6386 and 0.6371 respectively. Now we check the impact of only one variable i.e. LSTAT on MEDV. The summary(Linear_Model6) gives us a an R-square of 0.5441 and adjusted R-square of 0.5432. We can observe that both the values have decreased a lot from the previous model which shows LSTAT has the most negative impact on MEDV as shown by the below graph.

For further accuracy we use other models to determine the right model for the prediction of MEDV. We built 3 models of polynomial regression. The first model ‘Polynomial’ only contains one explanatory variable LSTAT. The second model ‘Polynomial2” contains Linear_Model3 variables which are NOX, RM, DIS, PTRATIO, LSTAT. The third model ‘Polynomial3’ contains Linear_Model4 variables which are CRIM, ZN, RM, DIS, LSTAT. For higher accuracy and precision, we need to calculate the degree at which the regression will be run successfully. Through automation the degree which we computed using R is 10. We run the regression on the computed degree, we observe that the R-square value in Polynomial2 has the highest value of Multiple R-squared: 0.7725 at degree 10. The graphs of all models are shown below:

We perform Cross Validation to determine the Accuracy of Models that we have used in order to determine which model is the best one, keeping all the models the same.

We first apply cross validation on one variable, LSTAT. Multiple R-squared is 0.5326, Adjusted R-squared is 0.5314 on Train data, and for the Test data Multiple R-squared is 0.5897and Adjusted R-squared is 0.5855. By looking at the adjusted R squared values of both datasets we conclude that Linear_Model6 is not predicting MEDV effectively.

Now we apply cross validation on a multiple variable model (Linear_Model3). Multiple R-squared is 0.7202, Adjusted R-squared is 0.7167 on Train data, and for the Test data Multiple R-squared is 0.6846, Adjusted R-squared is 0.6678. By looking at the adjusted R squared values of both datasets we conclude that Linear_Model3 is predicting MEDV effectively and accurately.

Lastly, cross validation is applied on Linear_Model4. Multiple R-squared is 0.6845, Adjusted R-squared is 0.6806 on Train data, and for the Test data Multiple R-squared is 0.6969, Adjusted R-squared is 0.6808. By looking at the adjusted R squared values of both datasets we conclude that Linear_Model4 is predicting MEDV effectively and accurately.

*Q3. Develop a classification model to predict whether a neighbourhood has high (CR01=1) or low (CR01=0) per capita crime rate. Explore various subsets of predictors and discuss the performance of your model. (15 marks)*

In our dataset CR01 is a dummy variable which measures whether a neighbourhood has high crime rate (CR01=1) or low crime rate (CR01=0) per capita. We will run logistic regression using all the variables.

We will eliminate all the insignificant variables, having low/zero asterisks. Greater number of stars indicate lower p-value and higher significance. So we remove INDUS, CHAS, RM, AGE, LSTAT. Now we run the logistic regression on remaining variables (TAX, MEDV, DIS, NOX, RAD, PTRATIO, ZN) and we see a few insignificant variables with higher p-values so we further eliminate, and keep 3 variables TAX, NOX, and RAD. The regression was run and the results showed that all variables are significant now. This model was named as CR01. We use CR01 to predict the probabilities. These probabilities were stored as ‘testprob’. The classification of data is done to test our model along with creation of confusion matrix to give us actual rate along with the error. So the data is split into train and test data sets. Both sets have 253 observations each.

We predicted(testpred) and the actual(testval) and then plotted the matrix as shown below:

The error is measured as (1,2+2,1)/253. The value we get is 0.1264822. True positive value is measured using formula 2,2/(1,2+2,2). The result we get is 0.8064516which shows that our logistic model predicts 81 percent CRIM. We use KNN to further analyse. KNN uses the nearest neighbor for classification. Confusion matrix is used to find the error rate which equals 0.07509881 and positive rate which equals 0.9268293 which shows KNN predicts 93 percent CRIM. The figures below show the area under the curve to be of the value given by KNN.

R-CODE:

# Import dataset

getwd() setwd(“/Users/Usama/Desktop /QMUL/Semester 1/Data Analy tics/Coursework”)

housing = read.csv(“housing.cs v”)

#display the type of data class(housing)

#Display the first few values of each variable

head(housing)

# Display the structure of the h ousing dataset

str(housing)

# Note that CR01, CHAS, RAD a re categorical data.

# plot

pairs(housing)

# print the summary summary(housing)

# attach housing table attach(housing)

# get the correlation for other variables against MEDV housing.cor=cor(housing, MED V)

housing.cor housing.reg=lm(MEDV ~ CRIM+ ZN+INDUS+NOX+RM+AGE+DIS +TAX+PTRATIO+LSTAT) summary(housing.reg)

# find that DIS, NOX, RM, PTRA TIO and LSTAT are significant, s o use these variables to do ano ther linear model

pairs(MEDV~INDUS+DIS+RM+ NOX+LSTAT+PTRATIO)

# Correlation Matrix housing.mcor=cor(housing) library(corrplot) corrplot(housing.mcor)

#load library caret along with l attice and ggplot2

# Calulate near zero variance nzv=nearZeroVar(housing, save Metrics = TRUE)

sum(nzv$nzv)

# Histograms for distribution par(mfrow=c(2,4))

hist(CRIM, breaks = 10, col = “g rey”)

hist(INDUS, breaks = 10, col = ” grey”)

hist(NOX, breaks = 10, col = “gr ey”)

hist(LSTAT, breaks = 10, col = ” grey”)

hist(PTRATIO, breaks = 10, col = “grey”)

hist(AGE, breaks = 10, col = “gr ey”)

hist(ZN, breaks = 50, col = “gre y”)

hist(RM, breaks = 50, col = “gre y”)

hist(DIS, breaks = 50, col = “gre y”)

hist(MEDV, breaks = 50, main = “Distribution of MEDV”,

ylab = “Count”, col = “grey”)

h=hist(MEDV, breaks = 50, mai n = “Distribution of MEDV”,

ylab = “Count”, col = “grey “)

xfit=seq(min(MEDV), max(MED V), length=40)

yfit = dnorm(xfit, mean=mean( MEDV),,sd=sd(MEDV)) yfit=yfit*diff(h$mids[1:2])*leng th(MEDV)

lines(xfit,yfit)

ggplot(data = housing , aes(ME DV)) + geom_density(fill = ‘grey ‘)# skewed

# convert cr01 into factor CR01.F = as.factor(CR01) table(CR01.F)

plot(MEDV ~ CR01.F) plot(MEDV ~ CR01)

# boxplots

par(mfrow=c(1,4)) boxplot(CRIM, main=”CRIM”) boxplot(ZN, main=”ZN”) boxplot(INDUS, main=”INDUS”) boxplot(PTRATIO, main=”PTRA TIO”)

boxplot(RM, main=”RM”) boxplot(NOX, main=”NOX”) boxplot(DIS, main=”DIS”) boxplot(AGE, main=”AGE”) boxplot(TAX, main=”TAX”) boxplot(RAD, main=”RAD”) boxplot(LSTAT, main=”LSTAT”) boxplot(MEDV, main=”MEDV”)

# Question Number2

# after removing these categor ical features RAD CHAS CR01 Linear_Model1=lm(MEDV ~ CR IM+ZN+INDUS+NOX+RM+AGE+ DIS+PTRATIO+LSTAT+TAX) summary(Linear_Model1)

# removing the features which are having higher pvalues i.e h aving zero stars Linear_Model2=lm(MEDV ~ CR IM+ZN+NOX+RM+DIS+PTRATI O+LSTAT) summary(Linear_Model2)

# removing the features which are having higher pvalues i.e h aving 2 stars Linear_Model3=lm(MEDV ~ NO X+RM+DIS+PTRATIO+LSTAT) summary(Linear_Model3)

Linear_Model4=lm(MEDV ~ CR IM+ZN+RM+DIS+LSTAT) summary(Linear_Model4)

# Linear model With only two v ariables RM and LSTAT Linear_Model5=lm(MEDV ~ R M+LSTAT)

summary(Linear_Model5)

# linear model with one variabl e LSTAT Linear_Model6=lm(MEDV ~ LS TAT)

summary(Linear_Model6) plot(LSTAT,MEDV) abline(34.55384,-0.95005,col=” red”)

# Multiple linear model with lin ear model2

fits1 <- fitted(Linear_Model2) plot(fits1,MEDV)

# Multiple linear model with lin ear model3

fits2 <- fitted(Linear_Model3) plot(fits2,MEDV)

# Multiple linear model with lin ear model5

fits3 <- fitted(Linear_Model4) plot(fits3,MEDV)

# polynomial of linear model plot(LSTAT,MEDV,main=”Polyn omial fit”) Polynomial=lm(MEDV ~ poly(LS TAT,10))

summary(Polynomial)

abline()

range(LSTAT) xvals=seq(1,38,length=30) yvals=predict(Polynomial,data. frame(LSTAT=xvals)) lines(xvals,yvals,col=”red”,lwd= 2)

Polynomial=lm(MEDV ~ poly(LS TAT,2)) yvals=predict(Polynomial,data. frame(LSTAT=xvals)) lines(xvals,yvals,col=”blue”,lwd =2)

# polynomial of non linear mod el 3 plot(fits2,MEDV,main=”Polyno mial fit”) Polynomial2=lm(MEDV ~ poly(f its2,10))

summary(Polynomial2) # 0.773 6

# Non liner model 4 plot(fits3,MEDV,main=”Polyno mial fit”) Polynomial3=lm(MEDV ~ poly(f its3,10)) summary(Polynomial3)

# automation of degrees testindex=sample(1:506,100) MEDV.train=housing[-testinde x,] MEDV.test=housing[testindex,] n=10

msevals=seq(1:n) for(i in 1:n){

yreg=MEDV.train$MEDV xreg=MEDV.train$LSTAT Polynomial=lm(yreg ~ poly(xr

eg,i)) msevals[i]=sum((predict(Poly

nomial,data.frame(xreg=MEDV .test$LSTAT))-MEDV.test$MED V)^2)/100} plot(seq(1:n),msevals)

# Cross validation on single vai able. Linear_Model_Cross_Train=lm( MEDV ~ LSTAT,data=MEDV.trai n) summary(Linear_Model_Cross _Train) Linear_Model_Cross_TEST=lm( MEDV ~ LSTAT,data=MEDV.tes t) summary(Linear_Model_Cross _TEST)

# cross validation on Multiple v ariables. best model Linear_Model_Cross_Train=lm( MEDV ~ NOX+RM+DIS+PTRATI O+LSTAT,data=MEDV.train) summary(Linear_Model_Cross _Train) Linear_Model_Cross_TEST=lm( MEDV ~ NOX+RM+DIS+PTRATI O+LSTAT,data=MEDV.test)

summary(Linear_Model_Cross _TEST)

# cross validation on multiple model with 3 variables. Linear_Model_Cross_Train=lm( MEDV ~ CRIM+ZN+RM+DIS+LS TAT,data=MEDV.train) summary(Linear_Model_Cross _Train) Linear_Model_Cross_TEST=lm( MEDV ~ CRIM+ZN+RM+DIS+LS TAT,data=MEDV.test) summary(Linear_Model_Cross _TEST)

# Cross validation polynomial o f linear model 3 Polynomial=lm(MEDV ~ poly(LS TAT,10),data=MEDV.train) summary(Polynomial) PolynomialTEST=lm(MEDV ~ p oly(LSTAT,10),data=MEDV.test) summary(Polynomial)

# Question 3 course work

# running logistic regression CR01.leg= glm(CR01 ~ MEDV+Z N+INDUS+CHAS+NOX+RM+AG E+DIS+RAD+TAX+PTRATIO+LST AT, family= binomial) summary(CR01.leg)

# removing some of the featur es on the basis of pvalues whic h is having Zero stars.

# and again runing logistic regr esion on remaining one. Cr01.leg= glm(CR01 ~ MEDV+Z N+NOX+DIS+RAD+TAX+PTRATI O, family= binomial) summary(Cr01.leg)

# Again removing some of the f eatures on the basis of pvalues which is having one stars.

# and again runing logistic regr esion on remaining one. Cr01.leg= glm(CR01 ~ NOX+RA D+TAX, family= binomial) summary(Cr01.leg)

head(Cr01.leg) testprob=predict(Cr01.leg,data .frame(NOX,RAD,TAX),type=”re sponse”)

head(testprob)

# To generate the confusion m atrix on test set testindex=sample(1:506,253) CR01.train=housing[-testindex, ] CR01.test=housing[testindex,] CR01.lreg=glm(CR01 ~ NOX+RA D+TAX, family= binomial,data= CR01.train) testprob=predict(CR01.lreg,CR 01.test,type=”response”) head(testprob) head(CR01.test) head(testindex)

# clasification based on these p robabilities we have to clasify Z ero or one.

# first we have generate vector with only zeros.

# vector of only Zero testpred=rep(0,253)

#Check the values in testpred testpred

# after this we have select one which should be one acoording to this probability.

# so it basically set all values of testprob vector which is greate r than 0.5 to one in testpred ve ctor.

testpred[testprob>0.5]=1

# to check the value in testpre d

testpred

# same steps for test vector wh ere there is one in test vector p ut one againt it in testval vecto r.

testval=rep(0,253)

testval testval[CR01.test$CR01==”1″]= 1

testval

table(testpred,testval) confmat=table(testpred,testval )

plot(confmat) (confmat[1,2]+confmat[2,1])/2 53 # 0.1699605 confmat[2,2]/(confmat[1,2]+co nfmat[2,2]) # 0.8253968

# Knn clasifier

library(class)

# we want training and test set together and we also want our lable seperated i.e Y categories in the seperate vector. head(CR01.train)

# we have removed colum 2,5 f rom train set beacuse they doe st contain any distance CR01.train.X=CR01.train[,-c(2,5 )]

CR01.train.Y=CR01.train[,2] CR01.test.X=CR01.test[,-c(2,5)] CR01.test.Y=CR01.test[,2]

CR01.knn=knn(CR01.train.X,CR 01.test.X,CR01.train.Y,1) head(CR01.knn) table(CR01.knn,CR01.test.Y) confmat=table(CR01.knn,CR01. test.Y)

plot(confmat)

# to find out the error rate (confmat[1,2]+confmat[2,1])/2 53 # 0.06719368 confmat[2,2]/(confmat[2,1]+co nfmat[2,2]) summary(CR01.knn)

# for Standardization data is to be rescale to mean zero and va riance 1 # CR01.train.standX=scale(CR01. train.X) CR01.test.standX=scale(CR01.t est.X)

# to find the optimal K value w e have to use cross validation. n=10

terror=rep(0,n) for(i in 1:n)

{ CR01.knn=knn(CR01.train.sta

ndX,CR01.test.standX,CR01.trai n.Y,i)

confmat=table(CR01.knn,CR0 1.test.Y)

terror[i]=(confmat[1,2]+conf mat[2,1])/253

}

kval=seq(1,n) plot(kval,terror)

# Here is the code to plot the R OC curve

n=10 # specifies the number of threshold values truepos=rep(1,n+1) # Initialize values for true positive and fals e positive values falsepos=rep(1,n+1)

for(i in 1:n)

{

s=i/(n+1) # sets the value of t

he threshold s, which varies be tween 0 and 1 as i goes throug h 1 to n.

# Using n+1 instead of n preve nts the exact value s=1. In this case the confusion matrix has

# only one row and the code below dows not work. Howeve r, for larger n the same will hap pen due to the

# the limited size of the datas et.

testpred=rep(0,253) testpred[testprob>s]=1

# for every value s the confusi

on matrix is calculated and the false and true positive values r ecorded

confmat=table(testpred,testv al)

truepos[i]=confmat[2,2]/(conf mat[1,2]+confmat[2,2])

falsepos[i]=confmat[2,1]/(con fmat[1,2]+confmat[1,1])

# terror[i]= (confmat[1,2]+con fmat[2,1])/253

} points(falsepos,truepos)

Crimerate<-housing[housing$C R01==1,]

head(Crimerate) nrow(Crimerate) Nocrimerate<-housing[housing $CR01==0,]

nrow(Nocrimerate) crimetrain.yes<-Crimerate[sam ple(nrow(Crimerate),126),] head(crimetrain.yes) crimetrain.No<-Nocrimerate[sa mple(nrow(Nocrimerate),126), ] crimetran<-rbind(crimetrain.ye s,crimetrain.No) head(crimetran) nrow(crimetran) crimeval.yes<-Crimerate[sampl e(nrow(Crimerate),126),] crimeval.No<-Nocrimerate[sa mple(nrow(Nocrimerate),150), ] crimevali<-rbind(crimeval.yes,c rimeval.No)

head(crimevali) nrow(crimevali)

Cr01.leg= glm(CR01 ~ NOX+RA D+TAX,data=crimetran,family= binomial(link=”logit”)) summary(Cr01.leg) library(“rpart”)

# install.packages(“ROSE”) library(“ROSE”)

log.predict<- predict(Cr01.leg, newdata=crimevali ) roc.curve(crimevali$CR01,log.p redict)

**...(download the rest of the essay above)**