Identifying Key Locations for Pam and Susan Stores: A Multiple Regression Analysis

Table of Contents

Introduction

Herein I have detailed the analyzation and identification of key locations for Pam and Susan stores. Given the particular data analyzed via multiple regression, we have recognized the site that will give our firm the greatest possible sales. I will demonstrate our findings via specific data as well as ways in which we can better estimate sales potential via multiple regression models.

Data

The data used to discuss our results (underneath Results and Discussion) was compared across historical data for 250 stores. We examined a considerable amount of data, including the following:

• Store size (Both gross sq. ft. and selling sq. ft.)

• Competitive Groups (1-7)

• Population (percentage of black and Spanish speaking)

• Family income (in $1,000s): 0-10, 10-14. 14-20. 20-30. 30-50, 50-100, 100+

• Median yearly income

• Median rent per month

• Median home value

• Percentage of population that are homeowners

• Percentage of population with no cars

• Percentage of population with 1 car

• Percentage of population with a TV

• Percentage of population with a washer

• Percentage of population with a dryer

• Percentage of population with a dishwasher

• Percentage of population with air conditioners

• Percentage of population with freezer

• Percentage of population with a second home

• Education (in years) from 0-8, 9-11, 12, and 12 plus

• Total Population

• Average Family Size

As noted above, all monetary values are in $1000s, and we examined all of this data across 250 different stores. Furthermore, we examined sales across all the locations.

Results and Discussion

The ultimate purpose of this project is to determine which of the two sites (Site A or Site B) we should use to open a new location for a Pam and Susan store. Given the subjectivity of the “competitive type” scoring method, we want to determine which of the seven competitive types (comtype) would be the most statistically significant. When we juxtaposed sales versus competitive types, we could ascertain that competitive type 1, 2, and 7 would be the most statistically significant. We ascertained this because 1 and 2 had the highest sales, while 7 had the lowest. This is detailed in the table below:

As we can see in the above scatterplot – comtype1, comtype2, and comtype7 are the most statistically significant given that 3, 4, 5, and 6 seem to be the most similar. As noted above, the purpose of this project is to determine what will give us the most statistically significant information possible (in this case, highest vs lowest sales). Therefore, utilizing these specific competitive types will allow us to see which one of these will correlate to better sales.

The next step was to determine which of the 20 variables would have the highest correlation to sales. We cut the list down by ten after examining the correlations. Below please find a list of the specific variables and their corresponding variables:

Variable

Correlation

%owners

-0.689846284

%dryers

-0.657330236

%freezer

-0.639445706

%washers

-0.56225519

%sch0-8

0.48621676

%spanishsp

0.54742695

population

0.599956754

%inc10-14

0.614049545

%inc0-10

0.61505374

%nocars

0.700939351

Above, these are the top 10 variables that correspond closest to overall sales for all 250 stores. I selected the four that had the most negative impact on sales, and six that had the most positive impact on sales. After carefully putting this together, I then moved on to conducting multiple regression models. It required seven different multiple regression models until I landed on the specific variables that would have the closest impact on our sales estimates. The criteria for removing a variable was by having a p-value that was >0.05. The eliminated variables were as follows:

• %inc10-14

• %inc0-10

• %owners

• %nocars

• %washers

• %population

• %dryers

I therefore determined that the most statistically significant variables were:

• comtype 1, 2, and 7

• %freezer

• %dryers

• %spanishsp

• %sch0-8

Below is the detailed information on their specific coefficients:

Coefficients

Standard Error

t Stat

P-value

Intercept

16020.78118

1157.292551

13.84332869

1.71384E-32

comtype1

9393.822289

862.7770647

10.8878906

9.93008E-23

comtype2

3802.264422

518.7601526

7.329522907

3.4437E-12

comtype7

-3123.244615

562.5492303

-5.551948962

7.41169E-08

%freezer

-112.4801661

39.36157831

-2.857613209

0.004639989

%dryers

-44.16538363

16.10174208

-2.742894738

0.00654549

%spanishsp

149.1517501

55.74166215

2.675767896

0.007964955

%sch0-8

-79.84654916

30.30415684

-2.634838171

0.008961101

After conducting these models, I was able to find the regression model equation for estimated sales. It is as follows:

Estimated Sales = 16020.78118 – (112.48*%freezer) -(44.1654*%dryers)+(149.152*%spanishsp) – (79.85*%sch0-8)

*It is important to note that should the regression equation require a comtype, (in the case of Site A) it should be multiplied by a factor of 1. If there is no comtype (in the case of Site B) it should simply be multiplied by 0. Below is a detailed list of the computed estimated sales based on the above regression equation corresponding to each site. I simply utilized the numbers for each site (Table B) and inserted them into the equation. Most importantly, I found that Site A has considerably higher estimated sales than Site B – thus giving us our answer that Site A is the more viable option of the two. In the below computations, Site A saw a positive increase from the competitive types because it was factored into the regression equation. The equation looked like this:

Estimated Sales of Site A = 16020.78118 – (112.48*6.1%) – (44.1654*9%)+(149.152*10.8%) – (79.85*37.4%) + (9393.822289*1)

In Site B, the competitive type was multiplied by 0 given they selected comtype5.

Site A

Site B

comtype1

comtype2

comtype7

%freezer

6.1

%dryers

12.2

%spanishsp

10.80%

6.60%

%sch0-8

37.40%

40.10%

Estimate Sales (in $1000s)

26544.19193

17163.86217

To make sure that the assumptions of our regressions are valid and so we can conclude that we can take utilize the results we came up with, I created a scatterplot of the residuals and a histogram of the residuals as well. As demonstrated below, the scatter plot has little to no discernible pattern at all, particularly lower on the x axis. The histogram also follows a clear normal curve. I do not see any severe skewedness so therefore I can conclude that the technical assumptions are sound and the data can be utilized. Both of these were created utilizing the residual output of the regression output.

Question 1

As a possible alternative to the competitive type of classifications, we can somewhat forecast sales using the demographic variables. Store size and percentage of hard goods does not play a critical role in estimate sales. As we saw, the key demographics that can forecast sales are the percentage who own freezers/dryers, are Spanish speaking, and have 0-8 years’ worth of school. However, the model reveals that the location sites are very important. In other words, the “competitive types” (which tell us information about the location of sites) is very important to having higher sales.

Question 2

The competitive type is a decent classification method for predicting sales. I would assert that store size and percentage of hard goods are not good methods for predicting sales. According to the data above, competitive types should only be taken into account when they correlate positively (comtype1, comtype2) with sales or negatively with sales (comtype7). For simplifying the categories, I would only use them in these cases. As we saw with Site A and Site B, Site B saw smaller estimated sales because comtype5 is not statistically significant. In other words – just use the competitive types when they correlate with positive or negative increases in sales (statistically significant)

Question 3

As I calculated above, I would recommend Site A. This is due to the multiple regression equation determining the estimated sales to be considerably higher than Site B. I would suggest this sales forecasting approach because it provides a closer estimate of what the potential sales would be, instead of just based on the subjective competitive types alone. In other words – I would pick my forecasting approach via multiple regression than simply utilizing the competitive type classification by itself.

Question 4

To determine which of the variables had the most correlation towards impacting sales, I saw that square feet of the selling area had a 0.35 correlation towards sales and that the percentage of hard goods was 0.016. This would lead me to conclude that the size of the store has somewhat of an impact on sales (smaller than the ones utilized above) and that the percentage of hard goods is negligible. In other words, the percentage of hard goods has almost zero slope when compared to sales. Considering the very poor correlation of hard goods to sales, the margins of hard goods vs soft goods would have little to no impact on sales.

Question 5 – Technical

As noted above, the scatter plot of residuals does not have any true correlation. The vast majority of the points on the graph fall around the 10,000 mark on the x-axis. There are a few outliers, but not enough to affect the correlation of the rest of the independent variables. We can also see a normal curve in the histogram above, thus demonstrating what is happening in the scatter plot (the majority hover around the same area). This tells us that the technical assumptions are satisfied.

Conclusion

Ultimately, I believe it is important that the company utilize fewer competitive types when analyzing sales forecasts. The only types that should be utilized are the ones that have a direct positive or negative impact. Using the correct competitive types with the right historical demographic data, we can see somewhat clear estimate of sales forecasts for particular locations. Some shortcomings of the analyses could be highlighted in the kind of data we utilized. If we had more historical demographic data, we could have a better opportunity to correlate what variables have a more statistically significant impact on sales.

Essay: Identifying Key Locations for Pam and Susan Stores: A Multiple Regression Analysis

Essay details and download:

Text preview of this essay:

Introduction

Conclusion

About this essay:

Essay details and download:

Text preview of this essay:

Introduction

Conclusion

About this essay:

Essay Categories: