Statistical techniques associated with survey sampling & how to interpret outputs

Survey sampling portrays the process of selecting samples of elements from a target population to perform a survey. In general, the objective of sample surveys is to make inferences about a population from information contained in a sample selected from that population [see 1, Mendenhall, Ott and Scheaffer]. The most utilised tool to accomplish that often involves a questionnaire used to measure the characteristics and/or attitudes of people. The interpretation often takes the form of reckoning a population mean (e.g. mean income per person) or proportion (e.g. proportion of voters favouring the Brexit vote). However, the problem with surveys is that information costs money and as a result, the experimenter must determine how much information he or she needs. Too little information prevents the experimenter from making good estimates, whereas too much of it results in a waste of money. So, the purpose of sampling is to reduce this cost and/or the amount of data that it would take to survey the entire target population, [see 2, “Survey Sampling”, para. 1].
According to Graham Kalton [see 3], sample surveys are nowadays widely accepted as a means of providing statistical data on an extensive range of subjects for both research and administrative purposes. Indeed, businesses and researchers have the need to conduct surveys for several reasons. To begin with, the main purpose is to uncover answers regarding the investigating subject by gathering meaningful opinions, comments, and feedback. Secondly, a survey evokes discussion, gives the survey respondents an opportunity to discuss important key topics and helps the experimenter to dig deeper into the survey and can incite related topics with a broader perspective. Finally, the most important objectives are to base the business decisions on unbiased information and compare results for providing a well-based conclusion for your target survey population.
One of the most common real life examples is that governments make considerable use of surveys to get informed of the conditions of their populations in terms of employment and unemployment, income and expenditure, housing conditions, education, nutrition, health, travel patterns, and many other subjects. They also conduct surveys of organisations such as manufacturers, retail outlets, farms, schools, and hospitals. Local governments equally make use of surveys for local planning purposes. Generally, surveys are also used in many other sciences such as sociology, political science, education and public health, [see 3, Kalton 1983]
This report will include a description of statistical techniques that are associated with a survey sampling, typical outputs generated by each of them and some explanations on how to interpret those outputs. Last but not least, some real life examples will be addressed and a critical evaluation of the topic will be analysed.

Table of Contents

How to select a good sample size

To begin with, there are several ways to choose a sample size. First of all, it should be bore in mind that the experience of the conductor plays a major role in the determination of a sample size when there are items readily available or convenient to collect. However, there are more scientific ways to estimate a sample size. For instance, the experimenter could use a target for the power of the statistical test to be applied once the same is collected or he/she could use a confidence level which determines how accurate a result will turn out with lower chances of error.
The requirement of a good sample is that the estimation should be based on these scientific forms and the means. One of those ways is using the standard error of the sample mean, that achievement with the corresponding formula:
σ/√n , where n is the sample size and the σ2 the corresponding variance
In addition to this, we express the 95% of the confidence interval with the form:
(x̅-2σ/√n, x̅+2σ/√n), where x̅ is the sample mean with a Normal Distribution and defined using the Central Limit Theorem. Therefore, if we wish to have a confidence interval that is W units in width, we should calculate: n = 16σ2/W2. This means that the smaller in range we need the interval to be then the bigger the size of the sample. For example, if we are interested in estimating the amount by which a drug lowers a subject’s blood pressure with a confidence interval that is ten units wide, and we know that the standard deviation of blood pressure in the population is 20, then the required sample size is 64, [see 4, “Sample Size Determination”, para. 7].
A special case of the means is the estimation of a proportion. The estimator of a proportion is p̂=X/n, where X is the number of ‘positive’ observations. The correspondent formula for this case is: (p̂-2√(0.25/n), p̂+2√(0.25/n)) for the confidence interval. Consequently, if we wish to have a confidence interval that is W units in width, we should calculate: n = 4/W2 = 1/B2, where B is the error bound on the estimate. For instance, for B=10% then the required sample size is 100. This special case is often used for opinion polls, [see 4, “Sample Size Determination”, para. 6].

Source of errors

There are many potential types of errors in survey sampling. According to Groves (1989)[see 1], the survey errors can be divided into two major groups: First, the errors of nonobservation where the sampled elements use only part of the target population, and the second one is the errors of observation, where the listed data deviate from the truth. Some examples of errors of nonobservation can be ascribed to sampling, coverage or nonresponse which is going to be analysed in the later part of this report. On the other hand, examples of errors of observation can be attributed to the interviewer, respondent or method of data collection. Both of our sources of obdurate errors can vigorously affect the accuracy of a survey. However, these errors cannot be eliminated from a survey but their effects can be reduced by careful devotion to an acceptable sampling plan. Some ways to reduce those errors are: callbacks (where the interviewer calls again the nonrespondents), offer rewards and motivation for encouraging responses, train better the interviewers, scrutinise the questionnaires to be sure that the form has been filled correctly and have an accurate questionnaire construction.

Types of probability samples

3.1 Simple Random Sampling

Simple random sampling provides a natural starting point for a discussion of probability sampling methods, not because it is widely used, but because it is the simplest method and it underlies many of the more complex methods (Kalton 1983)[see 3]. The definition states that a simple random sampling is a subset of individuals chosen from a population. Each single person in this sample is chosen randomly and entirely by chance. Therefore, as a principle, they have the same probability of being chosen at any stage during the sampling process and vice versa. For example, suppose N elderly people want to get a ticket for a concert, but there are only X<N tickets for them, so they decide to have a fair way to decide who gets to go. Then, every elderly person gets a number in the range between 0 and N-1, and random numbers are generated, either electronically or from a table of random numbers. Thus, the first X numbers would identify the lucky ticket winners. This type of probability sample is commonly used without replacement in both small and large populations. Especially, for large samples this method can be used with replacement while obtaining same results because the probability of drawing the same person is very small. Advantages of this type are that is free of classification error, it requires minimum advance knowledge of the population other than the frame and it allows one to draw externally valid conclusions about the entire population. Nevertheless, the survey conductor should be careful to make an unbiased random selection of individuals so that if a large number of samples were drawn, the average sample would accurately represent the population. Generally, it is appropriate to use this method because its simplicity makes it relatively easy to interpret data collected in this manner and it best suits situations where not much information is available about the population and data collection can be efficiently conducted on randomly distributed items, or where the cost of sampling is small enough to make efficiency less important than simplicity. As a consequence, if these conditions do not hold, then other methods may be a better choice, [see 5, “Simple Random Sample”, para. 6]

3.2 Systematic Sampling

Like simple random sampling, systematic sampling gives each element in the population the same chance of being selected for the sample. It differs, however, from simple random sampling in that the probabilities of different sets of elements being included in the sample are not all equal (Kalton 1983)[see 3]. For this method, the sampling starts by selecting an element from the list at random and then every kth element in the frame is selected, where k (the sampling interval). This is calculated as k=N/n, where n is the sample size and N is the population size, [see 6, “Systematic Sampling”, para. 1]. For example, assume that a teacher wants to sample 200 students from a school with 2000 students. The sampling fraction is 2000/200=10, so every 10th student is chosen after a random starting point between 1 and 10. If the random starting point is 3, then the students selected are 3,13,23,33,43,53,…,1993. As an aside, if every 10th student is a foreigner then this pattern could destroy the randomness of the population. However, there are situations where the sampling fraction contains decimal places (e.g. 2150/200=10.75). In these situations, the random starting point should be selected as a noninteger between 0 and 10.75 to ensure that every student has an equal chance of being selected. Furthermore, each noninteger selected should be expressed as the previous integer number. For instance, in our example, if the random starting point is 3.6, then 10.75 repeatedly to 3.6 gives 14.35, 25.1, 35.85 and so on. The subsequent selections are the fourteenth, twenty-fifth, thirty-fifth, etc., students. The interval between selected students is sometimes 10 and sometimes 11. In general, systematic sampling is appropriate to be applied only if the given population is logically homogeneous because its units are uniformly distributed over the population. However, it is inappropriate to use when the sampling interval hides a pattern which may compromise randomness.

3.3 Stratified Sampling

A stratified random sample is one obtained by separating the population elements into nonoverlapping groups, called strata, and then selecting a simple random sample from each stratum (Scheaffer et al., 2012)[see 1]. More general, stratification is the process of dividing members of the population into homogeneous subgroups before sampling. The strata should be mutually exclusive and every element in the population must be assigned to only one stratum, [see 7, “Stratified Sampling”, para. 1]. Two conditions need to be fulfilled for the choice of strata: First, the population proportions in the strata need to be known, and second it has to be possible to draw separate samples from each stratum (Kalton 1983)[see 3]. Stratified sampling can be divided into two strategies: Proportionate Stratification and Disproportionate Stratification (optimum allocation). Proportionate Stratification uses a sampling fraction in each of the strata that are proportional to that of the total population. For instance, if the population consists of X total individuals, m of which are male and f female (and where m + f = X), then the relative size of the two samples (x1 = m/X males, x2 = f/X females) should reflect this proportion, [see 7, “Stratified Sampling”, para. 3]. Generally, every population which can be divided in the proportion-categories, like categorical variables e.g. grades of students, hours in a day, days in a week, etc., then the corresponding size of the samples should reflect this proportion. Disproportionate Stratification is used to achieve an allocation that maximises the precision of the estimator of the population mean within the available resources. The optimum allocation for this purpose is to make the sampling fraction in a stratum proportional to the element standard deviation in that stratum and inversely proportional to the square root of the cost of including an element from that stratum in the sample (Kalton 1983)[see 3]. The statistical techniques of this type of sampling are given by:

Estimator of the population mean μ: y ̅st= 1/N ∑_(i=1)^L▒〖N_i y ̅ 〗i
Estimated variance of y ̅st: V ̂ (y ̅st ) = 1/N^(2 ) ∑_(i=1)^L▒N_i^2 (1- n_i/N_i )((s_i^2)/n_i )
Where, y ̅st=sample mean for the stratified random sample
N= size of entire population and should equal to sum of all stratum sizes
Ni= size of each stratum
y ̅i= sample mean of stratum i
n_i= number of observations in each stratum
L= count of strata
s_i^2= sample standard deviation of stratum i

A real-world example of using stratified sampling would be for a political survey. If the respondents needed to reflect the diversity of the population, the researcher would specifically seek to include participants of various minority groups such as race or religion, based on their proportionality to the total population. In summary, according to Scheaffer (2012)[see 1] it is appropriate to use stratified random sampling rather than simple random sampling because: 1) Stratification may produce a smaller bound on the error of estimation than would be produced by a simple random sample of the same size. This result is particularly true if measurements within strata are homogeneous. 2) The cost per observation in the survey may be reduced by stratification of the population elements into convenient groups. 3) Estimates of population parameters may be desired for subgroups of the population. These groups then are identifiable strata. However, it is inappropriate to use this type of sampling when the population cannot be exhaustively partitioned into disjoint subgroups because it would be a misapplication of this technique to make subgroups’ sample sizes proportional to the amount of data available from the subgroups, rather than scaling sample sizes to subgroups sizes.

3.4 Cluster Sampling

A cluster sample is a probability sample in which each sampling unit is a collection or cluster of elements (Scheaffer et al., 2012)[see 1]. Following this, each cluster should be a small-scale representation of the total population. The clusters should be mutually exclusive and collectively exhaustive. For example, suppose we wish to estimate the average income per household in a large city. This can be very costly or even impossible to obtain because we need a frame listing all households (elements) in the city. So, rather than drawing a simple random sample of elements, we could divide the city into regions such as blocks (or clusters of elements) and select a simple random sample of all blocks from the population. Then, the income of every household within each sampled block could be measured. The above description is referred to as a “one-stage” cluster design. Apart from this, if a simple random subsample of elements is selected within each of these groups, this is referred to as a “two-stage” or multistage sampling, [see 8, “Cluster Sampling”, para. 1]. Two-stage cluster sampling aims at minimising survey costs and at the same time controlling the uncertainty related to estimates of interest. This method is more reliable and quicker than other methods, which is the reason this method is now used frequently. Cluster sampling is appropriate to use because it can be cheaper than other methods since we compile information and it can take large populations into account because we divide the population into clusters and use only one of them. However, the disadvantage of this method is that there is a higher probability of making a mistake and if the cluster chosen has a biased opinion, then the entire population is inferred to have the same opinion which may not be the actual case.

Nonresponse

The cause of concern about nonresponse is the risk that nonrespondents will differ from respondents with regard to the survey variables, in which case the survey estimates based on the respondents alone will be biased estimates of the overall population parameters. (Kalton 1983)[see 3]. For example, if one selects a sample of 1000 managers in a field and polls them about their workload, the managers with a high workload may not answer the survey because they do not have enough time to answer it, and/or those with a low workload may decline to respond for fear that their supervisors or colleagues will perceive them as unnecessary. Therefore, non-response bias may make the measured value for the workload too low, too high, or, if the effects of the above biases happen to offset each other, “right for the wrong reasons, [see 9, “Non-Response Bias”, para. 1]

5.1 Causes of nonresponse

Generally, nonresponse can have different causes. It is a good idea to differentiate these various types of nonresponse which have different effects on estimators. The procedure start with the participation of a sample person in the survey to make contact. If there is no luck on finding contact then there is a case of non-response. If there is luck on finding contact then it can be established whether he or she belongs to the target population of the survey, otherwise, it can be excluded it out of this case. If this person belongs to the target population then he or she has to be convinced to co-operate. If this is not acceptable from the response then there is a case of non-response again. Even if there is contact, and the person wants to co-operate, then other circumstances (language, illness) maybe make this difficult. Therefore, there is a case of response when a person belongs to the target population can be contacted and he or she is willing to participate.

5.2 Ways to avoid nonresponse

Nonresponse bias is almost impossible to eliminate completely, but there are a few ways to ensure that it is avoided as much as possible. [see 11, “How to Avoid Nonresponse Error”, para. 8]. Obviously, having a professional, well-structured and designed survey will help get higher completion rates but there are some practical ways for a low nonresponse bias. To begin with, the survey conductor should acknowledge his sample’s different forms of communication software and devices and check if the survey is compatible with their devices. He should also investigate how much time does the survey takes to load because people are more likely to ignore it if it takes too long. Secondly, the conductor should avoid keeping his survey live for a short period of time because this period it may not be flexible with the time frames that his respondents will have to answer the survey. So, by extending the data collection period he can significantly reduce the level of nonresponse bias. Moreover, the interviewer should send a few reminders to his potential respondents throughout the whole collection period. However, it has to be noticed that a reminder in the start of the collection period it might not make a difference and also he or she must be careful not to send reminders to those that already have completed the survey. Following this, the survey conductor should ensure the respondents that the information which they will provide, will be kept completely confidential in order to persuade them for a response. Last but not least, the interviewer should give motivation, like rewards or make the survey relatively small, so he can avoid nonresponses.
29.11.2016

Essay: Statistical techniques associated with survey sampling & how to interpret outputs

Essay details and download:

Text preview of this essay:

How to select a good sample size

Source of errors

Types of probability samples

3.1 Simple Random Sampling

3.2 Systematic Sampling

3.3 Stratified Sampling

3.4 Cluster Sampling

Nonresponse

5.1 Causes of nonresponse

5.2 Ways to avoid nonresponse

About this essay:

Essay details and download:

Text preview of this essay:

How to select a good sample size

Source of errors

Types of probability samples

3.1 Simple Random Sampling

3.2 Systematic Sampling

3.3 Stratified Sampling

3.4 Cluster Sampling

Nonresponse

5.1 Causes of nonresponse

5.2 Ways to avoid nonresponse

About this essay:

Essay Categories: