Chapter 3
Methodology
3.0: introduction
This chapter discusses the scope of data, data processing. and actuarial modeling process.
This chapter also discusses the various methods adopted for this study. The study focused on pensioners and the occurrence of deaths.
3.1: Scope of Data
The study focus on the occurrence of death in Ghanaian female pensioners who retired from1990 to 2005 at SSNIT. These pensioners include those who retired voluntarily between 54 to 60 years and those who retired from 60 to 65 years. All the pensioners were exposed to investigation from the day of retirement to 2010. Each one of the pensioners was observed from the age of retirement to 80 years and occurrence of death recorded over a year period. The study ended investigation pensioners at 80 years because deaths recorded after 80 years were insignificant which will affect the output of the analysis.
Secondary data from Social Security and National Insurance Trust (SSNIT) which consist of 2,178 female pensioners was sampled from a five year period pension interval; 1990, 1995, 2000 and 2005. The total occurrence of deaths recorded within the five years interval period was 424.
The table below describe the selection of the cohort groups
Table 3.1: sample data from 1990 – 2005 five year interval cohort group
Cohort group Pensioners Occurrence of death
1990 61 28
1995 358 162
2000 670 168
2005 1,089 66
Total 2,178 424
3.2: Data collection
The study use a quantitative research to model the occurrence of death in Ghanaian female pensioners. Secondary data was obtained from SSNIT which consist of female pensioners’ for the periods 1990 to 2010 age 55 to 80 years. The data contain information on the date of birth, death if any and year of retirement of pensioners. the life certificates were updated pension year 02/06/2014 as at the time data was retrieved for the study purpose. However, any pensioner whose life certificate had not been updated as at that date was assumed dead until otherwise proved.
The general pensioners’ population includes invalidity pensioners, hazardous workers pensioners, old age pensioners, and early retirees. But for this study purpose, the target population comprises of both old age and early valid retirees. The old age retirees were individuals who go on retirement at the normal retirement age of 60 years while the early retirees include individuals who voluntarily go on retirement from the ages 55 to 59 years.
In other to obtain a homogenous group of early retirees and old age pensioners for a period of five intervals pension years 1990, 1995, 2000 and 2005, purposive sampling was employed to select individuals from the general pensioners population to form the cohort group.
The total sample obtained from the sampling consists of 2,178 female pensioners. These cohort groups of pensioners were grouped according to the year of retirement from age 55 to 80 years. For the pension year 1990 we have 61 pensioners, 1995 pension year we have 358 pensioners, pension year 2000 we have 670 pensioners and 1,089 pensioners for 2005 pension year.
For the purposes of this study some of the data that have major inconsistences were discarded. These inconsistences include inaccurate or blank date of birth, retirement and death, very late entry into the pensioner category. On the average about 10% of the total population was excluded before arriving at the sample size stated. The total general population was about 120,000 pensioners for both male and female. As at the time of the study, pensioners who have not renewed their life certificate and have had their pension payments seized were assumed dead at the date of last update. Out of the 2,178 female pensioners selected from the general population 424 deaths were recorded.
For confidentiality purpose member identification numbers were removed and data were regrouped to have three essential details; date of retirement, date of death or last update and current age if still alive. Data was further sorted and regrouped to obtain in each target year, age at pension, number of deaths at each age, and the exposed to risk at each age. Pensioners were exposed to investigation from the pension year to June 2014 and were observed from ages 55 to 80 years. The investigation was done only up to age 80 years because after 80 years reported deaths were very scanty and to avoid distorted or misleading results.
3.3: Methodology
Secondary data was used for the research which gives the number of workers who retired at a certain age x to x+1 as the exposed (Ex) within the year. It also counts the number of pensioners who died in a particular year (dx). The crude mortality rate (qx) produced at a particular year is discrete and not smooth. Graduation is done to change the discrete to continuous and for smoothness using Poisson model. But the data on the female mortality has excess zeros which the Poisson model did not fit. A zero inflated Poisson (ZIP) logit model was proposed.
Exposure-to-risk (Ex):
The Ex denotes the number of person years lived during year by people aged x at the start of the year. Assuming that people who die during a year have on average been alive during half of the year, the exposed-to-risk can be approximated by the number of survivors plus half the number of deaths in this group. (Pitacco et al, 2009). The differences in observation periods are accounted for by the count model by including the log of the exposure variable in model with coefficient constrained to be one. The exposure makes use of the correct probability distributions that is why it is superior in many to analyse rates as response variables. Also the exposure is used to adjust counts on the response variable and it is possible to various kinds of rates, indexes or per capita measures as predictors.
Central Exposed-to-risk (E_x^c):
It is more versatile than the initial exposed to risk and is simpler to calculate from the data available.
Production of Crude Mortality Rates for 1990, 1995, 2000 and 2005
The crude mortality rate for a given age for any given year is the probability that a person at age x dies that year. Crude mortality rates are usually calculated by simply dividing the relevant number of deaths by the number of life-years that were exposed to the risk of death over that period. The crude mortality rates for each plan year 1990, 1995, 2000 and 2005 were developed accordingly.
Description of Female Pension Data
Pension data is considered to be of the form of number of deaths and number of living pensioners who are exposed to death which are in cells by year of death and age at death. The study focus on the occurrence of death for a year which gives a count (discrete) variable outcome. A total of 424 deaths occurred within the five year interval period from age 55 to 80 years. The data was cleaned by discarding all pensioners who are over 80 years since much record was not recorded. The R software was then used to analyse the data by finding the descriptive statistics for each cohort group. The result from the output which shows there were excess of zeros with large variation was used to propose the model to be used for the data. The following models were proposed to model the data; zero inflated negative poisson and negative binomial. Before discussing them let’s consider poisson regression model and the zero inflated model. The response variable is the number of death that occurred in the year and is represented by y and the predictor variable is the age at which death occurred and is represented by x.
MODELS
Poisson regression model
Poisson regression model is used to model count data. It is a discrete probability distribution that is used to model the number of events occurring within a given time interval. The Poisson distribution models the log-odds as a linear function of the observed covariates. This gives the generalized linear model with Poisson response and ling log.
If the number of occurrence has a variable Y which has a poisson distribution with parameter μ and it takes integer values of y = 0, 1, 3, … then the probability distribution is given by
P(Y = y) = (μ^y e^(-λ))/y! ; λ > 0 3.1
where λ is the shape parameter which indicates the average number of events in the given time interval.
The poisson distribution has mean and the variance that can be shown as
E(Y ) = var(Y ) = μ
If it is true that the mean is equal to the variance, then any factor that affects one will also affect the other. The Poisson distribution can only be applied under the following assumptions;
1. the event is something that can be counted in whole numbers;
2. occurrences are independent, so that one occurrence neither diminishes nor increases the chance of another
3. the average frequency of occurrence for the time period in question is known
Under the Poisson model, we use the central exposed to risk and also assumed that the
force of mortality is constant between inter ages and the number of deaths has a Poisson distribution with mean μE_x^c, ie D ~ Poisson (μ E_x^c). Then:
P(D = d) = (e^(-μE_x^c ) (μ E_x^c))/d! 3.2
The maximum likelihood estimator of μ is :
μ = D/(E_x^c )
This asymptotically normal distributed with mean and variance:
E(μ ̃) = μ Var ((μ ) ̃) = μ/(E_x^c )
The Poisson distribution model was used to model the log-odds and the ages. The model is given by ; log(μ) = β_0 + β_1 x 3.3
Where|
x denotes the vector of explanatory variables and β the vector of regression parameters.
However, this model was not did not fit the data for the study since the mean is not equal to the variance even though it is a count data. This was due to the excess zeros in the data which were not sampling error but outcome. A Zero-Inflated-Poisson was proposed.
Zero-Inflated-Poisson (zip)
The data that has excess of zero counts is model by zip regression model. Theory suggests that the excess zeros are generated by a separate process from the count values and that the excess zeros are modeled independently. The zip model has two parts, the first part use Poisson to mode the count model and the second use logit model to predict excess zeros. Zero-inflated models estimate two equations simultaneously, one for the count model and one for the excess zeros.
Pr(yi = 0) = π + (1- π)e^(-μ) 3.4
Pr(Yi = yi) = (1 – π) (μ^(y_i ) e^(-μ))/y_i , y > 0 3.5
Where yi is the outcome variable with any non-zero value, μ is the expected Poisson count for the ith individual and is the probability of the extra zeros. The zip regression model has mean to be (1- π)μ and the variance is μ(1- π) (1+ μπ). This model fit best if the data is not over dispersed with the mean larger than the variance.
Logistics regression model
Logistic regression is used to measure the relationship between the dependent variable and one or more of the independent variables by using logistics function to estimate the probabilities. Logistic regression model also known as logit model is used to model dichotomous outcome variables.
The logistic regression is also a special case of the generalized linear model. In logit model the log-odds of the outcome is modeled as a linear combination of the predictor variable. The logistic function take an input values from negative to positive infinity but the output values is between zero and one. Given that the logistic function is given by
σ(t)= e^t/(e^t+1) = 1/(1+e^(-t) ) 3.6
The respond variable t being a linear function of an explanatory variable is given by;
t = β_o + β_1 x 3.7
And the logistic function can now be written as:
Negative Binomial Regression Model (NB)
The negative binomial regression model is a parametric model that is more dispersed than the Poisson which can handle the over dispersed situation in the data. Given y to be the respondent variable of the number of death occurrence in a year and that y ∼ Poisson (μ), whereas μ is a random variable with a gamma distribution. Now if
y/μ ~ Poisson (μ) and μ ~ Gamma(α,β),
Where the gamma distribution has mean αβ and variance αβ2, with probability density
P(μ)= 1/(β^α Γ(μ)) μ^(α-1) exp(-μ/β); μ>0 3.8
Then the negative binomial with unconditional distribution of y is
P(y) = (Γ (α+y))/(Γ (α)y !) (β/(1+ β))^y (1/(1+ β))^α, y = 0, 1, 2, … 3.9
This distribution has mean
E(y) = E[E(y / μ)] = E(μ) = αβ
and variance Var(y) = E[Var(y / μ)] + Var[E(y / μ)]
= Var (μ) + E (μ) = αβ+ αβ^2 3.10
Expressing the negative binomial distribution in terms of the parameters μ = αβ and k = 1/α, that the E(y) = μ and Var (y) = μ + kμ^2 (function is quadratic)
Therefore the distribution of y is given by
P(y) = (Γ (k^(-1)+y))/(Γ (k^(-1) ) y !) ((k μ)/(1+ μ))^y (1/(1+ k μ))^□(1/k), 3.11
Note that the following in negative binomial distribution:
the negative binomial distribution has tow parameters; μ and α
the over dispersed parameter is α
the negative binomial distribution become the same as the Poisson distribution when there is no over dispersion (α=0).
The expected value of the distribution is μ
To model the negative binomial, let yi ~ Negative (μ_i,k) with the log link, so that
Log μ_i = β0 + β1×1 + … (for offset) 3.12
The likelihood function for the negative binomial model is given by;
P(├ β┤|y, X) = ∏_(i=1)^N▒〖Pr〖(├ y_i ┤| x_i 〗) = 〗 ∏_(i=1)^N▒〖(Γ (k^(-1)+y))/(Γ (k^(-1) ) y !) ((k μ)/(1+ μ_i ))^(y_i ) (1/(1+ k μ_i ))^□(1/k) 〗,
Where μ_i = E (├ y_i ┤| x_i ) = exp ((x_i β)
Zero-Inflated Negative Binomial Regression
Data with excess zeros that uses the zero-inflated model assumes the outcome of the zeros is due to two different processes. The study data considered occurrence of death in Ghanaian female pensioners. the occurrence have two process; first that a pensioner death occurred which give a count outcome (non-zero death) and the second no death occurred which give a possible outcome of zero. The first part of the process which is the zeros is modeled by the logit whereas the negative binomial model is used to model the second part of the process which is the count. The expected count is expressed as a combination of the two processes;
E(n death occurrence = k) = P(no death)*0 + P(death)*E(y = k/death)
Zero inflated negative binomial distribution is a mixture of distribution which assign a amass of p to extra zeros and mass of (1 – p) to a negative binomial distribution , 0 ≤p ≤1 . it is a continuous mixture of Poisson distribution with mean μ o be gamma distributed and modeled the over dispersion. For better understanding of the zero-inflated negative binomial regression, review the negative binomial model;
P(Y = y) = (Γ (α +y))/(Γ (α) y !) (( μ)/(1+ β))^y (1/(1+ k μ))^α, y = 0, 1, 2, …;μ,α>0 3.13
Where μ = E(Y), α is the shape parameter which quantifies the amount of over dispersion and the response variable of interest is Y and the variance of Y is α + μ^2/α.. the ZINB distribution is given by
P(Y) = y) {█(p+(1-p) (1+ μ/α)^(- α), y=0@(1-p) (Γ (y+ α))/(y ! Γ (α)) (1+ α/μ)^(- y), y =1,2,… )┤ 3.14
The zero inflated negative binomial distribution has mean E(Y) = (1 – p) μ and variance to be Var (Y) = (1 – p) μ (1+pμ+ μ/α) , respectively. Note that the zero inflated negative binomial distribution reduces to Poisson distribution if both 1/α and p ≈ 0.
Model selection
Comparing the two models to select the one that best fit the study data, the Akaike Information Criteria and the Bayesian Information Criteria was used. The model that has the lowest AIC and the BIC is selected to be the best fit.
Likelihood function
Suppose a set of parameter value θ, with given x outcomes, then the likelihood function is the probability of those observed outcomes;
Suppose a given parameterized family of probability functions in the discrete distribution case;
where θ is the parameter, the likelihood function is
written
with x being the observed outcome of the data. Alternatively, when f(x | θ) is viewed as a function of x with fixed θ, it is a probability density function, and when viewed as a function of θ with x fixed, it is a likelihood function.
From a geometric standpoint, if we consider f (x, θ) as a function of two variables then the family of probability distributions can be viewed as a family of curves parallel to the x-axis, while the family of likelihood functions are the orthogonal curves parallel to the θ-axis.
Akaike Information Criteria (AIC)
One way of selecting a model from a set of models is by using the Akaike Information Criterion (AIC). The model that minimizes the Kullback-Leibler distance between the model is chosen. the criteria seeks a model that has a good fit to the true but few parameters. It is defined as:
AIC = -2 ( ln ( likelihood )) + 2 K
where likelihood is the probability of the data given a model and K is the number of free parameters in the model. AIC scores are often shown as ∆AIC scores, or difference between the best model (smallest AIC) and each model (so the best model has a ∆AIC of zero).
Abstract
Death occurrences in Pensioners is a count outcome which result in a either zeros or non-zeros. And with females high life expectancy these count data have excess of zeros. The appropriate model to be used to analysed such data is the zero-inflated regression model. However, in relation to the Poisson distribution, underestimating standard errors and biasing parameter estimates, the non-zero observations may be over-dispersion. So the models to be used for this situation in the study data are the zero-inflated negative binomial regression model and the negative binomial model which account for the characteristics better, compared to the zero-inflated Poisson. To compare the two models to select the appropriate regression model that best fit the study data the Akaike Information Criteria (AIC) and the Bayesian Information Criteria (BIC) was used.
Problem Statement
Women in pre-modern Ghanaian society were seen as bearers of children, retailers of fish and farmers with no formal education for that matter cannot work in the public sector. In the twentieth century onwards women level of education has been improving steadily which enable women to work in the public sectors. Though women form majority of the nation’s population of about 52%, however women working at the public sector .are minority of about 13%.
According to World Bank the mortality rate for female adult (per 1,000 female adults) in Ghana was measured at 233.49 in 2013. The adult mortality rate is the probability of dying between the ages of 15 and 60; that is the probability of a 15 year old dying before reaching age 60 is subject to current age-specific mortality between those ages. This shows that the life expectancy of women is high. So the few women who go on retirement will tend to live longer after pension before death could occur.
SSNIT pays monthly benefits to pensioners’ from the date of retirement until death occurred. Occurrences of deaths recorded are not always as reported but at times assumed when pensioners’ failed to renew the pension certificate. Also at age 73 and above most death occurrences are not reported.
Occurrence of death just like mortality data has a count outcome variable which can be zero or non-zero. The zeros count can be due to two distinct processes; the first by sampling zeros that occurred by chance that can be assumed as a result of a dichotomous process and the second due to true zeros which are inevitable which are part of the counting process. The main problem is the choice of model to use to best fit between the observed data and the predicted values.