In a post I published a few month ago (in French, here, based on some old paper,there), I mentioned a statistical procedure to test if the underlying distribution of an i.i.d. sample had a finite mean (based on extreme value results). Since I just used it on a small dataset (yes, with real data), I decided to post the R code, since it is rather simple to use. But instead of working on that dataset, let us see what happens on simulated samples. Consider =200 observations generated from a Pareto distribution

(here =2, as a start)

> a=2 > X=runif(200)^(-1/a)

Here, we will use the package developped by Mathieu Ribatet,

> library(RFA)

- Using Generalized Pareto Distribution (and LR test)

A first idea is to fit a GPD distribution on observations that exceed a threshold >1.

Since we would like to test (against the assumption that the expected value is finite, i.e. ), it is natural to consider the likelihood ratio

Under the null hypothesis, the distribution of should be chi square distribution with one degree of freedom. As mentioned here, the significance level is attained with a higher accuracy by employing Bartlett correction (there). But let us make it as simple as possible for the blog, and use the chi-square distribution to derive the *p*-value.

Since it is rather difficult to select an appropriate threshold, it can be natural (as in Hill’s estimator) to consider , and thus, to fit a GPD on the largest values. And then to plot all that on a graph (like the Hill plot)

> Xs=rev(sort(X)) > s=0;G=rep(NA,length(Xs)-14);Gsd=G;LR=G;pLR=G > for(i in length(X):15){ + s=s+1 + FG=fitgpd(X,Xs[i],method="mle") + FGc=fitgpd(X,Xs[i],method="mle",shape=1) + G[s]=FG$estimate[2] + Gsd[s]=FG$std.err[2] + FGc$fixed + LR[s]=FGc$deviance-FG$deviance + pLR[s]=1-pchisq(LR[s],df=1) + }

Here we keep the estimated value of the tail index, and the associated standard deviation of the estimator, to draw some confidence interval (assuming that the maximum likelihood estimate is Gaussian, which is correct only when n is extremely large). We also calculate the deviance of the model, the deviance of the constrained model (), and the difference, which is the log likelihood ratio. Then we calculate the *p*-value (since under the likelihood ratio statistics has a chi-square distribution).

If =2, we have the following graph, with on top the *p*-value (which is almost null here), the estimation of the tail index he largest values (and a confidence interval for the estimator),

If =1.5 (finite mean, but infinite variance), we have

i.e. for those two models, we clearly reject the assumption of infinite mean (even if gets too close from 1, we should consider thresholds large enough). On the other hand, if =1 (i.e. infinite mean) we clearly accept the assumption of infinite mean (whatever the threshold),

- Using Hill’s estimator

An alternative could be to use Hill’s estimator (with Alexander McNeil’s package). See here for more details on that estimator. The test is simply based on the confidence interval derived from the (asymptotic) normal distribution of Hill’s estimator,

> library(evir) > Xs=rev(sort(X)) > HILL=hill(X) > G=rev(HILL$y) > Gsd=rev(G/sqrt(HILL$x)) > pLR=1-pnorm(rep(1,length(G)),mean=G,sd=Gsd)

Again, if =2, we clearly rejct the assumption of infinite mean,

and similarly, if =1.5 (finite mean, but infinite variance)

Here the test is more robust than the one based on the GPD. And if =1 (i.e. infinite mean), again we accept ,

Note that if =0.7, it is still possible to run the procedure, and hopefully, it suggests that the underlying distribution has infinite mean,

(here without any doubt). Now you need to wait a few days to see some practical applications of the idea (there was on in the paper mentioned above actually, on business interruption insurance losses).