Chapter 1 Why the Decision ‐ Theoretic Perspective Misrepresents Frequentist Inference : Revisiting Stein ’ s Paradox and Admissibility

The primary objective of this paper is to make a case that R.A. Fisher’s objections to the decision-theoretic framing of frequentist inference are not without merit. It is argued that this framing is congruent with the Bayesian but incongruent with the frequentist approach; it provides the former with a theory of optimal inference but misrepresents the optimality theory of the latter. Decision-theoretic and Bayesian rules are considered optimal when they minimize the expected loss “for all possible values of θ in Θ” 1⁄2∀θ∈Θ ; irrespective of what the true value θ [state of Nature] happens to be; the value that gave rise to the data. In contrast, the theory of optimal frequentist inference is framed entirely in terms of the capacity of the procedure to pinpoint θ: The inappropriateness of the quantifier ∀θ∈Θ calls into question the relevance of admissibility as a minimal property for frequentist estimators. As a result, the pertinence of Stein’s paradox, as it relates to the capacity of frequentist estimators to pinpoint θ; needs to be reassessed. The paper also contrasts loss-based errors with traditional frequentist errors, arguing that the former are attached to θ; but the latter to the inference procedure itself.


neutral framing of inference that brings into focus their common features and tones
own their differences; see Refs.[2][3][4].Historically, Wald [5] proposed the original variant of the decision-theoretic framework with a view to unify Neyman's [6] rendering of frequentist interval estimation and testing:

"The problem in this formulation is very general.It contains the problems of testing hypotheses and of statistical estimation treated in the literature."(p.340) Among the frequentist pioneers, Jerzy Neyman accepted enthusiastically this broader perspective, primarily because the concepts of decision rules and action spaces seemed to provide a better framing for his behavioristic interpretation of Neyman-Pearson (N-P) testing based on the accept/reject rules; see Refs.[7,8].Neyman's attitude towards Wald's [1] framing was also adopted wholeheartedly by some of his most influential students/colleagues at Berkeley, including [9,10].In a foreword of a collection of Neyman's early papers, his students/editors described the Wald's framing as ( [11], p. vii):

"A natural but far reaching extension of their [N-P formulation] scope can be found in Abraham Wald's theory of statistical decision functions."

At the other end of the argument, Fisher [12] re ect d Wald's framing on the grounds that it seriously distorts his rendering of frequentist statistics:

"The attempt to reinterpret the common tests of significance used in s ure and led to "decisions" in Wald's sense, originated in several misapprehensions and has led, apparently, to several more."(p.69) With a few exceptions, such as Refs.[13][14 [15], Fisher's [12] viewpoint has been inadequately discussed and evaluated by the subsequent statistics literature.The primary aim of this paper is to revisit Fisher's minority view by taking a closer look at the decision-theoretic framework with a view to reevaluate the claim that it provides a neutral framework for comparing the frequentist and Bayesian approaches.I is argued that Fisher's view that the decision theoretic framing is germane to "acceptance sampling," but misrepresents frequentist inference, is not without merit.

The key argument of the discussion that follows is that the decision-theoretic notions of loss function and admissibility are congruent with the Bayesian approach, but incongruent with both the primary objective and the underlying reasoning of the frequentist approach.

Section 2 introduces the basic elements of the decision theoretic set-up with a view to bring out its links to the Bayesian and frequentist approaches, calling into question the conventional wisdom concerning its neutrality.Section 3 takes a closer look at the Bayesian approach and argues that had the decision-theoretic apparatus not exist, Bayesians would have been forced to invent it in order to establish a theory of optimal Bayesian inference.Section 4 discusses critically the notions of loss functions and admissibility, focusing primarily on their role in giving rise to Stein's paradox and their incompatibility ith the frequentist approach.It is argued that the frequentist dimension of the notions of a loss function and admissibility is more apparent than real.Section 5 makes a case that the decision-theoretic framework m

rep
esents both the primary objective and the underlying reasoning of the frequentist approach.Section 6 revisits the notion of a loss function and its dependence on "information other than the data."It is argued that loss-based errors are both different and incompatible with the traditional frequentis errors because they are attach d to the unknown parameters instead of the inference procedures themselves, as the traditional frequentist errors (Type I, II and coverage).


The decision theoretic set-up 2.
. Basic elements of the decision-theoretic framing

The current decision-theoretic set-up has three basic elements:

1.A prespecified (parametric) statistical model M θ ðxÞ, generically specified by
M θ ðxÞ¼{f ðx; θÞ; θ ∈ Θ}; x ∈ R n X ; for θ ∈ Θ⊂R m ; m ≪ n;ð1Þ
where f ðx; θÞ denotes the (joint) …;X n Þ, R n X denotes the sample space and Θ the parameter data x 0 : ¼ðx 1 ; …; x n Þ: 2. A decision space D containing all mappings dð:Þ : R n X !A; where A denotes the set of all actions available to the statistician.

3. A loss function Lð:; :Þ : ½D Â Θ !R; representing the numerical loss if the statistician takes action a ∈ A when the state of Nature is θ ∈ Θ; see Refs.[2,[16][17][18].

The basic idea is that when the decision-maker selects action a, he/she do s not know the "true" state of Nature, represented by θ * : However, contingent on each action a ∈ A; the decision maker "knows ðd;θÞ ∈ ½D Â Θ: The decision maker observes data x 0 ; which provides some information about θ * and then maps each x ∈ R n X to a certain action a ∈ A guided solely by Lðd;θÞ:


The original Wald framing

It is important to bring out the fact that the original Wald [5] framing was much narrower than the above basic elements 2 and 3, due to its original objective to formalize the Neyman-Pearson (N-P) approach; see [19].What were the key differences?


i.

The decision (action s the set of all singleton points of Θ and for testing D :¼ ðΘ 0 ;Θ 1 Þ, the null and alternative region , respectively.

ii.The original loss (weight) was a zero-positive function, with zero loss at:
L 0Àc ðθ; b θðXÞÞ¼ 0 if b θðXÞ ¼ θ * c θ > 0 if b θðXÞ ¼ θ 6 ¼ θ * ; θ ∈ Θ; (ð2Þ
where θ * is the true value of θ in Θ: For the discussion that follows, it is important to note that Eq. ( 2) is nonoperational in practice because θ * is unknown.

The more general framing, introduced by Wald ( [1 20]) and broadened by Le Cam [21], extended the scope of the original set-up by generalizing the notions of loss functions and decision spaces.In what follows it is argued that these extensions created serious incompatibilities with both the objective and the underlying reasoning of frequentist inference.

In addition, it is both of historical and methodological interest to note that Wal [5] introduced the notion of a prior distribution, πðθÞ; ∀θ ∈ Θ; into the original decision-theoretic machinery reluctantly, and justified it on being a useful tool for proving certain technical results:

"The situation regarding the introduction of an a priori probability distribution of θ is entirely different.First, the objection can be made against it, as Neyman has pointed out, that θ is merely an unknown constant a d not a variate, hence it makes no sense to speak of the probability distribution of θ.Second, even if we may assume that θ is a variate, we have in general no possibility of determining the distribution of θ and any assumptions regarding this distribution are of hypothetical character.The reason why we introduce here a hypothetical probability distribution of θ is simply that it proves to be useful in deducing certain theorems and in the calculation of the best system of re

ons of acceptance."(p.
02)


A shared neutral framework?

The frequentist, Bayesian, and the decision-theoretic approaches share the notion of a statistical model by viewi

data x 0 :¼ ðx 1 ; …; x n Þ as a realization o
a sample X :¼ ðX 1 ; …; X n Þ from Eq. ( 1).

The key differences between the three approaches are as follows:

a.The frequentist approach relies exclusively on M θ ðxÞ b.The Bayesian approach adds a prior distribution, πðθÞ; ∀θ ∈ Θ (for all θ ∈ Θ)

c.The decision-theoretic framing revolves around a loss (gain or utility) function:
LðdðxÞ;θÞ;∀θ ∈ Θ; ∀x ∈ R n X :ð3Þ
The loss function is often assumed to be an even, differentiable and convex function of ðdðxÞ À θÞ and can take numerous functional forms; see Refs.[17,18] inter alia.

The claim that the decision-theoretic perspective provides a neutral ground is often justified [3] on account of the loss function being a function of the sample and parameter spaces through the two universal quantifiers:

(i) "∀x ∈ R n X ," associated with the distribution of the sample:
frequentist : f ðx; θÞ; ∀x ∈ R n X ;ð4Þ
(ii)"∀θ ∈ Θ" associated with the posterior distribution:
Bayesian : πðθjx 0 Þ ¼ πðθÞ Á f ðx 0 jθÞ ð θ ∈ Θ πðθÞ Á f ðx 0 jθÞdθ ; ∀θ ∈ Θ:ð5Þ
The idea is that allowing for all values of x in R n X goes beyond the Bayesian perspective, which relies exclusively on a single point x 0 .What is not obvious is whether that is sufficient to do justice to the frequentist approach.A closer scrutiny suggests that frequentist inference is misrepresented by the way both quantifiers are employed in the decision-theoretic framing of inference.

First, the quantifier ∀x ∈ R n X plays only a minor role in transf rming a loss function, say Lðθ; θðxÞÞ; into a risk function:
Rðθ; θÞ¼E X ½Lðθ; θðXÞÞ¼ ð x ∈ R n X Lðθ; θðxÞÞf ðx; θÞdx, ∀θ ∈ Θ:ð6Þ
This is the only place where the distribution of the sample, f ðx; θÞ; ∀x ∈ R n X enters the decisiontheoretic framing, and the only relevant part of the behavior of θðXÞ is how it affects the risk function for different values of θ in Θ: In frequentist inference, however, the distribution of the sample takes center stage for the theory of optimal frequentist inference.It determines the sampling distribution of any statistic Y n ¼gðXÞ (estimator, test, and predictor) through:
Fðy; θÞ :¼ PðY n ≤ y; θÞ ¼ ð ð Á Á Á ð |fflfflfflfflfflffl{zfflfflfflfflfflffl} {x: gðxÞ ≤ t; x ∈ R n X } f ðx; θÞdx;ð7Þ
and that, in turn, yields the relevant error probabilities that determine optimal inference p ocedures.

Second, the decision-theoretic notion of optimality revolves around the universal quantifier "∀θ ∈ Θ," rendering it congruent with the Bayesian but incongruent with the frequentist approach.To be more specific, since different risk functions often intersect ove Θ; an optimal rule is usually selected after the risk function is reduced to a scalar.Two such choices of risk are:

Maximum risk : R max ð θÞ¼ sup θ ∈ Θ Rðθ; θÞ;

Bayes risk : R B ð θÞ¼
ð θ ∈ Θ Rðθ; θÞπðθÞdθ:ð8Þ
Hence, an obvious way to choose among different rules is to find the one that minimizes the relevant risk with respect to all possible estimates θðxÞ.In the case of Eq. ( 8), this gives rise t two corresponding decision rules:
Minimax rule : inf θðxÞ R max ð θÞ¼ inf θðxÞ ½sup θ ∈ Θ Rðθ; θÞ; Bayes rule : inf θðxÞ R B ð θÞ¼ inf θðxÞ ð θ ∈ Θ Rðθ; θÞπðθÞdθ:ð9Þ
In this sense, a decision or a Bayes rule θðxÞ will be considered optimal when it minimizes the relevant risk, no matter what the true state of Nature θ * happens to be.The last clause, "irrespective of θ * " constitutes a crucial caveat that is often ignored in discussions of these approaches.When viewed as a game against Nature, the decision maker selects action a from A;

irrespective of what value θ * Nature has chosen.That is, θ * plays no role in selecting the optimal rules since the latter have nothing to do with the true value θ * of θ.To avoid any misreading of this line of reasoning, it is important to emphasize that "the true value θ * " is shorthand for saying that "data x 0 constitute a typical realization of the sample X with distribution f ðx; θ the capacity of the inference procedure to inform the modeler about θ * ; no other value is relevant.According to Reid [23]:
"A statistical

The Bayesian approach

To shed further light on the affinity between the decision-theoretic framework and the Baye at the latter.


Bayesian inference and its primary objective

A key argument in favor of the Bayesian approach is often its simplicity in the sense that all forms of inference revolve around a single function, the posterior distribution: πðθjx 0 Þ ∝ πðθÞ Á f ðx 0 jθÞ; ∀θ ∈ Θ: Hence, an outsider looking at Bayesian approach might natu- rally sur ise that its primary objective is to yield "a probabilistic ranking" (orderin ) of all values of θ in Θ.Acco ayesian method is to derive from it suitable inference statemen his: After seeing the data x 0 , what do we now know about the parameter θ.Th only answer to this question is to present the entire posterior distribution."(p. 6)

The idea is that the modeling begins with an a priori probabilistic ranking based on πðθÞ; ∀ o d rive πðθjx 0 Þ; ∀θ ∈ Θ; hence the key role of the quantifier ∀θ ∈ Θ. O'Hagan [4], echoing earlier views in [24,25], contrast the frequentist (classical) inferences with the Bayesian inference arguing:

"Classical inference theory is very concerned with constructing good inference rules.The primary concern of Bayesian inference, …, is entirely different.The objective is to extract information concerning θ from the posterior distribution, and to present it helpfully via effective summaries.There are two criteria in this process.The first is to identify interesting features of the posterior distribut

n.…
The second criterion is good communication.Summaries should be chosen to convey clearly and succi ctly all the features of interest.… In Bayesian terms, therefore, a good inference is one which contributes effectively to appropriating the information about θ which is conveyed y the posterior distribution."(p.14)

Clearly, O'Hagan's [4] attempt to define what is a "good" Bayesian inference begs the question: what does constitute "effective appropriation of information about θ" mean, beyond the probabilistic ranking?That is, the issue of optimality is inextricably bound up with what the primary objective of Bayesian inference is.If the primary objective of Bayesian inference is not the revised probabilistic ranking, what is it?The answer is that the ranking is only half the story.The other half is concerned with the optimality for Bayesian inference which cannot be framed exclusively in terms of the poste

or distribution.The decision-theoretic perspec
ive provides the Bayesian approach with a theory of optimal inference as well as a primary objective: minimize expected losses for all values of θ in Θ.

In his attempt to defend his stribution is the inference, O'Hagan [4] arg l" Bayesian inferences are only parasitical on the Bayesian approach and ente the picture through the decision theoretic perspective:

"… a study of decision theory has two potential benefits.First, it provides a link to c assical inference.It thereby shows to what extent classical estimators, confidence intervals and hypotheses tests can be given a Bayesian interpretation or motivation.Second, it helps identify suitable summaries to give Bayes an answers to stylized inference questions which classical theory addresses."(p.14)

Both of the above mentioned potential benefits to the Bayesian approach, are questionable for two reasons.First, the link between the decision-theoretic and the classical (frequentist) inference is more apparent than real because it is fraught with misleading definitions and unclarities pertaining to the reasoning and objectives of the latter.As argued in the sequel, the quantifier "∀θ ∈ Θ" used to define "optimal" decision-theoretic or Bayes rules is at odds with and misrepresents frequentist inference.Second, the claim concerning Bayesian answers to frequentist questions of interest is misplaced because the former provides no real answers to the frequentist primary question of interest which pertains to learning about θ * : An optimal Bayes rule offers very little, if anything, relevant for learning about the value θ * that gave rise to x 0 .Let us unpack this answer in some more detail.


Optimality for Bayesian inference

What does minimizing the B yes risk amount to?Substituting the risk function in Eq. ( 6) into the Bayes risk in Eq. ( 8), one can show that:
R B ð θÞ ¼ ð θ ∈ Θ ð x ∈ R n X Lðθ; θðxÞ f ðx; θÞdx πðθÞdθ ¼ ð x ∈ R n X θ ∈ Θ Lðθ;θðxÞÞf ðxjθÞπðθÞdθdx ¼ ð x ∈ R n X ð θ ∈ Θ Lðθ;θðxÞÞπðθjxÞdθ & ' mðxÞdx,ð10Þ
where mðxÞ ¼ ð θ ∈ Θ f ðx; θÞdθ; see Ref. [18].The second and third equ lities presume that one can reverse the order of integration (a technical issue), and treat f ðx; θÞ as the joint distribution of X and θ so that the following equalities hold:
f ðx; θÞ ¼ f ðxjθÞπðθÞ ¼ πðθjxÞmðxÞ:ð11Þ
In this case, these equalities are questionable due to the blurring of the distinction between x; a generic value of R n X ; and the particular value x 0 ; see Ref. [26].

Why the Decision-Th oretic Perspective Misrepresents Frequentist Inference: Revisiting Stein's… http://dx.doi.org/10.5772/65720

In light of Eq. ( 10), a Bayesian estimate is "optimal" relative to a particular loss function Lð θðXÞ;θÞ; when it mini

zes R B ð θÞ, or equivalently
ð θ ∈ Θ Lðθ; θðx
ÞπðθjxÞdθ: This makes it
clear that what constitutes an "optimal" Bayesian estimate is primarily determined by Lð θðXÞ;θÞ [27]:
i.
When L 2 ð θ;θÞ¼ð θ À θÞ 2 , the Bayes estimate θ is the mean of πðθjx 0 Þ.


i .

When L 1 ð θ;θÞ¼j θ À θj , the Bayes estimate θ is the median of πðθjx 0 Þ.

iii.When L 0À1 ðθ;θÞ¼δðθ;θÞ¼
0 forjθ À θj < ε 1 forjθ À θj ε ; & for ε > 0; the Bayes estimate θ is the mode o oss function is the square:
L 2 ð θðXÞ; θÞ¼ð θðXÞ À θÞ 2 ; ∀θ ∈ Θ;ð12Þ
whose risk function is t ision-theoretic Mean Square Error (MSE 1 inition of the MSE, denoted by MSE 1 , is different from the frequentist MSE, which is defined by:
MSEð θn ðXÞ; θ * Þ¼Eð θn ðXÞ À θ * Þ 2 :ð14Þ
The key difference is that Eq. ( 14) is defined at the point θ ¼ θ * ; as opposed to ∀θ ∈ Θ.

Unfortunately, statistics textbooks adopt one of the two definitions of the MSE-either at θ¼θ * or ∀θ ∈ Θ-and ignore (or seem unaware) of the other.At first sight, his diff rence might appear pedantic, but it turns out that it has very serious implications for the relevant theory of optimality for the frequentist vs. Bayesian inference procedures.Indeed, reliance on ∀θ ∈ Θ undermines completely the relevance of admissibility as a minimal property for estimators in frequentist inference.

A such that:
Rðθ; θÞ ≤ Rðθ; θÞ;∀θ ∈ Θ;ð15Þ
and the strict inequality (<) holds for at least one value of θ.Otherwise, θðXÞ is said to be admissible with respect to the loss function Lðθ; θÞ:

The objective of minimizing losses weighted by πðθjx 0 Þ for all value of θ in Θ; is in direct contrast to the frequentist primary objective, which is to learn from data about the true value θ * underlying the generation of x 0 : Hence, the question that naturally arises is: what does an optimal Bayes rule, stemming from Eq. ( 17) convey about the underlying data generating mechanism in Eq. ( 1)?It is not obvious why the highest ranked value θðx 0 Þ (mode), or some other feature of the posterior distribution, has any value in pinpointing θ * knowing that θðx 0 Þ is selected irrespective of θ * the true state of Na ure.


The duality between loss functions and priors

The derivation in Eq. ( 10) brings out the built-in affinity between the decision-theoretic framing of inference and the Bayesian approach.As shown above, minimizing the Bayes risk:
R B ð θÞ ¼ ð θ ∈ Θ Rð θ;θÞπðθÞdθ;ð16Þ
is equivalent to minimizing the integral:
ð θ ∈ Θ Lð θðXÞ;θÞπðθjxÞdθ:ð17Þ
This result brings out two important features of optimal Bayesian infere played by the quantifier x ∈ R n X in both the Bayesian and decision-theoretic optimality theory of inference.

Second, i indicates that Lðθ; θÞ and πðθÞ are perfect substitutes with respect to any weight fun ng the loss function or the prior yields the same result:

"… the problem of estimating θ with a modified (weighted) loss function is identical to the problem with a simple loss but with modified hyperparameters of the prior distribution while s not change."( [28], p. 522) This implies that in practice a Bayesian could derive a particular Bayes rule by att stribution depending on which derivation is easier; see Refs.[18,28].


Revisiting the complete class theorem

The issue of contr lt-in tension between the frequentist an Bayesian approaches to optimality, which in turn undermines several important results, including the complete class theorem, first proved in Ref. [20]: "Wald showed that under fairly general conditions the class of Bayes decision functions forms an essentially complete class; in other words, for any decision function that is not Bayesian, there exists one that is Bayes and is at least as good no matter what the true state of Nature may be." ( [19], p. 341) As argued in the sequel, it should come as no surprise to learn that Bayes rules dominate all other rules when admissibility is given center stage.The key result is t at a Bayes rule θB ðxÞ with respect to a prior distribution πðθÞ is: i.

Admissible, under certain regularity conditions, including when θB ðxÞ is unique up to equivalence relative to the same risk func ion Rðθ; θB Þ.


ii.

Minimax when Rðθ; θB Þ ¼ c < ∞:

iii.An admissible, relative to a risk function Rðθ; θB Þ; estimate θðxÞ is either Bayes θB ðxÞ or the limit of a sequence of Bayes rules; see Refs.[2,17,28].

Ignoring the contrasting objective

these results have been interpreted as evidence for t
e superiority of the Bayesian perspective, and led to the intimation that an effective way to generate optimal frequentist procedures is to find the Bayes solution using a reasonable prior and then examine their frequentist properties to see whether it is satisfactory from the latter viewpoint; see Refs.[29,30].

As argued next, even if one were to agree that Bayes rules and admissible estimators largely coincide, the importance of such a result hinges on the relevance of admissibility as a key property for frequentist estimators.


Loss functions and admissibility revisited

The claim to be discussed in this section is that the notions of a "loss fu tible with the optimal theory of frequentist estimation as framed by Fisher; see Ref. [31].


Admissibility as a minimal property

The following e ample brings out the inappropriateness of admissibility as a minimal property for ptimal frequentist estimators.

Example.In the context of the simple Normal model:

X k e NIIDðθ; 1Þ; k¼1; 2; …; n;
for n > 2 ,ð18Þ
co sider the decision-theoretic notion of MSE 1 in Eq. ( 13) to compare two estimators of θ:

i.

The maximum likelihood estimator (MLE):
X n ¼ 1 n X n k¼1 X k
ii.The "crystalball" estimator:
θ cb ¼7405926; ∀x ∈ R n X
When compared on admissibility grounds, both estimators are admissible and thus equally acceptable.Common sense, however, suggests that if a particular criterion of optimality cannot distinguish between X n [a strongly consistent, unbiased, fully efficient and sufficient estimator] and θ cb ; an arbitrarily chosen real number that ignores the data altogether, is not much of a minimal property.

A moment's reflection suggests that the inappropriateness of admissibility stems from its reliance on the quantifier "∀θ ∈ Θ."The admissibility of θ cb arises from the fact that for certain
MSE 1 X n ; θ ¼ 1 n > MSE 1 ðθ cb ; θ Þ ≤ λ 2 n for θ ∈ θ cb AE λ ffiffiffi n p :ð19Þ iven that the primary objective of a frequentist estimator is to pin-point θ * ; the result in Eq. ( 19) seems totally irrelevant as a gauge of its capacity to achieve that!This example indicates that admissibility is totally ineffective as a minimal property because it does not filter out θ cb ; the worst possible estimator!Instead, it excludes potentially good estimators like the sample median; see Ref. [32].This highlights the "extreme relativism" of admissibility to the particular loss function, L 2 ð θðXÞ; θÞ, in this case.For the absolute loss function L 1 ð θðXÞ; θÞ¼j θðXÞ À θj, however, the sample median would mbrace of the decision-theoretic framing, Lehmann [33] warned statisticians about the perils of arbitrary loss functions:

"It is argued that the choice of a loss function, while less crucia than that of the model, exerts an important influence on the nature of the solution of a statistical decision problem, and that an arbitrary choice such as squared error may be baldly misleading as to the relative desirability of the competing procedures."(p.425)

A strong case can be made that the key minimal property (necessary but not sufficient) for frequentist estimation is consistency, an extension of the Law of Large Numbers (LLN) to estimators, more generally.For instance, consistency would have eliminated θ cb from consideration because it is inconsistent.This makes intuitive sense because if an estimator θðXÞ cannot pinpoint θ * with an infinite data information, it should be considered irrelevant for learning about θ * .Indeed, there is nothing in the notion of admissibility that advances learning from data about θ * .

Further to relative (to particular loss functio property for frequentist estimators, the pertinent measure of finite sample precis stimators is full efficiency, which is defined relative to the assumed statistical model (1).


Stein's paradox and admissibility

The quintessential example that has bolstered the appeal of the Bayesian claims concerning admissibility is the James-Stein estimator [34], which gave rise to an extensive literature on shrinkage estimators, see Ref. [35].

Let X :¼ ðX 1 ;X 2 ; …;X m Þ be independent sample from Normal distribution:
X k e NIðθ k ; σ 2 Þ; k¼1; 2; …;m;ð20Þ
where σ 2 is known.Using the notation θ:¼ðθ 1 ;θ 2 ; …;θ m Þ and I m :¼diag(1; 1; …; 1), this can be denoted by: X e d an optimal estimator θðXÞ of θ with respect to the square "overall" loss function:
L 2 ðθ; θðXÞÞ¼ð∥ θðXÞ À θ∥ 2 Þ ¼ X m k¼1 ð θk ðXÞ À θ k Þ 2 :ð21Þ
Stein l world by showing that for m¼2 the least-squares (LS)

es inadmissible.Indeed, James and Stein [37] were able to come up with a nonlinear estimator:
θJS ðXÞ¼ 1 À ðm À 2Þσ 2 ∥X∥ 2 X,ð22Þ
that became known g that:
MSE 1 ð θJS ðXÞ; θÞ < MSE 1 ð θLS ðXÞ; θÞ;∀θ ∈ R m :ð23Þ
It turns out that θJM ðXÞ is also inadmissible for m > 2 and dominated by the modified James-Stein estimator that is admissible:
θJS þ ðXÞ¼ 1 À ðmÀ2Þσ 2 ∥X∥ 2 þ X,ð24Þ
where ðzÞ þ ¼ maxð0; zÞ; see Ref. [17].

The traditional interpretation of this result is that for the Normal, Independent model in Eq. ( 20), the James-Stein estimator (15) of θ :¼ ðθ 1 ;θ 2 ; …;θ m Þ; for m > 2; reduces the overall MSE 1 in Eq. ( 21).This result seems to imply that one will "do better" ( sing a combined nonlinear (shrinkage) estimator, instead of estimating these means separately.What is surprising about this result is that there is no statistical reason (due to independence) to connect the inferences pertaining to the different individua means, and yet the obvious estimator (LS) is inadmissi le.

As argued next, this result calls in o question the appropriateness of the notion of admissibility wi h respect to a particular loss function, and not the judiciousness of frequentist estim tion.


Frequentist inference and learning from data

The objectives and underlying reason ng of frequentist inference are inadequately discussed in the statistics literature.As a result, some of its key differences with Bayesian inference remain beclouded.


Frequentist approach: primary objective and reasoning

All forms of parametric frequentist inference begin with a prespecified statistical model M θ ðxÞ¼{f ðx; θÞ; θ ∈ Θ}; x ∈ R n X : This model is cho

n from the set of all possible models that could have
given rise to data x 0 : ¼ðx 1 ; …;x n Þ; by selecting the probabilistic structure for the underlying stochastic process {X t ; t ∈ N :¼ ð1; 2; …;n; …Þ} in such a way so as to render the observed data x 0 a "typical" realization thereof.In light of the fact that each value of θ ∈ Θ represe ts a different element of the family of models represented by M θ ðxÞ; the primary objective of frequentist inference is to learn from data about the "true" model:
M * ðxÞ¼{f ðx; θ * Þ}; x ∈ R n X ;ð25Þ
where θ * denotes the true value of θ in Θ.The "typicality" is testable vis-a-vis the data x 0 using misspecification testing; see Ref. [38].

The frequentist approach relies on two modes of reasoning for inference purposes:

Factual ðestimation; predi tionÞ :
f ðx; θ * Þ; ∀x ∈ R n X ; Hypothetical ðhypothesistestingÞ : f ðx; θ 0 Þ; fðx; θ 1 Þ ∀x ∈ R n X ;ð26Þ
where θ * denotes the true value of θ in Θ, and θ i ; i ¼ 0; 1 denote hypothesized values of θ associated with the hypotheses, H 0 :
θ 0 ∈ Θ 0 , H 1 : θ 1 ∈ Θ 1 ;
where Θ 0 and Θ 1 constitu :

A frequentist estimator θ aims to pinpoint θ * , and its optimality is evaluated by how effectively it achieves that.Similarly, a test statistic usually compares a good estimator θ of θ with a prespecified value θ 0 ; but behind θ is the value θ * assumed to have generated data x 0 : Hence, the h pothetical reasoning is used in testing to learn about θ * ; and has nothing to do with all possible values of θ in Θ:

This contradicts misleading claims by Bayesian textbooks ([3], p. 61):

"The are estimators and, if possible, to select the best estimator, the reasoning being hat estimators are evaluated on their long-run performance for all possible values of the parameter θ:"

Contrary to this claim, the only relevant value of θ in evaluating the "optimality" of θ is θ * : Such misleading claims stem from an apparent confusion between the existential and universal quantifiers in framing certain inferential assertions.

The existence of θ * can be formally defined using the existential quantifier:
∃θ * ∈ Θ : there exists a θ * ∈ Θ such that:ð27Þ
This introduces a potential conflict between the existential and the universal quantifier "∀θ ∈ Θ" because neither the decision theoretic nor the Bayesian approach explicitly invoke θ * .Decision-theoretic and Bayesian rules are considered optimal when they minimize the expected loss ∀θ ∈ Θ; no matter what θ * happens to be.Any attempt to explain away the crucial differences between the two quantifiers can be easily scotched using elementary logic.The two quanti iers could not be more different since, using the logical connective for negation (¬), the equivalence between the two involves double negations:
ðiÞ ∃ θ * ∈ Θ ⇔ ¬ ∀ θ ∉ Θ; ðiiÞ∀ θ ∈ Θ ⇔ ¬ ∃ θ * ∉ Θ ð28Þ
Similarly, invoking intuition to justify the quantifier ∀θ ∈ Θ as innocuous and natural on the grou

s that one should care about th
behavior of an estimator θ for all possible values of θ;

is highly misleading.The behavior of θ; for all θ ∈ Θ, although relevant, is not what determines how effective a frequentist estimator is at pinpointing θ * ; what matters is its sampling behavior around θ * .Assessing its effectiveness

alls for evaluating (deductively) the sampl
ng distribution of θ under f ctual θ ¼ θ * ; or hypothetical values θ 0 and θ 1 ; and not for all possible values of θ in Θ: Let's unpack the details of this claim.


Frequentist estimation

The underlying reasoning for frequentist estimation is factual, in the sense the optimality of an estimator is appraised in terms of its generic capacity of θn ðXÞ to zero-in on (pinpoint) the true value θ * , whatever the sample realization X ¼ x 0 .Optimal properties like consistency, unbiasedness, full efficiency, sufficiency, etc., calibrate this generic capacity using

.e.,
n terms of f ð θn ðxÞ; θ * Þ; for x ∈ R n X : For instance, strong consistency asserts that as n !∞; θn ðXÞ will zero-in on θ * almost surely:
Pðlim n!∞ θn ðXÞ¼θ * Þ¼1:ð29Þ
Similarly, unbiasedness asserts that the mean of θn ðXÞ is the true value θ * :
Eð θn ðXÞÞ¼θ * :ð30Þ
In this sense, both of these optimal properties are defined at the point θ¼θ * .This is achieved by using factual reasoning, i.e., evaluating the sampling distribution of θn ðXÞ under the true state of Nature (θ¼θ * ), without having to know θ * : This is in contrast to using loss functions, s wing θ * .

Example.In the case of the simple Normal model i Eq. ( 18), the point estimator, X n, is consistent, unbiased, fully efficient, sufficient, with a sampling distribution:

X n e N θ;
1 n :ð31Þ
What is not usually appreciated sufficiently is that the evaluation of that distribution is factual, i.e., θ¼θ * , and should formally be denoted by:
X n e θ¼θ * N θ * ; 1 n :ð32Þ
When X n is standardized, it yields the pivotal function:
dðX; θÞ :¼ ffiffiffi n p ðX n À θ * Þ e θ¼θ * Nð0; 1Þ;ð33Þ
whose distribution only holds for the true θ * ; and no other value.This provides the basis for constructing a ð1 À αÞ confidence interval (CI):
P X m À cα 2 ð 1 ffiffiffi n p Þ ≤ θ ≤ X n þ cα 2 ð 1 ffiffiffi n p Þ θ¼θ * ¼1 À α;ð34Þ
which asserts that the random interval
X n À cα 2 s ffiffi n p ; X n þ cα 2 s ffiffi n p h i
will cover (overlay) the true mean θ * , whatever that happens to be, with probability ð1 À αÞ; or equivalently, the error of coverage is α: Hence, frequentist evaluation of the coverage error probability depends only on the sampling distribution of X n and is attached to random interval for all values θ 6 ¼ θ * without requiring one to know θ * :

The evaluation at θ¼ θ * calls into question the decision-theoretic definition of unbiasedness:
E 1 ð θn ðXÞÞ¼θ;∀θ ∈ Θ;ð35Þ
in the context of frequentist estimation since this assertion makes sense only when defined at θ¼θ * : Similarly, the appropriate frequentist definition of the MSE for an estimator, initially proposed by Fisher [39], is defined at the point θ¼θ * :
MSEð θn ðXÞ; θ * Þ¼Eð θn ðXÞ À θ * Þ 2 ; for θ * in Θ:ð36Þ
Indeed, the well-known decomposition:
MSEð θðXÞ; θ * Þ¼Varð θðXÞÞ þ ½Eð θn ðXÞÞ À θ * 2 ; for θ * in Θ;ð37Þ
is meaningful only when defined at the point θ¼θ * (true mean) since by definition:
Varð θðXÞÞ ¼ E½ θn ðXÞ À θ m 2 ; θ m ¼ Eð θn ðXÞÞ Biasð θn ðXÞ; θ * Þ ¼ Eð θn ðXÞÞ À θ * ;ð38Þ
and thus, the variance and the bias involve only two values of θ in Θ; θ m and θ * ; and when θ m ¼ θ * the estimator is unbiased.This implies that the apparent affinity between the MSE 1 defined in Eq. ( 13) and the variance of an estimator is more apparent than real because the latter makes frequentist sense only when θ m ¼ Eð θn ðXÞÞ is a single point.


James-Stein estimator from a frequentist perspective

For a proper frequentist evaluation of the above James-Stein result, it is important to bring out the conflict between the overall MSE (14) and the factual reasoning underlying frequentist estimation.From the latter perspective, the James-Stein estimator raises several issues of concern.

First, both the least-squares θLS ðXÞ and the James-Stein θJS ðXÞ estimators are inconsistent estimators o θ since the underlying model suffers from the incidental parameter problem: there is essentially one observation (X k ) for each unknown parameter (θ k ), and as m !∞ the number of unknown parameters increases at the same rate.To bring out the futility of comparing these two estimators more clearly, consider the following simpler example.

Example.Let X :¼ ðX 1 ;X 2 ; …;X n Þ be a sample from the simple Normal model in Eq. ( 18).

Comparing the two estimators θ1 ¼X n and θ2 ¼ 1
2 ðX 1 þ X n Þ
and inferring that θ2 is relatively more efficient than θ1 relative to a square loss function, i.e., MSEð θ2 ðXÞ; θÞ¼1 < MSEð θ1 ðXÞ; θÞ¼ 1 2
; ∀ θ ∈ R;ð39Þ
is totally uninteresting because both estimators are inconsistent!Second, to be able to discuss the role of admissibility in the Stein [37] result, we need to consider a consistent James-Stein estimator, by extending the original data to a panel (longitudinal) data where the sample is:

X t :¼ðX 1t ;X 2t ; …;X mt Þ; t¼1; 2; …;n: In this case, the consistent least-squares and James-Stein estimators are:
θLS XÞ¼ðX 1 ;X 2 ; …;X m Þ;where X k ¼ 1 n X n t¼1 X kt ; k¼1; 2; …;m; θJS þ ðXÞ¼ 1 À ðmÀ2Þσ 2 ∥X∥ 2 þ X; where X:¼ðX 1 ;X 2 ; …;X m Þ:ð40Þ
This enables us to evaluate the notion of "relatively better" more objectively.

Admissibility relative to the overall loss function in Eq. ( 21) introduces a trade-off between the accuracy of the estimators for individual parameters θ :¼ ðθ 1 ;θ 2 ; …;θ m Þ and the "overall" expected loss.The question is: "In what sense the overall MSE among a group of mean estimates provides a better measure of "error" in learning about the true values
θ * :¼ ðθ * 1 ;θ * 2 ; …;θ * m Þ?
" The short answer is: it does not.Indeed, the overall MSE will be irrelevant when the primary objective of estimation is to learn from data about θ * .This is because the par icular loss function penalizes the estimator's capacity to pin-point θ * by trading an increase in bias for a decrease in the overall MSE in Eq. ( 21), when the latter is misleadingly evaluated over all θ in Θ :¼ R m .That is, the James-Stein estimator flouts the primary objective of pinpointing θ * in favor of reducing the overall MSE ∀θ ∈ Θ.

In summary, the abov

discussion suggests that there is nothing parad
xical about Stein's [37] original result.What is problematic is not the least-squares estimator, but the choice of "better"

Advances in Statistical Methodologies and Their Application to Real Problems in terms of admissibility relative to an overall MSE in evaluating the accuracy of the estimators of θ.


Frequentist hypothesis testing

Another frequentist inference procedu e one an employ to learn from data about θ * is hypothesis testing, where the question posed is whether θ * is close enough to some prespecified value θ 0 .In contrast to es t testing is hypothetical in nature.


Legitimate frequentist error probabilities

For testing the hypotheses:

H 0 :θ ≤ θ 0 vs:H 1 :θ > 0 ; wh re θ 0 is a prespecified value; one ut lizes the same sampling distribution X n e N θ;
1 n À Á
, but transforms the pivot dðX; θÞ :¼ ffiffiff * with the prespecified value θ 0 ; yielding dðXÞ :¼ ffiffiffi n p ðX n À θ 0 Þ: However, instead of evaluating it under the f ctual θ ¼ θ * , it is now evaluated under various hypothetical scenarios associated with H 0 and H 1 to yield two types of (hypothetical) sampling distributions:
(I) dðXÞ :¼ ffiffiffi n p ðX n À θ 0 Þ e θ¼θ * Nð0; 1Þ;
(II)
dðXÞ :¼ ffiffiffi n p ðX n À θ 0 Þ e θ¼θ * Nðδ 1 ; 1Þ; δ 1 ¼ ffiffiffi n p ðθ 1 À θ 0 Þfor θ 1 > θ 0 :
In both cases, (I) and (II), the underlying reasoning is hypothetical in the sense that the factual in Eq. ( 33) is replaced by hypothesized values of θ; and the test statistic dðXÞ provides a standardized distance between the hypothesized values (θ 0 or θ 1 ) and θ * t e true θ; assumed to underlie the generat ng dðx 0 Þ: Using the sampling distribution in (I), one can define the following legitimate error probabilities:
significance level : PðdðXÞ > c α ; H 0 Þ ¼ α; p-value : PðdðXÞ > dðx 0 Þ; H 0 Þ¼pðx 0 Þ:ð41Þ
Using the sampling distribution in (II), one can define:

type II error prob: :
PðdðXÞ ≤ c α ; θ¼θ 1 ¼βðθ 1 Þ; for θ 1 > θ 0 ; power : PðdðXÞ > c α ; θ¼θ 1 Þ¼ρðθ 1 Þ; for θ 1 > θ 0 :ð42Þ
It can be shown that the test T α ; defined by the test statistic dðXÞ and the rejection region C 1 ðαÞ¼{x :dðxÞ > c α }; constitutes a uniformly most powerful (UMP) test for significance level α; see Ref. [9].The type I [II] error probability is associated with test T α erroneously rejecting [accepting] H 0 .The type I and II error probabilities evaluate the generic capacity [whate er the sample realization x ∈ R n ] of a test to reach correct inferences.Contrary to Bayesian claims, these error probabilities have nothing to do with the temporal or the physical dimension of the long-run metaphor associated with repeated samples.The relevant feature of the long-run metaphor is the repeatability (in principle) of the D M represented by M θ ðxÞ; this feature can be easily operationalized using computer simulation; see Ref. [40].

The key difference between the significance level α and the p-value is that the forme

is
predata and the latter a post-data error probability.Indeed, the p-value can be viewed as the smallest significance level α at which H 0 would have been rejected with data x 0 .The legitimacy of postdata error probabilities underlying the hypothetical reasoning can be used to go beyond the N-P accept/reject rules and provide an evidential interpretation pertaining to the discrepancy γ from the null warranted by data x 0 ; see Ref. [41].

Despite the fact that frequentist testing uses hypothetical reasoning, its main objective is also to learn from data about the true model M * ðxÞ¼{f ðx; θ * Þ}; x ∈ R n : This is because a test statistic like dðXÞ:¼ ffiffiffi n p ðX n À θ 0 Þ constitutes nothing more than a scaled distance between θ * ½the value behind the generation of x n ; and a hypothesized value θ 0 ; with θ * being replaced by its "best" estimator X n :


Revis ting loss and risk functions

The above discussion raises serious doubts about the role of loss functions and admissibility in evaluating learning from data x 0 about θ * : To understand why the decision-theoretic framing misrepresents the frequentist approach, one needs to consider the role of loss functions in statistical in erence more generally.


Where do loss functions com from?

A closer scrutiny of the decision-theoretic set up reveals that the loss function needs to invoke "information from sources other than the data," which is usually not readily available.Indeed, such information is available in very restrictive situations, such as acceptance sampling in quality control.In light of that, a proper understanding of the intended scope of statistical inference calls for distinguishing the special cases where the loss function is part and parcel of the available substantive information from those that no such information is either relevant or available.[25], p. 624, reiterated Fisher's [42] distinction: "Now it is undoubtedly true that on the one hand that situations exist where the loss function is at least approximately known (for example, certain problems in business) and sampling inspection are of this sort.… On the other hand, a vast number of inferential problems occur, particularly in the analysis of scientific data, where there is no way f knowing in advance to what use the results of research will subsequently be put."


Tiao and Box

Cox [43] went further and questioned this framing even in cases where the inference might involve a decision: "The reasons that the detailed techniques [decision-theoretic] seem of fairly limited applicability, even when a fairly clear cut decision element is involved, may be (i) that, except in such fields as control theory and acceptance sampling, a major contribution of statistical technique is in presenting the evidence in incisive form for discussion, rather than in roviding mechanical presentation for the final decision.This is especially the case when a single major decision is involved.(ii) The central difficulty may be in formulating the elements required for the quantitative analysis, rather than in combining these elements via a decision rule."(p.45) Another important aspect of using loss functions in inference is that in practice they seem to be an add-on to the inference itself since they bring to the problem the information other than the data.In particular, the same statistical inference problem can give rise to very different decisions/actions depending on one's loss function.To illustrate tha