Logo

2.2 Parameter Estimation

Parameter estimation differences in the least squares and maximum likelihood paradigms are the formulation of the objective function. In the least squares paradigm, the objective function is simply the sum of squares of residuals. We can regard the boundary restriction of the dependent variable as the linear constraints. Parameter estimation can be specified as a quadratic programming problem (QP): (Vanderbei, 2008)
\begin{align*}
Minimize \qquad & f\left( \beta \right)=\sum\limits_{i=1}^{n}{\sum\limits_{t=1}^{{{T}_{i}}}{{{\left[ \left( {{y}_{it}}-{{{\bar{y}}}_{i}} \right)-\left( {{x}_{it}}-{{{\bar{x}}}_{i}} \right)\beta \right]}^{2}}}} \\
Subject \quad to \qquad
&\qquad \left( {{x}_{it}}-{{{\bar{x}}}_{i}} \right)\beta \le b \\
&\quad -\left( {{x}_{it}}-{{{\bar{x}}}_{i}} \right)\beta \le -a,
\end{align*}
where the number of the spatial and temporal units is $n$ and ${{T}_{i}}$, and the parameter space $\beta \in {{\Omega }_{\beta }}$. We have not exactly specified what ${{\Omega }_{\beta }}$ should be. This is critical to solve the above QP problem successfully. We will discuss this issue in section 3.

We can modify the objective function slightly by providing a distributional assumption to the demeaned dependent variable $\left( {{y}_{it}}-{{{\bar{y}}}_{i}} \right)$ in (2.1)
\begin{equation}
\left( {{y}_{it}}-{{{\bar{y}}}_{i}} \right)\sim TN\left[ \left( {{x}_{it}}-{{{\bar{x}}}_{i}} \right)\beta , {{\sigma }^{2}};p_{1},q_{1} \right],
\tag{2.2}
\end{equation}
Here the setup of the lower and upper limits, $p_{1}$ and $q_{1}$, are vital to a successful maximum likelihood estimation. Theoretically, $p_{1}$ and $q_{1}$ should be set to their greatest and least possible values according to the available data information.
\begin{align*}
& {{p}_{1}}=a-\left( \bar{\bar{y}}+t_{{{\sigma }_{b}}}^{\min }\cdot {{\sigma }_{b}} \right)\\
& {{q}_{1}}=b-\left( \bar{\bar{y}}+t_{{{\sigma }_{b}}}^{\max }\cdot {{\sigma }_{b}} \right).
\end{align*}
where ${{\sigma }_{b}}$ is the between-groups deviation estimated by the untruncated normal assumption, and $t_{{{\sigma }_{b}}}^{\min }$ and $t_{{{\sigma }_{b}}}^{\max }$ refer to the largest deviation of ${{\bar{y}}_{i}}$ in the negative and positive directions, respectively.

Given this distributional assumption, the objection function can be specified as
\begin{equation*}
Maximize \qquad \log L\equiv -\sum\limits_{i=1}^{n}{\sum\limits_{t=1}^{{{T}_{i}}}{{{D}_{it}}-\frac{1} {2{{\sigma }^{2}}}}}{{\left[ \left( {{y}_{it}}-{{{\bar{y}}}_{i}} \right)-\left( {{x}_{it}}-{{{\bar{x}}}_{i}} \right)\beta \right]}^{2}},
\end{equation*}
where ${{D}_{it}}=\sqrt{2\pi }\sigma \left[ \Phi \left( \frac{p_{1}-\left( {{x}_{it}}-{{{\bar{x}}}_{i}} \right)\beta }{\sigma } \right)-\Phi \left( \frac{q_{1}-\left( {{x}_{it}}-{{{\bar{x}}}_{i}} \right)\beta } {\sigma } \right) \right]$.

We must be aware that one additional parameter $\sigma$ is added in the above likelihood function, and the parameter space $\sigma \in {{\Omega }_{\sigma }}$ needs to be specified. The inequality constraints and the parameter space $\beta \in {{\Omega }_{\beta }}$ remain the same.

From the perspective of the likelihood paradigm, the two objective functions above are all problematic, since the panel regression is incorrectly specified in the first place.9 To see why this is so, we first assume that the dependent variable ${{y}_{it}}$ is distributed as truncated normal
\begin{equation*}
{{y}_{it}}\sim TN\left( {{\mu }_{i}},\sigma ;a,b \right),
\end{equation*}
where ${\mu_{i}}$ is the district-level location parameter.

The time mean of ${{y}_{it}}$ is
\begin{equation*}
{{E}_{t}}\left( {{y}_{it}} \right)={{\mu }_{i}}-\frac{\sigma \left\{ \exp \left[ -\frac{{{\left( b-{{\mu }_{i}} \right)}^{2}}}{2{{\sigma }^{2}}} \right]-\exp \left[ -\frac{{{\left( a-{{\mu }_{i}} \right)}^{2}}}{2{{\sigma }^{2}}} \right] \right\}}{\sqrt{2\pi }\left[ \Phi \left( \frac{b-{{\mu }_{i}}}{\sigma } \right)-\Phi \left( \frac{a-{{\mu }_{i}}} {\sigma } \right) \right]}.
\end{equation*}
Unless $b-{{\mu }_{i}}={{\mu }_{i}}-a$ or $\left( a,b \right)\to \left( -\infty ,\infty \right)$, the time mean ${{E}_{t}}\left( {{y}_{it}} \right)$ (or noted as ${{\bar{y}}_{i}}$) is a biased estimate of ${{\mu }_{i}}$ (Johnson and Kotz, 1970: 81).10 If we intend to purge the between-groups variation, we should subtract ${\mu_{i}}$ from ${{y}_{it}}$. However, we mistakenly use the time mean ${{E}_{t}}\left( {{y}_{it}} \right)$ to estimate ${\mu_{i}}$, and the differencing operation ${{y}_{it}}-{{E}_{t}}\left( {{y}_{it}} \right)$ fails to generate the valid within-groups variation. This illustrates the common problem for the previous panel regressions in (2.1) and (2.2).

If we want to specify a conceptually equivalent model as the panel regression, we can use the individual-level dependent variable to estimate district-level location parameters ${{\mu }_{i}}$, and then perform the demeaning operation to derive the within-groups regression.11 In this scenario, the dependent variable can be specified as
\begin{equation*}
\left( {{y}_{it}}-{\hat{\mu }_{i}} \right)\sim TN\left( {{x}^{*}_{it}}\beta ,{{\sigma }^{2}};p_{2},q_{2} \right),
\end{equation*}
where ${{x}^{*}_{it}}$ represents the covariate matrix that is fixed at the minimum after being demeaned, and
\begin{align*}
{{p}_{2}}&=a-\left( \hat{\mu }+t_{{{\sigma }_{{{{\hat{\mu }}}_{i}}}}}^{\min }\cdot {{\sigma }_{{{{\hat{\mu }}}_{i}}}} \right)\\
{{q}_{2}}&=b-\left( \hat{\mu }+t_{{{\sigma }_{{{{\hat{\mu }}}_{i}}}}}^{\max }\cdot {{\sigma }_{{{{\hat{\mu }}}_{i}}}} \right).
\end{align*}

Applying a maximum likelihood estimation,we can derive the objective function as
\begin{equation*}
Maximize \qquad \log L\equiv -\sum\limits_{i=1}^{n}{\sum\limits_{t=1}^{{{T}_{i}}}{\left\{ {{D}_{it}}-\frac{1} {2{{\sigma }^{2}}}{{\left[ \left( {{y}_{it}}-{{{\hat{\mu }}}_{i}} \right)-{{x}_{it}}\beta \right]}^{2}} \right\}}},
\end{equation*}
where ${{D}_{it}}=\sqrt{2\pi }\sigma \left[ \Phi \left( \frac{p_{2}-x_{it}^{*}\beta }{\sigma } \right)-\Phi \left( \frac{q_{2}-x_{it}^{*}\beta }{\sigma } \right) \right]$. Since the demeaning operation is achieved with the maximum likelihood estimates of ${{\hat{\mu }}_{i}}$, no demeaning on the covariate matrix is necessary. However, we will apply the same demeaned specification for the sake of comparability. In section 3, we will apply the technique of constrained optimization to solve the three optimization problems given different parameter constraints.

____________________

Footnote

9 In contemporary statistical science, the likelihood theory is a crucial paradigm of inference for data analysis (Royall, 1997:xiii). It provides a unifying approach of statistical modeling to both frequentists and Bayesians with the criterion of maximum likelihood (Azzalini, 1996). The rapid development of political methodology in the last two decades has also witnessed the establishment of the likelihood paradigm in the scientific study of politics (King, 1998). As a model of inference, the fundamental assumption of the likelihood theory is the likelihood principle, which states that "all evidence, which is obtained from an experiment, about an unknown quantity $\theta$, is contained in the likelihood function of $\theta$ for the given function." (Berger and Wolpert,1984:vii) In other words, given the fact that the likelihood function is defined by the probability density (or mass) function, we must make a distributional assumption of the dependent variable to derive a likelihood function. The plausibility of such a distributional assumption is therefore vital to the validity of the statistical inference.

10 When $b-{{\mu }_{i}}={{\mu }_{i}}-a$, the normal distribution is evenly truncated at both ends. When $\left( a,b \right)\to \left( -\infty ,\infty \right)$, the variable is not truncated at all. Both situations rarely occur when the dependent variable is distributed as truncated normal.

11 This involves a two-stage procedure. In the first stage, ${\hat{\mu }_{i}}$ is estimated by ${{\mu }_{it}}$ without covariates. In the next stage, we take ${\hat{\mu }_{i}}$ as the district-level property and subtract it to derive complete within-groups deviation.

Download [full paper] [supplementary material A] [supplementary material B] [all replication files]