The `birthwt`

dataset in the `MASS`

package
contains 10 measurements of infant-mother pairs, collected in
Springfield, MA in 1986. See `?MASS::birthwt`

for additional
details.

Consider the random variable birth weight which is labelled
`bwt`

in the dataset and is measured in grams. The
distribution of birthweight is uncertain. Neither the distribution
function (CDF) or the density function (PDF) are known.

The data collected in the `birthwt`

dataset can be used to
**estimate** the distribution. Let’s first plot the data.
There are several ways to plot the data, here we consider the eCDF and
the histogram.

## eCDF

Recall that the CDF is a function with takes candidate outcomes as inputs and returns cumulative left hand probabilities as outputs. Let \(F\) be the distribution function or CDF for infant birthweight.

\[ F(x) = P(\text{birth weight}\leq x) \]

If one were interested in calculating the probability of a birthweight less than or equal to 2000 grams, one could plug 2000 into \(F\).

\[ F(2000) = P(\text{birth weight} \leq 2000) \]

If the distribution function \(F\)
is unknown, how might one use the dataset to **estimate**
\(F(2000)\)? One approach is to
calculate the **empirical** probability. That is,

\[ \hat{F}(2000) = \frac{\text{# birth weights less than or equal to 2000}}{\text{# infants in dataset}} \]

The hat notation, \(\hat{\ }\),
indicates an **estimate** of the true, underlying
expression. In this case, \(\hat{F}\)
is an estimate of \(F\), the
distribution function of infant birthweight.

In words, the **empirical** probability is the
proportion of observations in the dataset which qualify as the event of
interest. In R, we can calculate the proportion with the following
command.

`## [1] 0.1005291`

To generate all cumulative left hand probabilities, we can create a
function. The function `Fhat`

is created so that an input a
vector `x`

returns an output vector of the same size.

`## [1] 0.005291005`

`## [1] 0.1005291`

`## [1] 0.5132275`

`## [1] 0.005291005 0.100529101 0.513227513 0.952380952`

We can plot `Fhat`

.

```
xxx <- seq(900,5000,by=1)
plot(xxx,Fhat(xxx,d1$bwt), col = "red", lwd = 2, type = "l", xlab = "Birth weight (grams)", ylab = "P(Birth weight <= x)")
```

Note that `Fhat`

has all the properties of a CDF. It
starts at zero, goes to one, and is non-decreasing. Because this
estimate of the distribution function is so important, it has its own
name and command. `Fhat`

is known as the **empirical
cumulative distribution function** or eCDF. In R, one can
generate the eCDF from a vector using the `ecdf`

command.

```
## Empirical CDF
## Call: ecdf(d1$bwt)
## x[1:131] = 709, 1021, 1135, ..., 4593, 4990
```

*Assessment*

- Create a plot of the eCDF of mother’s age in years.
- Based on the eCDF, what is the probability of observing an age of 35 or higher among mother-infant pairs?

## Smoothing

The eCDF is a powerful tool, but it has its drawbacks. An eCDF is not smooth. In many instances, the true underlying distribution is believed to be smooth. There are a number of ways to estimate the distribution so that the resulting estimate is smooth. For example, one may

- Choose from a smooth family of distributions
- Choose from a mixture of smooth distributions
- Use a smoothing kernel

### Smooth families

There are several well known and widely used distributions which may describe the distribution of birth weights.

Navigate to this Desmos calculator (link)

You’ll see the eCDF of the birth weight data.

#### Gaussian or Normal or Bell distribution

Click the circle in line 5. This will display the distribution function of a Gaussian or Normal random variable. This distribution as two parameters: the mean and standard deviation. Line 3 is a slider that controls the mean. Line 4 has a slider that controls the standard deviation.

*Assessment:*

- Use the sliders to find a Gaussian distribution which describes the
birth weight data reasonably well.

- What are your values of the mean and standard deviation?
- How does the mean parameter change the shape of the distribution? How about the standard deviation parameter?

The mathematical expression for the Gaussian distribution is not simple. If \(\mu\) is the mean and \(\sigma\) is the standard deviation, then

\[ F(x) = \int_{-\infty}^x (2\pi\sigma^2)^{-\frac{1}{2}} \exp\left(-\frac{1}{2}\frac{(t-\mu)^2}{\sigma^2} \right)\, dt \]

Because the CDF of the Gaussian distribution is so important, it is generally denoted with the expression \(\Phi(x, \mu, \sigma^2)\).

In R, the Gaussian distribution function is
`pnorm(x, µ, σ)`

. To plot the Gaussian CDF with \(\mu=\) 3000 and \(\sigma=\) 750 overlaid on the eCDF, use
`pnorm(x, 3000, 750)`

, as in the code below.

*Assessment*:

- Generate a plot of the Gaussian CDF overlaid on the eCDF using the parameters you identified in the previous assessment.
- The term
*low birth weight*refers to infants which weight less than 2500 grams. Based on your Gaussian model of birthweight (with parameters you selected by eye), what is the probability of observing a low birth weight infant? - Create a plot of the eCDF of mother’s age in years. Overlay the
Gaussian CDF with the sample mean (
`mean(d1$age)`

) and sample standard deviation (`sd(d1$age)`

) of age as the parameters.

#### Gamma distribution

Click the circle in line 5 again (to make it disappear). Now click line 10 to reveal the CDF of the gamma distribution. This distribution also has two parameters. (Sometime the parameters are shape and scale and in other instances they are shape and rate. Here, we use shape (\(\alpha\)) and rate (\(\beta\)).) The lines 6 and 7 control the parameters for the gamma distribution.

*Assessment:*

- Use the sliders to find a gamma distribution which describes the
birth weight data reasonably well.

- What are your values of \(\alpha\) and \(\beta\)?

In R, the gamma distribution function is
`pgamma(x, α, β)`

. To plot the gamma CDF with \(\alpha=\) 14.75 and \(\beta=\) 0.005 overlaid on the eCDF, use
`pgamma(x, 14.75, 0.005)`

, as in the code below.

*Assessment*:

- Generate a plot of the gamma CDF overlaid on the eCDF using the parameters you identified in the previous assessment.
- The term
*very low birth weight*refers to infants which weight less than 1500 grams. Based on your gamma model of birthweight (with parameters you selected by eye), what is the probability of observing a very low birth weight infant? - Create a plot of the eCDF of mother’s age in years. Overlay the
gamma CDF with \(\alpha=\)
`mean(d1$age)^2/var(d1$age)`

and \(\beta=\)`mean(d1$age)/var(d1$age)`

as the parameters.

### Mixtures of smooth distributions

Another way to estimate a smooth distribution is to take a weighted average of two or more smooth distributions. For example, one could take the average of two smooth Gaussian distribution functions. There would be 5 parameters: \(\mu\) and \(\sigma\) for the first distribution, \(\mu\) and \(\sigma\) for the second distribution, and \(\alpha\) a number between 0 and 1 which indicates the relative weight of the first distribution to the second.

Mathematically, the mixture of 2 normals is expressed as:

\[ \hat{F}(x) = \alpha \Phi(x, \mu_1, \sigma_1) + (1-\alpha) \Phi(x, \mu_2, \sigma_2) \]

If \(\alpha=1\) then the CDF would be the first distribution, \(\Phi(x, \mu_1, \sigma_1)\). If \(\alpha = 0\), then the CDF would be the second distribution, \(\Phi(x, \mu_2, \sigma_2)\). Mixture distributions admit a larger number of possible functions than the individual components alone.

Click the circle in line 10 again (to make it disappear). Now click line 16 to reveal the CDF of the mixture of 2 Gaussian distributions. The lines 11 to 15 control the parameters for the mixture distribution.

*Assessment:*

- Use the sliders to find a mixture distribution which describes the
birth weight data reasonably well.

- What are your values of the parameters?
- Of all the distributions in this module, which estimate of the CDF best describes birth weight in your opinion?
- How did you judge which estimate was best? What was your criteria?
- Create a plot of the eCDF of mother’s age in years. Overlay the CDF of the mixture of 2 Gaussians with \(\mu_1=\) 19.8, \(\sigma_1=\) 2.8, \(\mu_2=\) 26.4, \(\sigma_2=\) 5, and \(\alpha=\) 0.48.
- Among the Gaussian, gamma, and mixture distributions, which estimate gave the best description of the distribution of maternal age, in your opinion? Explain why.

### Kernel methods for estimation of smooth distributions

This section will be covered in a later module.