Biostatistics: Exercise 2
Goodnessoffit and linear models
The data sets are available for download in a .zip file.

A retrospective study on the effects of smoking gives the following numbers of smokers in four different patient groups:
Group 1 2 3 4 Smokers 83 90 129 70 Patients 86 93 136 82
 What are the proportions of smokers in the respective samples?
 Use
prop.test
to test the null hypothesis is that the four populations from which the patients were drawn have the same true proportion of smokers. The alternative is that this proportion is different in at least one of the populations.
 Under simple Mendelian inheritance, the distribution of human
genotypes for a diallelic marker system should be p^2^ : 2pq :
q^2^, where p and q are the allele frequencies (HardyWeinberg
equilibrium).
 Construct a simple chi^2^ goodnessoffit test for the null hypothesis of HardyWeinberg equilibrium.

In a sample of schizophrenic patients, observed genotype counts for the Dopamine 3 receptor polymorphism were
Genotype A1A1 A1A2 A2A2 Count 45 35 15
Is there evidence for deviation from HardyWeinberg equilibrium in the underlying population?
 Data set
cars
gives the speed of cars and the distances taken to stop. Note that the data were recorded in the 1920s. Plot the data set. Can a linear model (straight line) be used for describing the relation between the variables?
 Graphically analyze the relation between the variables using
lowess()
.  Does linear modeling work after taking logarithms?
 Data set
GAGUrine
contains data collected by Susan Prosser on the concentration of a chemical GAG in the urine of 314 children aged from zero to seventeen years. Analyze these data, and produce a chart to help a pediatrician to assess if a child’s GAG concentration is “normal”.  The Janka hardness is an important structural property of Australian
timbers, which is difficult to measure. It is, however, related to
the density of the timber, which is relatively easy to measure. For
the data in
janka
a low degree polynomial regression of hardness on density is suggested as appropriate. Fit models and check whether there are obvious outliers or heteroscedasticity and if these can be remedied by square root or log transformations.  Data set
tetrahymena
contains data about the growth of tetrahymena cells: the diameter (μm) and concentration (counts/ml) of the cells and whether gloces was added to the growth medium or not. Find an appropriate model for the diameter of the cells explained by the other variables.