Biostatistics: Exercise 2

Goodness-of-fit and linear models

The data sets are available for download in a .zip file.

  1. A retrospective study on the effects of smoking gives the following numbers of smokers in four different patient groups:

    Group      1    2    3     4
    Smokers    83   90   129   70
    Patients   86   93   136   82
    
    • What are the proportions of smokers in the respective samples?
    • Use prop.test to test the null hypothesis is that the four populations from which the patients were drawn have the same true proportion of smokers. The alternative is that this proportion is different in at least one of the populations.
  2. Under simple Mendelian inheritance, the distribution of human genotypes for a diallelic marker system should be p^2^ : 2pq : q^2^, where p and q are the allele frequencies (Hardy-Weinberg equilibrium).
    • Construct a simple chi^2^ goodness-of-fit test for the null hypothesis of Hardy-Weinberg equilibrium.
    • In a sample of schizophrenic patients, observed genotype counts for the Dopamine 3 receptor polymorphism were

      Genotype   A1A1   A1A2   A2A2
      Count      45     35     15
      

      Is there evidence for deviation from Hardy-Weinberg equilibrium in the underlying population?

  3. Data set cars gives the speed of cars and the distances taken to stop. Note that the data were recorded in the 1920s.
    • Plot the data set. Can a linear model (straight line) be used for describing the relation between the variables?
    • Graphically analyze the relation between the variables using lowess().
    • Does linear modeling work after taking logarithms?
  4. Data set GAGUrine contains data collected by Susan Prosser on the concentration of a chemical GAG in the urine of 314 children aged from zero to seventeen years. Analyze these data, and produce a chart to help a pediatrician to assess if a child’s GAG concentration is “normal”.
  5. The Janka hardness is an important structural property of Australian timbers, which is difficult to measure. It is, however, related to the density of the timber, which is relatively easy to measure. For the data in janka a low degree polynomial regression of hardness on density is suggested as appropriate. Fit models and check whether there are obvious outliers or heteroscedasticity and if these can be remedied by square root or log transformations.
  6. Data set tetrahymena contains data about the growth of tetrahymena cells: the diameter (μm) and concentration (counts/ml) of the cells and whether gloces was added to the growth medium or not. Find an appropriate model for the diameter of the cells explained by the other variables.