Introduction
There is a number of R functions available in the basic, pre-loaded packages, such as base, utils or stats and there is a huge number of functions available via third-party packages. Today, we will work a bit with some of the functions that are commonly used by R users coming from different disciplines. In particular, we will:
- Learn how to work with sets.
- Work with polynomials.
- Learn how to define and work with functions.
- Do some basic calculus.
- Learn about formulas.
- Work with some selected statistical tests.
- Learn more about setting up a statistical analyses workflow by fitting a linear model to example data.
Exercise 1. Create and work with sets
- Open R-studio and create three sets of genes:
G1 <- {‘ANK-1’, ‘ANK-c’, ‘GALNTL-1’}
G2 <- {‘ANK-1’, ‘FMA’, ‘RHO’, ‘GRP’}
G3 <- {‘GALNTL-1’, ‘ANK-c’, ‘HQX’}.
Visualise membership relations between all elements of the sets using Venn diagram. Use package venn.Click to see how
library('venn') G1 <- c('ANK-1', 'ANK-c', 'GALNTL-1') G2 <- c('ANK-1', 'FMA', 'RHO', 'GRP') G3 <- c('GALNTL-1', 'ANK-c', 'HQX') venn(list(G1 = G1, G2 = G2, G3 = G3))
- What is the:
- Union of G1 and G3?
- Union of intersections: G1 with G2 and G1 with G3?
- Difference between G2 and G3?
- Is union of G1 and G2 equal to the intersection of G2 with G3?
- Are genes ANK-c and GALNTL-1 members of the intersection of G1 with G3?
Click to see how
union(G1, G3) union(intersect(G1, G2), intersect(G2, G3)) setdiff(G2, G3) setequal(union(G1, G2), intersect(G2, G3)) is.element(c('ANK-c', 'GALNTL-1'), intersect(G1, G3))
Exercise 2. Define and work with polynomials
- Define the following polynomials p1 and p2: 5x^3 + 4x^2 + 7 and
2x^2 + 3x - 11.
Click to see how
library(polynom) p1 <- polynomial(c(7, 0, 4, 5)) p2 <- polynomial(c(-11, 3, 2))
- Define a polynomial (p3) with the following zeros: -3, 4, 7.
Click to see how
library(polynom) p3 <- poly.calc(c(-3, 4, 7))
- Define a polynomial (p4) passing through the following points:
A(-3,7), B(24,-9), C(7,4).
Click to see how
p4 <- poly.calc(c(-3, 24, 7), c(7, -9, 4))
- Find approximate maximum (value) of the p4 polynomial using its
visualisation.
Click to see how
plot(p4, ylim=c(-1, 8)) #The maximum is between 6 and 8.
- Perform the following operations:
- Find the sum of p1 and p3.
- Divide p4 by p2.
- Find the area under p4 on the [-10, 0] interval.
- Find the second order derivative of p1.
- Find the Greatest Common Divisor of p1 and p2.
- Find the Least Common Multiple of p1 and the sum of p2 and p3.
Click to see how
p1 + p3 p4 / p2 integral(p4, c(-10, 0)) deriv(deriv(p1)) GCD(p1, p2) LCM(p1, p2 + p3)
Exercise 3. Formulas
Observe that there is a convenient way of validating your formulas:
# First, define the formula
frm <- formula(y ~ (v1 + v2 + v3) %in% v)
# Check the validity of the above
terms <- attr(terms.formula(frm), "term.labels")
# See all terms that will be added together
terms
# Here: y ~ v1:v + v2:v + v3:v
Write formulas describing the following relations of y:
- depends on var1 and var2 but not on their interaction,
- depends on height but without the constant term,
- depends on interactions of ‘a’ with ‘b’ and ‘a’ with ‘c’ and ‘a’
with ‘d’ but without additive effects, - like above, but with additive effects of single variables,
- like above, but without the constant term and without the additive
effect of ‘c’.
Click to see how
y ~ var1 + var2 y ~ +0 + height frm <- formula(y ~ (b + c + d) %in% a) #Check the validity of the above terms <- attr(terms.formula(frm), "term.labels") y ~ a * (b + c + d) y ~ a * (b + c + d) - c
Exercise 4. Define and work with functions
- Define function: f(x) = 2x^3 + 3x^2 + sin(x/2) + 7 and finds its
value for x = 42.
Click to see how
f <- function(x) { y = 2*x^3 + 3*x^2 + sin(x/2) + 7 return(y) } f(42) f(42) = 153476
- Find the f(x) zeros on the [-10, 10] interval, values at zeros
and plot the function for the same interval.
Click to see how
uniroot(f, lower=-10, upper=10) curve(f, from=-10, to=10) One zero: $f(-2.16) = 5.24\times10^{-5}$
- Find the first order derivative of f(x) with respect to x. Use
both D() and deriv(). Are the resulting derivatives the same?
What is the difference between the functions?
Click to see how
D(expression(2*x^3 + 3*x^2 + sin(x/2) + 7), name='x') deriv(~2*x^3 + 3*x^2 + sin(x/2) + 7, 'x') The resulting derivatives are the same, just written in different ways. *D()* takes an expression as argument and it returns an expression while *deriv()* works on formulas.
- Evaluate the derivative from FNS4.3 at x = {1, 3, 7}. What values
does it take at these points?
Click to see how
my.call <- D(expression(2*x^3 + 3*x^2 + sin(x/2) + 7), name='x') x <- c(1, 3, 7) eval(my.call) [1] 12.4 72.0 335.5
- Perform numerical integration: \int_1^7 f(x).
- What is the value?
- What is the absolute estimation error?
Click to see how
integrate(f, lower = 1, upper = 7) 1588 with absolute error < 1.8e-11
Exercise 5. Statistical tests
Draw N=30 random observations from N(0,1) (normal distribution with mean=0 and std. dev.=1) and N=38 random observations from N(0.02,1.1).
- What statistical test(s) would you use to check whether the two samples come from the normal distribution?
- Can we really say something about the samples’ normality? Why?
- What parametric test would you use to check whether both samples come from the distribution with the same mean?
-
What do you have to check before you apply the test?
Click to see the answer
- For example, one can plot a QQ plot for both samples. One can also use Shapiro-Wilk test for normality. Can you think of any more tests? - Well, here sample size is low, we may get false results! - For instance the Student's t-test. It is appropriate for sample sizes below N=100. As a rule of thumb, N=30 is about sufficient. - Before applying the test, one has to check whether its assumptions are valid. Here, we have to check the normality first.
Exercise 6. Linear model and diagnostics
Package UsingR contains a data with fat measurements in 252 males.
- load the dataset (install the package if necessary),
- read about the dataset in package help,
- construct a linear model with body.fat as the dependent variable and variables: age, weight, height, neck and hip as independent variables,
- perform model diagnostics,
- improve the model if necessary (hint: e.g. outliers),
- which variables are significant?
- how much of the variability can be explained by the model?
- how would you improve the model? Do this if you have time,
Click to see how
library('UsingR') data('fat') ?fat model <- lm(body.fat ~ age + weight + height + neck + hip, data=fat) summary(model) plot(model) #Remove outliers fat2 <- fat[-c(54, 39, 42), ] model2 <- lm(body.fat ~ age + weight + height + neck + hip, data=fat2) summary(model2) plot(model2)