Features:
A list of things you need to know how to do:
| All Functions and Datasets | Add to Existing Plot | ANOVA Models | Bootstrap Methods |
![]() |
|||
![]() |
|||
| Input/Output-Files | |||
![]() |
![]() |
![]() |
|
![]() |
|||
![]() |
Missing Values | ||
![]() |
![]() |
||
![]() |
![]() |
||
![]() |
|||
![]() |
![]() |
![]() |
|
![]() |
![]() |
![]() |
|
![]() |
![]() |
![]() |
|
![]() |
|||
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
|
![]() |
Starting a new Splus project:
[48]ehlehl% cd Splus5project
[49]ehlehl% Splus5 CHAPTER
Creating data directory for chapter .
Splus5 chapter Splus5project initialized.
[50]ehlehl% Splus5
S-PLUS : Copyright (c) 1988, 1999 MathSoft, Inc.
S : Copyright Lucent Technologies, Inc.
Version 5.1 Release 1 for Sun SPARC, SunOS 5.5 : 1999
Working data will be in .Data
Getting data into Splus:
> # Making a vector of numbers
> xshort <- c(0,1,2,3,4)
# c is a function which can take an
# arbitrary number of arguments and
# strings them out one after another
> # Printing out an object -- type the name
> xshort
[1] 0 1 2 3 4
> # Make a big object
> times <- seq(from=0, to=10,length=1001)
> # Print part of times
> times[1:10]
[1] 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
> # Generate normal data
noise <- rnorm(1001)
> # Read data from a file
> gas <- read.table("gas.data",header=TRUE)
> gas[1:3,]
Cost Distance UnitPrice
1 9.80 390.4 31.2
2 10.41 413.6 31.4
3 10.50 415.2 31.5
> # gas is now a data.frame
> summary(gas)
Cost Distance UnitPrice
Min.: 6.33 Min.:198.4 Min.:31.20
1st Qu.:11.21 1st Qu.:423.4 1st Qu.:36.10
Median:13.12 Median:443.1 Median:40.10
Mean:12.71 Mean:443.0 Mean:39.15
3rd Qu.:14.09 3rd Qu.:465.6 3rd Qu.:43.30
Max.:15.67 Max.:534.5 Max.:47.10
> # Now what objects have we created?
> objects()
[1] ".Last.value" "gas" "last.dump" "times"
[5] "xshort"
Manipulating data
> # Most functions operate on vectors componentwise
> y <- sin(2*pi*times)
> # add columns to the data.frame
> # Compute litres of gas used in tankful
> gas$GasUsed <- gas$Cost/(gas$UnitPrice/100)
> # Compute litres per hundred kilometres
> gas$Consumption <- gas$GasUsed/(gas$Distance/100)
> var(gas[,c(2,4,5)])
Distance GasUsed Consumption
Distance 1886.49916 90.893806 -10.2972991
GasUsed 90.89381 8.195885 0.3428260
Consumption -10.29730 0.342826 0.2426734
> sqrt(diag(var(gas[,c(2,4,5)])))
[1] 43.4338480 2.8628456 0.4926189
> apply(gas,2,mean)
Cost Distance UnitPrice GasUsed Consumption
12.7143 443.0374 39.15234 32.44423 7.346162
> # some examples of vector behaviour
> c(1,2,3)*c(1,2,3)
[1] 1 4 9
> # some matrix manipulation
> c(1,-1,0)%*%var(gas[,c(2,4,5)]) %*% c(1,-1,0)
[,1]
[1,] 1712.907
> solve(var(gas[,c(2,4,5)]))%*%var(gas[,c(2,4,5)])
Distance GasUsed Consumption
Distance 1 -1.332268e-15 2.220446e-16
GasUsed 0 1.000000e+00 -3.552714e-15
Consumption 0 -8.526513e-14 1.000000e+00
Making simple plots
> # Plot a graph of the sin(2*pi*t)
> # Open up a graphics window
> motif()
> plot(times,y,xlab="Time",ylab="Amplitude",
main="The Sine Function",type='l')
> # Saving a hard copy
> postscript(file='sine.ps',horizontal=F)
> plot(times,y,xlab="Time",ylab="Amplitude"
,main="The Sine Function",type='l')
> dev.off()
> postscript("gaspairs.ps",horizontal=F)
> pairs(gas)
> dev.off()
Other important plotting functions:
Regression: Splus has a built-in modelling language. Many different modelling functions use the same language.
> Dreg <- lm(Distance ~ GasUsed, data=gas)
> Dreg
Call:
lm(formula = Distance ~ GasUsed, data = gas)
Coefficients:
(Intercept) GasUsed
83.22513 11.09018
Degrees of freedom: 107 total; 105 residual
Residual standard error: 29.77981
> summary(Dreg)
Call: lm(formula = Distance ~ GasUsed, data = gas)
Residuals:
Min 1Q Median 3Q Max
-74.02 -16.77 -2.967 17.76 100.8
Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) 83.2251 32.9062 2.5292 0.0129
GasUsed 11.0902 1.0103 10.9766 0.0000
Residual standard error: 29.78 on 105 degrees of freedom
Multiple R-Squared: 0.5343
F-statistic: 120.5 on 1 and 105 degrees of freedom, the p-value is 0
Correlation of Coefficients:
(Intercept)
GasUsed -0.9962
> # what is Dreg?
> # List: with components
> names(Dreg)
[1] "coefficients" "residuals" "fitted.values" "effects"
[5] "R" "rank" "assign" "df.residual"
[9] "contrasts" "terms" "call"
> names(Dreg$call)
[1] "" "formula" "data"
> Dreg$call
lm(formula = Distance ~ GasUsed, data = gas)
> Dreg$call$formula
Distance ~ GasUsed
Categorical independent variables
> # Some data sets are built in to S
> # Different objects are kept in different places.
> objects(4)
> summary(fuel.frame)
...
[124] "freeny.x" "freeny.y" "fuel.frame"
[127] "fuel.frame.orig" "galaxy" "gas"
...
> summary(fuel.frame)
Weight Disp. Mileage Fuel Type
Min.:1845 Min.: 73.0 Min.:18.00 Min.:2.703 Compact:15
1st Qu.:2571 1st Qu.:113.8 1st Qu.:21.00 1st Qu.:3.704 Large: 3
Median:2885 Median:144.5 Median:23.00 Median:4.348 Medium:13
Mean:2901 Mean:152.1 Mean:24.58 Mean:4.210 Small:13
3rd Qu.:3231 3rd Qu.:180.0 3rd Qu.:27.00 3rd Qu.:4.762 Sporty: 9
Max.:3855 Max.:305.0 Max.:37.00 Max.:5.556 Van: 7
i> Freg <- lm(Fuel ~ Disp. + Weight +Type, data=fuel.frame)
> summary(Freg)
Call: lm(formula = Fuel ~ Disp. + Weight + Type, data = fuel.frame)
Residuals:
Min 1Q Median 3Q Max
-0.6973 -0.2444 -0.01367 0.2 0.6363
Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) 2.9606 0.6005 4.9306 0.0000
Disp. 0.0076 0.0017 4.3616 0.0001
Weight 0.0000 0.0003 0.1611 0.8726
Type1 -0.1453 0.1293 -1.1239 0.2662
Type2 0.0981 0.0470 2.0850 0.0420
Type3 -0.1241 0.0505 -2.4556 0.0174
Type4 -0.0436 0.0252 -1.7314 0.0893
Type5 0.1915 0.0337 5.6748 0.0000
Residual standard error: 0.3139 on 52 degrees of freedom
Multiple R-Squared: 0.8486
F-statistic: 41.65 on 7 and 52 degrees of freedom, the p-value is 0
Correlation of Coefficients:
(Intercept) Disp. Weight Type1 Type2 Type3 Type4
Disp. 0.4864
Weight -0.9415 -0.7477
Type1 0.3786 -0.2960 -0.1559
Type2 0.0888 0.3470 -0.2166 -0.4995
Type3 -0.7591 -0.0589 0.5929 -0.6197 0.1746
Type4 -0.3099 -0.1641 0.2937 -0.2997 0.1304 0.3021
Type5 0.7121 0.6013 -0.7688 -0.0053 0.2625 -0.3933 -0.2013
> coef(Freg)
(Intercept) Disp. Weight Type1 Type2 Type3
2.960617 0.007594073 4.16396e-05 -0.1452804 0.09808411 -0.1240942
Type4 Type5
-0.04358047 0.1915062
> anova(Freg)
Analysis of Variance Table
Response: Fuel
Terms added sequentially (first to last)
Df Sum of Sq Mean Sq F Value Pr(F)
Disp. 1 17.25253 17.25253 175.0661 0.000000e+00
Weight 1 7.93103 7.93103 80.4783 0.000000e+00
Type 5 3.54878 0.70976 7.2021 3.465263e-05
Residuals 52 5.12453 0.09855
> # Diagnostic plots:
> postscript("FuelDiag.ps",horizontal=F)
> # Lay the plots out in a 3 by 2 grid
> par(mfrow=c(3,2))
> plot(Freg)
> dev.off()
> # Examine the postscript file without leaving S
> !ghostview FuelDiag.ps&
Writing SPlus functions: Here I will make a contour plot
for a density which is a 50/50 mixture of two bivariate normal densities
with different correlations.
> x <- seq(-3, 3, length = 200)
> y <- x
> g <- function(x, y, rho1, rho2 = 0)
{
# g computes the density of a 50/50 mixture of two normal densities
# at a grid of x,y values
#
# Use outer to make a matrix whose ijth entry is x[i]*y[j]
q1 <- outer(x, y, "*")
y1 <- rep(1, length(y))
x1 <- rep(1, length(x))
qx <- outer(x^2, x1, "*")
qy <- outer(y1, y^2, "*")
den1 <- exp( - (qx - 2 * rho1 * q1 + qy)/
(2 * sqrt(1 - rho1^2)))/(2 * pi * sqrt(1 - rho1^2))
den2 <- exp( - (qx - 2 * rho2 * q1 + qy)/
(2 * sqrt(1 - rho2^2)))/(2 * pi * sqrt(1 - rho2^2))
(den1 + den2)/2
}
> postscript("partb.ps",height=6.5,width=6.5,horizontal=F)
> zb <- g(x, y, 0.8, -0.8)
> contour(x, y, zb, nlevels = 20, labex = 0)
> dev.off()
Simulations: I will simulate the sample mean when sampling
from the Cauchy distribution.
> # generate 1000 samples of size 25 and put them in 1000 by 25 matrix
> xc <- matrix(rcauchy(10000*25), nrow=10000)
> # compute sample means the really slow way
> xcm0 <- mean(xc[1,1])
> for ( i in 2:10000) xcm0 <- c(xcm0,mean(xc[i,]))
> # compute sample means the slow way:
> xcm <- apply(xc,1,mean)
> # faster using vector arithmetic
> one <- rep(1,25)
> xcm1 <- xc %*% one/25
> # Plot a kernel smoothed density estimate for xbar
> plot(ksmooth(xcm1,range.x=c(-6,6)),type='l',
xlab="x",ylab="Density")
> # Compare estimated density of xbar to
> # Original Cauchy density
> xs <- seq(-6,6,length=1000)
> lines(xs,dcauchy(xs),lty=2)
Pitfalls: When you write a function that calls a function
and so on you can get caught in a trap: when functions look for
variables you need to be aware of where they look.
> # in a new directory
> a <- 3
> g <- function(x) {x^a}
> g(2)
[1] 8
> # now remove all those functions and stick the whole thing
> # inside a function
> remove(ls())
>ls()
character(0)
> w <- function(){
a <- 3
g <- function(x) { x^a }
g(2)
}
> w()
Problem in g(2): Object "a" not found
Use traceback() to see the call stack
The problem is in where S looks for things it needs inside g.
Look up assign in Splus for help!
Looping: is slow so use the vector calculations - make the argument a vector. Use outer. Use matrix arithmetic.
Help: Function help or args when you know the name of the function but not how to use it. Use help.start() to get a netscape based on line help browser. Go to http://lib.stat.cmu.edu/ and look at S-news, the S-Archive and other things.