martedì 29 maggio 2007

Detecting outliers through boxplots of the features

This function detects univariate outliers simultaneously using boxplots
of the features:

require(dprep)

data(diabetes)

outbox(diabetes,nclass=1)

giovedì 24 maggio 2007

Selecting “contrasting” colors

From the R-list (as usual):

What I want to be able to do is to place text on a background of arbitrary
(but known RGB) colour so that the text is legible.
I guess that this is better described as a "contrasting" than a "complementary" colour.
Since luminance contrasts are necessary and sufficient for readable text,
you could use white for dark colors and black for light colors.
Luminance is roughly proportional to 0.2*(R^2.4)+0.6*(G^2.4),
suggesting something like

lightdark <-function (color){
rgb <- col2rgb(color)/255
L <- c(0.2, 0.6, 0) %*% rgb
ifelse(L >= 0.2, "#000060", "#FFFFA0")
}


This uses a pale yellow for dark backgrounds and a dark blue for light
backgrounds, and it seems to work reasonably well.

mercoledì 23 maggio 2007

Scatter plot with axes drawn on the same scale

I'd like to produce some scatter plots where N units on the X axis are > equal to N units on the Y axis (as measured with a ruler, on screen or paper).
x <- sample(10:200,40)
y <- sample(20:100,40)
windows(width = max(x),height = max(y))
plot(x,y)
# try:
plot(x, y, asp = 1)
# or, better:
library(MASS)
eqscplot(x,y)
#or
library(lattice)
xyplot(y ~ x, aspect = "iso")

martedì 22 maggio 2007

How to draw a plot with two Y axises and one X axis

plot(1:10)
par("usr")
# [1] 0.64 10.36 0.64 10.36
# Now resetting y axis' usr coordinates:
par(usr=c(par("usr")[1:2], 101, 105))
points(1:5, 105:101, col="red")
axis(4)

sabato 19 maggio 2007

Running R Programs on clusters

1. Syntax for running R programs in BATCH mode from the command-line

$ R CMD BATCH [options] my_script.R [outfile]
$ nohup nice -n 14 R CMD BATCH myRfile.R &


The output file lists the commands from the script file and their outputs. If no outfile is specified, the name used is that of 'infile' and '.Rout' is appended to outfile. To stop all the usual R command line information from being written to the outfile, add this as first line to my_script.R file: 'options(echo=FALSE)'. If the command is run like this 'R CMD BATCH --no-save my_script.R', then nothing will be saved in the .Rdata file which can get often very large. More on this can be found on the help pages: '$ R CMD BATCH --help' or '> ?BATCH'.

2. Submitting R script to Linux cluster via Torque Create the following shell script 'my_script.sh'

##################################
#!/bin/sh
cd $PBS_O_WORKDIR
R CMD BATCH --no-save my_script.R
##################################


This script doesn't need to have executable permissons. Use the following 'qsub' command to send this shell script to the Linux cluster from the directory where the R script 'my_script.R' is located. To utilize several CPUs on the Linux cluster, one can divide the input data into several smaller subsets and execute for each subset a separate process from a dedicated directory.

$ qsub my_script.sh

giovedì 17 maggio 2007

Quick and dirty function for descriptive statistics

desc <- function(mydata) {
require(e1071)
quantls <- quantile(x=mydata, probs=seq(from=0, to=1, by=0.25))
themean <- mean(mydata)
thesd <- sd(mydata)
kurt <- kurtosis(mydata)
skew <- skewness(mydata)
retlist <- list(Quantiles=quantls, Mean=themean,
StandDev=thesd,Skewness=skew, Kurtosis=kurt)
return(retlist)
}
# example
exampledata <- rnorm(10000)
summary(exampledata)
desc(exampledata)

mercoledì 16 maggio 2007

How can I turn a string into a variable?

If you have

varname <- c("a", "b", "d")

you can do

get(varname[1]) + 2

for a + 2

or

assign(varname[1], 2 + 2)

for a <- 2 + 2

or

eval(substitute(lm(y ~ x + variable), list(variable = as.name(varname[1]))

for lm(y ~ x + a)

At least in the first two cases it is often easier to just use a list, and then you can easily index it by name

vars <- list(a = 1:10, b = rnorm(100), d = LETTERS) vars[["a"]]

without any of this messing about.

martedì 15 maggio 2007

Large Matrix into a vector

mx <- matrix(rnorm(100,1:100),10,10)
vec <- c(mx)
or
vec <- c(as.matrix(mx))
or
dim(mx) <- NULL

lunedì 14 maggio 2007

Comparing two matrices, row by row

ar1 <- array(data=c(1:16),dim=c(4,4))
ar2 <- array(data=c(1,2,3,3,5:16),dim=c(4,4))
z<-ar1==ar2
ar1
[,1] [,2] [,3] [,4]
[1,] 1 5 9 13
[2,] 2 6 10 14
[3,] 3 7 11 15
[4,] 4 8 12 16
ar2
[,1] [,2] [,3] [,4]
[1,] 1 5 9 13
[2,] 2 6 10 14
[3,] 3 7 11 15
[4,] 3 8 12 16
z
[,1] [,2] [,3] [,4]
[1,] TRUE TRUE TRUE TRUE
[2,] TRUE TRUE TRUE TRUE
[3,] TRUE TRUE TRUE TRUE
[4,] FALSE TRUE TRUE TRUE
which(z==FALSE)
[1] 4


Or

apply(ar1==ar2,1,all)

venerdì 11 maggio 2007

Density curve over a histogram

This post was updated.

x <- rnorm(1000)
hist(x, freq = FALSE, col = "grey")
curve(dnorm, col = 2, add = TRUE)

This thread (specifically the Ted Harding answer) from the r-help-list augments the usefulness of this simple tip.

This kind of plots cab be easily produced using the lessR package.

For example, from ?color.density:

library(lessR)
# generate 100 random normal data values
y <- rnorm(100)
# normal curve and general density curves superimposed over histogram
# all defaults
color.density(y)


giovedì 10 maggio 2007

Barplots of two sets

x <- c(0.0001, 0.0059, 0.0855, 0.9082)
y <- c(0.54, 0.813, 0.379, 0.35)
# create a two row matrix with x and y
height <- rbind(x, y)
# Use height and set 'beside = TRUE' to get pairs
# save the bar midpoints in 'mp'
# Set the bar pair labels to A:D
mp <- barplot(height, beside = TRUE,
ylim = c(0, 1), names.arg = LETTERS[1:4])
# Nel caso generale, i.e., che si usa di
# solito (height MUST be a matrix)
mp <- barplot(height, beside = TRUE)
# Draw the bar values above the bars
text(mp, height, labels = format(height, 4),
pos = 3, cex = .75)


martedì 8 maggio 2007

Fill a matrix with vectors of different lengths

v <- c(1,1,2,3,4,6)
binc <- function(x){
l <- sum(x)+1
y <- c(1,rep(0,l-1))
for (i in x) y <- y+c(rep(0,i),y)[1:l]
}
out <- lapply(1:length(v), function(i) binc(v[1:i]))
nout <- max(sapply(out, length))
sapply(out, function(x) c(x, rep(0, nout - length(x))))

lunedì 7 maggio 2007

Make many barplot into one plot

# I have 4 tables like this:
satu <- array(c(5,15,20,68,29,54,84,119), dim=c(2,4), dimnames=list(c("Negative", "Positive"), c("Black", "Brown", "Red", "Blond")))
dua <- array(c(50,105,30,8,29,25,84,9), dim=c(2,4), dimnames=list(c("Negative", "Positive"), c("Black", "Brown", "Red", "Blond")))
tiga <- array(c(9,16,26,68,12,4,84,12), dim=c(2,4), dimnames=list(c("Negative", "Positive"), c("Black", "Brown", "Red", "Blond")))
empat <- array(c(25,13,50,78,19,34,84,101), dim=c(2,4), dimnames=list(c("Negative", "Positive"), c("Black", "Brown", "Red", "Blond")))
# rbind() the tables together
TAB <- rbind(satu, dua, tiga, empat)
# Do the barplot and save the bar midpoints
mp <- barplot(TAB, beside = TRUE, axisnames = FALSE)
# Add the individual bar labels
mtext(1, at = mp, text = c("N", "P"),
line = 0, cex = 0.5)
# Get the midpoints of each sequential pair of bars
# within each of the four groups
at <- t(sapply(seq(1, nrow(TAB), by = 2),
function(x) colMeans(mp[c(x, x+1), ])))
# Add the group labels for each pair
mtext(1, at = at, text = rep(c("satu", "dua", "tiga", "empat"), 4),
line = 1, cex = 0.75)
# Add the color labels for each group
mtext(1, at = colMeans(mp), text = c("Black", "Brown", "Red", "Blond"), line = 2)

domenica 6 maggio 2007

How to clear screen in R (Windows)

This tip (taken from this old thread) does work only under Windows:

cls <- function() {
       require(rcom)
       wsh <- comCreateObject("Wscript.Shell")
       comInvoke(wsh, "SendKeys", "\014")
       invisible(wsh)
}
cls()
or
# An R function to clear the screen on RGui:
cls <- function() {
if (.Platform$GUI[1] != "Rgui")
return(invisible(FALSE))
if (!require(rcom, quietly = TRUE)) # Not shown any way!
stop("Package rcom is required for 'cls()'")
wsh <- comCreateObject("Wscript.Shell")
if (is.null(wsh)) {
    return(invisible(FALSE))
} else {
comInvoke(wsh, "SendKeys", "\014")
    return(invisible(TRUE))
}
}
cls()

See this StackOverflow Question for other solutions.

venerdì 4 maggio 2007

Fill a matrix with vectors of different lengths

V <- c(1,1,2,3,4,6)
binc <- function(x){
l <- sum(x)+1
y <- c(1,rep(0,l-1))
for (i in x) y <- y+c(rep(0,i),y)[1:l]
}
out <- lapply(1:length(v), function(i) binc(v[1:i]))
nout <- max(sapply(out, length))
sapply(out, function(x) c(x, rep(0, nout - length(x))))

giovedì 3 maggio 2007

Rotating a distribution plot by 90 degrees

x <- c( rnorm(50,10,2), rnorm(30,20,2) )
y <- 2+3*x + rnorm(80)
d.x <- density(x)
d.y <- density(y)
layout( matrix( c(0,2,2,1,3,3,1,3,3),ncol=3) )
plot(d.x$x, d.x$y, xlim=range(x), type='l')
plot(d.y$y, d.y$x, ylim=range(y), xlim=rev(range(d.y$y)), type='l')
plot(x,y, xlim=range(x), ylim=range(y) )





mercoledì 2 maggio 2007

ls() improved!

This marvelous little function shows all objects in the current workspace
by mode, class and 'size'! Thanks to Bendix Carstensen!

lls <- function (pos = 1, pat = "")

{

dimx <- function(dd) if (is.null(dim(dd)))

length(dd)

else dim(dd)

lll <- ls(pos = pos, pat = pat)

cat(formatC("mode", 1, 15), formatC("class", 1, 18),

formatC("name",1, max(nchar(lll)) + 1), "size\n-----------------------------------------------------------------\n")

if (length(lll) > 0)

{

for (i in 1:length(lll))

{

cat(formatC(eval(parse(t = paste("mode(", lll[i],

")"))), 1, 15), formatC(paste(eval(parse(t = paste("class(",

lll[i], ")"))), collapse = " "), 1, 18), formatC(lll[i],

1, max(nchar(lll)) + 1), " ", eval(parse(t = paste("dimx(", lll[i], ")"))), "\n")

}

}

}