R tips, tricks and codes

My R packages
I am the maintainer of the following packages on CRAN (The Comprehensive R Archive Network):

baseline: Baseline correction of spectra (docs)
EMSC: Extended Multiplicative Signal Correction (docs)
ER: Effect + Residual Modelling (docs)
fixedTimeEvents: The distribution of distances between discrete events in fixed time (docs)
HDANOVA: High-Dimensional Analysis of Variance with ASCA and similar methods (docs)
MatrixCorrelation: Matrix Correlation Coefficients (docs)
mixlm: Mixed Model ANOVA and Statistics for Education (docs)
multiblock: Data Analysis for Multiple Blocks (docs)
pls: Partial Least Squares and Principal Component Regression (docs)
plsVarSel: Variable selection in Partial Least Squares (docs)
RcmdrPlugin.NMBU: R Commander Plug-in for University Level Applied Statistics (docs, introduction)

–
There is a nice personal documentation/bragging page at: RDocumentation.

I also contribute to the following packages on CRAN:

microclass: Methods for Taxonomic Classification of Prokaryotes (docs)
microcontax: The ConTax Data Package (docs)
micropan: Microbial Pan-Genome Analysis (docs)
microseq: Basic Biological Sequence Analysis (docs)
oreo: Large Amplitude Oscillatory Shear (docs)
takos: Analysis of Differential Calorimetry Scans (docs)

The following packages are only available at GitHub and NMBU’s repository:

microcontax.data: Data for the microcontax package
nmbu: Utility package for automatic update and startup of RcmdrPlugin.NMBU.
SelectWave: A shiny app for spectroscopic data analysis

My package development is git based: https://github.com/khliland.

R itself
R is a programming language and software environment for statistical computing and graphics. The R language has become a de facto standard among statisticians for developing statistical software, and is widely used for statistical software development and data analysis. [Wikipedia]
After working with R as my main programming language for several years, I have met many challenges and done many searches for information. Through this web page I hope to collect some of my experience and share with other programmers, new or experienced.

Working efficiently with R

A complete computing environment with source code, the R Console, workspace, history, files, plots, packages and help pages visible at the same time (cross platform, freeware) is available through R Studio.
For Windows users Notepad++ and NppToR in combination with R gives colour coded source code and code sourcing from the editor.
Linux users are often more familiar with Eclipse which also has support for R (cross platform, freeware).

Web resources

The main resource for R, including R downloads, packages, documentation, task views and search engines is The R Project for Statistical Computing and its sub domain CRAN (the Comprehensive R Archive Network).
Among the many bloggs concerning R, the one to rule them all is R-bloggers. It is a vast resource to tips, tricks and code.

Quick programs
One of my favourite activities in front of the television in the evenings is making quick programs and efficient solutions for large or repetitive problems. An example can be found over at R-Forge where I have used Rcpp to combine C++ and R in the Needleman Wunsch package. This is one solution to computing similarities between two sequences using a global or semi-global search. Using C++ ensures minimal overhead in the computations, and reducing a matrix problem to a double vector problem with extensive reuse of memory ensures a small memory footprint.

Quick functions
Though I usually work with wide data matrices having from a few hundred to tens of thousands columns, I sometimes have to handle tall matrices. One such problem involved a little over 18 million milking records from more than 3 million cows. Associated with the cows were around 4 million health registrations that needed to be looked up in the cow table to assign additional attributes. Programming the lookup as a double for loop and testing it on a small subset of cows and registrations, I calculated that it would take around 29 days to complete the whole lookup on my fairly quick computer. After scratching my noodle and searching the web I stumbled upon the match() function. This does exactly what I needed, returning the index of the first exact match in the second vector for each element in the first vector. The difference from using the double loop was a reduction in time from 29 days (estimate) to 0.89 seconds for the whole job.

Subscribe to RSS feed