R plot gossip

At the end of March, I began to apply R to some subjects. There are lots of traps during the programming. Sometimes, it took me hours only found that the speciality of some structures or function. For example, at first, I thought it would be extremely easy when I tried to plot through ggplot in a loop. However, days after I have implemented a serials of plots, I found all the plots have the same outlook. S3 and s4 object as well as vector, list, dataframe sometimes draw me crazy.

strange plots

When I tried to plot bars, I found there were some strange guys in the several bar plots. Here is the task: I need plot a list of data, each element will be a bar plot. First of all, I initialize the plot parameters as followed:

1
2
3
4
5
6
7
8
9
10
plot.list = lapply(up.list, function(y){ 
df = fortify(y, showCategory=Inf)
log.m = as.data.frame(-log10(df$p.adjust))
log.m$names = df$Description
log.m <- log.m[order(log.m[,1],decreasing = TRUE),]
showCategory = min(length(log.m[,1]), 20)
log.m <- log.m[1:showCategory, ]
log.m <- log.m[order(log.m[,1],decreasing = FALSE),]
return(log.m)
} )

fortify is a way of coverting a list to dataframe. To sort log.m, just put order(log.m[,1]) as new idx. Next step is to get top 20 of log.m and then sort the 20 elements.

The descriptions are too long sometimes so that I tried to find a way to cutoff longer strings. That is, to change the log.m\$names = df\$Description line. I searched for substring function and found str_sub from package stringr to finish the job:

1
2
3
4
library(stringr)
log.m$names = unlist(lapply(df$Description, function(y){y<-trimws(y)
if(str_length(y)>=53) {y = paste(str_sub(x, 1, 50), "...")}
return(y)}))

The code seems to work well, bar plots are supposed to list in order but they’re not:

To inspect the differences between cutoff and the original

Note that, the type of two names are different: the former list is string list whereas the latter one is factor. To keep the same type, I use as.factor() function to convert string list as factor:

1
2
3
log.m$names = as.factor(sapply(df$Description, function(y){y<-trimws(y) 
if(str_length(x)>=53) {y = paste(str_sub(y, 1, 50), "...")}
return(y)}))

It turns out that plots still performance abnormally although the type of names has become as factor. Now, the only differences is the levels for the two list. I just mask the str_sub function to see if the other parts work well or not.

1
2
3
log.m$names = as.factor(sapply(df$Description, function(y){y<-trimws(x) 
if(str_length(x)>=503) {y = paste(str_sub(y, 1, 500), "...")}
return(y)}))

Yeah, the stacked bars on plots disappear now. I should draw the conclusion that it is the function str_sub which lead to the abnormal plots. So the following code should work well as I expected:

1
2
3
log.m$names = as.factor(sapply(df$Description, function(y){y<-trimws(y) 
if(str_length(y)>=33) {y = paste(substr(y, 1, 33), "...")}
return(y)}))

But wait. ʕฅ•ω•ฅʔ, when I check the name factor after I changed the code as:

1
if(str_length(y)>10) {y =substr(y, 1, 10) }#paste(str_sub(x, 1, 20), "...")}

It seems that the shorter kept length of string, the more serious on the plot:

When I looked at the plot, I found the label negative r. It seems this negative r can plot for more than one times. Is there a possibility that all the bars with the negative r label are ploted together?. First, let’s look at how many negative r in the factor? 7 times.

1
2
3
4
5
 plot.list[[1]]$names

[1] ameboidal- negative r negative r endotheliu negative r negative r regulation negative r endothelia endothelia negative r epithelial
[13] negative r basement m collagen-c apical par positive r positive r regulation regulation
103 Levels: actin fila adrenal gl ameboidal- amyloid fi apical jun apical par apical pla basement m bicellular blood circ ... Z disc

Now, for plot.list[[1]], there are 6 joint line which means 7 bars on one line

See the extreme condition, now we can set only one level for the name factor, that is:

1
2
3
log.m$names = as.factor(sapply(df$Description, function(y){y<-as.character(trimws(y) )
if(str_length(y)>10) {y =substr(y, 1, 10) }
return("aaa")}))

Maybe you can imagine how the plot looks like. All 20 bars stack together in one line.

If we want to keep shorter string, in the mean time, we still hope not be borthered by the same label problem, we can introduce hash function to solve the problem.

1
2
3
4
5
6
library(digest)
log.m$names = as.factor(sapply(df$Description, function(y){y<-as.character(trimws(y) )
if(str_length(y)>10) {
hs <- digest(y, "xxhash32")
y =paste(substr(y, 1, 10), hs)}
return(y)}))

The code of plots is as follows

1
2
3
4
5
6
7
8
9
10
11
lapply(seq_along(plot.list), function(y, i) {
col <- y[[i]]
ggplot(col, aes(reorder(x=col[,2], col[,1]), y=col[,1])) +
geom_bar(stat="identity", fill= "#3399CC", color="grey50") +
ggtitle(paste("title", i)) +
theme(axis.text.y = element_text(size=15)) +
scale_y_continuous(name="-log10(p-value)") +
scale_x_discrete(name= "") +
coord_flip()}
,
y=plot.list)

All same plots

There are too many plots so that it is a good idea to put multiple graphs on one page.

This is what happened, all plots were the same after plot in a loop. At first I thought, I didn’t allocate memory for the up.list because I initialized up.list as followed:

1
2
3
4
plots = list()
for (i in seq(1, n, by=1)){
up.list[[i]] <- new-element
}

Then, I changed the initialization:

1
plots = vector("list", length(cluster.de))

Still, all plots are identical. Code for ploting seems correct:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
plots = vector("list", length(cluster.de))
for (id in seq(1, length(plot.list), by=1)) {
col <- plot.list[[id]]
plots[[id]] <- ggplot(col, aes(reorder(x=col[,2], col[,1]), y=col[,1])) +
geom_bar(stat="identity", fill= "#3399CC", color="grey50") +
ggtitle(paste("title", id)) +
theme(axis.text.y = element_text(size=8)) +
scale_y_continuous(name="-log10(p-value)") +
scale_x_discrete(name= "") +
coord_flip()
}

for (i in seq(1, length(plots), by=4)){
ni = min(i+3, length(plots))
p <-plot_grid(plotlist=plots[i:ni], ncol=2)
print(p)
}

But from the 4 plots in one page we can see that every plot is the same as the others.

For now, I haven’t found solution to this problem.