Mingbo


  • Home

  • slides

  • Archives

  • About

R plot gossip

Posted on 2019-04-20 | In Programming

At the end of March, I began to apply R to some subjects. There are lots of traps during the programming. Sometimes, it took me hours only found that the speciality of some structures or function. For example, at first, I thought it would be extremely easy when I tried to plot through ggplot in a loop. However, days after I have implemented a serials of plots, I found all the plots have the same outlook. S3 and s4 object as well as vector, list, dataframe sometimes draw me crazy.

strange plots

When I tried to plot bars, I found there were some strange guys in the several bar plots. Here is the task: I need plot a list of data, each element will be a bar plot. First of all, I initialize the plot parameters as followed:

1
2
3
4
5
6
7
8
9
10
plot.list = lapply(up.list, function(y){ 
df = fortify(y, showCategory=Inf)
log.m = as.data.frame(-log10(df$p.adjust))
log.m$names = df$Description
log.m <- log.m[order(log.m[,1],decreasing = TRUE),]
showCategory = min(length(log.m[,1]), 20)
log.m <- log.m[1:showCategory, ]
log.m <- log.m[order(log.m[,1],decreasing = FALSE),]
return(log.m)
} )

fortify is a way of coverting a list to dataframe. To sort log.m, just put order(log.m[,1]) as new idx. Next step is to get top 20 of log.m and then sort the 20 elements.

The descriptions are too long sometimes so that I tried to find a way to cutoff longer strings. That is, to change the log.m\$names = df\$Description line. I searched for substring function and found str_sub from package stringr to finish the job:

1
2
3
4
library(stringr)
log.m$names = unlist(lapply(df$Description, function(y){y<-trimws(y)
if(str_length(y)>=53) {y = paste(str_sub(x, 1, 50), "...")}
return(y)}))

The code seems to work well, bar plots are supposed to list in order but they’re not:

To inspect the differences between cutoff and the original

Note that, the type of two names are different: the former list is string list whereas the latter one is factor. To keep the same type, I use as.factor() function to convert string list as factor:

1
2
3
log.m$names = as.factor(sapply(df$Description, function(y){y<-trimws(y) 
if(str_length(x)>=53) {y = paste(str_sub(y, 1, 50), "...")}
return(y)}))

It turns out that plots still performance abnormally although the type of names has become as factor. Now, the only differences is the levels for the two list. I just mask the str_sub function to see if the other parts work well or not.

1
2
3
log.m$names = as.factor(sapply(df$Description, function(y){y<-trimws(x) 
if(str_length(x)>=503) {y = paste(str_sub(y, 1, 500), "...")}
return(y)}))

Yeah, the stacked bars on plots disappear now. I should draw the conclusion that it is the function str_sub which lead to the abnormal plots. So the following code should work well as I expected:

1
2
3
log.m$names = as.factor(sapply(df$Description, function(y){y<-trimws(y) 
if(str_length(y)>=33) {y = paste(substr(y, 1, 33), "...")}
return(y)}))

But wait. ʕฅ•ω•ฅʔ, when I check the name factor after I changed the code as:

1
if(str_length(y)>10) {y =substr(y, 1, 10) }#paste(str_sub(x, 1, 20), "...")}

It seems that the shorter kept length of string, the more serious on the plot:

When I looked at the plot, I found the label negative r. It seems this negative r can plot for more than one times. Is there a possibility that all the bars with the negative r label are ploted together?. First, let’s look at how many negative r in the factor? 7 times.

1
2
3
4
5
 plot.list[[1]]$names

[1] ameboidal- negative r negative r endotheliu negative r negative r regulation negative r endothelia endothelia negative r epithelial
[13] negative r basement m collagen-c apical par positive r positive r regulation regulation
103 Levels: actin fila adrenal gl ameboidal- amyloid fi apical jun apical par apical pla basement m bicellular blood circ ... Z disc

Now, for plot.list[[1]], there are 6 joint line which means 7 bars on one line

See the extreme condition, now we can set only one level for the name factor, that is:

1
2
3
log.m$names = as.factor(sapply(df$Description, function(y){y<-as.character(trimws(y) )
if(str_length(y)>10) {y =substr(y, 1, 10) }
return("aaa")}))

Maybe you can imagine how the plot looks like. All 20 bars stack together in one line.

If we want to keep shorter string, in the mean time, we still hope not be borthered by the same label problem, we can introduce hash function to solve the problem.

1
2
3
4
5
6
library(digest)
log.m$names = as.factor(sapply(df$Description, function(y){y<-as.character(trimws(y) )
if(str_length(y)>10) {
hs <- digest(y, "xxhash32")
y =paste(substr(y, 1, 10), hs)}
return(y)}))

The code of plots is as follows

1
2
3
4
5
6
7
8
9
10
11
lapply(seq_along(plot.list), function(y, i) {
col <- y[[i]]
ggplot(col, aes(reorder(x=col[,2], col[,1]), y=col[,1])) +
geom_bar(stat="identity", fill= "#3399CC", color="grey50") +
ggtitle(paste("title", i)) +
theme(axis.text.y = element_text(size=15)) +
scale_y_continuous(name="-log10(p-value)") +
scale_x_discrete(name= "") +
coord_flip()}
,
y=plot.list)

All same plots

There are too many plots so that it is a good idea to put multiple graphs on one page.

This is what happened, all plots were the same after plot in a loop. At first I thought, I didn’t allocate memory for the up.list because I initialized up.list as followed:

1
2
3
4
plots = list()
for (i in seq(1, n, by=1)){
up.list[[i]] <- new-element
}

Then, I changed the initialization:

1
plots = vector("list", length(cluster.de))

Still, all plots are identical. Code for ploting seems correct:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
plots = vector("list", length(cluster.de))
for (id in seq(1, length(plot.list), by=1)) {
col <- plot.list[[id]]
plots[[id]] <- ggplot(col, aes(reorder(x=col[,2], col[,1]), y=col[,1])) +
geom_bar(stat="identity", fill= "#3399CC", color="grey50") +
ggtitle(paste("title", id)) +
theme(axis.text.y = element_text(size=8)) +
scale_y_continuous(name="-log10(p-value)") +
scale_x_discrete(name= "") +
coord_flip()
}

for (i in seq(1, length(plots), by=4)){
ni = min(i+3, length(plots))
p <-plot_grid(plotlist=plots[i:ni], ncol=2)
print(p)
}

But from the 4 plots in one page we can see that every plot is the same as the others.

For now, I haven’t found solution to this problem.

Set Seurat enviroment through Anaconda

Posted on 2019-04-08 | In Programming

I’m using the computional cluster to run R enviroment. But I haven’t got the access to the root or sudoers.
Usually, I will choose Docker to set up a totally independent enviroment. This doesn’t work right now, because to launch a docker daemon, you have to get the permission to access root.

I use linuxbrew to set up the enviroment at first. But sometimes it won’t work as well when it comes to the library dependencies.
The mainly reason is that linuxbrew just installs some softwares into the user’s directory, but we still depend on some system libraries and so on. For example, I found that the version of libstdc++ does not meet a software.
I installed all version of gcc and tried to substitute all correlated library to these system ones. However, even vim refuse to work right now.

Good news are that I recently Anaconda can support building a R enviroment friendly.
First of all, download the Anaconda3-2018.12-Linux-x86_64.sh according your operating system.
Then install softwares, I need library Seurat, so some softwares are needed:

1
2
3
4
5
6
7
sh Anaconda3-2018.12-Linux-x86_64.sh
cd Anaconda3
source bin/activate
conda create -n r_env r-essentials r-base
conda activate r_env
conda install hdf5
conda install -c r r-git2r

Then install all library you need inner R

1
2
3
4
5
6
7
install.packages("hdf5r")
install.packages("Seurat")
install.packages(c("WriteXLS", "devtools", "urltools", "pheatmap", "shinythemes","BiocManager"))
libraray(devtools)
install_github("ggjlab/scMCA")
BiocManager::install("clusterProfiler", version = "3.8")
BiocManager::install("piano", version = "3.8")

rmarkdown generate pdf

It is easy to generate html using through knit. However, sometimes pdf generating will be an obstacle. The most common advise is Pandoc.
However, when I try to generate pdfs, I found all format good in html became really unacceptable. After some search, I decided to generate PDF through webshot library in R. The installation step is as follows:

1
2
install.packages("webshot")
webshot::install_phantomjs()

Now you can generate pdf through Rscript:

1
2
library(webshot)
webshot("x.html", "y.pdf")

Problems emerged when opened the pdf file due to the pages. Because webshot generates pdf with a huge page without break, it is hard for pdf viewer to load a page. So I had to find anothe software. html-pdf can solve the pages problem, to install:

1
2
conda install nodejs
npm install -g html-pdf

Estar egg: knitter script for Rmd to generate html or pdfs

I just improve a script from stackoverflow

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
#!/bin/sh

### Test usage; if incorrect, output correct usage and exit
if [ "$#" -gt 2 -o "$#" -lt 2 ]; then
echo "********************************************************************"
echo "* Knitter version 1.0 *"
echo "********************************************************************"
echo -e "The 'knitter' script converts Rmd files into HTML or PDFs. \n"
echo -e "usage: knitter file.Rmd file.{pdf,html} \n"
echo -e "Spaces in the filename or directory name may cause failure. \n"
exit
fi
# Stem and extension of file
extension1=`echo $1 | cut -f2 -d.`
extension2=`echo $2 | cut -f2 -d.`
name_without_ext=`echo $1 | cut -f1 -d.`
out_without_ext=`echo $2 | cut -f1 -d.`

### Test if file exist
if [[ ! -r $1 ]]; then
echo -e "\n File does not exist, or option mispecified \n"
exit
fi

### Test file extension
if [[ $extension1 != Rmd ]]; then
echo -e "\n Invalid input file, must be a Rmd-file \n"
exit
fi

# Create temporary script
# Use user-defined 'TMPDIR' if possible; else, use /tmp
if [[ -n $TMPDIR ]]; then
pathy=$TMPDIR
else
pathy=/tmp
fi
# Tempfile for the script
tempscript=`mktemp $pathy/tempscript.XXXXXX` || exit 1
if [[ $extension2 == "pdf" ]]; then
echo "library(rmarkdown);rmarkdown::render('"${1}"', 'html_document');" >> $tempscript
Rscript $tempscript
html-pdf "${name_without_ext}".html "${2}"
rm ${name_without_ext}.html
echo generate "${2}"
fi

if [[ $extension2 == "html" ]]; then
echo "library(rmarkdown); rmarkdown::render('"${1}"', 'html_document')" >> $tempscript
Rscript $tempscript
mv ${name_without_ext}.html ${2}
echo generate "${2}"
fi

if [[ $extension2 == "both" ]]; then
echo "library(rmarkdown);rmarkdown::render('"${1}"', 'html_document');" >> $tempscript
Rscript $tempscript
html-pdf "${name_without_ext}".html ${out_without_ext}.pdf
mv ${name_without_ext}.html ${out_without_ext}.html
echo generate "${out_without_ext}".html and "${out_without_ext}".pdf
fi

Mint-soft

Posted on 2019-03-17 | In soft

必要软件

程序员的命:Shell

先装个zsh,然后google一下‘oh my zsh‘,安装一下,然后安装我最喜欢的jump插件,想跳哪里跳哪里。

字典 Goldendict

作为一个外国人,首要的任务是查字典,Linux最佳选择就是的goldendict,德语字典在这里有Duden字典和Wahrig字典都是比较好用的字典,英文字典我推荐柯林斯双解字典。

Markdown编辑器

Markdown编辑器首选的显然是typora,三大操作系统平台都可以使用自如。

云记事本

无论在windows下,还是Mac下,wiznote都是一个用起来最方便的记事本软件。现在wiznote已经收费,这是好事儿,一直不收费很有可能会维持不下去,但是一定会赶走一大批用户。我用wiznote三年有余,但是linux下它表现并不好。尝试了一下Gitnote,感觉很不错,只要我们有个github账户,自己建立一个repository,我就有最安全方便的笔记存储空间了。

寻找Mac下Alfred替代品

在Mac下已经被Alfred惯出毛病,没有它,浑身不舒服。简单搜索,还真被我找到了,它就是Synapse。简单使用了一下,感觉非常不错,把快捷键改成ALT+Space,设置成开机启动,相当舒服。

中文输入法

实际上在linux下输入发可选择范围基本确定,ibus或者fcitx。由于我更习惯使用双拼,对我的选择就更少了,看了几种方案,最后选择了RIME(中州韻輸入法引擎)。其实在Mac下我之前用过搜狗拼音输入法,基本上是秒卸载,相当一段时间,我一直使用baidu拼音输入法。一方面是由于百度做任何事情没有底线,另一方面,有时我爱人用我的电脑,会被我的双拼设置搞崩溃。最后发现RIME输入发是相当对我的口味。没想到在linux下又重逢,谷歌一下,简单设置一下,双拼大法好。

先安装这些,后面有需要再安装。对了,对于极其丑陋的thunderbird,我选择狗带,明天试一下nylas。

install-Seurat

Posted on 2019-03-17 | In Programming

Mint安装Seurat包

Mint是基于Debian和Ubuntu的Linux操作系统。实验室统一开发环境是Mint,用

1
cat /etc/linuxmint/info

查了一下我的Mint版本号是Mint19。打开Rstduio按照Seurat的安装步骤:

1
2
install.packages('Seurat')
library(Seurat)

第一步就出错了,ggplot2安装失败,原因是其依赖的另一package无法找到。由于目前不在实验室,当时没有记录,所以无法回忆其当时的包名。Google搜索后发现这个包要求R需要是>3.5。所以我的目前的主要任务集中在怎样把目前的R3.4版本升级至3.5。网上一堆教程,都是通过更新recv-keys,例如:

1
2
3
apt-key adv --keyserver keyserver.ubuntu.com --recv-keys xxxxxx
sudo apt-get update
sudo apt-get upgrade r-base r-base-dev

多次尝试,包括修改source.list文件内容均告失败。最后的办法是把原来的R版本所有包都卸载,重新安装信版本。

1
2
3
4
5
6
sudo apt purge r-base* r-recommended r-cran-*
sudo apt autoremove
sudo apt update
sudo add-apt-repository 'deb https://cloud.r-project.org/bin/linux ubuntu bionic-cran35/'
sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E084DAB9
sudo apt update

安装r-base-core后,即可以做后续的安装工作了。打开Rstudio,发现已经是R3.5版本了。

简单增加博客评论

Posted on 2019-03-10

博客的评论系统

希望把博客的评论系统建立起来,之前使用的是disqus,重新部署的时候,页面大部分都无法显示。不想再用disqus。看到有人创造性的利用github作为载体建立评论系统,也就是Gitment了。
按照教程在github上设置了Gitment,惊闻Gitment需要请求服务,是作者搭的,作者已经不维护了。按照以下操作:

1
npm i --save gitment

修改自己js,连接自己搭建的服务器,WTF?

1
/node_modules/gitment/dist/gitment.browser.js

详细修改过程可参照:
https://sherry0429.github.io/2019/02/12/gitment%E4%BF%AE%E5%A4%8D/

后继续寻觅其他可以评论系统,找到这篇文章:
https://wangjiezhe.com/posts/2018-10-29-Hexo-NexT-3/
根据此文章的教程安装了utterances。目前发现还是比较不错。知识现在看到的效果是全局评论。
issue-term不太了解具体,目前不想深入探究,仅仅设置pathname。

Change theme to Next

Posted on 2019-03-10

不再折腾主题,乖乖的切到Next

又一次,今年年初的又一次,博客系统hexo下的maupassant主题又罢工了。由于年初的各种事情繁琐而多,我就放弃治疗博客系统了,也就是说,有新的博文也无法发出来,先不管那些报错了。

现在稍微腾出一点儿时间,准备把博客系统好好弄一下。其实最简单的办法,也是屡试不爽的方法就是把所有的环境重新安装一遍,显示hexo,再是maupassant。这次不灵了,hexo generate之后一堆报错。我甚至觉得maupassant已经无法搞定了,搜索错误的关键词,发现没有人遇到与我相同的问题。最后是怀疑我文章里有公式的特殊字符,影响markdown parse。修改了hexo-renderer-marked的js,仍然有问题。最后决定换其他主题了,然而只要把所有的文章迁移过来一定会报错。报错如下:

1
2
3
4
5
6
7
8
INFO  Start processing
FATAL Something's wrong. Maybe you can find the solution here: http://hexo.io/docs/troubleshooting.html
Template render error: (unknown path) [Line 65, Column 565]
expected variable end
at Object._prettifyError (/Users/chengmingbo/blog_deploy/blog/node_modules/nunjucks/src/lib.js:36:11)
at Template.render (/Users/chengmingbo/blog_deploy/blog/node_modules/nunjucks/src/environment.js:542:21)
at Environment.renderString (/Users/chengmingbo/blog_deploy/blog/node_modules/nunjucks/src/environment.js:380:17
... ...

好吧,一切从头来,一个文件一个文件的添加,每次hexo generate一下。终于找到了一个有问题的文件。先注释掉再说,后面慢慢查是什么特殊字符引起的问题。好在可以更新博客了。

反复

选定了Next主题,又出现了反复,加上评论系统disqus发现博客白屏了,只有一个文件头显示。可是加功能一时爽,调试火葬场。当时实在记不起来到底是加了什么使得博客又不工作了。只能重头再来。在找主题的过程中发现star排名第四的hexo-theme-apollo已经停止开发,作者一句话让我决定不再折腾什么主题了:

1
专注文章内容的创作胜过博客样式的美观,祝各位玩的开心:

后续工作

  1. 追查出什么特殊字符引起了hexo generate出现问题
  2. 看是否能复原评论系统,如果不能先这样吧,只要不耽误写博文。

Expectation and Variance of Poisson Distribution

Posted on 2017-08-07

Pmf of Poisson Distribution is as follows:

$$f(X=k;\lambda)=\frac{\lambda^k e^{-\lambda}}{k!}$$

Our aim is to derive the the expectation of $E(X)$ and the variance $Var(X)$. Given that the formula of expectation:
$$
E(X)=\sum_{k=0}^{\infty} k \frac{\lambda^k e^{-\lambda }}{k!}
$$

Notice that when $k=0$, the formula is equal to 0, that is:

$$\sum_{k=0}^{\infty} k \frac{\lambda^ke^{-\lambda}}{k!}\Large|_{k=0}=0$$

Then, the formula become as followed:

$$E(X)=\sum_{k=1}^{\infty} k \frac{\lambda^ke^{-\lambda}}{k!}$$

$$\begin{aligned}E(X)&=\sum_{k=0}^{\infty} k \frac{\lambda^ke^{-\lambda}}{k!}=\sum_{k=0}^{\infty} \frac{\lambda^ke^{-\lambda}}{(k-1)!}\&=\sum_{k=0}^{\infty} \frac{\lambda^{k-1}\lambda e^{-\lambda}}{(k-1)!}\&=\lambda e^{-\lambda}\sum_{k=1}^{\infty}\frac{\lambda^{k-1}}{(k-1)!}\end{aligned}$$

Now we need take advantage of Taylor Expansion, recall that:

$$e^x=1+x+\frac{x^2}{2!}+\frac{x^3}{3!}+\cdots+\frac{x^{k-1}}{(k-1)!}=\sum_{k=1}^{\infty}\frac{x^{k-1}}{(k-1)!}$$

Compare $E(X)$, we can get:

$$E(X)=\lambda e^{-\lambda}e^\lambda=\lambda$$

As known that $Var(X)=E(X^2)-(E(x))^2$, we just get $E(X^2)$. Given that:

$$E(X)=\sum_{k=1}^{\infty} k \frac{\lambda^ke^{-\lambda}}{k!}=\lambda$$

we can use this formula to derive the $E(X^2)$,

$$\begin{aligned}E(X)=&\sum_{k=1}^{\infty} k \frac{\lambda^ke^{-\lambda}}{k!}=\lambda\\Leftrightarrow&\sum_{k=1}^{\infty} k \frac{\lambda^k}{k!}=\lambda e^{\lambda}\\Leftrightarrow&\frac{\partial\sum_{k=1}^{\infty} k \frac{\lambda^k}{k!}}{\partial \lambda}=\frac{\partial \lambda e^{\lambda}}{\partial \lambda}\\Leftrightarrow&\sum_{k=1}^{\infty}k^2\frac{\lambda^{k-1}}{k!}=e^\lambda+\lambda e^\lambda\\Leftrightarrow&\sum_{k=1}^{\infty}k^2\frac{\lambda^{k-1}e^{-\lambda}}{k!}=1+\lambda \\Leftrightarrow&\sum_{k=1}^{\infty}k^2\frac{\lambda^{k}e^{-\lambda}}{k!}=\lambda+\lambda^2=E(X^2)\end{aligned}$$

then,

$$Var(X)=E(X^2)-(E(X))^2=\lambda+\lambda^2-(\lambda)^2=\lambda$$

Thus, we have proved that the Expectation and the Variance of Poisson Distribution are both $\lambda$

样本方差为什么除以N-1?(翻译)

Posted on 2017-06-17

原文作者:Vincent Spruy

译者:程明波

英文文章地址

译文地址

译者注:由于历史原因,高斯分布(Gaussian Distribution),正态分布(Normal Distribution)皆指概率密度函数形如$\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(x-\mu)^2}{2\sigma^2}}$的分布。文中我会采用正态分布的提法。

简介

本文,呼应标题,我将推导著名正态分布数据均值和方差的计算公式。如果一些读者对于这个问题的“为什么”并不感兴趣,仅仅是对“什么时候使用”感兴趣,那答案就非常简单了:

如果你想预估一份数据的均值和方差(典型情况),那么方差公式除的是$N-1$,即:

$$\sigma^2 = \frac{1}{N-1}\sum_{i=1}^N (x_i - \mu)^2$$

另一种情况,如果整体的真实均值已知,那么方差公式除的就是$N$,即:

$$\sigma^2 = \frac{1}{N}\sum_{i=1}^N (x_i - \mu)^2$$

然而,前一种情况,会是你遇到更典型的情形。一会儿,我会举一个预估高斯白噪音的离散程度例子。例子中高斯白噪音的均值是已知的0,这种情况下,我们只需要估计方差。

如果数据是正态分布,我们可以完全用均值$\mu$和方差$\sigma^2$刻画这个分布。其中,方差是标准差$\sigma$的平方,标准差代表了每个数据点偏离均值点的平均距离,也就是说,方差表示了数据离散程度。对于正态分布,68.3%的数据的值会介于$\mu-\sigma$和$\mu+\sigma$之间。下面图片展示是一个正态分布的概率密度函数,他的均值是$\mu=10$,方差是$\sigma^2=3^2=9$:

图1. 正态分布概率密度函数. 对于正态分布数据,68%的样本落在均值$\pm$方差。

通常,我们拿不到全部的全体数据。上面的例子中,典型的情况是我们有一些观察数据,但是,我们没有上图中x轴上所有可能的观察数据。例如我们可能有下面一些观察数据:

表1

观察数据ID 观察值
观察数据 1 10
观察数据 2 12
观察数据 3 7
观察数据 4 5
观察数据 5 11

现在如果我们通过把所有值相加并除以观察的次数,得到经验均值:

$$\mu=\frac{10+12+7+5+11}{5}=9\tag{1}$$.

通常,我们会假设经验均值接近分布的未知的真实均值,因此,我们可以假设观测数据来自于均值为$\mu=9$的正态分布。在这个例子中,分布真实均值是10, 也就是说,经验均值实际上接近于真实均值。

数据的方差计算如下:

$$\begin{aligned}\sigma^2&= \frac{1}{N-1}\sum_{i=1}^N (x_i - \mu)^2\&= \frac{(10-9)^2+(12-9)^2+(7-9)^2+(5-9)^2+(11-9)^2}{4})\&= 8.5.\end{aligned}\tag{2}$$

同样,我们一般假设经验方差接近于基于分布真实未知方差。在此例中,真实方差是9,所以,经验方差也是接近于真实方差。

那么我们手上的问题现在就是为什么我们用于计算经验均值和经验方差的公式是正确的。事实上,另一个我们经常用于计算方差的公式是这样定义的:

$$\begin{aligned}\sigma^2 &= \frac{1}{N}\sum_{i=1}^N (x_i - \mu)^2 \&= \frac{(10-9)^2+(12-9)^2+(7-9)^2+(5-9)^2+(11-9)^2}{4}) \&= 6.8.\end{aligned}\tag{3}$$

公式(2)和公式(3)的唯一不同是前一个公式除的是$N-1$,而后一个除的是$N$。两个公式都是对的,只是根据不同的场景使用不同的公式。

接下来的部分,我们针对给定一个正态分布的样本集,完成对其未知方差和均值最好估计的完整推导。我们将会看到,一些情况下,方差除的是$N$,另一些情况除的是$N-1$。

用一个公式近似一个参数(均值或方差)叫做估计量。下面,我们定义一个分布的真实但未知的参数为$\hat{\mu}$和$\hat{\sigma}^2$。而估计量,例如,经验的平均和经验方差,定义为$\mu$和$\sigma^2$。

为了找到最优的估计量,首先,一个整体均值为$\mu$标准差为$\sigma$的正态分布,对于特定的观察点$x_i$,我们需要一个分析相似的表达式。对于一个已知参数的正态分布一般定义为$N(\mu,\sigma^2)$。似然函数为:

$$x_i \sim N(\mu,\sigma^2) \Rightarrow P(x_i; \mu,\sigma)=\frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{1}{2\sigma^2}(x-\mu)^2}.\tag{4}$$

为了计算均值和方差,显然,我们需要这个分布一个以上的样本。接下来,设$\vec{x}=(x_1,x_2,\cdots,x_N)$为包含所有的可用样本的向量(例如:表一中所有的值)。如果所有这些样本统计独立,我们可以写出联合似然函数为所有似然函数的乘积:

$$\begin{aligned}P(\vec{x};\mu,\sigma^2)&=P(x_1,x_2,\cdots,x_n;\mu,\sigma^2)\&=P(x_1;\mu,\sigma^2)P(x_2;\mu,\sigma^2)\cdots P(x_N;\mu,\sigma^2)\&=\prod_{i=1}^{N}P(x_i;\mu,\sigma^2)\end{aligned}.\tag{5}$$

把公式(4)代入公式(5),可得出联合概率密度函数的分析表达式:

$$\begin{aligned}P({\vec{x};\mu,\sigma})&=\prod_{i=1}^{N}\frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{1}{2\sigma^2}(x_i-\mu)^2}\&=\frac{1}{(2\pi\sigma^2)^{\frac{N}{2}}}e^{-\frac{1}{2\sigma^2}\sum_{i=1}^{N}(x_i-\mu)^2}\end{aligned}.\tag{6}$$

公式(6)在接下来的部分将非常重要。我们会用它推导关于正态分布著名的估计量均值和方差。

最小方差,无偏估计量

决定一个估计量是不是“好”估计量,首先我们需要定义什么是真正的“好” 估计量。说一个估计量好,依赖于两个度量,叫做其偏差(bias)和方差(variance)(是的,我们要讨论均值估计量的方差,以及方差估计量的方差)。本节将简单的讨论这两个度量。

参数偏差

想象一下,如果我们能拿到全体不同的(互斥)数据子集。类比之前的的例子,假设,除了【表1】中的数据,我们还有完全不同观察结果表2及表3。那么,一个关于均值好的估计量,应该使得这个估计量平均下来等于真实的均值。我们可以接受其中一个自己的经验均值不等于真实均值,但是,一个好的估计量应该保证:对于所有子集均值估计的平均值等于真实均值。这个限制条件用数学化的表示,就是估计量的期望值(Expected Value)应该等于参数值:

$$E(\mu)=\hat{\mu}\qquad E(\sigma^2)=\hat{\sigma}^2.\tag{7}$$

如果满足上面的条件,那么这些估计量就被称之为“无偏估计”。反之,如果上面的条件不满足,这些估计量叫做“有偏的”,也就是说平均来看,他们或者低估或者高估了参数的真实值。

参数方差

无偏估计量保证平均来看,它们估计的值等于真是参数。但是,这并不意味着每次估计是一个好的估计。比如,如果真实均值为10,一个无偏估计量可以估计全体的其中一个子集的均值为50,而另一个均值为-30。期望的估计的值确实是10,也等于真是的参数值,但是,估计量的质量明显依赖每次估计的离散程度。对于全体5个不同子集,一个估计量产生的估计值(10,15,5,12,8)是无偏的和另一个估计量产生的估计值(50,-30,100,-90,20)(译者注:原文作者最后一个是10,我计算换成20,这样均值才是10)。但是第一个估计量的所有估计值明显比第二个估计量的估计值更接近真实值。

因此,一个好的估计量不仅需要有低偏差,同时也需要低方差。这个方差表示为平均平方误差的估计量:

$$Var(\mu)=E[(\hat{\mu}-\mu)^2]$$

$$Var(\sigma^2)=E[(\hat{\sigma}-\sigma)^2]$$

因此一个好的估计量是低偏差,低方差的。如果存在最优的估计量,那么这个估计应该是无偏的,而且方差比所有的其他可能估计量都要低。这样的一个估计量被称之为最小方差,无偏(MVU)估计量。下一节,我们将会针对一个正态分布推导均值和方差估计量的数学表达式。我们将会看到,一个正态分布的方差MVU估计量在一些假设下需要除以$N$,而在另一些假设下需要除以$N-1$。

最大似然估计

基于整体的一个子集,尽管有大量的获取一个参数估计量的技术,所有这些技术中最简单的可能就数最大似然估计了。

观察值$\vec{x}$的概率在公式(6)定义为$P(\vec{x};\mu,\sigma^2)$. 如果我们在此函数中固定$x$和$\sigma^2$,当使$\vec{x}$变化时,我们就可以获得图(1)的正态分布。但是,我们也可以固定$\vec{x}$,使$\mu$和(或)$\sigma^2$变化。比如,我们可以选择类似前面例子中的$\vec{x}=(10,12,7,5,11)$。我们选择固定$\mu=10$,同时使$\sigma^2$变化。图(2)展示了当$x$和$\mu$固定时,$\sigma^2$对于这个分布取不同值的变化曲线:

图 2. 此图表示了似然函数在特定观察数据$\vec{x}$,下固定$\mu=10$,$\sigma^2$变化曲线。

上图,我们通过固定$\mu=10$,令$\sigma^2$变化计算了$P(\vec{x};\sigma^2)$的似然函数。在结果曲线的每一个数据点代表了似然度,观察值$\vec{x}$是一个正态分布在参数$\sigma^2$下的样本。那么对应最大似然度的参数值最有可能是从我们定义的分布中产生数据的参数。因此,我们能通过找到似然度曲线的最大值决定最优的$\sigma^2$。在此例中,最大值在$\sigma^2=7.8$,这样标准差就是$\sqrt{(\sigma^2)=2.8}$。事实上,如果给定$\mu=10$,通过传统的方法计算,我们会发明方差就是7.8:

$$\frac{(10-10)^2+(12-10)^2+(7-10)^2+(5-10)^2+(11-10)^2}{5}=7.8$$

因此,基于样本数据的方差计算公式只需要简单的通过找到最大的似然函数的最高点。此外,除了固定$\mu$,我们可以使$\mu$和$\sigma^2$同时变化。然后找到两个估计量对应在两个维度的似然函数的最大值。

要找一个函数的最大值,也很简单,只需要求导使其等于0。如果想找一个有两个变量函数的最大值,我们需要计算每个变量的偏导,再把两个偏导全部设置为0。接下来,设$\hat{\mu}_{ML}$为通过极大似然方法得到的总体均值的最优估计量,设$\hat{\sigma}^2_ML$为方差的最优估计量。要最大化似然函数,我们可以简单的计算它的(偏)导数,然后赋值为0,如下:

$$\begin{aligned} &\hat{\mu}{ML} = \arg\max\mu P(\vec{x}; \mu, \sigma^2)\ &\Rightarrow \frac{\partial P(\vec{x}; \mu, \sigma^2)}{\partial \mu} = 0 \end{aligned}$$

及

$$\begin{aligned} &\hat{\sigma}^2_{ML} = \arg\max_{\sigma^2} P(\vec{x}; \mu, \sigma^2)\ &\Rightarrow \frac{\partial P(\vec{x}; \mu, \sigma^2)}{\partial \sigma^2} = 0 \end{aligned}$$

下一节,我们将利用这个技术得到$\mu$和$\sigma^2$的MVU估计量。我们考虑两种情形:

第一种情形,我们假设分布的真正的均值$\hat{\mu}$是已知的,因此,我们只需要估计方差,那么问题就变成在参数为$\sigma^2$的一维的极大似然函数中对应找其最大值。这种情况不经常出现,但是,在实际应用中确实存在。例如,如果我们知道一个信号(比如:一幅图中一个像素的颜色值)本来应该有特定的值,但是,信号被白噪音污染了(均值为0的高斯噪音),这时分布的均值是已知的,我们只需要估计方差。

第二种情形就是处理均值和方差的真实值都不知道的情况。这种情况最常见,这时,我们需要基于样本数据估计均值和方差。

后面我们将看到,每种情形产生不同的MVU估计量。具体来说,第一种情形方差估计量需要除以$N$来标准化MVU。而第二种除的是$N-1$。

均值已知的方差估计

参数估计

如果分布的均值真实值已知,那么似然函数只有一个参数$\sigma^2$。求最大似然估计量也就是解决:

$$\hat{\sigma^2}{ML}=\arg\max{\sigma^2} P(\vec{x};\sigma^2).\tag{8}$$

但是,根据公式(6)的定义,如果计算$P(\vec{x};\sigma^2)$涉及到计算函数中指数的偏导。事实上,计算对数似然函数比计算似然函数本身的导数要简单的多。因为对数函数是单调递增函数,其最大值取值位置与原似然函数是一样的。因此我们用下面的式子替换:

$$\hat{\sigma}^2_{ML}=\arg\max_{\sigma^2}\log(P(\vec{x};\sigma^2)).\tag{9}$$

下面,我令$s=\sigma^2$简化式子。我们通过计算公式(6)的对数的导数赋值为0来最大化对数似然函数:

$$\begin{aligned}&\frac{\partial \log(P(\vec{x};\sigma^2))}{\partial \sigma^2}=0\&\Leftrightarrow\frac{\partial\log(P(\vec{x};s))}{\partial s}=0\&\Leftrightarrow\frac{\partial}{\partial s}\log\left(\frac{1}{(2\pi s)^{\frac{N}{2}}}e^{-\frac{1}{2s}\sum_{i=1}^{N}(x_i-\mu)^2} \right)=0\&\Leftrightarrow\frac{\partial}{\partial s}\log\left(\frac{1}{(2\pi)^{\frac{N}{2}}}\right)+\frac{\partial}{\partial s}\log\left(\frac{1}{\sqrt{s}^\frac{N}{2}}\right)+\frac{\partial}{\partial s} \log\left(e^{-\frac{1}{2s}\sum_{i=1}^{N}(x_i-\mu)^2}\right )=0\&\Leftrightarrow0+\frac{\partial}{\partial s}\log\left((s)^{-\frac{N}{2}}\right)+\frac{\partial}{\partial s}\left(-\frac{1}{2s}\sum_{i=1}^{N}(x_i-\mu)^2\right)=0\&\Leftrightarrow -\frac{N}{2}\log (s)+\frac{1}{2 s^2}\sum_{i=1}^{N}(x_i-\mu)^2=0\&\Leftrightarrow -\frac{N}{2s}+\frac{1}{2s^2}\sum_{i=1}^{N}(x_i-\mu)^2=0\&\Leftrightarrow \frac{N}{2s^2}\left(-s+\frac{1}{N}\sum_{i=1}^{N}(x_i-\mu)^2\right)=0\&\Leftrightarrow\frac{N}{2s^2}\left(\frac{1}{N}\sum_{i=1}^{N}(x_i-\mu)^2-s\right)=0\end{aligned}$$

很明显,如果$N>0$,那么上面等式唯一的解就是:

$$s=\sigma^2=\frac{1}{N}\sum_{i=1}^{N}(x_i-\mu)^2.\tag{10}$$

注意到,实际上$\hat{\sigma}^2$的极大似然估计估计量就是传统上一般计算方差的公式。这里标准化因子是$\frac{1}{N}$.

但是,极大似然估计并不保证得出的是一个无偏估计量。另外,就算得到的估计量是无偏的,极大似然估计也不能保证估计是最小方差,即MVU。因此,我们需要检查公式(10)的的估计量是否是无偏的。

表现评价

我们需要检查公式(7)的等式是否成立,来确定是否公式(10)中的估计量是无偏的。即判断:

$$E(s)=\hat{s}.$$

我们把公式(10)代入到$E(s)$,计算:

$$\begin{aligned}E[s] &= E \left[\frac{1}{N}\sum_{i=1}^N(x_i - \mu)^2 \right] = \frac{1}{N} \sum_{i=1}^N E \left[(x_i - \mu)^2 \right] = \frac{1}{N} \sum_{i=1}^N E \left[x_i^2 - 2x_i \mu + \mu^2 \right]\&= \frac{1}{N} \left( N E[x_i^2] -2N \mu E[x_i] + N \mu^2 \right)\&= \frac{1}{N} \left( N E[x_i^2] -2N \mu^2 + N \mu^2 \right)\&= \frac{1}{N} \left( N E[x_i^2] -N \mu^2 \right)\end{aligned}$$

另外,真实方差$\hat{s}$有一个非常重要的性质为$\hat{s}=E[x_i^2]-E[x_i]^2$,可变换公式为$E[x_i^2]=\hat{s}+E[x_i]^2=\hat{s}+\mu^2$。使用此性质我们可能从上面的公式推出:

$$\begin{aligned}E[s]&=\frac{1}{N}(N E[x_i^2]-N\mu^2)\&=\frac{1}{N}(N\hat{s}+N\mu^2-N\mu^2)\&=\frac{1}{N}(N\hat{s})\&=\hat{s}\end{aligned}$$

满足了公式(7)的条件$E[s]=\hat s$,因此,我们得到的数据方差$\hat s$的统计量是无偏的。此外,因为极大似然估计的如果是一个无偏的估计量,那么也是最小方差(MVU),也就是说,我们得到的估计量比任何一个其他的估计量都大。

因此,在分布真实均值已知的情况下,我们不用除以$N-1$,而是用除$N$计算正态分布的方差。

均值未知的方差估计

参数估计

上一节,分布的真实均值已知,因此,我们只需要估计数据的方差。但是,如果真实的均值未知,我们均值的估计量就也需要计算了。

此外,方差的估计量需要使用均值的估计量。我们会看到,这时,之前我们得到的方差的估计量就不再无偏了。我们一会儿会通过除以N-1,而不是N来稍微的增加方差估计量的值,从而使方差估计无偏。

与之前一样,基于log似然函数,我们用极大似然估计计算两个估计量。首先我们先计算$\hat\mu$的极大似然估计量:

$$\begin{aligned}&\frac{\partial \log(P(\vec{x}; s, \mu))}{\partial \mu} = 0\&\Leftrightarrow \frac{\partial}{\partial \mu} \log \left( \frac{1}{(2 \pi s)^{\frac{N}{2}}} e^{-\frac{1}{2s}\sum_{i=1}^N(x_i - \mu)^2} \right) = 0\&\Leftrightarrow \frac{\partial}{\partial \mu} \log \left( \frac{1}{(2 \pi)^{\frac{N}{2}}} \right) + \frac{\partial}{\partial \mu} \log \left(e^{-\frac{1}{2s}\sum_{i=1}^N(x_i - \mu)^2} \right) = 0\&\Leftrightarrow \frac{\partial}{\partial \mu} \left(-\frac{1}{2s}\sum_{i=1}^N(x_i - \mu)^2 \right) = 0\&\Leftrightarrow -\frac{1}{2s}\frac{\partial}{\partial \mu} \left(\sum_{i=1}^N(x_i - \mu)^2 \right) = 0\&\Leftrightarrow -\frac{1}{2s} \left(\sum_{i=1}^N -2(x_i - \mu) \right) = 0\&\Leftrightarrow \frac{1}{s} \left(\sum_{i=1}^N (x_i - \mu) \right) = 0 \&\Leftrightarrow \frac{N}{s} \left( \frac{1}{N} \sum_{i=1}^N (x_i) - \mu \right) = 0 \end{aligned}$$

显然,如果$N>0$,那么上面的等式只有一种解:

$$\mu=\frac{1}{N}\sum_{i=1}^{N}x_i.\tag{11}$$

注意到,实际的这是计算一个分布均值的著名公式。虽然我们知道这个公式,但我们现在证明了极大似然估计量估计了一个正态分布未知均值的真实值。现在我们先假定我们之前公式(10)计算的方差$\hat s$的估计量仍然是MVU方差估计量。但下一节我们会证明这个估计量已经是有偏的了。

表现评价

我们需要通过检查估计量$\mu$对真实$\hat \mu$的估计是否无偏来确定公式(7)的条件能否成立:

$$E[\mu]=E\left[\frac{1}{N}\sum_{i=1}^{N}x_i\right]=\frac{1}{N}\sum_{i=1}^N E[x_i]=\frac{1}{N}N E[x_i]=\frac{1}{N} N \hat\mu=\hat\mu.$$

既然$E[\mu]=\hat\mu$,那么也就是说我们对分布均值的估计量是无偏的。因为极大似然估计可以保证在估计是无偏的情况下得到的是最小方差估计量,所以我们就已经是证明了$\mu$是均值的MVU估计量。

现在我们检查基于经验均值$\mu$,而不是真实均值$\hat\mu$的方差估计量$s$对真实方差$\hat s$的估计身上仍然是无偏的。我们只需要把得到的估计量$\mu$带入到之前在公式(10)推导出的公式:

$$\begin{aligned} s &= \sigma^2 = \frac{1}{N}\sum_{i=1}^N(x_i - \mu)^2\&=\frac{1}{N}\sum_{i=1}^N \left(x_i - \frac{1}{N} \sum_{i=1}^N (x_i) \right)^2\&=\frac{1}{N}\sum_{i=1}^N \left[x_i^2 - 2 x_i \frac{1}{N} \sum_{i=1}^N (x_i) + \left(\frac{1}{N} \sum_{i=1}^N (x_i) \right)^2 \right]\&=\frac{\sum_{i=1}^N x_i^2}{N} - \frac{2\sum_{i=1}^N x_i \sum_{i=1}^N x_i}{N^2} + \left(\frac{\sum_{i=1}^N x_i}{N} \right)^2\&=\frac{\sum_{i=1}^N x_i^2}{N} - \frac{2\sum_{i=1}^N x_i \sum_{i=1}^N x_i}{N^2} + \left(\frac{\sum_{i=1}^N x_i}{N} \right)^2\&=\frac{\sum_{i=1}^N x_i^2}{N} - \left(\frac{\sum_{i=1}^N x_i}{N} \right)^2\end{aligned}$$

现在我们需要再次检查公式(7)的条件是否成立,来决定估计量是否无偏:

$$\begin{aligned} E[s]&= E \left[ \frac{\sum_{i=1}^N x_i^2}{N} - \left(\frac{\sum_{i=1}^N x_i}{N} \right)^2 \right ]\&= \frac{\sum_{i=1}^N E[x_i^2]}{N} - \frac{E[(\sum_{i=1}^N x_i)^2]}{N^2} \end{aligned}$$

记得我们在之前用过方差一个非常重要的性质,真实方差$\hat s$可以写成$\hat s = E[x_i^2]-E[x_i]^2$,即,$E[x_i^2]=\hat s + E[x_i]^2=\hat s +\mu^2$。利用这个性质我们可以推出:

$$\begin{aligned} E[s] &= \frac{\sum_{i=1}^N E[x_i^2]}{N} - \frac{E[(\sum_{i=1}^N x_i)^2]}{N^2}\&= s + \mu^2 - \frac{E[(\sum_{i=1}^N x_i)^2]}{N^2}\&= s + \mu^2 - \frac{E[\sum_{i=1}^N x_i^2 + \sum_i^N \sum_{j\neq i}^N x_i x_j]}{N^2}\&= s + \mu^2 - \frac{E[N(s+\mu^2) + \sum_i^N \sum_{j\neq i}^N x_i x_j]}{N^2}\&= s + \mu^2 - \frac{N(s+\mu^2) + \sum_i^N \sum_{j\neq i}^N E[x_i] E[x_j]}{N^2}\&= s + \mu^2 - \frac{N(s+\mu^2) + N(N-1)\mu^2}{N^2}\&= s + \mu^2 - \frac{N(s+\mu^2) + N^2\mu^2 -N\mu^2}{N^2}\&= s + \mu^2 - \frac{s+\mu^2 + N\mu^2 -\mu^2}{N}\&= s + \mu^2 - \frac{s}{N} - \frac{\mu^2}{N} - \mu^2 + \frac{\mu^2}{N}\&= s - \frac{s}{N}\&= s \left( 1 - \frac{1}{N} \right)\&= s \left(\frac{N-1}{N} \right) \end{aligned}$$

显然$E[s]\neq\hat s$,上面公式可知分布的方差估计量不再是无偏的了。事实上,平均来看,这个估计量低估了真实方差,比例为$\frac{N-1}{N}$。当样本的数量趋于无穷时($N\rightarrow\infty$),这个偏差趋近于0。但是对于小的样本集,这个偏差就意义了,需要被消除。

修正偏差

因为偏差不过是一个因子,我们只需通过对公式(10)的估计量乘以偏差的倒数。这样我们就可以定义一个如下的无偏的估计量$s\prime$:

$$\begin{aligned} s\prime &= \left ( \frac{N-1}{N} \right )^{-1} s\s\prime &= \left ( \frac{N-1}{N} \right )^{-1} \frac{1}{N}\sum_{i=1}^N(x_i - \mu)^2\s\prime &=\left ( \frac{N}{N-1} \right ) \frac{1}{N}\sum_{i=1}^N(x_i - \mu)^2\s\prime &= \frac{1}{N-1}\sum_{i=1}^N(x_i - \mu)^2\end{aligned}$$

这个估计量现在就是无偏的了,事实上,这个公式与传统计算方差的公式非常像,不同的是除的是$N-1$而不是$N$。然而,你可能注意到这个估计量不再是最小方差估计量,但是这个估计量是所有无偏估计量中最小方差的一个。如果我们除以$N$,那么估计量就是有偏的了,如果我们除以$N-1$,估计量就不是最小方差估计量。但大体来说,一个有偏的估计量要比一个稍高一点方差的估计量要糟糕的多。因此,如果当总体的均值是未知的情况下,方差除的是$N-1$,而不是$N$。

总结

本文,我们推导了如果从分布数据中计算常见的方差和均值公式。此外,我们还证明了在方差估计中,标准化因子在总体均值已知时是$\frac{1}{N}$,在均值也需要估计时是$\frac{1}{N-1}$。

本文PDF

Decision Tree (ID3)

Posted on 2017-05-21 | In Alogorithm

Preface

In April 9th, 2017, incident occurred in United Airlines where crew of UA beat up a passenger and dragged him out of the plane before which was about to take off attracted attention all around the world. Many would gave out doubt: why a company being so rude to passengers can exist in this world? Actually, UA is going well is just because they have an extremely precise emergency situation procedure which is calculate by compute depending on big-data analysis. Computer can help us make decisions though, it has no emotions, which is effective in most cases, but can not be approved by our human beings. Let’s take a look at how algorithm make a decision:

It is a decision tree, which simply represents the procedure of how UA algorithm make the decision. First of all, before taking off, four employees of UA need fly from Chicago to Kentucky. Then the algorithm check if there is any seats left, if so, passengers were safe for the moment. But UA3411 was full, the algorithm began assessing the importance of employees or passengers. Obviously, the algorithm think crew is more important due to business consideration. Then how to choose who should be evicted from the plane. The algorithm was more complicated than the tree I drew, however, Asian or not was one of the criterion. But why? Because Asian are pushovers. The passenger agreed at first, however, when he heard that he had to wait for one day, he realized that he could not treat his patient, then he refused. Then he was beat up and dragged off the plane.

As you have seen, it is a decision tree, which is similar to human decision-making process. Decision tree is a simple but powerful algorithm in machine learning. In fact, you are often using decision tree theory when making decision, for example

Introduction

Decision tree is a classification and regression algorithm, we build a tree through statistics. Today we only talk about how to classify dataset using Decision Tree. First we will introduce some information theory background knowledge, then we use iris data build a decision tree using IDC3 algorithm.

Iris data

Iris dataset is a very famous dataset deposited on UCI machine learning repository, which described three kinds of iris. there are four columns corresponding for features as followed:

  • sepal length in cm
  • sepal width in cm
  • petal length in cm
  • petal width in cm

The last column represents iris categories:

  • Iris-setosa (n=50)
  • Iris-versicolor (n=50)
  • Iris-virginica (n=50)

Here, our task is to use the dataset to train a model and generate a decision tree. During the process we need calculate some statistics values to decide how to generate a better one.

The dataset is very small so that you can easily download it and take a look.

Entropy and Information Gain

Entropy

Before Decision Tree, I’d like to talk about some concept in Information Theory. Entropy is a concept from thermodynamics at first, C.E.Shannon introduced which into information theory which represent redundancy in 1948. It sounds a very strange concept. In fact, it is very easy to understand. For example, during the knockout stages in world Cup Games, there are 16 teams. Now I let you guess which team will win the champion which assume I know the answer, how many times do you need to get the outcome? First of all, you cut 16 teams to 8-8 parts, you asked me if the team in first 8 teams or the other. I told you that the team was in the other 8 teams. Then you cut the the 8 teams again, you ask me if the team is in the first 4 teams or the other, I told you that the champion would be in the first 4 teams, and so forth and so on. And how many times is the entropy of who wining the champion.

$$ Entropy(champion) = {\rm log}_2^{16}=4 $$

That is, we can use 4 bits to represents which team will win the game. Clever you may ask why we divide team to two parts other than three or four parts. That is because we use binary represents the world in computer world. $ 2^4=16 $ means we can use 4 bit represents 16 conditions. We can use entropy represent all information in this world. And if you have known that which team will win the campion, the entropy is 0, because, you do not need any more information to deduce the outcome.

Entropy represents uncertainty indeed. Ancient China, we have to record history on bamboo slips, which demanded us decrease words. That means entropy of every single ancient Chinese character is higher than words we are saying today. That is, if we lost just some of these words, we would lose lots of stories. There are many songs starts with:”Yoo, yoo, check now”, which barely offer us information, which means we can drop those words and interpret the these songs precisely as well. The entropy of these sentence is low.

Assume $X$ is discrete random variable, the distribution is:
$$P(X=x_i)=p_i$$
then the entropy of X is:
$$H(X)=-\sum_{i=1}^{n}p_i {\rm log}_2 p_i$$
where if p_0=0, we define 0log0 = 0.

It seems that the equation has nothing to do with the entropy we have calculated in the champion example. Now let’s calculate the example. First of all $X$ represents the probability of each team which would win the game. we assume all teams were at the same level, so we have
$$p(X=x_1)=p(X=x_2)=p(X=x_3)=\cdots = p(X=x_{16})=\frac{1}{16}$$
the entropy is
$$H(X)=-\sum_{i=1}^{16}\frac{1}{16}{\rm log}_2 \frac{1}{16}=-16\times\frac{1}{16}\times {\rm log}_2 {2^{-4}}=4$$

Bingo, the the answer is same. In fact, if we know some more information, the entropy is lower than 4. for example, the probability of Germany is higher than some Asian teams.

Entropy and Iris Data

Now we calculate entropy of Iris Data which will be used to fit a decision tree in following sections. We concern about the categories(setosa, versicolor and virginica). Remember the equation of how to calculate entropy:
$$H(X)=-\sum_{i=1}^{n}p_i {\rm log}_2 p_i$$

Three kinds of flowers are all 50s, so the probability of each category is the same:
$$p_1=p_2=p_3=\frac{50}{50+50+50}=\frac{1}{3}$$
Then, the entropy is pretty easy to calculate
$$H(X)=-1\times (\frac{1}{3}{\rm log}_2\frac{1}{3}+\frac{1}{3}{\rm log}_2\frac{1}{3}+\frac{1}{3}{\rm log}_2\frac{1}{3})=1.5850$$

Conditional Entropy

The meaning of Conditional Entropy is as its name. With respect with random variable$(X, Y)$, the joint distribution is
$$P(X=x_i, Y=y_j)=p_{ij}, i=1,2,3\cdots m; j=1,2,3,\cdots n$$
Conditional Entropy H(Y|X) represents that given we have known random variable $X$ , the disorder or uncertainty of $Y$. The definition is as followed:
$$H(Y|X)=\sum_{i=1}^m p_i H(Y|X=x_i)$$
Here, $p_i=P(X=x_i)$.

Conditional Entropy and Iris Data

We calculate some Conditional Entropy as examples. First of all, I random choose 15 columns of sepal length with respect to their categories. the result is as followed:

No. sepal length in cm categories
1 5.90 Iris-versicolor
2 7.20 Iris-virginica
3 5.00 Iris-versicolor
4 5.00 Iris-setosa
5 5.90 Iris-versicolor
6 5.70 Iris-setosa
7 5.20 Iris-versicolor
8 5.50 Iris-versicolor
9 4.80 Iris-setosa
10 4.60 Iris-setosa
11 6.50 Iris-versicolor
12 5.20 Iris-setosa
13 7.70 Iris-virginica
14 6.40 Iris-virginica
15 6.00 Iris-versicolor

The octave code is

1
2
3
4
5
6
7
%% octave
[a,b,c,d, cate] = textread("iris.data", "%f%f%f%f%s","delimiter", ",");

for i=1:15
x = floor(rand()*150);
fprintf('%f %s\n', a(x), cate{x} );
end

We just take this 15 items for examples, I assume that we divide sepal length into two parts: greater than mean and less than mean. The mean is
$$mean = (5.90+7.2+\cdots+6.00)/15 = 5.7733$$
There are 8 elements less then 5.7733 and 7 bigger ones. That is

mean idx of greater than mean idx of less than mean
5.7733 1,2,5,11,13,14,15 3,4,6,7,8,9,10,12

We let $x_1=greater$(1,2,5,11,13,14,15), $x_2=less$(3,4,6,7,8,9,10,12) then
$$H(Y|X=x_1)=-(p_1 {\rm log}_2 p_1 + p_2 {\rm log}_2 p_2 + p_3 {\rm log}_2 p_3)=\frac{4}{7}{\rm log}_2\frac{4}{7}+\frac{3}{7}{\rm log}_2\frac{3}{7}+0{\rm log}_2 0=0.98523$$
$$H(Y|X=x_2)=-(p_1 {\rm log}_2 p_1 + p_2 {\rm log}_2 p_2+p_3 {\rm log}_2 p_3)=\frac{3}{8}{\rm log}_2\frac{3}{8}+0{\rm log}_2 0+\frac{5}{8}{\rm log}_2\frac{5}{8}=0.95443$$

The Conditional Entropy then is
$$H(Y|X)=\sum_{i=1}^{2}p_i H(Y|x_i)=\frac{7}{15}\times 0.98523+\frac{8}{15}\times 0.95443=0.96880$$

Information Gain

Just as its name implies, Information Gain means the information we have gained after adding some features. That is, we can vanish some uncertainty when we add some information. For example, I want you to guess an NBA player, the uncertainty is very high, however, there are only several persons in the list if I tell you that he is a Chinese. You gained information after knowing the Chinese feature to decrease the uncertainty. The calculation of Information Gain is
$$IG(Y, X)= H(Y)-H(Y|X)$$
Here, we want to decide $Y$ with feature $X$. It is easy, just Entropy of $Y$ minus Conditional Entropy $Y$ given $X$. The meaning is obvious too: $H(Y)$ represents uncertainty, $H(Y|X)$ represents uncertainty of $Y$ given $X$, the difference is the Information Gain.

Information Gain and Iris Data

In this section, I will apply Information Gain equations to the whole Iris data. First of all, let $Y$ represent categories of iris, and $X_1,X_2,X_3, X_4$ represent sepal length, sepal width, petal length petal width respectively.

We have computed that $H(Y)=1.0986$, next, we will calculate 4 Conditional Entropy $H(Y|X_1),H(Y|X_2),H(Y|X_3),H(Y|X_4)$. In light of continuousness of $X$, we divide them by mean of each feature. Then
$$\overline{X_1}=5.8433,\,\overline{X_2}=3.0540,\,\overline{X_3}=3.7587,\,\overline{X_4}=1.1987$$

$$H(Y|X_1)=-\sum_{i=1}^3 p_i H(Y|X_{1i})=-(\frac{70}{150}(\frac{0}{70}{\rm log}_2\frac{0}{70}+\frac{26}{70}{\rm log}_2\frac{26}{70} +\frac{44}{70}{\rm log}_2\frac{44}{70})+\frac{80}{150}(\frac{50}{80}{\rm log}_2\frac{50}{80}+\frac{24}{80}{\rm log}_2\frac{24}{80}+\frac{6}{80}{\rm log}_2\frac{6}{80}))=1.09757$$

$$H(Y|X_2)=-\sum_{i=1}^3 p_i H(Y|X_{2i})=-(\frac{67}{150}(\frac{42}{67}{\rm log}_2\frac{42}{67}+\frac{8}{67}{\rm log}_2\frac{8}{67}+\frac{17}{67}{\rm log}_2\frac{17}{67}+\frac{83}{150}(\frac{8}{83}{\rm log}_2\frac{8}{83}+\frac{42}{83}{\rm log}_2\frac{42}{83}+\frac{33}{83}{\rm log}_2\frac{33}{83}))=1.32433$$

$$H(Y|X_3)=-\sum_{i=1}^3 p_i H(Y|X_{3i})=-(\frac{93}{150}(\frac{0}{93}{\rm log}_2\frac{0}{93}+\frac{43}{93}{\rm log}_2\frac{43}{93}+\frac{50}{93}{\rm log}_2\frac{50}{93}+\frac{57}{150}(\frac{50}{57}{\rm log}_2\frac{50}{57}+\frac{7}{57}{\rm log}_2\frac{7}{57}+\frac{0}{57}{\rm log}_2\frac{0}{57}))=0.821667$$

$$H(Y|X_4)=-\sum_{i=1}^3 p_i H(Y|X_{4i})=-(\frac{90}{150}(\frac{0}{90}{\rm log}_2\frac{0}{90}+\frac{40}{90}{\rm log}_2\frac{40}{90}+\frac{50}{90}{\rm log}_2\frac{50}{90}+\frac{60}{150}(\frac{50}{60}{\rm log}_2\frac{50}{60}+\frac{10}{60}{\rm log}_2\frac{10}{60}+\frac{0}{60}{\rm log}_2\frac{0}{60}))=0.854655
$$
Information Gains is easy to get
$$IG(Y, X_1)=H(Y)-H(Y|X_1)=1.5850-1.09757=0.487427$$

$$IG(Y, X_2)=H(Y)-H(Y|X_2)=1.5850-1.32433=0.260669$$

$$IG(Y, X_3)=H(Y)-H(Y|X_3)=1.5850-0.821667=0.763333$$

$$IG(Y, X_4)=H(Y)-H(Y|X_4)=1.5850-0.854655=0.730345$$
By now, we find that $IG(Y, X_3)$ is bigger than others, which means feature $X_3$ supplies more information.

ID3(Iterative Dichotomiser 3)

ID3 algorithm was developed by Ross Quinlan in 1986, which is a very classic algorithm as well as C4.5 and CART. We First apply Information Gain of each feature with respect to iris data. Then to choose the maximum to divide data into 2 parts. For each part we apply Information Gain recursively until we put all parents data to one node. Now that we have know Information Gain from the last section, obviously we choose X3 as the feature dividing data into 2 parts in the first place.

Let’s take a look at the first cut using feature $X_3$. We have 150 items at first, after comparing if $X_3>3.7587$, we divide data into two parts, one has 93 items, the other got 57. From the data, we know that there is no setosa in node B, meanwhile, no virginica in node C, which means that this feature is very good for split data due to exclude setosa and virginica.

Node B Node C
setosa 0 50
versicolor 43 7
virginica 50 0

The end condition of the algorithm is decided by IG. When IG is less then some threshold or if there is only one category left, we can end the algorithm. If IG less than some value(e.g. 0.01) and more than one category left simultaneously, we have to choose a final category to be the leaf, the rule is to set the category having samples more than the others.

Take Node H for example, we set IG threshold to 0.01 in the first place. Then we calculate the Information Gain for each feature, the biggest IG from feature 2(sepal width in cm), which is 0.003204 and less than 0.01. So we have to set H as a leaf. There are 0 Iris-setosa, 25 Iris-versicolor and 44 Iris-virginica in the leaf, so we set the bigger one(i.e. Iris-virginica) to the leaf.

Summary

Today we have talked about what is decision tree algorithm. Firstly, I introduce three background concept Entropy, Conditional Entropy and Information Gain. Next we apply ID3 algorithm to Iris data to build a decision.

One of the most significant advantages of decision tree is that we can explain the result. If the algorithm decided UA should beat the their passengers, they could trace the tree to find the path of reason chain. It is very useful to tell consumers why we recommend them something, under such circumstance, we can use decision tree to train a model.

There is a shortcoming that Information Gain tends to use feature with more values. In order to resolve the problem, Ross Quinlan improved the algorithm through Information Gain Rate Rather than IG. Breiman introduced CART algorithm subsequently, which can be applied to classification as well as regression. Recently, Scientists have developed more powerful algorithm such as Random Forest and Gradient Boosting Decision Tree etc.

Reference

  1. 《统计学习方法》,李航
  2. 《数学之美》,吴军
  3. http://www.shogun-toolbox.org/static/notebook/current/DecisionTrees.html
  4. https://en.wikipedia.org/

Appendix code

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
%% octave main function file
%% iris data dowload link: https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data
[a,b,c,d, cate] = textread("iris.data", "%f%f%f%f%s","delimiter", ",");


%for i=1:15
% x = floor(rand()*150);
% fprintf('%f %s\n', a(x), cate{x} );
%end;

features = [a, b, c, d];
for i=1:length(features(1, :))
col = features(:, i);
me = mean(col);
disp(me);
feat(i).greater = find(col > me);
feat(i).less = find(col <= me);
end

total = (1:150)';
decision(feat, length(features(1, :)), cate, total);
fprintf('\n')
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
%% octave: decsion tree file
function decision(feat, feat_size, cate, total)
if length(total) == 0
return
end
fprintf('(-%d-)', length(total));
%plogp = @(x)[x*log2(x)];
function e = plogp(pi)
if pi == 0
e = 0;
else
e = pi*log2(pi);
end
end

function d = div(a, b)
if b == 0
d = 0;
else
d = a/b;
end
end

debug = 0;

function m = maxc(cate, cates, total)
maxidx = 1;
max_c = 0;
for i=1:length(cates)
c =find(strcmp(cate, cates{i}));
cl = length(intersect(c, total));
if debug == 1 fprintf('\n%d##%d %s###',i, cl, char(cates{i})) end
%if (debug == 1 && cl <10 && cl >0) disp(intersect(c, total)') end
if cl > max_c
max_c = cl;
maxidx = i;
end
end
if debug == 1 fprintf('\n****%d %d******\n', maxidx, max_c) end
%m = cates(maxidx);
m = maxidx;
end


% compute h(y)
cates = unique(cate);
hx = 0;
for i = 1:length(cates)
c = find(strcmp(cate, cates{i}));
rc = intersect(c, total);
hx -= plogp(length(rc)/length(total));
end
%fprintf('hx = %f\n', hx)
% compute h(y|x)
max_feature = 1;
max_ig = 0;

max_left = intersect(feat(1).greater, total);
max_right = intersect(feat(1).less, total);
for i=1:feat_size
hxh = 0;
hxl = 0;
feat_greater = intersect(feat(i).greater, total);
feat_less = intersect(feat(i).less, total);
ge = length(feat_greater);
le = length(feat_less);

if (ge+le) == 0
continue
end

for j = 1:length(cates);
c = find(strcmp(cate, cates{j}));
xh = length(intersect(feat_greater, c));
xl = length(intersect(feat_less, c));
hxh -= plogp(div(xh, ge));
hxl -= plogp(div(xl, le));
end
% compute hx - h(y|x)
hxy = (ge/(ge+le))*hxh + ((le)/(ge+le))*hxl;
ig = hx - hxy;

if ig > max_ig
max_ig = ig;
max_feature = i;
max_left= feat_less;
max_right = feat_greater;
end

end

left = max_left;
right = max_right;
%fprintf('feature:ig %d %f %d %d ------ \n', max_feature, max_ig, length(left), length(right));


if debug == 1 printf("\033[0;32;1m-ig--%f \033[0m", max_ig); end
if(max_ig < 0.01)
%fprintf('<%s>', char(maxc(cate, cates, total)))
printf("\033[0;31;1m<%d>\033[0m", maxc(cate, cates, total));
return
end
fprintf("\033[0;34;1m#%d \033[0m", max_feature);
fprintf('{' )
decision(feat, feat_size, cate, left);
decision(feat, feat_size, cate, right);
fprintf('}')
end

A Tutorial on Singular Value Decomposition

Posted on 2017-05-01 | In Alogorithm

Preface

Under some circumstance, we want to compress data to save storage space. For example, when iPhone7 was released, many were trapped in a dilemma: Should I buy a 32G iPhone without enough free space or that of 128G with a lot of storage being wasted? I had been trapped in such dilemma indeed. I still remember that I only had 8G storage totally when I was using my first Android phone. What annoyed me most was my thousands of photos. Well, I confess that I was being always a mad picture taker. I knew that there were some technique which could compress a picture through reducing pixel. However, it is not enough, because, as you know, in some arbitrary position in a picture, we can tell that the picture share the same color. An extreme Example: if we have a pure color picture, what we just need know is the RGB value and the size, then reproducing the picture is done without extra effort. What I was dreaming is done perfectly by Singular Value Decomposition(SVD).

Introduction

Before SVD, in this article, I will introduce some mathmatical concepts in the first place which cover Linear transformation and EigenVector&EigenValue. This Background knowledge is meant to make SVD straightforward. You can skip if you are familar with this knowledge.

Linear transformation

Given a matrice $A$ and vector $\vec{x}$, we want to compute the mulplication of $A$ and $\vec{x}$

$$\vec{x}=\begin{pmatrix}1\3\end{pmatrix}\qquad A=\begin{pmatrix}2 & 1 \\ -1 & 1 \end{pmatrix}\qquad\vec{y}=A\vec{x}$$

But when we do this multiplication, what happens? Acutually, when we multiply $A$ and $\vec{x}$, we are changing the coordinate axes of the vector $x$ to another new axes. Begin with a simpler example, we let

$$A=\begin{pmatrix}1 & 0\\ 0 &1\end{pmatrix}$$

then we have
$$A\vec{x}=\begin{pmatrix}1 & 0\\ 0 &1\end{pmatrix}\begin{pmatrix}1\3\end{pmatrix}=\begin{pmatrix}1\3\end{pmatrix}$$

You may have noticed that we can always get the same $\vec{x}$ after left multiply by A. In this case, we use coordinate axes $i=\begin{pmatrix}1 \\ 0\end{pmatrix}$ and $j=\begin{pmatrix}0 \\ 1\end{pmatrix}$ as the figure below demonstrated. That is, if we want to represent $\begin{pmatrix}1\3\end{pmatrix}$ under the coordination, we can calculate the transformation as followed:

\begin{align} A\vec{x}=1\cdot i + 3\cdot j = 1\cdot \begin{pmatrix}1 \\ 0\end{pmatrix} + 3\cdot \begin{pmatrix}0 \\ 1\end{pmatrix}=\begin{pmatrix}1\3\end{pmatrix}\end{align}

As we know, we can put a vector to anywhere in space, and if we want to calculate sum of two vectors, the simplest way is to connect the to vector from one’s head to the other’s tail. Our example, we compute $A\vec{x}$ means add two vector(green imaginary lines) up. And the answer is still $\begin{pmatrix}1\3\end{pmatrix}$.

Now we change $i=\begin{pmatrix}2\\ -1\end{pmatrix}$ and $j=\begin{pmatrix}1\1\end{pmatrix}$ as the coordinate axes(the red vectors), which means $A=\begin{pmatrix}2 & 1 \\ -1 & 1\end{pmatrix}$. I put vectors(black ones) to this figure as well. We can see what happens when we change a new coordinate axes.

First of all, we multiply $j$ by $3$ and $i$ by 1. Then we move vector j and let the head of $i$ connect the tail of $3\cdot j$. We can now find what is the coordination of $1\cdot i+3\cdot j$(the blue one). We now verify the result using mutiplication of $A$ and $\vec{x}$:

$$A\vec{x}=\begin{pmatrix}2 & 1 \\ -1 & 1\end{pmatrix}\begin{pmatrix}1\3\end{pmatrix}=1\cdot \begin{pmatrix}2 \\ -1\end{pmatrix} + 3\cdot \begin{pmatrix}1 \\ 1\end{pmatrix}=\begin{pmatrix}5\2\end{pmatrix}$$


Here, you can imagine that matrice $A$ is just like a function $f(x)\rightarrow y$, when you subsitute $x$, we get the exact $y$ using the principle $f(x)\rightarrow y$. In fact, the multiplication is tranform the vector from one coordination to another.

Exercise

  1. $A=\begin{pmatrix}1 & 2 \\ 3 & 4\end{pmatrix}$, draw the picture to stretch and rotate $x=\begin{pmatrix}1\3\end{pmatrix}$.
  2. Find a $A$ matrix to rotate $\vec{x}=\begin{pmatrix}1\3\end{pmatrix}$ to $90^{\circ}$ and $180^{\circ}$.
  3. what if $A=\begin{pmatrix}1 & 2 \\ 2 & 4\end{pmatrix}$.

EigenVector and EigenValue

EigenVector and EigenValue is an extremely important concept in linear algebra, and is commonly used everywhere including SVD we are talking today. However, many do not know how to interpret it. In fact, EigenVector and EigenValue is very easy as long as we know about what is linear transformation.

A Problem

Before start, let’s take a look at a question: if we want to multiply matrices for 1000 times, how to calculate effectively?
$$AAA\cdots A= \begin{pmatrix}3& 1 \0 & 2\end{pmatrix}\begin{pmatrix}3& 1 \0 & 2\end{pmatrix}\begin{pmatrix}3& 1 \0 & 2\end{pmatrix}\cdots \begin{pmatrix}3& 1 \0 & 2\end{pmatrix}=\prod_{i=1}^{1000}\begin{pmatrix}3& 1 \0 & 2\end{pmatrix}$$

Intuition

Last section, we have talked that if we multiply a vector by a matrix $A$, means that we use $A$ to stretch and rotate the vector in order to represent the vector in a new coordinate axes. However, there are some vectors for $A$, they can only be stretched but can not be rotated. Assume $A=\begin{pmatrix}3 & 1 \\ 0 & 2\end{pmatrix}$, let $\vec{x}=\begin{pmatrix}1 \\ -1\end{pmatrix}$. When we multiply $A$ and $\vec{x}$

$$A\vec{x}=\begin{pmatrix}3 & 1 \\ 0 & 2\end{pmatrix}\begin{pmatrix}1 \\ -1\end{pmatrix}=\begin{pmatrix}2 \\ -2\end{pmatrix}=2\cdot \begin{pmatrix}1 \\ -1\end{pmatrix}$$

It turns out we can choose any vector along $\vec{x}$, the outcome is the same, for example:

$$A\vec{x}=\begin{pmatrix}3 & 1 \\ 0 & 2\end{pmatrix}\begin{pmatrix}-3 \\ 3\end{pmatrix}=\begin{pmatrix}-6 \\ -6\end{pmatrix}=2\cdot \begin{pmatrix}-3 \\ 3\end{pmatrix}$$

We name vectors like $\begin{pmatrix}-3 \\ 3\end{pmatrix}$ and $\begin{pmatrix}1 \\ -1\end{pmatrix}$ EigenVectors and 2 the conresponse EigenValues. In practice, we usually choose unit eigenvectors(length equals to 1) given that there are innumerable EigenVectors along the line.

I won’t cover how to compute these vectors and vaules and just list the answer as followed

\begin{align}&\begin{pmatrix}3 & 1 \0& 2\end{pmatrix}
\begin{pmatrix}{-1}/{\sqrt(2)} \\ {1}/{\sqrt(2)}\end{pmatrix}=
2\begin{pmatrix}{-1}/{\sqrt(2)} \\ {1}/{\sqrt(2)}\end{pmatrix} \qquad\qquad\vec{x_1}=\begin{pmatrix}{-1}/{\sqrt(2)} \\ {1}/{\sqrt(2)}\end{pmatrix} &\lambda_1=2\
&\begin{pmatrix}3 & 1 \0& 2\end{pmatrix}
\begin{pmatrix}1 \\ 0\end{pmatrix}\qquad=\qquad
3\begin{pmatrix}1 \\ 0\end{pmatrix}\qquad\qquad\quad\,\,\,\vec{x_2}=\begin{pmatrix}1 \\ 0\end{pmatrix}
&\lambda_2=3
\end{align}
Notice that $|\vec{x_1}|=1$ and $|\vec{x_2}|=1$

EigenValue Decomposition

If we put two EigenVectors and corresponding EigenValues together, we can get the following equation:
$$AQ=\begin{pmatrix}3 & 1 \0& 2\end{pmatrix}
\begin{pmatrix}
{-1}/{\sqrt(2)}&1\
{1}/{\sqrt(2)}&0
\end{pmatrix}=
\begin{pmatrix}
{-1}/{\sqrt(2)}&1 \\ {1}/{\sqrt(2)}&0
\end{pmatrix}
\begin{pmatrix}
2 & 0\
0 & 3
\end{pmatrix}=Q\Lambda
$$
Then we have $AQ=Q\Lambda$, the conclusion is still right if we introduce more dimensions, that is
\begin{align}
A\vec{x_1}=\lambda\vec{x_1}\
A\vec{x_2}=\lambda\vec{x_2}\
\vdots\qquad\
A\vec{x_k}=\lambda\vec{x_k}
\end{align}

$$Q=
\begin{pmatrix}
x_{11}& x_{21} &\cdots x_{k1}&\
x_{12}& x_{22} &\cdots x_{k2}&\
&\vdots&&\
x_{1m}& x_{22} &\cdots x_{km}&
\end{pmatrix}
\qquad\Lambda=
\begin{pmatrix}
\lambda_1 & 0 & \cdots&0\
0 &\lambda_2&\cdots&0\
\vdots&\vdots&\ddots\
0&\cdots&\cdots&\lambda_k
\end{pmatrix}$$

If we do something on the equation $AQ=Q\Lambda$, then we have:
$$AQQ^{-1}=A=Q\Lambda Q^{-1}$$
It is EigenVaule Decomposition.

Resolution

Now, Let’s look at the question in the beginning of this section
\begin{align}
AAA\cdots A&= \begin{pmatrix}3& 1 \0 & 2\end{pmatrix}\begin{pmatrix}3& 1 \0 & 2\end{pmatrix}\begin{pmatrix}3& 1 \0 & 2\end{pmatrix}\cdots \begin{pmatrix}3& 1 \0 & 2\end{pmatrix}=\prod_{i=1}^{1000}\begin{pmatrix}3& 1 \0 & 2\end{pmatrix}\
AAA\cdots A &= Q\Lambda Q^{-1}Q\Lambda Q^{-1}Q\Lambda Q^{-1}\cdots Q\Lambda Q^{-1}=Q\prod_{i=1}^{1000}\begin{pmatrix}3& 1 \0 & 2\end{pmatrix}Q^{-1}\
AAA\cdots A &=Q\Lambda\Lambda\cdots \Lambda Q^{-1}=
Q\begin{pmatrix}2^{1000} & 0 \\0 & 3^{1000}\end{pmatrix}Q^{-1}
\end{align}
The calculation is extremely simple using EVD.

Exercise

  1. Research how to compute EigenVectors and EigenValues, then compute$\begin{pmatrix}1 & 2 & 3\4 & 5 &6\7 & 8 & 9\end{pmatrix}$.
  2. Think about the decisive factor affects how many EigenValues we can get.

Singular Value Decompositon

Notice that EigenVector Decomposition is applied to decompose square matrices. Is there any approach to decompose non-square matrices? The answer is a YES, and the name is Singular Value Decompositon.

Intuition

First of all, let’s take a look at what SVD looks like

From the picture, we can find that matrice $A$ is decomposed to 3 components: $U$, $\Sigma$ and $V^{T}$. $U$ and$V^T$ are both sqaure matrices and $\Sigma$ has the same size as $A$. Still, I want to emphasize that $U$ and$V^T$ are both unitary matrix, which means the Determinant of $U$ and $V^T$ is 1 and $U^T=U^{-1}\quad V^T=V^{-1}$.

Deduction

In the Linear Transformation section, we can transform a vector to another coordinate axes. Assume you have a non-square matrice, and you want to transform A from vectors $V=(\vec{v_1}, \vec{v_2},\cdots,\vec{v_n})^T$ to antoher coordinate axes which is $U=(\vec{u_1}, \vec{u_2},\cdots,\vec{u_n})^T$, the thing is, $\vec{v_i}$ and $\vec{u_i}$ have unit length, and all directions are perpendicular, that is, each of $\vec{v_i}$ are at right angles to other $\vec{v_j}$, we name such matrices as orthogonal matrices. In addition, I need add a factor $\Sigma=(\sigma_1,\sigma_2, \sigma_3,\cdots,\sigma_n)$ which represent the times of each direction of $\vec{u_i}$, i.e., We need transform A from $V=(\vec{v_1}, \vec{v_2},\cdots,\vec{v_n})^T$ to $(\sigma_1 \vec{u_1},\sigma_2 \vec{u_2}, \sigma_3 \vec{u_3},…\sigma_n \vec{u_n})^T$. From the picture below we can find that we want to transform from the circle coordinate axes to the ellipse axes.

\begin{align}
\vec{v_1} \vec{v_2} \vec{v_3},…\vec{v_n} \qquad\rightarrow \qquad &\vec{u_1},\vec{u_2},\vec{u_3},…\vec{u_n}\
&\sigma_1,\sigma_2, \sigma_3,…\sigma_n
\end{align}

Recall that we can transform $A$ at every direction, then generate another direction as new coordinate direction. So we have
$$ A \vec{v_1}=\sigma_1 \vec{u_1}\
A \vec{v_2}=\sigma_2 \vec{u_2}\
\vdots\
A \vec{v_j}=\sigma_j \vec{u_j}$$

\begin{align}
&\begin{pmatrix}\\A\\\end{pmatrix}\begin{pmatrix}\\ \vec{v_1},\vec{v_2},\cdots,\vec{v_n}\\\end{pmatrix}=\begin{pmatrix}\\ \vec{u_1}, \vec{u_2},\cdots,\vec{u_n}\\ \end{pmatrix}\begin{pmatrix}
\sigma_1 & 0 & \cdots & 0 \
0 & \sigma_2 & \cdots & 0 \
\vdots & \vdots & \ddots & \vdots \
0 & 0 & \cdots & \sigma_n
\end{pmatrix}\
&C^{m\times n}\qquad\quad C^{n\times n}\qquad\qquad\qquad C^{m\times n}\qquad \qquad \qquad C^{n\times n}
\end{align}
Which is
$$A_{m\times n}V_{n\times n} = \hat{U}{m\times n}\hat{\Sigma}{n\times n}$$

\begin{align}
A_{m\times n}V_{n\times n} &= \hat{U}{m\times n}\hat{\Sigma}{n\times n}\
(A_{m\times n}V_{n\times n}V_{n\times n}^{-1} &= \hat{U}{m\times n}\hat{\Sigma}{n\times n}V_{n\times n}^{-1}\
A_{m\times n}&=\hat{U}{m\times n}\hat{\Sigma}{n\times n}V_{n\times n}^{-1}\&=\hat{U}{m\times n}\hat{\Sigma}{n\times n}V_{n\times n}^{T}
\end{align}

We need do something to the equation in order to continue the deduction. First we stretch matrice $\hat{\Sigma}$ vertically to $m \times n$ size. Then stretch $\hat{U}$ horizonly to $m\times m$, we can set any value to the right entries.

Due to the fact we need calculate $U^{-1}$ and $V^{-1}$, the equation is adjusted to
$$A_{m\times n} = U_{m\times m}\Sigma_{m\times n}V^T_{n\times n}$$
For furture convenience, we need sort all $\sigma s$, which means:
$$\sigma_1\geq\sigma_2\geq\sigma_3 \geq\cdots\geq \sigma_m$$.

How to calculate $U$, $V^T$ and $\Sigma$

To Decompose matrice $A$, we need calculate $U$, $V^T$ and $\Sigma$. Remember that $U^T = U^{-1}$ and $V^T = V^{-1}$, we will use the property next.

\begin{align}
A &= U\Sigma V^T\
\end{align}

\begin{align}
AA^T&=U\Sigma V^T(U\Sigma V^T)^T\
&=U\Sigma V^TV\Sigma^T U^T\
&=U\Sigma V^{-1}V\Sigma^T U^T\
&=U\Sigma I\Sigma^T U^T\
&=U\Sigma^2 U^T
\end{align}

\begin{align}
(AA^T)U&=(U\Sigma^2 U^T)U\
&=(U\Sigma^2 )U^{-1}U\
&=U\Sigma^2
\end{align}


\begin{align}
A^TA
&=(U\Sigma V^T)^TU\Sigma V^T\
&=V\Sigma^T U^TU\Sigma V^T\
&=V\Sigma^T U^TU\Sigma V^T\
&=V\Sigma^T U^{-1}U\Sigma V^T\
&=V\Sigma^T I\Sigma V^T\
&=V\Sigma^2 V^T\
\end{align}

\begin{align}(A^TA)V&=(V\Sigma^2 V^T)V\
&=(V\Sigma^2)V^{-1}V\
&=V\Sigma^2
\end{align}

Image Compression

Firstly, let’s look at the process of compressing a picture, the left picture is original grayscale image. On the right, under different compress rate, we can see pictures after reproducing. Before compress, the size of the picture is 1775K byte. Then the picture is almost the same, when we compress which into 100K byte size, which means we can save 90% storage space

To compress a piture, you just decompose the matrice through SVD, then instead of using the original $U_{m\times m}$, $\Sigma_{m\times n}$ and $U_{n\times n}$, we shrink every matrice to new size $U_{m\times r}$, $\Sigma_{r\times r}$ and $U_{r\times n}$. The final $size(R)$ is still $m\times n$, but we abandon some entries since these entries are not so important than these we have reserved.

1
2
3
4
%% octave: core code of svd compressions
X = imread(filename);
[U S V] = svd(double(X));
R = U(:,1:r)*S(1:r,1:r)*V(:,1:r)';

Summary

Today we have learned mathmatics backgroud on SVD, including linear transformation and EigenVector&EigenVaule. Before SVD, we first talked about EigenValue Decomposition. Finally, Singular Vaule Decomposition is very easy to be deduced. In the last section, we took an example see how SVD be applied to image compression field.

Now, it comes to the topic how to save our storage of a 32G iPhone7, the coclusion is obvious: using SVD compress image to shrink the size of our photos.

Reference

  1. https://www.youtube.com/watch?v=EokL7E6o1AE
  2. https://www.youtube.com/watch?v=cOUTpqlX-Xs
  3. https://www.youtube.com/channel/UCYO_jab_esuFRV4b17AJtAw
  4. https://yhatt.github.io/marp/
  5. https://itunes.apple.com/cn/itunes-u/linear-algebra/id354869137
  6. http://www.ams.org/samplings/feature-column/fcarc-svd
  7. https://www.cnblogs.com/LeftNotEasy/archive/2011/01/19/svd-and-applications.html
12

Mingbo Cheng

Mingbo

18 posts
4 categories
5 tags
RSS
GitHub
Links
  • flickering
  • mathjax grammar
  • vividfree
  • colah
  • Andrew Moore
  • matlabplot
  • Ryan’s Cabinet
  • JerryLead
  • YXZF'S BLOG
  • VONNG
0%
© 2019 Mingbo Cheng
Powered by Hexo
|
Theme — NexT.Gemini v5.1.4