As a simple model, Logistic regression is very popular in Machine Learning, especially in computer industry while gradient descent is more of popularity as well among dozens of optimization methods. The aim of this article is to demonstrate how to reach these formulas conclusion.

1. Model a problem

First of all, we know that Logistic regression is a statistical model. Suppose we have a coin, we’d like to know the probability of the head and tail. Bernoulli distribution is a good distribution to model the coin. The hyphothsis is as followed: \[P(y=1|\mu)=\mu \qquad(1)\].

\(y=1\) means the probability of the head, then \(y=0\) is that of the tail. \(\mu\) here is the parameter, if \(\mu=0.5\), then the probability of head equals to 0.5, which means we have half the probabilty reach the head as well as the tail. Now we have: \[P(y=0|\mu)=1-\mu\qquad(2)\]

If we take a look at these two formula, we can conclude a more general formula: \[P(y|\mu)={\mu}^y\cdot{(1-\mu)}^{(1-y)} \qquad(3)\]

Let’s test it, \(y\) only have two value \(0\) and \(1\), if \(y=1\), \(P(y|\mu)=\mu\), othewise, \(P(y|\mu)=1-\mu\). Suppose we have a dataset \(D=\{y\_1,y\_2,y\_3...y\_m\}\) observed values, we want to estimate the parameter \(\mu\), the problems become following question: \[P(\mu|D)=?\qquad (4)\]

2. Estimate Parameter

We use Bayes formula to transfer the problem to another: \[P(\mu|D)=\frac{P(D|\mu)\cdot P(\mu)}{P(D)}=\frac{P(y\_1,y\_2,y\_3...,y\_m|\mu)\cdot P(\mu)}{P(D)}\qquad (5)\]

Here denominator \(P(D)\) is a constant as well as \(P(\mu)\) if we see \(\mu\) as variable rather than a distribution. Then we have: \[P(\mu|D)\triangleq P(y\_1,y\_2,y\_3...,y\_m|\mu) \qquad(6)\]

We can find a series of \(\mu\)(e.g, \(\mu=0.1, \mu=0.72\)), but we should find the maximum \(P(\mu|D)\), because when \(P(\mu|D)\) reaches its max means \(\mu\) most likely is the right parameter. We assume that each \(y\) is independent from the others given the parameter \(\mu\). then we have: \[P(\mu|D)\triangleq P(y\_1|\mu)P(y\_2|\mu)P(y\_3|\mu)...,P(y\_m|\mu)={\prod}\_{i=1}^{m}P(y\_i|\mu)\qquad(7)\]

Now, the problem is to maximize the \({\prod}\_{i=1}^{m}P(y\_i|\mu)\), that is: \[L = \underset{\mu}{argmax}{\prod}\_{i=1}^{m}P(y\_i|\mu)=\underset{\mu}{argmax}{\prod}\_{i=1}^{m}[{\mu}^y\cdot{(1-\mu)}^{(1-y)}] \qquad(8)\]

It is a little hard to find the maximum, we change the problem to another way: \[L =\underset{\mu}{argmax}ln({\prod}\_{i=1}^{m}[{\mu}^y\cdot{(1-\mu)}^{(1-y)}])\qquad(9)\]

Then we have: \[L = \underset{\mu}{argmax}{\sum}\_{i=1}^{m}[{y ln(\mu}) + (1-y) ln({1-\mu)}]\qquad(10)\]

3. Logistic Regression

If we let \(\mu=h\_{\theta}(x)\), then we have:

\[L = \underset{\theta}{argmax}{\sum}\_{i=1}^{m}[{y ln(h\_{\theta}(x)}) + (1-y) ln({1-h\_{\theta}(x))}]\qquad(11)\]

Here, \[h\_{\theta}(x)=\frac{1}{1+e^{-(\theta\_0 x\_0+\theta\_1 x\_1+\theta\_2 x\_2+\theta\_m x\_m)}}=\frac{1}{1+e^{-{\Theta}X}}\qquad (12)\]

\(h\_{\theta}(x)\) is named sigmoid function and now we use parameter \(\theta\) to estimate \(\mu\). Simgoid function is a pretty good function, derivative of which is elegant. we let \(\sigma=\frac{1}{1+e^{-x}}\), then:

\[\begin{align} \frac{\partial{\sigma}}{\partial x} &= \frac{-1}{(1+e^{-x})^2}\cdot e^{-x}\cdot(-1)\\\\ &=\frac{e^{-x}}{(1+e^{-x})^2}=\frac{1+e^{-x}-1}{(1+e^{-x})^2}\\\\ &=\frac{1}{1+e^{-x}}-\frac{1}{(1+e^{-x})^2}\\\\ &=\sigma(1-\sigma)\qquad \qquad (13) \end{align}\]

4. Gradient Descent

We transfer the maximizing to a minimizing problem, we define the cost function:

\[J(\theta) = -\frac{1}{m}{\sum}\_{i=1}^{m}[{yln(h\_{\theta}(x)}) + (1-y)ln({1-h\_{\theta}(x))}]\qquad (14) \]

if we want to minimize \(J(\theta)\), we need know the gradient of \(\frac{\partial J(\theta)}{\partial\theta\_j}\), that is: \[\begin{align} \frac{\partial J(\theta)}{\partial\theta\_j} &=\frac{\partial}{\partial\theta\_j}(-\frac{1}{m}{\sum}\_{i=1}^{m}[yln(h\_\theta(x)) + (1-y)ln(1-h\_\theta(x))])\\\\ &= -\frac{1}{m}{\sum}\_{i=1}^{m}[\frac{y}{h\_\theta(x)} + \frac{y-1}{1-{h\_\theta(x)}} ]\frac{\partial h\_\theta (x)}{\partial\theta\_j}\\\\ &= -\frac{1}{m}{\sum}\_{i=1}^{m}[{\frac{y}{\sigma}} + \frac{y-1}{1-\sigma} ]\frac{\partial\sigma}{\partial\theta\_j}\\\\ &= -\frac{1}{m}{\sum}\_{i=1}^{m}[\frac{y-y\sigma}{\sigma(1-\sigma)} + \frac{\sigma y-\sigma}{\sigma(1-\sigma)}]\frac{\partial\sigma}{\partial\theta\_j}\\\\ &= -\frac{1}{m}{\sum}\_{i=1}^{m}[\frac{y-\sigma}{\sigma\cdot(1-\sigma)}]\frac{\partial\sigma}{\partial\theta\_j}\\\\ &= -\frac{1}{m}{\sum}\_{i=1}^{m}[\frac{y-\sigma}{\sigma(1-\sigma)}]\cdot\sigma(1-\sigma)\frac{\partial\Theta X}{\partial\theta\_j}\\\\ &= -\frac{1}{m}{\sum}\_{i=1}^{m}[\frac{y-\sigma}{1}]\frac{\partial\Theta X}{\partial\theta\_j}\\\\ &= -\frac{1}{m}{\sum}\_{i=1}^{m}[y-\sigma]{X\_j}\\\\ &= \frac{1}{m}{\sum}\_{i=1}^{m}[h\_{\theta}(x)-y]{X\_j} \qquad(15) \end{align}\]

Notice that the gradient descent is same as that of square cost function in Linear Regression, which is an extremely grace equation. ### Summarize Today We have talked about how to derive cost function of Logistic Regression, then we get the gradient descent equation of it. Next step is to use gradient descent to iterate the minimum of the cost function.


  2. Parameter estimation for text analysis, Gregor Heinrich
  3. Pattern Recognition and Machine Learning, Christopher M. Bishop