Summary of reading: painless conjugate gradient
I finally finish the Painless Conjugate Gradient by Jonathan Richard Shewchuk^{1} today.
Quadratic form and Steepest Descent
The author spends almost half of the article on this part. To me the most useful takeaways are:

Solving a linear system can be regarded as optimizing a quadratic form .

The equation (8) in the paper suggests that optimizing a quadratic form is same to optimizing its “conjugated error” because

Represent iterative algorithm as matrix power and analyze the convergence based on the distribution of eigenvalues.
Conjugate Directions
Section 7.1 introduces the concept of Conjugate Directions. The conjugate directions is just a special kind of descent method, in which a a set of Aorthogonal(i.e. ) vectors are used for line search and exactly one step is made along each vector.
Representing error in search directions
We can derive the step size by a similar procedure which we use for Steepest Descent (equation 31 in the paper):
Since we require that exactly one step is made along each search direction, we should be able to represent the initial error by these search directions(equation 33):
The paper uses a constructive proof which derives a formula for to show such representation does exist(equation 34):
Surprisingly we find that . This hints us we do eliminate the error along one direction completely for each step. In fact this is true and proved in page 27.
Aorthogonality
Another requirement on search directions is Aorthogonality. This is done by performing a GramSchmidt conjugation on a set of linearly independent vectors .
Similar to traditional GramSchmidt process, we can compute a set of Aorthogonal vectors by linear combination with coefficients (equation 37):
Note that we have to store all previous search directions to compute a new one . This will be resolved in Conjugate Gradients by exploiting properties of search space (described below).
Space of search directions:
In section 7.3 the author introduces a space which is spanned by , , …, , and proves that current residual is orthogonal to all previous search space () in equation 38 and 39.
This property will be used to simplify GramSchmidt coefficients.
Conjugate Gradients
We finally reach the Conjugate Gradient method in chapter 8.
In section 7.2 we said that GramSchmidt conjugation is used to construct a set of Aorthogonal vectors from linearly independent vectors . But has not determined yet.
In Conjugate Gradient we just let .
To reason this we write down the iteration step for residual(equation 43):
We can see the new residual is a linear combination of and .
By letting , we have
because is included in , the fact that next residual is orthogonal to (Equation 39) implies that is Aorthogonal to . … is already Aorthogonal to all of the previous search directions except !
Therefore the coefficients for each are simplified to a single scalar, and we can just store the last search direction for each iteration step.
References

Shewchuk, Jonathan Richard. “An introduction to the conjugate gradient method without the agonizing pain.” (1994). ↩