Linear Algebra for ML

November 15, 2023


I took a Linear Algebra (LA) class in my first year of college, it was taught in such a mathematical and algebraic way that I couldn't appreciate it at all.

I got a B+ in that class, I remember in the the final exam I didn't have time to finish the test and got mad at myself, and then disappointed. I also realize I don't understand Linear Algebra at all. Like why does the no. of cols in A have to match the no. of rows in B for a matmul operation?

Now that I'm starting grad school, and have the goal of becoming an AI engineer, it's time to really understand LA.

I watched this video today taught by the amazing Charles Frye, who was also involved in teaching FSDL.

He also made videos for calculus and probability in his series Math4ML, along with exercise notebooks.

His analogy of comparing linear algebra with programming really helped me make sense of the how and the why behind it.

Here are some takeaways.

  • LA is like programming
    • in programming, we combine functions with matching types through function composition
    • in LA, we combine matrices with matching shapes through matrix multiplication
    • so, think of matrices as functions, and matmul as a function application, that takes another matrix
  • why arrays as functions?
    • easy to change slightly, small changes to entries -> small changes to function
    • LA can be made lightning fast (TPU, BLAS)
    • arrays represent linear functions
    • linear functions play nice with weighted sums
    • this makes linear function easy to reason about
      • ex: 0 is always sent to 0
      • everything sent to 0 is called the kernel
        • kernel is made of weighted sums
      • weighted sums define the rank of function
        • We can make new non-kernel elements by making weighted sums of known non-kernel elements.
        • Rank answers the question: how many non-kernel elements do I need to know in order to make every element that’s not in the kernel?
  • SVD is matrix "refactoring"
    • "separation of concern" = Eigenvectors ("eigendecomposition")
    • removing redundant code = Low-rank decomposition
    • break up into functions (decomposing) = Singular Value decomposition (SVD)
  • Function & Matrix decomposition
    • any function can be decomposed to "representatives", reversible renaming, and type conversion
    • any matrix can be decomposed into A (tall matrix), B (square matrix), C (wide matrix)
    • in SVD, special choices are made: B is diagonal, and A and C are unitary (no growing/shrinking)
  • why do SVD?
    • calculate low-rank approximation
    • used in PCa for preprocessing and analyzing data
    • classify and measure matrices in LA
  • applied to video
    • many matrices are nearly low rank (you can notice patterns)
      • full rank = absence of patterns in matrix
    • simplest low-rank pattern in a video is the background
      • an outer product of a vector * rest of vectors (a repeat() function)
      • think of it as an approximation
      • used to compress data (this is used in JPEG encoding of images in form of Fourier transform)
      • can be used to separate out foreground. Rank 1 approx is the background, and you can use it to approx foreground.

LA concepts

  • matrix: a function that transforms data
  • matmul: applying a function to an input
  • rank: count of unique, independent features in data
    • low rank : repeating patterns, useful for compressing data
    • full rank: implies complex data
  • kernel: a function that returns zero for specific inputs, helps identify redundant or dependent features in data
  • eigenvectors: directions in which data varies the most
  • SVD: decompose function into simpler parts
    • Surjective (onto): Every possible output is represented.
    • Bijective (one-to-one and onto): Each input has a unique output and vice versa.
    • Injective (one-to-one): Every input leads to a unique output.

More Resources

Hackers approach

Visualize

Papers/Courses/Textbooks