This repository is a companion to my blog post, "Building a Faster Triangular Solver than MKL" where I explain how I discovered a triangular solver that beats MKL (on multiple-of-8 matrix sizes, on AVX2, single-threaded). The code for that solver, as well as a proof-of-concept divide-and-conquer solver, is here. It should be compiled with GCC or Clang on Linux. The GHA workflow should be your guide to trying this out locally.
Note that if you have a CPU with AVX-512 support, you will likely see BLAS perform much better than my implementation. This is meant to be educational.
Please read the blog post for the full story and context of this repository.
If this example helps you with your work, consider saying thank you by buying me a coffee!