I was playing bloons td back when it was flash, in firefox. It was sometimes too slow. So i fired up perf and found out what horrors flash player was doing with memcpy. One byte memcpy, completely unaligned memcpy.
So i wrote an ssse3 memcpy that could do one byte unaligned with xmm registers. It was 30% faster then whatever glibc was doing and made the game playable. Was planing to submit it to glibc, but they came up with something different that was just as fast.
Pouet.net