> Vectorization is a process by which floating-point computations in scientific code are compiled into special instructions that execute elementary operations (+,-,*, etc.) or functions (exp, cos, etc.) in parallel on fixed-size vector arrays.
I guess this is a scientific computing course or something but I feel like even so it’s important to point out that most processors have many vector instructions that operate on integers, bitfields, characters, and the like. The fundamental premise of “do a thing on a bunch of data at once” isn’t limited to just floating point operations.
It seems odd in a scientific computing class to not mention classic cray style vectors. I mean, they haven’t been popular for a while, but you never know, RISC-V might bring them back.
Vectorization (disambiguation) https://en.wikipedia.org/wiki/Vectorization :
> Array programming, a style of computer programming where operations are applied to whole arrays instead of individual elements
> Automatic vectorization, a compiler optimization that transforms loops to vector operations
> Image tracing, the creation of vector from raster graphics
> Word embedding, mapping words to vectors, in natural language processing
> Vectorization (mathematics), a linear transformation which converts a matrix into a column vector
Vector (disambiguation) https://en.wikipedia.org/wiki/Vector
> Vector (mathematics and physics):
> Row and column vectors, single row or column matrices
> Vector space
> Vector field, a vector for each point
And then there are a number of CS usages of the word vector for 1D arrays.
Compute kernel: https://en.wikipedia.org/wiki/Compute_kernel
GPGPU > Vectorization, Stream Processing > Compute kernels: https://en.wikipedia.org/wiki/General-purpose_computing_on_g...
sympy.utilities.lambdify.lambdify() https://github.com/sympy/sympy/blob/a76b02fcd3a8b7f79b3a88df... :
> """Convert a SymPy expression into a function that allows for fast numeric evaluation [with the CPython math module, mpmath, NumPy, SciPy, CuPy, JAX, TensorFlow, SymPt, numexpr,]
pyorch lambdify PR, sympytorch: https://github.com/sympy/sympy/pull/20516#issuecomment-78428...
sympytorch: https://github.com/patrick-kidger/sympytorch :
> Turn SymPy expressions into PyTorch Modules.
> SymPy floats (optionally) become trainable parameters. SymPy symbols are inputs to the Module.
sympy2jax https://github.com/MilesCranmer/sympy2jax :
> Turn SymPy expressions into parametrized, differentiable, vectorizable, JAX functions.
> All SymPy floats become trainable input parameters. SymPy symbols become columns of a passed matrix.
Auto-vectorization is easier to get into than other SIMD-frameworks like CUDA, OpenCL, ROCm, Intel's ISPC and whatnot. But in my experience, auto-vectorizers are just not as flexible as the proper SIMD-tools.
I'd say auto-vectorization should still be learned by modern high-performance programmers, because its very low-hanging fruit. You barely have to do anything and suddenly your for-loops are AVX512 optimized, though maybe not to the fullest extent possible.
Still, I suggest that programmers also learn how to properly make SIMD code. Maybe intrinsics are too hard in practice, but ISPC, CUDA, and other SIMD-programming environments make things far easier to learn than you might expect.
------------
ISPC in particular is Intel's SIMD programming language, much akin to CUDA except it compiles into AVX512. So for AVX512-like code execution environments, using the ISPC language/compiler is exceptionally useful.
Its harder to learn a new language than to learn a few compiler-options to enable auto-vectorization however. So in practice, auto-vectorization will continue to be used. But for tasks that specifically would benefit from SIMD-thinking, the dedicated ISPC language should be beneficial.
> I'd say auto-vectorization should still be learned by modern high-performance programmers, because its very low-hanging fruit.
Somewhat related: 55 GiB/s FizzBuzz https://codegolf.stackexchange.com/questions/215216/high-thr... (discussion: 29031488)
What do you mean by learning auto vectorization? Do you mean learning how to structure your code so that your compiler of choice has a good chance of pulling it off?
Even more important than learning how to vectorize is learning the basics of computer architecture: register, caches, pipelining, branch prediction, and superscalar execution.
One of the common interview questions I ask is computing the sum of an array of doubles. Barely 1% of applicants can even get the basics correctly such as having multiple partial sums.
Once you do the work such that your code actually allows execution to happen in parallel, then leveraging superscalar and simd happens automatically.
> auto-vectorizers are just not as flexible as the proper SIMD-tools
What is a "proper" SIMD tool?
What would happen if I try to run a program contains AVX instruction on a system that does not have it? Do I need to recompile the program?
I wonder how much these apply to the Julia compiler: based on the docs for `@simd` and the writing here [1], only the first two criteria on this page [2] seem to be really a requirement in Julia's case.
The one about avoiding function calls is a no-go, since things like + and * are function calls too. But is it still required for all function calls to be inline-able?
Indirect indexing isn't mentioned in Julia's case, but is it one of the things that would disable auto-vectorization and require an explicit @simd annotation?
[1] https://viralinstruction.com/posts/hardware/#74a3ddb4-8af1-1... [2] https://cvw.cac.cornell.edu/vector/coding_vectorizable
Yes, it is required for all functions to be inlined in Julia. Note that Julia 1.8 and newer supports callsite inlining, so you can, for example, do `@inline exp(x[i])` inside your `for` loop, and this will cause `exp` to get inlined there, and should enable vectorization. For functions like `log` or `sin`, this is not enough, but you could try `@turbo` from LoopVectorization.jl, which supports these and some others.
`@simd ivdep` may be needed when you have indirect indexing. Do not forget the `ivdep`, but only if you know it is actually safe. This can lead to hard to track down bugs if it is not.
When I was in grad school programming on a Cray, "vectorization" was about chaining operations across an array and getting the inner-loop dependencies right, rather than parallel operations across elements of wide words. Interesting how the definitions have changed with architectural changes.
From what I recall, in vector machines, operations are still done in parallel due to chaining; the difference is that you're doing e.g. add(v[1]) and mul(v[0]) in parallel, then add(v[2]) and mul(v[1]), and so on, rather than add(v[0]) and add(v[1]) in parallel, then mul(v[0]) and mul(v[1]), &c, which makes support for any vector length trivial.
(I'm sure you know this; this is just for the benefit of anyone else perusing the comment section.)
This presentation seems very “let’s get ya going on the Xeon’s we’ve got in Stampede today” focused. Which, to be fair, is probably what lots of researchers who are interested in using (rather than studying) HPC need to know.
RISC-V has, or at least had, some classic vector extension proposal. Not sure if it actually got implemented.
This looks like a curious case, non-fixed-length vectorization code in RISC-V:
> "Perhaps the most interesting part of the open RISC-V instruction set architecture (ISA) is the vector extension (RISC-V "V"). In contrast to the average single-instruction multipe-data (SIMD) instruction set, RISC-V vector instructions are vector length agnostic (VLA). Thus, a RISC-V "V" CPU is flexible in choosing a vector register size while RISC-V "V" binary code is portable between different CPU implementations."
Classical vector machines (think Cray) had flexible vector sizes. But eventually they were buried under a tide of consumer processors, which were fundamentally scalar. Then as these machines took over the world, they had to go into every niche, so the MMX style “fixed length vector instruction” was invented, for CPUs that might have to do some math while still fitting into that paradigm.
RISC-V also has the P extension, which follows this pattern. It will be very cool if RISC-V proves flexible enough to really enable a resurgence of classic vectors though.
We need also something like Tensorization:Introduction
We really need to name tensors differently. "Array" would be a more fitting name.
Tensors are really different mathematical objects with a far more rich structure than those used in the Deep Learning context.
Tensor ops are usually implemented using the principles of vectorizing code.
After all most tensors are implemented as 1D arrays with strides for each dimension with the innermost dimension always contiguous.
PyTorch internals: http://blog.ezyang.com/2019/05/pytorch-internals/
A good basic thing to know is a tensor is a single dimensional array under the hood. At least as far as I know in my mental model - I have not yet looked at the code :-)
Therefore "converting" a 2 * 5 * 10 tensor into a 2 * 50 one is an immediate operation, it doesn't have to scan. Just change some metadata.
From a hardware perspective, vector instructions operate on small 1D vectors, whereas tensor instructions operate on small 2D matrices. I say “instructions”, but it’s really only matrix multiply or matrix multiply and accumulate - most other instructions are fine staying as 1D.
This is mostly going to be the same. The main difference in tensorization is that your vectors are not contiguous. So you lose some speedup. How you typically optimize this is by flattening your tensor. Then your data is contiguious and all the same rules apply.
This is overly simplified though. Things get different when we start talking about {S,M}I{S,M}D (page addresses SIMD) or GPUs. Parallel computing is a whole other game, and CUDA takes it to the next level (lots of people can write CUDA kernels, not a lot of people can write GOOD CUDA kernels (I'm definitely not a pro, but know some)). Parallelism adds another level of complexity due to being able to access different memory locations simultaneously. If you think concurrent optimization is magic, parallel optimization is wizardry (I still believe this is true)
Sorry if this is a different sensd of that same word…
I can understand almost everything in the explanations of how ChatGPT, neural networks and fine-tuning works
But one thing isn’t being explained…
How do you embed the words as vectors????
Do you just make up your own scheme?
I understand about vector databases being used for search, to retrieve snippets and stuff them into a prompt. I don’t mean that.
I mean how did people embed the words WHEN THEY USE THE EMBEDDINGS API?
I think this whole vector database and pinecone stuff will be phased out once the LLM windows get larger, like with the new Claude from Anthropic. And the ability to fine tune the model on your data. Then the LLM can make its own lower-rank model of your entire corpus of data, rather than having to do vector lookups and stuffing it into a prompt.
> Sorry if this is a different sensd of that same word…
it is, hence the downvotes. This one is more about CPU operations, AVX/SIMD
I know I’m probably going to get downvoted for this, but I’m going to post it anyhow :-)
https://jvns.ca/blog/2023/02/08/why-does-0-1-plus-0-2-equal-...
Perhaps you could point out in which way you think this is relevant?
I thought I was crazy when looking at all those vector extensions piling up made me think "why not just have some generic-sized vector instructions, and then let the CPU to sort out what is the largest chunk size it actually supports internally" but apparently no, I am not! It was actually been done before and we might even have that again in the future. That's nice.
ARM's newer SVE/SVE2 instructions can do something like this now. SVE allows the vector length to vary between 128 and 2048 bits, at 128 bit increments. RISC-V's vector RVV instructions also allow the vector length to vary across different CPU implementations. This way, the same RVV vector code can run on different CPUs having different vector lengths.
Definitely not autovectorized.
Compilers are pretty good at restructuring code. As a programmer you need know some common situations when the compiler would like to reorder your code, but can't.
E.g. when you do summation using a for loop, and the compiler is trying to preserve the rounding-correctness your float operations. Or because you created a dependency graph with a for loop where each iteration requires the previous iteration's result.
Pretty much.
If one wants to sum two arrays instead of for looping through the elements of two arrays one might instead use iterate by chunks so that the compiler can easily tie them all together as a single operation which can then easily vectorize.
If I recall, you can absolutely loop through the elements in a tight loop and the compiler (e.g. GCC) will auto-vectorize for you (if you have the relevant optimization flag set).
The trick with coding for auto-vectorization is to keep your loops small and free of clutter.
I don't have the documentation handy but I think you only need to follow a couple rules:
- loop must have a defined size (for-loop instead of while-loop)
- don't muck with pointers inside the loop (simple pointer increment is okay)
- don't modify other variables (only the array should be modified)
Here are some examples, or you could use this library itself in rust https://github.com/vorner/slipstream
> One of the common interview questions I ask is computing the sum of an array of doubles. Barely 1% of applicants can even get the basics correctly such as having multiple partial sums.
Ehhh... I've got multiple hats from my college classes.
Yes, optimizations say that multiple partial sums is fastest. But... my floating-point analysis classes has my brain yelling "cancellation error". The only way to minimize cancellation error is to sort the numbers from biggest to smallest, and then separate them from positives-and-negatives. You add the numbers from smallest to largest in magnitude, only matching signs, and then finally combine them at the end.
The only two partial sums you should be keeping are "positive numbers" and "negative numbers", and everything else needs to be folded together to minimize rounding and cancellation errors.
--------
You know. For stability and minimization of errors. There's other, faster, methods to possibly be employed. But optimization of partial sums is probably the last thing I'm thinking if I see "float" or "double".
If things are "int" or "long", I'll have a higher chance of actually thinking (and caring) about optimization, because stability/cancellation error isn't a thing with int or long.
The reason you need multiple sums is because there is a loop dependency which causes every iteration to stall and prevents pipelining and saturating the multiple compute units of your superscalar processor.
Yes, floating-point isn't associative, so the results aren't numerically equivalent, and that is why you need to do that optimization yourself since the compiler isn't allowed to do it for you.
As per the exercise, there is no information about the distribution of the data, so there is no indication that any order would be any better than another for numerical stability.
If you care about stability you'd implement a kahan sum, which isn't really affected by changing the order of the sum so that you can parallelize it.
Is asking a question that fewer than 1% of applicants "can even get the basics correct" -- which implies that only a tiny fraction of 1% actually provide a fully complete and correct answer -- a good use of either the applicants' or the interviewer's time? Is it effective in distinguishing candidates, or are you only seeking 1 success (fewer, actually) per 100?
1% success rate is typical when interviewing developers.
Also they get asked multiple exercises, not doing great at one is not eliminatory.
Being able to quickly identify whether an individual is exceptionally good is very important to ensure the employer can fast-track them and be responsive quickly enough that he doesn't accept another offer.
There is no such thing as 'failure to autovectorize' in CUDA or OpenCL. All code is vectorized in these languages. SIMD is fundamental to the language model.
As such, it's easier to write high performance code in practice. Intel's ISPC is the closest tool that replicates this effect for AVX512.
The cheeky answer would be to throw away the trash every decade, but the real answer is that there is tooling in place to pick the best possible implementation of a function at link time on modern *nix systems to target several levels of SIMD support e.g. SSE2 only, SSE4.2, AVX2, common collection of AVX512 extensions.
What tool you are referring to? Can you please forward me to those resources?
E.g. ISPC can compile for multiple targets at once. The runtime system selects the best compatible implementation for the machine running the code.
segfault with illegal opcode
So when I am downloading binary - i.e. Krita/GIMP or any open source project - how do the dev ensure that the common binary will utilize these newer instructions without hitting 'illegal opcode' error?
That sounds like what would be called a pipeline today.
No, classic Cray did have a fixed vector size (4096 bits, or 64 double precision floating point values) both in the HW and the ISA. What it did have, however, which most contemporary SIMD ISA's lack, was a vector length register which allowed masking the end of the vector register. That meant the compiler didn't have to generate a scalar loop tail to handle the final iterations of a vectorized loop, it just set the vector length register to mask of the end of the vector data register, which meant that load/store/arithmetic instructions only worked on the first N elements in the vector rather than the full 64.
I think the RISC-V V extension, as well as ARM SVE, both have something similar as well. What RVV and SVE have, that Cray vector ISA didn't, is the ability to support different HW vector register lengths with the same ISA.
Same with vectors!
https://en.wikipedia.org/wiki/Vector_(mathematics_and_physic...
As a transplant to the ML world who migrated from physics I do find it funny that people in ML will insist that gradients are vectors. Once you get to general relativity your physics professors will beat into you the fact that gradients are not vectors. (They are one-forms.) But in the ML context it's a distinction without a difference, unless you're in a really specialized corner of the field like information geometry.
Tensor processing units sounds far more cutting edge than number array processing units.
Names can have different meanings based on context, even within pure mathematics. I think a mathematician would have little trouble discerning which kind of tensor is meant, as this sort of reusing existing terms is pretty common.
Just calling it an array might be underselling it, at this point. Perhaps a tensor in the context of CompSci is just a particular type of array with certain expected properties.
> Just calling it an array might be underselling it, at this point
I like the NDArray terminology used by numpy. I think it gets the point across more clearly than “tensor”, and conceptually they’re pretty much equivalent
You could apply that argument to “isomorphic” functions, c++ “vectors” or compute “vectorizarion”, “integer” types… even though the semantics aren’t exactly the same between the computing and math terms, they’re related enough that they do help with understanding to those coming from a math background (which to be fair is probably less and less programmers over time).
Isomorphic javascript is one of the most egregious.
I think that naming battle is lost.
Would indeed be cool to have true tensor processing units :-)
PKD level of cool as you could argue those chips directly process spacetime chunks
I think the bigger difficulty lies in how to store and process this stuff on a GPU/other accelerator.
If the OP cares about implementation details about how an API like PyTorch is made, I think the MiniTorch 'book' is a pretty good intro:
The MiniTorch book seems amazing, thanks
> the principles of vectorizing code. > 1D arrays
I think accelerators mostly do computations over small matrices and not vectors, so it is "matricized" code, but I am not an expert in this area.
ChatGPT interpretation:
Sure! Let me break it down for you:
When we talk about "tensor ops," we're referring to operations performed on tensors, which are multidimensional arrays of numbers commonly used in mathematics and computer science.
Now, these tensor operations are typically implemented using a technique called "vectorizing code." In simple terms, vectorizing code means performing operations on entire arrays of data instead of looping through each element one by one. It's like doing multiple calculations at once, which can be more efficient and faster.
Tensors are usually represented as 1D arrays, meaning all the elements are arranged in a single line. Each dimension of the tensor has a concept called "stride," which represents how many elements we need to skip to move to the next element in that dimension. This helps us efficiently access and manipulate the data in the tensor.
Additionally, the innermost dimension of a tensor is always "contiguous," which means the elements are stored sequentially without any gaps. This arrangement also aids in efficient processing of the tensor data.
So, in summary, tensor ops involve performing operations on multidimensional arrays, and we use vectorized code to do these operations more efficiently. Tensors are represented as 1D arrays with strides for each dimension, and the innermost dimension is always stored sequentially without any gaps.
That's pretty much how every serious dense matrix (or more generally, multi-dimensional array) representation has been done since, well, forever. It's only C programmers writing "My First Matrix Program" that insist on allocating each row separately due to the lack of native multidimensional arrays in that language.
Fortran did it:
A 2-dimensional array A will, in the object program, be stored sequentially in
the order Ai,i, A2,i Am l , A| i2 , A2f2 Am, 2> , Am,„. Thus
it is stored “columnwise”, with the first of its subscripts varying most
rapidly, and the last varying least rapidly. The same is true of 3-dimensional arrays.
1 -dimensional arrays are of course simply stored sequentially. All arrays are stored backwards in storage; i.e. the above sequence is in the order of decreasing absolute location.
https://ia800807.us.archive.org/0/items/history-of-fortran/I...Yes! I was surprised how simple it was. Obvious once you know how. I never really considered multidimensional storage until looking into Machine Learning recently. My only brush with it is relational database tuning, but that is very different, as the number of "dimensions" are small and they are usually not homogenous.
If there is matrix multiply at hardware level its fair to have another name than vectorization. For example the dimensions and partitioning of large matrices to fit would be specific to that design and very different from rolling things out on 1D arrays
Maybe they should call it a Matrix Multiplication Xcelerator just to mess with people.
From the introductory text “The ultimate goal of vectorization is an increase in floating-point performance…”. I find it a bit funny that by vectorizing float executions we’ll be able to calculate wrong answers faster ;-)
It's not a wrong answer, anymore than writing down ⅓ as 0.333, multiplying that by 3, and discovering that the result is 0.999 and not 1.
And there's nothing in vectorization that requires floating-point numbers; it's just that the algorithms that tend to demand the most out of a computer tend to be large numerical algorithms, which use floating-point.
They're not wrong answers, they're floating-point answers. What's wrong is the assumption that infinite precision will fit in a handful of bytes.
for many practical applications where we compute things, the input data isn't fully accurate, it's an estimate with significant error or uncertainty. when we consider how the outputs of computations are used, in many cases it isn't useful for outputs to be arbitrarily precise. maybe the output is used to make a decision and the expected cost or benefit decision won't significantly change even if the output is wrong by 1%, for example.
one person's "wrong" is another person's approximation. by reducing accuracy we can often get very large performance improvements. good engineering makes an appropriate tradeoff.
there's even more scope for this kind of thing inside some numerical algorithms where having a "pretty good" guess gives a massive speed boost, but the algorithm can be iterated to find quite precise answers even if the "pretty good" guess is an approximation that's off by 20%
there's yet more scope for this if we consider if we're trying to mathematically model reality -- even if we perform all the math with arbitrary precision calculations, the theoretical model itself is still an approximation of reality, there's some error there. there's limited value in solving a model exactly, as there's always some approximation error between the model and reality.
it's amazing we have such good support for high-performance scientific quality ieee floating point math everywhere in commodity hardware. it's a great tool to have in the toolbelt
How is this relevant? This is more of a critique of floats than it is about vectorization.
Still irrelevant.
The linked document describes Intel's autovectorizer, it's warnings and compiler flags that point out which loops autovectorized or not, as well as listing specific reason codes why.
Microsoft, GCC and Clang all do this too, though with different compiler flags and messages.
I'd say that the whole point of this document listed here is to build up the programmer to understanding these error messages and specifically know how to fix the errors that causes a autovectorization-fail.
In rust I've noticed using iterators instead of for loops tends to trigger auto vectorization more often.
In Rust using iterators is almost always better than for loops. The compiler is virtually guaranteed to optimize out bounds checks vs indexing in.
Kahan summing doesn't fix cancellation error. It just extends the precision by another 53-bits or so.
The only way you fix cancellation error is summing the positives in one variable, and the negatives in another, and do a big cancellation at the end.
Anyway, my overall point is not to criticize Kahan summing or other techniques. My point is that these techniques are incompatible with optimizations. The faster way is often times a less stable operation.
cpu cache line optimization is relevant to a superminority of programming contexts
if you're working in those contexts then it's all fine, but it's important to acknowledge that most software development is bottlenecked by concerns far above this level of concern
I think crest may be referring to function multiversioning, but that only helps choose/call the implementation, not generate it.
github.com/google/highway supports compiling your single source code once into several versions, and then dispatching to the best one.
Disclosure: I am the main author.
Depends on the binary. Compiling to the lowest common denominator is one strategy. The other is to to have the code detect what's available and only use that subset.
Interesting. How do the devs decide on the lowest common denominator? Do you happen to know any blog or resources that explains the process?
> Once you get to general relativity your physics professors will beat into you the fact that gradients are not vectors.
In linear algebra you should learn this as soon as you hear about it. Is the english Wikipedia page correct in that you always use nabla for the gradient? https://en.wikipedia.org/wiki/Gradient#Generalizations
Because `grad f = ∇f` holds in scalar fields f only (not in vector or tensor fields (tensors that aren't scalars)).
Even if you are taking the gradient of a scalar field, the resulting object is not a vector. The difference is that if you apply some transformation to your coordinate system a gradient transforms differently from a normal vector.
In the ML context it generally doesn't matter because you don't do these transformations (and more generally there is no metric for your space). So you can just treat a gradient as an array of numbers. But in a physical context the distinction starts to matter, at least on a manifold with curvature.
The one place it does matter in ML (that I'm aware of) is information geometry, where you try to do a transformation from the normal coordinate space to a more "natural" coordinate system that is based on the Fisher information, and this introduces curvature into model.
> the result is 0.999 and not 1.
But 0.999... is 1.
Edit: Nevermind, I see you meant "writing down 1/3 as 0.333" literally.
Most devs don't care, so it depends on the compiler default. However some people who are eagering for optimization will turn on -march=native to get the best profile out of the current running processor family. This is where the troubles come in
I touched that question in "Hardware Support" section, on page 6 of that article http://const.me/articles/simd/simd.pdf
TLDR: SSE1 and SSE2 are parts of the base AMD64 ISA, they are always available while compiling for 64 bits. SSE3, SSE4.1 are old, market penetration is very close to 100%, always available in practice. For the newer extensions like AVX and FMA, the optimal tradeoff depends on the product and the market.
Do I need to download the source and compile it using correct flags on the target system for optimal performance? In my previous institution they would download the commonly distributed R language binary and run as is on a brand new system, and I was wondering if recompiling from source would have made a difference.
I don’t know, it depends on the specific library.
If they were compiling for least-common denominator like SSE2, recompiling with -march=native might make a difference. The exact amount of that difference depends heavily on the code.
Automatic vectorizers are limited, they only generating good vectorized code for pure vertical operations which handle long dense arrays of FP32 or FP64 values.
OTOH, some widely-used libraries like Eigen https://eigen.tuxfamily.org/index.php?title=Main_Page are using manually-written intrinsics under the hood, selecting an implementation with preprocessor macros. If your R library uses Eigen, compiling for AVX2 or AVX512 will probably cause a non-trivial performance win.
If your library uses runtime dispatch to select best implementation in runtime, and is compiled recently (i.e. you don’t have new instruction sets in your CPU which weren’t available back then), you’re unlikely to get much profit.
BTW, I’ve noticed C standard library implemented by VC++ uses runtime dispatch, functions like memcpy() are using AVX vectors internally, when supported.
89 Comments: