Hacker News

saagarjha said a year ago:

> Vectorization is a process by which floating-point computations in scientific code are compiled into special instructions that execute elementary operations (+,-,*, etc.) or functions (exp, cos, etc.) in parallel on fixed-size vector arrays.

I guess this is a scientific computing course or something but I feel like even so it’s important to point out that most processors have many vector instructions that operate on integers, bitfields, characters, and the like. The fundamental premise of “do a thing on a bunch of data at once” isn’t limited to just floating point operations.

bee_rider said a year ago:

It seems odd in a scientific computing class to not mention classic cray style vectors. I mean, they haven’t been popular for a while, but you never know, RISC-V might bring them back.

westurner said a year ago:

Vectorization (disambiguation) https://en.wikipedia.org/wiki/Vectorization :

> Array programming, a style of computer programming where operations are applied to whole arrays instead of individual elements

> Automatic vectorization, a compiler optimization that transforms loops to vector operations

> Image tracing, the creation of vector from raster graphics

> Word embedding, mapping words to vectors, in natural language processing

> Vectorization (mathematics), a linear transformation which converts a matrix into a column vector

Vector (disambiguation) https://en.wikipedia.org/wiki/Vector

> Vector (mathematics and physics):

> Row and column vectors, single row or column matrices

> Vector space

> Vector field, a vector for each point

And then there are a number of CS usages of the word vector for 1D arrays.

Compute kernel: https://en.wikipedia.org/wiki/Compute_kernel

GPGPU > Vectorization, Stream Processing > Compute kernels: https://en.wikipedia.org/wiki/General-purpose_computing_on_g...

sympy.utilities.lambdify.lambdify() https://github.com/sympy/sympy/blob/a76b02fcd3a8b7f79b3a88df... :

> """Convert a SymPy expression into a function that allows for fast numeric evaluation [with the CPython math module, mpmath, NumPy, SciPy, CuPy, JAX, TensorFlow, SymPt, numexpr,]

pyorch lambdify PR, sympytorch: https://github.com/sympy/sympy/pull/20516#issuecomment-78428...

sympytorch: https://github.com/patrick-kidger/sympytorch :

> Turn SymPy expressions into PyTorch Modules.

> SymPy floats (optionally) become trainable parameters. SymPy symbols are inputs to the Module.

sympy2jax https://github.com/MilesCranmer/sympy2jax :

> Turn SymPy expressions into parametrized, differentiable, vectorizable, JAX functions.

> All SymPy floats become trainable input parameters. SymPy symbols become columns of a passed matrix.

dragontamer said a year ago:

Auto-vectorization is easier to get into than other SIMD-frameworks like CUDA, OpenCL, ROCm, Intel's ISPC and whatnot. But in my experience, auto-vectorizers are just not as flexible as the proper SIMD-tools.

I'd say auto-vectorization should still be learned by modern high-performance programmers, because its very low-hanging fruit. You barely have to do anything and suddenly your for-loops are AVX512 optimized, though maybe not to the fullest extent possible.

Still, I suggest that programmers also learn how to properly make SIMD code. Maybe intrinsics are too hard in practice, but ISPC, CUDA, and other SIMD-programming environments make things far easier to learn than you might expect.

------------

ISPC in particular is Intel's SIMD programming language, much akin to CUDA except it compiles into AVX512. So for AVX512-like code execution environments, using the ISPC language/compiler is exceptionally useful.

Its harder to learn a new language than to learn a few compiler-options to enable auto-vectorization however. So in practice, auto-vectorization will continue to be used. But for tasks that specifically would benefit from SIMD-thinking, the dedicated ISPC language should be beneficial.

sva_ said a year ago:

> I'd say auto-vectorization should still be learned by modern high-performance programmers, because its very low-hanging fruit.

Somewhat related: 55 GiB/s FizzBuzz https://codegolf.stackexchange.com/questions/215216/high-thr... (discussion: 29031488)

jackmott42 said a year ago:

What do you mean by learning auto vectorization? Do you mean learning how to structure your code so that your compiler of choice has a good chance of pulling it off?

mgaunard said a year ago:

Even more important than learning how to vectorize is learning the basics of computer architecture: register, caches, pipelining, branch prediction, and superscalar execution.

One of the common interview questions I ask is computing the sum of an array of doubles. Barely 1% of applicants can even get the basics correctly such as having multiple partial sums.

Once you do the work such that your code actually allows execution to happen in parallel, then leveraging superscalar and simd happens automatically.

Genghis_Khan said a year ago:

> auto-vectorizers are just not as flexible as the proper SIMD-tools

What is a "proper" SIMD tool?

iamcreasy said a year ago:

What would happen if I try to run a program contains AVX instruction on a system that does not have it? Do I need to recompile the program?

sundarurfriend said a year ago:

I wonder how much these apply to the Julia compiler: based on the docs for `@simd` and the writing here [1], only the first two criteria on this page [2] seem to be really a requirement in Julia's case.

The one about avoiding function calls is a no-go, since things like + and * are function calls too. But is it still required for all function calls to be inline-able?

Indirect indexing isn't mentioned in Julia's case, but is it one of the things that would disable auto-vectorization and require an explicit @simd annotation?

[1] https://viralinstruction.com/posts/hardware/#74a3ddb4-8af1-1... [2] https://cvw.cac.cornell.edu/vector/coding_vectorizable

celrod said a year ago:

Yes, it is required for all functions to be inlined in Julia. Note that Julia 1.8 and newer supports callsite inlining, so you can, for example, do `@inline exp(x[i])` inside your `for` loop, and this will cause `exp` to get inlined there, and should enable vectorization. For functions like `log` or `sin`, this is not enough, but you could try `@turbo` from LoopVectorization.jl, which supports these and some others.

`@simd ivdep` may be needed when you have indirect indexing. Do not forget the `ivdep`, but only if you know it is actually safe. This can lead to hard to track down bugs if it is not.

dev_tty01 said a year ago:

When I was in grad school programming on a Cray, "vectorization" was about chaining operations across an array and getting the inner-loop dependencies right, rather than parallel operations across elements of wide words. Interesting how the definitions have changed with architectural changes.

agalunar said a year ago:

From what I recall, in vector machines, operations are still done in parallel due to chaining; the difference is that you're doing e.g. add(v[1]) and mul(v[0]) in parallel, then add(v[2]) and mul(v[1]), and so on, rather than add(v[0]) and add(v[1]) in parallel, then mul(v[0]) and mul(v[1]), &c, which makes support for any vector length trivial.

(I'm sure you know this; this is just for the benefit of anyone else perusing the comment section.)

bee_rider said a year ago:

This presentation seems very “let’s get ya going on the Xeon’s we’ve got in Stampede today” focused. Which, to be fair, is probably what lots of researchers who are interested in using (rather than studying) HPC need to know.

RISC-V has, or at least had, some classic vector extension proposal. Not sure if it actually got implemented.

photochemsyn said a year ago:

This looks like a curious case, non-fixed-length vectorization code in RISC-V:

> "Perhaps the most interesting part of the open RISC-V instruction set architecture (ISA) is the vector extension (RISC-V "V"). In contrast to the average single-instruction multipe-data (SIMD) instruction set, RISC-V vector instructions are vector length agnostic (VLA). Thus, a RISC-V "V" CPU is flexible in choosing a vector register size while RISC-V "V" binary code is portable between different CPU implementations."

https://gms.tf/riscv-vector.html

bee_rider said a year ago:

Classical vector machines (think Cray) had flexible vector sizes. But eventually they were buried under a tide of consumer processors, which were fundamentally scalar. Then as these machines took over the world, they had to go into every niche, so the MMX style “fixed length vector instruction” was invented, for CPUs that might have to do some math while still fitting into that paradigm.

RISC-V also has the P extension, which follows this pattern. It will be very cool if RISC-V proves flexible enough to really enable a resurgence of classic vectors though.

nologic01 said a year ago:

We need also something like Tensorization:Introduction

amelius said a year ago:

We really need to name tensors differently. "Array" would be a more fitting name.

Tensors are really different mathematical objects with a far more rich structure than those used in the Deep Learning context.

https://en.wikipedia.org/wiki/Tensor

alfalfasprout said a year ago:

Tensor ops are usually implemented using the principles of vectorizing code.

After all most tensors are implemented as 1D arrays with strides for each dimension with the innermost dimension always contiguous.

quickthrower2 said a year ago:

PyTorch internals: http://blog.ezyang.com/2019/05/pytorch-internals/

A good basic thing to know is a tensor is a single dimensional array under the hood. At least as far as I know in my mental model - I have not yet looked at the code :-)

Therefore "converting" a 2 * 5 * 10 tensor into a 2 * 50 one is an immediate operation, it doesn't have to scan. Just change some metadata.

corsix said a year ago:

From a hardware perspective, vector instructions operate on small 1D vectors, whereas tensor instructions operate on small 2D matrices. I say “instructions”, but it’s really only matrix multiply or matrix multiply and accumulate - most other instructions are fine staying as 1D.

godelski said a year ago:

This is mostly going to be the same. The main difference in tensorization is that your vectors are not contiguous. So you lose some speedup. How you typically optimize this is by flattening your tensor. Then your data is contiguious and all the same rules apply.

This is overly simplified though. Things get different when we start talking about {S,M}I{S,M}D (page addresses SIMD) or GPUs. Parallel computing is a whole other game, and CUDA takes it to the next level (lots of people can write CUDA kernels, not a lot of people can write GOOD CUDA kernels (I'm definitely not a pro, but know some)). Parallelism adds another level of complexity due to being able to access different memory locations simultaneously. If you think concurrent optimization is magic, parallel optimization is wizardry (I still believe this is true)

EGreg said a year ago:

Sorry if this is a different sensd of that same word…

I can understand almost everything in the explanations of how ChatGPT, neural networks and fine-tuning works

But one thing isn’t being explained…

How do you embed the words as vectors????

Do you just make up your own scheme?

I understand about vector databases being used for search, to retrieve snippets and stuff them into a prompt. I don’t mean that.

I mean how did people embed the words WHEN THEY USE THE EMBEDDINGS API?

I think this whole vector database and pinecone stuff will be phased out once the LLM windows get larger, like with the new Claude from Anthropic. And the ability to fine tune the model on your data. Then the LLM can make its own lower-rank model of your entire corpus of data, rather than having to do vector lookups and stuffing it into a prompt.

dekken_ said a year ago:

> Sorry if this is a different sensd of that same word…

it is, hence the downvotes. This one is more about CPU operations, AVX/SIMD

https://en.wikipedia.org/wiki/Automatic_vectorization

Joker_vD said a year ago:

I thought I was crazy when looking at all those vector extensions piling up made me think "why not just have some generic-sized vector instructions, and then let the CPU to sort out what is the largest chunk size it actually supports internally" but apparently no, I am not! It was actually been done before and we might even have that again in the future. That's nice.

r3ndr said a year ago:

ARM's newer SVE/SVE2 instructions can do something like this now. SVE allows the vector length to vary between 128 and 2048 bits, at 128 bit increments. RISC-V's vector RVV instructions also allow the vector length to vary across different CPU implementations. This way, the same RVV vector code can run on different CPUs having different vector lengths.

Veliladon said a year ago:

Pretty much.

If one wants to sum two arrays instead of for looping through the elements of two arrays one might instead use iterate by chunks so that the compiler can easily tie them all together as a single operation which can then easily vectorize.

pasmafaute said a year ago:

If I recall, you can absolutely loop through the elements in a tight loop and the compiler (e.g. GCC) will auto-vectorize for you (if you have the relevant optimization flag set).

The trick with coding for auto-vectorization is to keep your loops small and free of clutter.

I don't have the documentation handy but I think you only need to follow a couple rules:

- loop must have a defined size (for-loop instead of while-loop)

- don't muck with pointers inside the loop (simple pointer increment is okay)

- don't modify other variables (only the array should be modified)

dragontamer said a year ago:

> One of the common interview questions I ask is computing the sum of an array of doubles. Barely 1% of applicants can even get the basics correctly such as having multiple partial sums.

Ehhh... I've got multiple hats from my college classes.

Yes, optimizations say that multiple partial sums is fastest. But... my floating-point analysis classes has my brain yelling "cancellation error". The only way to minimize cancellation error is to sort the numbers from biggest to smallest, and then separate them from positives-and-negatives. You add the numbers from smallest to largest in magnitude, only matching signs, and then finally combine them at the end.

The only two partial sums you should be keeping are "positive numbers" and "negative numbers", and everything else needs to be folded together to minimize rounding and cancellation errors.

--------

You know. For stability and minimization of errors. There's other, faster, methods to possibly be employed. But optimization of partial sums is probably the last thing I'm thinking if I see "float" or "double".

If things are "int" or "long", I'll have a higher chance of actually thinking (and caring) about optimization, because stability/cancellation error isn't a thing with int or long.

mgaunard said 10 months ago:

The reason you need multiple sums is because there is a loop dependency which causes every iteration to stall and prevents pipelining and saturating the multiple compute units of your superscalar processor.

Yes, floating-point isn't associative, so the results aren't numerically equivalent, and that is why you need to do that optimization yourself since the compiler isn't allowed to do it for you.

As per the exercise, there is no information about the distribution of the data, so there is no indication that any order would be any better than another for numerical stability.

If you care about stability you'd implement a kahan sum, which isn't really affected by changing the order of the sum so that you can parallelize it.

listenallyall said a year ago:

Is asking a question that fewer than 1% of applicants "can even get the basics correct" -- which implies that only a tiny fraction of 1% actually provide a fully complete and correct answer -- a good use of either the applicants' or the interviewer's time? Is it effective in distinguishing candidates, or are you only seeking 1 success (fewer, actually) per 100?

mgaunard said a year ago:

1% success rate is typical when interviewing developers.

Also they get asked multiple exercises, not doing great at one is not eliminatory.

Being able to quickly identify whether an individual is exceptionally good is very important to ensure the employer can fast-track them and be responsive quickly enough that he doesn't accept another offer.

jabl said a year ago:

No, classic Cray did have a fixed vector size (4096 bits, or 64 double precision floating point values) both in the HW and the ISA. What it did have, however, which most contemporary SIMD ISA's lack, was a vector length register which allowed masking the end of the vector register. That meant the compiler didn't have to generate a scalar loop tail to handle the final iterations of a vectorized loop, it just set the vector length register to mask of the end of the vector data register, which meant that load/store/arithmetic instructions only worked on the first N elements in the vector rather than the full 64.

I think the RISC-V V extension, as well as ARM SVE, both have something similar as well. What RVV and SVE have, that Cray vector ISA didn't, is the ability to support different HW vector register lengths with the same ISA.

sva_ said a year ago:

Names can have different meanings based on context, even within pure mathematics. I think a mathematician would have little trouble discerning which kind of tensor is meant, as this sort of reusing existing terms is pretty common.

Just calling it an array might be underselling it, at this point. Perhaps a tensor in the context of CompSci is just a particular type of array with certain expected properties.

rsfern said a year ago:

> Just calling it an array might be underselling it, at this point

I like the NDArray terminology used by numpy. I think it gets the point across more clearly than “tensor”, and conceptually they’re pretty much equivalent

richardw said a year ago:

ChatGPT interpretation:

Sure! Let me break it down for you:

When we talk about "tensor ops," we're referring to operations performed on tensors, which are multidimensional arrays of numbers commonly used in mathematics and computer science.

Now, these tensor operations are typically implemented using a technique called "vectorizing code." In simple terms, vectorizing code means performing operations on entire arrays of data instead of looping through each element one by one. It's like doing multiple calculations at once, which can be more efficient and faster.

Tensors are usually represented as 1D arrays, meaning all the elements are arranged in a single line. Each dimension of the tensor has a concept called "stride," which represents how many elements we need to skip to move to the next element in that dimension. This helps us efficiently access and manipulate the data in the tensor.

Additionally, the innermost dimension of a tensor is always "contiguous," which means the elements are stored sequentially without any gaps. This arrangement also aids in efficient processing of the tensor data.

So, in summary, tensor ops involve performing operations on multidimensional arrays, and we use vectorized code to do these operations more efficiently. Tensors are represented as 1D arrays with strides for each dimension, and the innermost dimension is always stored sequentially without any gaps.

jabl said a year ago:

That's pretty much how every serious dense matrix (or more generally, multi-dimensional array) representation has been done since, well, forever. It's only C programmers writing "My First Matrix Program" that insist on allocating each row separately due to the lack of native multidimensional arrays in that language.

ReleaseCandidat said a year ago:

Fortran did it:

    A 2-dimensional array A will, in the object program, be stored sequentially in
    the order Ai,i, A2,i Am l , A| i2 , A2f2 Am, 2> , Am,„. Thus
    it is stored “columnwise”, with the first of its subscripts varying most 
    rapidly, and the last varying least rapidly. The same is true of 3-dimensional arrays.
   1 -dimensional arrays are of course simply stored sequentially. All arrays are stored backwards in storage; i.e. the above sequence is in the order of decreasing absolute location.

https://ia800807.us.archive.org/0/items/history-of-fortran/I...

quickthrower2 said a year ago:

Yes! I was surprised how simple it was. Obvious once you know how. I never really considered multidimensional storage until looking into Machine Learning recently. My only brush with it is relational database tuning, but that is very different, as the number of "dimensions" are small and they are usually not homogenous.

NKosmatos said a year ago:

From the introductory text “The ultimate goal of vectorization is an increase in floating-point performance…”. I find it a bit funny that by vectorizing float executions we’ll be able to calculate wrong answers faster ;-)

jcranmer said a year ago:

It's not a wrong answer, anymore than writing down ⅓ as 0.333, multiplying that by 3, and discovering that the result is 0.999 and not 1.

And there's nothing in vectorization that requires floating-point numbers; it's just that the algorithms that tend to demand the most out of a computer tend to be large numerical algorithms, which use floating-point.

comicjk said a year ago:

They're not wrong answers, they're floating-point answers. What's wrong is the assumption that infinite precision will fit in a handful of bytes.

shoo said a year ago:

for many practical applications where we compute things, the input data isn't fully accurate, it's an estimate with significant error or uncertainty. when we consider how the outputs of computations are used, in many cases it isn't useful for outputs to be arbitrarily precise. maybe the output is used to make a decision and the expected cost or benefit decision won't significantly change even if the output is wrong by 1%, for example.

one person's "wrong" is another person's approximation. by reducing accuracy we can often get very large performance improvements. good engineering makes an appropriate tradeoff.

there's even more scope for this kind of thing inside some numerical algorithms where having a "pretty good" guess gives a massive speed boost, but the algorithm can be iterated to find quite precise answers even if the "pretty good" guess is an approximation that's off by 20%

there's yet more scope for this if we consider if we're trying to mathematically model reality -- even if we perform all the math with arbitrary precision calculations, the theoretical model itself is still an approximation of reality, there's some error there. there's limited value in solving a model exactly, as there's always some approximation error between the model and reality.

it's amazing we have such good support for high-performance scientific quality ieee floating point math everywhere in commodity hardware. it's a great tool to have in the toolbelt

m00x said a year ago:

How is this relevant? This is more of a critique of floats than it is about vectorization.

windsignaling said a year ago:

Still irrelevant.

ReleaseCandidat said a year ago:

> Once you get to general relativity your physics professors will beat into you the fact that gradients are not vectors.

In linear algebra you should learn this as soon as you hear about it. Is the english Wikipedia page correct in that you always use nabla for the gradient? https://en.wikipedia.org/wiki/Gradient#Generalizations

Because `grad f = ∇f` holds in scalar fields f only (not in vector or tensor fields (tensors that aren't scalars)).

antognini said a year ago:

Even if you are taking the gradient of a scalar field, the resulting object is not a vector. The difference is that if you apply some transformation to your coordinate system a gradient transforms differently from a normal vector.

In the ML context it generally doesn't matter because you don't do these transformations (and more generally there is no metric for your space). So you can just treat a gradient as an array of numbers. But in a physical context the distinction starts to matter, at least on a manifold with curvature.

The one place it does matter in ML (that I'm aware of) is information geometry, where you try to do a transformation from the normal coordinate space to a more "natural" coordinate system that is based on the Fisher information, and this introduces curvature into model.

Const-me said a year ago:

I touched that question in "Hardware Support" section, on page 6 of that article http://const.me/articles/simd/simd.pdf

TLDR: SSE1 and SSE2 are parts of the base AMD64 ISA, they are always available while compiling for 64 bits. SSE3, SSE4.1 are old, market penetration is very close to 100%, always available in practice. For the newer extensions like AVX and FMA, the optimal tradeoff depends on the product and the market.

iamcreasy said a year ago:

Do I need to download the source and compile it using correct flags on the target system for optimal performance? In my previous institution they would download the commonly distributed R language binary and run as is on a brand new system, and I was wondering if recompiling from source would have made a difference.

Const-me said a year ago:

I don’t know, it depends on the specific library.

If they were compiling for least-common denominator like SSE2, recompiling with -march=native might make a difference. The exact amount of that difference depends heavily on the code.

Automatic vectorizers are limited, they only generating good vectorized code for pure vertical operations which handle long dense arrays of FP32 or FP64 values.

OTOH, some widely-used libraries like Eigen https://eigen.tuxfamily.org/index.php?title=Main_Page are using manually-written intrinsics under the hood, selecting an implementation with preprocessor macros. If your R library uses Eigen, compiling for AVX2 or AVX512 will probably cause a non-trivial performance win.

If your library uses runtime dispatch to select best implementation in runtime, and is compiled recently (i.e. you don’t have new instruction sets in your CPU which weren’t available back then), you’re unlikely to get much profit.

BTW, I’ve noticed C standard library implemented by VC++ uses runtime dispatch, functions like memcpy() are using AVX vectors internally, when supported.

story

Vectorization: Introduction (cvw.cac.cornell.edu)

89 Comments:

Loading...

story

Vectorization: Introduction (cvw.cac.cornell.edu)

89 Comments:

story

Vectorization: Introduction(cvw.cac.cornell.edu)

89 Comments:

Loading...

story

Vectorization: Introduction(cvw.cac.cornell.edu)

89 Comments:

Vectorization: Introduction (cvw.cac.cornell.edu)

Vectorization: Introduction (cvw.cac.cornell.edu)