Windows ARM support for MSVC Compiler#597
Windows ARM support for MSVC Compiler#597dousse-adobe wants to merge 6 commits intoRenderKit:masterfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Adds Windows ARM64 (MSVC) build support by tightening SIMD type conversions and extending ARM/NEON feature detection across the codebase.
Changes:
- Detect Windows ARM64 in CMake/MSVC toolchain setup and enable ARM-specific build flags.
- Replace implicit SIMD/native-type conversions with explicit accessors (
m128(),m128i(),m256i(),vec_float(),vec_int(), etc.) throughout kernels, tutorials, and math types. - Extend ARM64 handling in intrinsics/emulation and architecture-specific traversal/intersector paths.
Reviewed changes
Copilot reviewed 79 out of 80 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| CMakeLists.txt | Detect Windows ARM64 generator platform and enable EMBREE_ARM. |
| common/cmake/msvc.cmake | Add ARM-specific MSVC defines/flags (incl. /Zc:preprocessor). |
| common/cmake/check_arm_neon.cpp | Accept _M_ARM64 as NEON-capable for compile checks. |
| common/sys/thread.cpp | Include ARM emulation header for _M_ARM64. |
| common/sys/sysinfo.h | Treat _M_ARM64 similarly to __ARM_NEON in ISA selection. |
| common/sys/sysinfo.cpp | Recognize _M_ARM64 for CPU name/type and feature bits. |
| common/sys/platform.h | Define __64BIT__ for _M_ARM64. |
| common/sys/intrinsics.h | Add _M_ARM64 conditions for bit-scan/popcnt and includes. |
| common/math/emath.h | Add _M_ARM64 handling for scalar SIMD-assisted math helpers. |
| common/math/color.h | Extend ARM64 reciprocal/rsqrt paths to _M_ARM64. |
| common/math/bbox.h | Enable SIMD-specialized BBox<Vec3fa> empty() for _M_ARM64. |
| common/math/vec2.h | Include SIMD helpers for _M_ARM64. |
| common/math/vec2fa.h | Replace implicit SIMD conversions with explicit accessors for _M_ARM64. |
| common/math/vec3.h | Include SIMD helpers for _M_ARM64 and adjust Vec3 ctor access. |
| common/math/vec3ba.h | Rename SIMD storage member and add explicit accessors. |
| common/math/vec3ia.h | Rename SIMD storage member and add explicit accessors/conversions. |
| common/math/vec4.h | Include SIMD helpers for _M_ARM64 and adjust Vec4 ctor access. |
| common/math/linearspace3.h | Use explicit m128() when converting SIMD vectors to Vec3fa. |
| common/simd/simd.h | Include SSE wrapper on _M_ARM64. |
| common/simd/sse.h | Enable blendv_ps helper on _M_ARM64. |
| common/simd/arm/emulation.h | Adjust precise-div macros for _M_ARM64. |
| common/simd/arm/avx2neon.h | MSVC compatibility, _M_ARM64 lane-access workarounds, and warning suppression. |
| common/simd/vuint8_avx.h | Remove implicit casts; add explicit m256i() and update uses. |
| common/simd/vint8_avx.h | Remove implicit casts; add explicit accessors and update uses. |
| common/simd/vllong4_avx2.h | Remove implicit casts; add explicit m256i() and update uses/masks. |
| common/simd/vllong8_avx512.h | Remove implicit casts; add explicit m512i() and update uses/masks. |
| common/simd/vdouble4_avx.h | Remove implicit cast; use m256d() in operations/comparisons/select. |
| common/simd/vboolf8_avx.h | Remove implicit conversions; add explicit m256()/m256d() and update ops. |
| common/simd/vboolf8_avx512.h | Remove implicit __mmask8 conversion; add packed mask getters and update ops. |
| common/simd/vboolf4_sse2.h | Remove implicit conversions; add explicit accessors and update ops/shuffles. |
| common/simd/vboolf4_avx512.h | Remove implicit __mmask8 conversion; add packed mask getters and update ops. |
| common/simd/vboolf16_avx512.h | Remove implicit __mmask16 conversion; add packed mask getters and update ops. |
| common/simd/vboold4_avx.h | Remove implicit conversions; add explicit accessors and update ops. |
| common/simd/vboold4_avx512.h | Remove implicit __mmask8 conversion; add packed mask getters and update ops. |
| common/simd/vboold8_avx512.h | Remove implicit __mmask8 conversion; add packed mask getters and update ops. |
| tutorials/common/scenegraph/scenegraph.cpp | Use explicit .m128() when constructing scalar vector types; adjust comparison logic. |
| tutorials/common/math/random_sampler.h | Use explicit .m128() when converting shifted int vector to Vec3fa. |
| kernels/common/state.cpp | Treat _M_ARM64 as ARM/NEON in ISA support checks and debug asserts. |
| kernels/common/isa.h | Enable SIMD4 target definition for _M_ARM64. |
| kernels/common/default.h | Replace implicit float->int vector conversion with explicit vec_int(). |
| kernels/common/buffer.h | Use explicit .m128() when converting loaded vfloat4 to Vec3fa. |
| kernels/common/acceln.cpp | Include _M_ARM64 in packet-validity early-out checks. |
| kernels/common/accel.h | Enable 4-wide intersector wrappers under _M_ARM64. |
| kernels/bvh/node_intersector_packet.h | Treat _M_ARM64 like __aarch64__ in FMA-related traversal math. |
| kernels/bvh/node_intersector_frustum.h | Treat _M_ARM64 like __aarch64__ in frustum traversal math. |
| kernels/bvh/node_intersector1.h | Treat _M_ARM64 like __aarch64__ in FMA-related traversal math and conditions. |
| kernels/bvh/bvh_traverser1.h | Replace implicit conversions with explicit mask/int accessors for AVX512VL path. |
| kernels/bvh/bvh_statistics.cpp | Include _M_ARM64 in BVH4 statistics instantiation conditions. |
| kernels/bvh/bvh_node_qaabb.h | Replace implicit int/float vector conversions with explicit vec_int()/vec_float(). |
| kernels/bvh/bvh_builder_morton.cpp | Use explicit .m128() conversions for BBox3fx construction. |
| kernels/bvh/bvh.cpp | Include _M_ARM64 in BVH4 instantiation conditions. |
| kernels/builders/primrefgen_presplit.h | Use explicit .m128i() when converting select(...) result to Vec3ia. |
| kernels/builders/heuristic_binning.h | Use explicit .m128i() for clamp/floor-to-int conversions. |
| kernels/builders/bvh_builder_morton.h | Use explicit .m128i() for centroid quantization to vint4. |
| kernels/geometry/subgrid_intersector_pluecker.h | Use explicit vec_float() for vint->vfloat conversions. |
| kernels/geometry/subgrid_intersector_moeller.h | Use explicit vec_float() for vint->vfloat conversions. |
| kernels/geometry/linei.h | Fix mask type for select() on 8-wide vectors (vboolf8). |
| kernels/geometry/instance_intersector.cpp | Instantiate packet intersectors for _M_ARM64. |
| kernels/geometry/grid_soa.h | Use explicit vec_float() for vint->vfloat conversions. |
| kernels/geometry/grid_soa.cpp | Use explicit vec_int() when converting clamped floats to ints. |
| kernels/geometry/curveNi_intersector.h | Use explicit .m128() when converting vfloat4 to Vec3fa. |
| kernels/geometry/curveNi_mb_intersector.h | Use explicit .m128() when converting vfloat4 to Vec3fa. |
| kernels/subdiv/tessellation.h | Use explicit .m256() for vint8->vfloat8 conversions. |
| kernels/subdiv/patch_eval_simd.h | Use explicit .m128i() for float->int vector conversion (MSVC ARM aliasing). |
| kernels/subdiv/patch_eval_grid.h | Use explicit vec_float() for ix/iy -> vfloatx conversions. |
| kernels/subdiv/feature_adaptive_eval_simd.h | Use explicit .m128i() for float->int vector conversion (MSVC ARM aliasing). |
| kernels/subdiv/feature_adaptive_eval_grid.h | Use explicit vec_float() for ix/iy -> vfloatx conversions. |
| kernels/subdiv/linear_bezier_patch.h | Use explicit .m128() when converting vfloat4 values into Vec2fa. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| #if !defined(__ARM_NEON) || !defined(_M_ARM64) | ||
| assert(sse2::getISA() <= SSE2); |
There was a problem hiding this comment.
The preprocessor guard in State::verify() is using || where it should likely be &&. As written, the assert(sse2::getISA() <= SSE2) will still run whenever either __ARM_NEON or _M_ARM64 is not defined, which is true for many ARM build configurations and defeats the intent of skipping this assert on ARM/NEON builds. Consider changing the condition to only run the assert when neither __ARM_NEON nor _M_ARM64 are defined (i.e., both are absent).
| __forceinline const __m128i m128i() const { return _mm_cvtps_epi32(v); } | ||
| __forceinline const __m128d m128d() const { return _mm_cvtps_pd(v); } |
There was a problem hiding this comment.
vboolf4::m128i() and vboolf4::m128d() currently use numeric conversions (_mm_cvtps_epi32 / _mm_cvtps_pd) rather than bit-casts. For mask types this changes the underlying bit pattern and breaks semantics (e.g., _mm_movedup_pd(v.m128d()) and any code expecting a bitwise reinterpretation). These accessors should preserve bits (casts), not convert values.
| __forceinline const __m128i m128i() const { return _mm_cvtps_epi32(v); } | |
| __forceinline const __m128d m128d() const { return _mm_cvtps_pd(v); } | |
| __forceinline const __m128i m128i() const { return _mm_castps_si128(v); } | |
| __forceinline const __m128d m128d() const { return _mm_castps_pd(v); } |
| __forceinline __mmask8 packedMask8() const { return v; } | ||
|
|
||
| /* return packed 16 bits mask */ | ||
| __forceinline __mmask8 packedMask16() const { return (__mmask16)v; } |
There was a problem hiding this comment.
packedMask16() is declared to return __mmask8 but returns a __mmask16 value. This is inconsistent with the name and with its usage in AVX-512 mask intrinsics that operate on 16-bit masks. It should return __mmask16 (and only widen/narrow explicitly where needed) to avoid confusion and accidental truncation/overload issues.
| __forceinline __mmask8 packedMask16() const { return (__mmask16)v; } | |
| __forceinline __mmask16 packedMask16() const { return (__mmask16)v; } |
| __m256d _mm256_permute4x64_pd (__m256d a, const int imm8) | ||
| { | ||
| float64x2_t lo,hi; | ||
| lo[0] = _mm256_permute4x64_pd_select(a,imm8 >> 0); | ||
| lo[1] = _mm256_permute4x64_pd_select(a,imm8 >> 2); | ||
| hi[0] = _mm256_permute4x64_pd_select(a,imm8 >> 4); | ||
| hi[1] = _mm256_permute4x64_pd_select(a,imm8 >> 6); | ||
| lo = vsetq_lane_f64(_mm256_permute4x64_pd_select(a,imm8 >> 0), lo, 0); | ||
| lo = vsetq_lane_f64(_mm256_permute4x64_pd_select(a,imm8 >> 2), lo, 1); | ||
| hi = vsetq_lane_f64(_mm256_permute4x64_pd_select(a,imm8 >> 4), hi, 0); | ||
| hi = vsetq_lane_f64(_mm256_permute4x64_pd_select(a,imm8 >> 6), hi, 1); |
There was a problem hiding this comment.
_mm256_permute4x64_pd builds lo/hi via vsetq_lane_f64 but float64x2_t lo, hi; are left uninitialized before being passed to vsetq_lane_f64. These intrinsics typically read the input vector to produce the result, so using an uninitialized vector is undefined behavior. Initialize lo/hi first (e.g., to zero) or construct them in a way that doesn't depend on prior contents.
|
Thanks for your patch! This will take a little time to review. Don't mind the Copilot review, I ran it mainly to check for typos etc, it will not be a substitute for a full manual review. |
Summary
I come back to revive this old but great piece of work.
Reasoning
We are at Adobe in the process of porting our applications to Windows ARM platforms.
Embree being used in the 3DIVA division (Substance toolsuite notably), and MSVC being our main windows compiler, we would like to propose these changes.
Implementation Details
The PR comes with a few twists though...
I updated the sse2neon.h header to the latest state while taking care to back port the changes you added in it.
I wanted to make sure the tests suite (see the
Resultssection) passes before submitting this PR but encountered an issue which leads to a design adjustment in the way you use vectorized types in embree.The problem:
Types such as
vfloat<...>,vbool<...>,vint<...>, ... make use of__m128,__m128i,__m256, ... native types for sse/avx operations. These types are then reinterpreted when compiling for ARM platforms to other types likefloat32x4_t,int32x4_t, ...For GCC and Clang, it's perfect.
For MSVC this is not the end of the story: the
arm_neon.hdeployed by msvc does this:This is not an issue when used in a pure C context (can be argued but not the subject). But when used in a C++ context and when you rely on function overload to achieve some operations, well... you get for instance:
This is the first visible issue.
And then you have implicit cast from vectorized types to their inner types:
This is not a compile issue this time but logic issue as you never know how the types will be interpreted.
Possible native solution
After some feedbacks from users, the visual studio team added these options to their compiler
/Zc:arm64-aliased-neon-typesand/Zc:arm64-aliased-neon-types-This needs MSVC 2022 17.2 version (june 7th, 2022)
Types are too restrictive, are not builtin types but class types where conversion like int32x4 -> int64x2 are not trivially supported (So
__m256to__m128are not either)And still, with the unaliased types flag enabled there are types and functions needed in your implementation (e.g. poly128) which still lack compiler implementation:
https://developercommunity.visualstudio.com/t/ARM64-intrinsics-vmull_p64-and-vmull_h/10090361?ftype=problem&page=25
Replacing the functions used, and implementing type conversion could have been an option, but I chose what seemed like a safe and logic solution (quite verbose as well).
The suggested solution
I removed the implicit cast operators from your vectorized types and replaced them with explicit cast functions. (As simple as that)
So for the previous example:
I had to go through all the AVX/SSE instructions you use, and check their documentation so we do use the right type/conversion.
I did not use any AI agent to assist me so I am sure of the changes in conversions used.
Results
I could not run your CI environment locally but found my way through some of your tests. As a baseline, I configured the project with
-DEMBREE_TESTING_INSTALL_TESTS=ON, cmake installed the project, and used the testing suite in the deployed package as a reference.What I get is 100% passed tests on both Windows x64 and Windows ARM.