Windows ARM support for MSVC Compiler by dousse-adobe · Pull Request #597 · RenderKit/embree

dousse-adobe · 2026-04-23T15:09:13Z

Summary

This PR adds support for building embree for Windows ARM64 platforms, using MSVC as the compiler.
@anthony-linaro

I come back to revive this old but great piece of work.

Reasoning

We are at Adobe in the process of porting our applications to Windows ARM platforms.
Embree being used in the 3DIVA division (Substance toolsuite notably), and MSVC being our main windows compiler, we would like to propose these changes.

Implementation Details

The PR comes with a few twists though...
I updated the sse2neon.h header to the latest state while taking care to back port the changes you added in it.
I wanted to make sure the tests suite (see the Results section) passes before submitting this PR but encountered an issue which leads to a design adjustment in the way you use vectorized types in embree.

The problem:

Types such as vfloat<...>, vbool<...>, vint<...>, ... make use of __m128, __m128i, __m256, ... native types for sse/avx operations. These types are then reinterpreted when compiling for ARM platforms to other types like float32x4_t, int32x4_t, ...
For GCC and Clang, it's perfect.
For MSVC this is not the end of the story: the arm_neon.h deployed by msvc does this:

typedef __n128   int32x4_t;
typedef __n128   int64x2_t;
typedef __n128   float32x4_t;
typedef __n128   float64x2_t;

This is not an issue when used in a pure C context (can be argued but not the subject). But when used in a C++ context and when you rely on function overload to achieve some operations, well... you get for instance:

    // In vint4_sse2.h
    // __m128i becomes int32x4_t via sse2neon, and then __n128
    __forceinline vint(__m128i a) : v(a) {}

    // ERROR: Redeclaration of the constructor
    // __m128 becomes float32x4_t via sse2neon, and then __n128
    __forceinline explicit vint(__m128 a) : v(_mm_cvtps_epi32(a)) {}

This is the first visible issue.
And then you have implicit cast from vectorized types to their inner types:

// In vfloat4_sse.h
  template<>
  struct vfloat<4>
  {
  ...
    __forceinline operator const __m128&() const { return v; }
    __forceinline operator       __m128&()       { return v; }
  ...

This is not a compile issue this time but logic issue as you never know how the types will be interpreted.

Possible native solution

After some feedbacks from users, the visual studio team added these options to their compiler /Zc:arm64-aliased-neon-types and /Zc:arm64-aliased-neon-types-

This needs MSVC 2022 17.2 version (june 7th, 2022)
Types are too restrictive, are not builtin types but class types where conversion like int32x4 -> int64x2 are not trivially supported (So __m256 to __m128 are not either)

And still, with the unaliased types flag enabled there are types and functions needed in your implementation (e.g. poly128) which still lack compiler implementation:
https://developercommunity.visualstudio.com/t/ARM64-intrinsics-vmull_p64-and-vmull_h/10090361?ftype=problem&page=25

Replacing the functions used, and implementing type conversion could have been an option, but I chose what seemed like a safe and logic solution (quite verbose as well).

The suggested solution

I removed the implicit cast operators from your vectorized types and replaced them with explicit cast functions. (As simple as that)
So for the previous example:

// In vfloat4_sse.h
  template<>
  struct vfloat<4>
  {
  ...
    // These were removed
    //__forceinline operator const __m128&() const { return v; }
    //__forceinline operator       __m128&()       { return v; }
    // These were added
    __forceinline const __m128& m128() const { return v; }
    __forceinline __m128&       m128()       { return v; }
    __forceinline __m128i       m128i() const      { return _mm_cvtps_epi32(v); }
  ...

I had to go through all the AVX/SSE instructions you use, and check their documentation so we do use the right type/conversion.
I did not use any AI agent to assist me so I am sure of the changes in conversions used.

Results

I could not run your CI environment locally but found my way through some of your tests. As a baseline, I configured the project with -DEMBREE_TESTING_INSTALL_TESTS=ON, cmake installed the project, and used the testing suite in the deployed package as a reference.

What I get is 100% passed tests on both Windows x64 and Windows ARM.

Copilot

Pull request overview

Adds Windows ARM64 (MSVC) build support by tightening SIMD type conversions and extending ARM/NEON feature detection across the codebase.

Changes:

Detect Windows ARM64 in CMake/MSVC toolchain setup and enable ARM-specific build flags.
Replace implicit SIMD/native-type conversions with explicit accessors (m128(), m128i(), m256i(), vec_float(), vec_int(), etc.) throughout kernels, tutorials, and math types.
Extend ARM64 handling in intrinsics/emulation and architecture-specific traversal/intersector paths.

Reviewed changes

Copilot reviewed 79 out of 80 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
CMakeLists.txt	Detect Windows ARM64 generator platform and enable `EMBREE_ARM`.
common/cmake/msvc.cmake	Add ARM-specific MSVC defines/flags (incl. `/Zc:preprocessor`).
common/cmake/check_arm_neon.cpp	Accept `_M_ARM64` as NEON-capable for compile checks.
common/sys/thread.cpp	Include ARM emulation header for `_M_ARM64`.
common/sys/sysinfo.h	Treat `_M_ARM64` similarly to `__ARM_NEON` in ISA selection.
common/sys/sysinfo.cpp	Recognize `_M_ARM64` for CPU name/type and feature bits.
common/sys/platform.h	Define `__64BIT__` for `_M_ARM64`.
common/sys/intrinsics.h	Add `_M_ARM64` conditions for bit-scan/popcnt and includes.
common/math/emath.h	Add `_M_ARM64` handling for scalar SIMD-assisted math helpers.
common/math/color.h	Extend ARM64 reciprocal/rsqrt paths to `_M_ARM64`.
common/math/bbox.h	Enable SIMD-specialized `BBox<Vec3fa>` empty() for `_M_ARM64`.
common/math/vec2.h	Include SIMD helpers for `_M_ARM64`.
common/math/vec2fa.h	Replace implicit SIMD conversions with explicit accessors for `_M_ARM64`.
common/math/vec3.h	Include SIMD helpers for `_M_ARM64` and adjust Vec3 ctor access.
common/math/vec3ba.h	Rename SIMD storage member and add explicit accessors.
common/math/vec3ia.h	Rename SIMD storage member and add explicit accessors/conversions.
common/math/vec4.h	Include SIMD helpers for `_M_ARM64` and adjust Vec4 ctor access.
common/math/linearspace3.h	Use explicit `m128()` when converting SIMD vectors to `Vec3fa`.
common/simd/simd.h	Include SSE wrapper on `_M_ARM64`.
common/simd/sse.h	Enable `blendv_ps` helper on `_M_ARM64`.
common/simd/arm/emulation.h	Adjust precise-div macros for `_M_ARM64`.
common/simd/arm/avx2neon.h	MSVC compatibility, `_M_ARM64` lane-access workarounds, and warning suppression.
common/simd/vuint8_avx.h	Remove implicit casts; add explicit `m256i()` and update uses.
common/simd/vint8_avx.h	Remove implicit casts; add explicit accessors and update uses.
common/simd/vllong4_avx2.h	Remove implicit casts; add explicit `m256i()` and update uses/masks.
common/simd/vllong8_avx512.h	Remove implicit casts; add explicit `m512i()` and update uses/masks.
common/simd/vdouble4_avx.h	Remove implicit cast; use `m256d()` in operations/comparisons/select.
common/simd/vboolf8_avx.h	Remove implicit conversions; add explicit `m256()`/`m256d()` and update ops.
common/simd/vboolf8_avx512.h	Remove implicit `__mmask8` conversion; add packed mask getters and update ops.
common/simd/vboolf4_sse2.h	Remove implicit conversions; add explicit accessors and update ops/shuffles.
common/simd/vboolf4_avx512.h	Remove implicit `__mmask8` conversion; add packed mask getters and update ops.
common/simd/vboolf16_avx512.h	Remove implicit `__mmask16` conversion; add packed mask getters and update ops.
common/simd/vboold4_avx.h	Remove implicit conversions; add explicit accessors and update ops.
common/simd/vboold4_avx512.h	Remove implicit `__mmask8` conversion; add packed mask getters and update ops.
common/simd/vboold8_avx512.h	Remove implicit `__mmask8` conversion; add packed mask getters and update ops.
tutorials/common/scenegraph/scenegraph.cpp	Use explicit `.m128()` when constructing scalar vector types; adjust comparison logic.
tutorials/common/math/random_sampler.h	Use explicit `.m128()` when converting shifted int vector to `Vec3fa`.
kernels/common/state.cpp	Treat `_M_ARM64` as ARM/NEON in ISA support checks and debug asserts.
kernels/common/isa.h	Enable SIMD4 target definition for `_M_ARM64`.
kernels/common/default.h	Replace implicit float->int vector conversion with explicit `vec_int()`.
kernels/common/buffer.h	Use explicit `.m128()` when converting loaded `vfloat4` to `Vec3fa`.
kernels/common/acceln.cpp	Include `_M_ARM64` in packet-validity early-out checks.
kernels/common/accel.h	Enable 4-wide intersector wrappers under `_M_ARM64`.
kernels/bvh/node_intersector_packet.h	Treat `_M_ARM64` like `__aarch64__` in FMA-related traversal math.
kernels/bvh/node_intersector_frustum.h	Treat `_M_ARM64` like `__aarch64__` in frustum traversal math.
kernels/bvh/node_intersector1.h	Treat `_M_ARM64` like `__aarch64__` in FMA-related traversal math and conditions.
kernels/bvh/bvh_traverser1.h	Replace implicit conversions with explicit mask/int accessors for AVX512VL path.
kernels/bvh/bvh_statistics.cpp	Include `_M_ARM64` in BVH4 statistics instantiation conditions.
kernels/bvh/bvh_node_qaabb.h	Replace implicit int/float vector conversions with explicit `vec_int()/vec_float()`.
kernels/bvh/bvh_builder_morton.cpp	Use explicit `.m128()` conversions for `BBox3fx` construction.
kernels/bvh/bvh.cpp	Include `_M_ARM64` in BVH4 instantiation conditions.
kernels/builders/primrefgen_presplit.h	Use explicit `.m128i()` when converting `select(...)` result to `Vec3ia`.
kernels/builders/heuristic_binning.h	Use explicit `.m128i()` for clamp/floor-to-int conversions.
kernels/builders/bvh_builder_morton.h	Use explicit `.m128i()` for centroid quantization to `vint4`.
kernels/geometry/subgrid_intersector_pluecker.h	Use explicit `vec_float()` for vint->vfloat conversions.
kernels/geometry/subgrid_intersector_moeller.h	Use explicit `vec_float()` for vint->vfloat conversions.
kernels/geometry/linei.h	Fix mask type for `select()` on 8-wide vectors (`vboolf8`).
kernels/geometry/instance_intersector.cpp	Instantiate packet intersectors for `_M_ARM64`.
kernels/geometry/grid_soa.h	Use explicit `vec_float()` for vint->vfloat conversions.
kernels/geometry/grid_soa.cpp	Use explicit `vec_int()` when converting clamped floats to ints.
kernels/geometry/curveNi_intersector.h	Use explicit `.m128()` when converting `vfloat4` to `Vec3fa`.
kernels/geometry/curveNi_mb_intersector.h	Use explicit `.m128()` when converting `vfloat4` to `Vec3fa`.
kernels/subdiv/tessellation.h	Use explicit `.m256()` for vint8->vfloat8 conversions.
kernels/subdiv/patch_eval_simd.h	Use explicit `.m128i()` for float->int vector conversion (MSVC ARM aliasing).
kernels/subdiv/patch_eval_grid.h	Use explicit `vec_float()` for ix/iy -> vfloatx conversions.
kernels/subdiv/feature_adaptive_eval_simd.h	Use explicit `.m128i()` for float->int vector conversion (MSVC ARM aliasing).
kernels/subdiv/feature_adaptive_eval_grid.h	Use explicit `vec_float()` for ix/iy -> vfloatx conversions.
kernels/subdiv/linear_bezier_patch.h	Use explicit `.m128()` when converting `vfloat4` values into `Vec2fa`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-23T21:08:19Z

+#if !defined(__ARM_NEON) || !defined(_M_ARM64)
    assert(sse2::getISA() <= SSE2);


The preprocessor guard in State::verify() is using || where it should likely be &&. As written, the assert(sse2::getISA() <= SSE2) will still run whenever either __ARM_NEON or _M_ARM64 is not defined, which is true for many ARM build configurations and defeats the intent of skipping this assert on ARM/NEON builds. Consider changing the condition to only run the assert when neither __ARM_NEON nor _M_ARM64 are defined (i.e., both are absent).

Copilot · 2026-04-23T21:08:20Z

+    __forceinline const __m128i m128i() const { return _mm_cvtps_epi32(v); }
+    __forceinline const __m128d m128d() const { return _mm_cvtps_pd(v); }


vboolf4::m128i() and vboolf4::m128d() currently use numeric conversions (_mm_cvtps_epi32 / _mm_cvtps_pd) rather than bit-casts. For mask types this changes the underlying bit pattern and breaks semantics (e.g., _mm_movedup_pd(v.m128d()) and any code expecting a bitwise reinterpretation). These accessors should preserve bits (casts), not convert values.

Suggested change

__forceinline const __m128i m128i() const { return _mm_cvtps_epi32(v); }

__forceinline const __m128d m128d() const { return _mm_cvtps_pd(v); }

__forceinline const __m128i m128i() const { return _mm_castps_si128(v); }

__forceinline const __m128d m128d() const { return _mm_castps_pd(v); }

Copilot · 2026-04-23T21:08:20Z

+    __forceinline __mmask8 packedMask8() const { return v; }
+
+    /* return packed 16 bits mask */
+    __forceinline __mmask8 packedMask16() const { return (__mmask16)v; }


packedMask16() is declared to return __mmask8 but returns a __mmask16 value. This is inconsistent with the name and with its usage in AVX-512 mask intrinsics that operate on 16-bit masks. It should return __mmask16 (and only widen/narrow explicitly where needed) to avoid confusion and accidental truncation/overload issues.

Suggested change

__forceinline __mmask8 packedMask16() const { return (__mmask16)v; }

__forceinline __mmask16 packedMask16() const { return (__mmask16)v; }

Copilot · 2026-04-23T21:08:20Z

 __m256d _mm256_permute4x64_pd (__m256d a, const int imm8)
 {
    float64x2_t lo,hi;
-    lo[0] = _mm256_permute4x64_pd_select(a,imm8 >> 0);
-    lo[1] = _mm256_permute4x64_pd_select(a,imm8 >> 2);
-    hi[0] = _mm256_permute4x64_pd_select(a,imm8 >> 4);
-    hi[1] = _mm256_permute4x64_pd_select(a,imm8 >> 6);
+    lo = vsetq_lane_f64(_mm256_permute4x64_pd_select(a,imm8 >> 0), lo, 0);
+    lo = vsetq_lane_f64(_mm256_permute4x64_pd_select(a,imm8 >> 2), lo, 1);
+    hi = vsetq_lane_f64(_mm256_permute4x64_pd_select(a,imm8 >> 4), hi, 0);
+    hi = vsetq_lane_f64(_mm256_permute4x64_pd_select(a,imm8 >> 6), hi, 1);


_mm256_permute4x64_pd builds lo/hi via vsetq_lane_f64 but float64x2_t lo, hi; are left uninitialized before being passed to vsetq_lane_f64. These intrinsics typically read the input vector to produce the result, so using an uninitialized vector is undefined behavior. Initialize lo/hi first (e.g., to zero) or construct them in a way that doesn't depend on prior contents.

stefanatwork · 2026-04-23T21:13:12Z

Thanks for your patch! This will take a little time to review. Don't mind the Copilot review, I ran it mainly to check for typos etc, it will not be a substitute for a full manual review.

anthony-linaro and others added 6 commits June 7, 2023 12:57

Update sse2neon to newer version

047b270

Add support for Windows ARM64 platforms via MSVC

f4d986c

Merge remote-tracking branch 'linaro/arm64-msvc-support'

6f4ebc6

Upgrade to see2neon 1.9.1

a999ef1

Reapply previous embree implicit patch on sse2neon

136bcdc

Explicit simd type cast and cmake win arm detection

03d8ec8

stefanatwork self-assigned this Apr 23, 2026

stefanatwork requested review from Copilot and stefanatwork April 23, 2026 21:03

Copilot started reviewing on behalf of stefanatwork April 23, 2026 21:04 View session

Copilot AI reviewed Apr 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Windows ARM support for MSVC Compiler#597

Windows ARM support for MSVC Compiler#597
dousse-adobe wants to merge 6 commits intoRenderKit:masterfrom
dousse-adobe:dousse/arm-forreal

dousse-adobe commented Apr 23, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 23, 2026

Uh oh!

Copilot AI Apr 23, 2026

Uh oh!

Copilot AI Apr 23, 2026

Uh oh!

Copilot AI Apr 23, 2026

Uh oh!

stefanatwork commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		#if !defined(__ARM_NEON) \|\| !defined(_M_ARM64)
		assert(sse2::getISA() <= SSE2);

		__forceinline const __m128i m128i() const { return _mm_cvtps_epi32(v); }
		__forceinline const __m128d m128d() const { return _mm_cvtps_pd(v); }

	__forceinline __mmask8 packedMask16() const { return (__mmask16)v; }
	__forceinline __mmask16 packedMask16() const { return (__mmask16)v; }

Conversation

dousse-adobe commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Reasoning

Implementation Details

The problem:

Possible native solution

The suggested solution

Results

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

stefanatwork commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

dousse-adobe commented Apr 23, 2026 •

edited

Loading