Skip to content

Windows ARM support for MSVC Compiler#597

Open
dousse-adobe wants to merge 6 commits intoRenderKit:masterfrom
dousse-adobe:dousse/arm-forreal
Open

Windows ARM support for MSVC Compiler#597
dousse-adobe wants to merge 6 commits intoRenderKit:masterfrom
dousse-adobe:dousse/arm-forreal

Conversation

@dousse-adobe
Copy link
Copy Markdown

@dousse-adobe dousse-adobe commented Apr 23, 2026

Summary

This PR adds support for building embree for Windows ARM64 platforms, using MSVC as the compiler.
@anthony-linaro

I come back to revive this old but great piece of work.

Reasoning

We are at Adobe in the process of porting our applications to Windows ARM platforms.
Embree being used in the 3DIVA division (Substance toolsuite notably), and MSVC being our main windows compiler, we would like to propose these changes.

Implementation Details

The PR comes with a few twists though...
I updated the sse2neon.h header to the latest state while taking care to back port the changes you added in it.
I wanted to make sure the tests suite (see the Results section) passes before submitting this PR but encountered an issue which leads to a design adjustment in the way you use vectorized types in embree.

The problem:

Types such as vfloat<...>, vbool<...>, vint<...>, ... make use of __m128, __m128i, __m256, ... native types for sse/avx operations. These types are then reinterpreted when compiling for ARM platforms to other types like float32x4_t, int32x4_t, ...
For GCC and Clang, it's perfect.
For MSVC this is not the end of the story: the arm_neon.h deployed by msvc does this:

typedef __n128   int32x4_t;
typedef __n128   int64x2_t;
typedef __n128   float32x4_t;
typedef __n128   float64x2_t;

This is not an issue when used in a pure C context (can be argued but not the subject). But when used in a C++ context and when you rely on function overload to achieve some operations, well... you get for instance:

    // In vint4_sse2.h
    // __m128i becomes int32x4_t via sse2neon, and then __n128
    __forceinline vint(__m128i a) : v(a) {}

    // ERROR: Redeclaration of the constructor
    // __m128 becomes float32x4_t via sse2neon, and then __n128
    __forceinline explicit vint(__m128 a) : v(_mm_cvtps_epi32(a)) {}

This is the first visible issue.
And then you have implicit cast from vectorized types to their inner types:

// In vfloat4_sse.h
  template<>
  struct vfloat<4>
  {
  ...
    __forceinline operator const __m128&() const { return v; }
    __forceinline operator       __m128&()       { return v; }
  ...

This is not a compile issue this time but logic issue as you never know how the types will be interpreted.

Possible native solution

After some feedbacks from users, the visual studio team added these options to their compiler /Zc:arm64-aliased-neon-types and /Zc:arm64-aliased-neon-types-

This needs MSVC 2022 17.2 version (june 7th, 2022)
Types are too restrictive, are not builtin types but class types where conversion like int32x4 -> int64x2 are not trivially supported (So __m256 to __m128 are not either)

And still, with the unaliased types flag enabled there are types and functions needed in your implementation (e.g. poly128) which still lack compiler implementation:
https://developercommunity.visualstudio.com/t/ARM64-intrinsics-vmull_p64-and-vmull_h/10090361?ftype=problem&page=25

Replacing the functions used, and implementing type conversion could have been an option, but I chose what seemed like a safe and logic solution (quite verbose as well).

The suggested solution

I removed the implicit cast operators from your vectorized types and replaced them with explicit cast functions. (As simple as that)
So for the previous example:

// In vfloat4_sse.h
  template<>
  struct vfloat<4>
  {
  ...
    // These were removed
    //__forceinline operator const __m128&() const { return v; }
    //__forceinline operator       __m128&()       { return v; }
    // These were added
    __forceinline const __m128& m128() const { return v; }
    __forceinline __m128&       m128()       { return v; }
    __forceinline __m128i       m128i() const      { return _mm_cvtps_epi32(v); }
  ...

I had to go through all the AVX/SSE instructions you use, and check their documentation so we do use the right type/conversion.
I did not use any AI agent to assist me so I am sure of the changes in conversions used.

Results

I could not run your CI environment locally but found my way through some of your tests. As a baseline, I configured the project with -DEMBREE_TESTING_INSTALL_TESTS=ON, cmake installed the project, and used the testing suite in the deployed package as a reference.

What I get is 100% passed tests on both Windows x64 and Windows ARM.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds Windows ARM64 (MSVC) build support by tightening SIMD type conversions and extending ARM/NEON feature detection across the codebase.

Changes:

  • Detect Windows ARM64 in CMake/MSVC toolchain setup and enable ARM-specific build flags.
  • Replace implicit SIMD/native-type conversions with explicit accessors (m128(), m128i(), m256i(), vec_float(), vec_int(), etc.) throughout kernels, tutorials, and math types.
  • Extend ARM64 handling in intrinsics/emulation and architecture-specific traversal/intersector paths.

Reviewed changes

Copilot reviewed 79 out of 80 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
CMakeLists.txt Detect Windows ARM64 generator platform and enable EMBREE_ARM.
common/cmake/msvc.cmake Add ARM-specific MSVC defines/flags (incl. /Zc:preprocessor).
common/cmake/check_arm_neon.cpp Accept _M_ARM64 as NEON-capable for compile checks.
common/sys/thread.cpp Include ARM emulation header for _M_ARM64.
common/sys/sysinfo.h Treat _M_ARM64 similarly to __ARM_NEON in ISA selection.
common/sys/sysinfo.cpp Recognize _M_ARM64 for CPU name/type and feature bits.
common/sys/platform.h Define __64BIT__ for _M_ARM64.
common/sys/intrinsics.h Add _M_ARM64 conditions for bit-scan/popcnt and includes.
common/math/emath.h Add _M_ARM64 handling for scalar SIMD-assisted math helpers.
common/math/color.h Extend ARM64 reciprocal/rsqrt paths to _M_ARM64.
common/math/bbox.h Enable SIMD-specialized BBox<Vec3fa> empty() for _M_ARM64.
common/math/vec2.h Include SIMD helpers for _M_ARM64.
common/math/vec2fa.h Replace implicit SIMD conversions with explicit accessors for _M_ARM64.
common/math/vec3.h Include SIMD helpers for _M_ARM64 and adjust Vec3 ctor access.
common/math/vec3ba.h Rename SIMD storage member and add explicit accessors.
common/math/vec3ia.h Rename SIMD storage member and add explicit accessors/conversions.
common/math/vec4.h Include SIMD helpers for _M_ARM64 and adjust Vec4 ctor access.
common/math/linearspace3.h Use explicit m128() when converting SIMD vectors to Vec3fa.
common/simd/simd.h Include SSE wrapper on _M_ARM64.
common/simd/sse.h Enable blendv_ps helper on _M_ARM64.
common/simd/arm/emulation.h Adjust precise-div macros for _M_ARM64.
common/simd/arm/avx2neon.h MSVC compatibility, _M_ARM64 lane-access workarounds, and warning suppression.
common/simd/vuint8_avx.h Remove implicit casts; add explicit m256i() and update uses.
common/simd/vint8_avx.h Remove implicit casts; add explicit accessors and update uses.
common/simd/vllong4_avx2.h Remove implicit casts; add explicit m256i() and update uses/masks.
common/simd/vllong8_avx512.h Remove implicit casts; add explicit m512i() and update uses/masks.
common/simd/vdouble4_avx.h Remove implicit cast; use m256d() in operations/comparisons/select.
common/simd/vboolf8_avx.h Remove implicit conversions; add explicit m256()/m256d() and update ops.
common/simd/vboolf8_avx512.h Remove implicit __mmask8 conversion; add packed mask getters and update ops.
common/simd/vboolf4_sse2.h Remove implicit conversions; add explicit accessors and update ops/shuffles.
common/simd/vboolf4_avx512.h Remove implicit __mmask8 conversion; add packed mask getters and update ops.
common/simd/vboolf16_avx512.h Remove implicit __mmask16 conversion; add packed mask getters and update ops.
common/simd/vboold4_avx.h Remove implicit conversions; add explicit accessors and update ops.
common/simd/vboold4_avx512.h Remove implicit __mmask8 conversion; add packed mask getters and update ops.
common/simd/vboold8_avx512.h Remove implicit __mmask8 conversion; add packed mask getters and update ops.
tutorials/common/scenegraph/scenegraph.cpp Use explicit .m128() when constructing scalar vector types; adjust comparison logic.
tutorials/common/math/random_sampler.h Use explicit .m128() when converting shifted int vector to Vec3fa.
kernels/common/state.cpp Treat _M_ARM64 as ARM/NEON in ISA support checks and debug asserts.
kernels/common/isa.h Enable SIMD4 target definition for _M_ARM64.
kernels/common/default.h Replace implicit float->int vector conversion with explicit vec_int().
kernels/common/buffer.h Use explicit .m128() when converting loaded vfloat4 to Vec3fa.
kernels/common/acceln.cpp Include _M_ARM64 in packet-validity early-out checks.
kernels/common/accel.h Enable 4-wide intersector wrappers under _M_ARM64.
kernels/bvh/node_intersector_packet.h Treat _M_ARM64 like __aarch64__ in FMA-related traversal math.
kernels/bvh/node_intersector_frustum.h Treat _M_ARM64 like __aarch64__ in frustum traversal math.
kernels/bvh/node_intersector1.h Treat _M_ARM64 like __aarch64__ in FMA-related traversal math and conditions.
kernels/bvh/bvh_traverser1.h Replace implicit conversions with explicit mask/int accessors for AVX512VL path.
kernels/bvh/bvh_statistics.cpp Include _M_ARM64 in BVH4 statistics instantiation conditions.
kernels/bvh/bvh_node_qaabb.h Replace implicit int/float vector conversions with explicit vec_int()/vec_float().
kernels/bvh/bvh_builder_morton.cpp Use explicit .m128() conversions for BBox3fx construction.
kernels/bvh/bvh.cpp Include _M_ARM64 in BVH4 instantiation conditions.
kernels/builders/primrefgen_presplit.h Use explicit .m128i() when converting select(...) result to Vec3ia.
kernels/builders/heuristic_binning.h Use explicit .m128i() for clamp/floor-to-int conversions.
kernels/builders/bvh_builder_morton.h Use explicit .m128i() for centroid quantization to vint4.
kernels/geometry/subgrid_intersector_pluecker.h Use explicit vec_float() for vint->vfloat conversions.
kernels/geometry/subgrid_intersector_moeller.h Use explicit vec_float() for vint->vfloat conversions.
kernels/geometry/linei.h Fix mask type for select() on 8-wide vectors (vboolf8).
kernels/geometry/instance_intersector.cpp Instantiate packet intersectors for _M_ARM64.
kernels/geometry/grid_soa.h Use explicit vec_float() for vint->vfloat conversions.
kernels/geometry/grid_soa.cpp Use explicit vec_int() when converting clamped floats to ints.
kernels/geometry/curveNi_intersector.h Use explicit .m128() when converting vfloat4 to Vec3fa.
kernels/geometry/curveNi_mb_intersector.h Use explicit .m128() when converting vfloat4 to Vec3fa.
kernels/subdiv/tessellation.h Use explicit .m256() for vint8->vfloat8 conversions.
kernels/subdiv/patch_eval_simd.h Use explicit .m128i() for float->int vector conversion (MSVC ARM aliasing).
kernels/subdiv/patch_eval_grid.h Use explicit vec_float() for ix/iy -> vfloatx conversions.
kernels/subdiv/feature_adaptive_eval_simd.h Use explicit .m128i() for float->int vector conversion (MSVC ARM aliasing).
kernels/subdiv/feature_adaptive_eval_grid.h Use explicit vec_float() for ix/iy -> vfloatx conversions.
kernels/subdiv/linear_bezier_patch.h Use explicit .m128() when converting vfloat4 values into Vec2fa.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread kernels/common/state.cpp
Comment on lines +178 to 179
#if !defined(__ARM_NEON) || !defined(_M_ARM64)
assert(sse2::getISA() <= SSE2);
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The preprocessor guard in State::verify() is using || where it should likely be &&. As written, the assert(sse2::getISA() <= SSE2) will still run whenever either __ARM_NEON or _M_ARM64 is not defined, which is true for many ARM build configurations and defeats the intent of skipping this assert on ARM/NEON builds. Consider changing the condition to only run the assert when neither __ARM_NEON nor _M_ARM64 are defined (i.e., both are absent).

Copilot uses AI. Check for mistakes.
Comment on lines +41 to +42
__forceinline const __m128i m128i() const { return _mm_cvtps_epi32(v); }
__forceinline const __m128d m128d() const { return _mm_cvtps_pd(v); }
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vboolf4::m128i() and vboolf4::m128d() currently use numeric conversions (_mm_cvtps_epi32 / _mm_cvtps_pd) rather than bit-casts. For mask types this changes the underlying bit pattern and breaks semantics (e.g., _mm_movedup_pd(v.m128d()) and any code expecting a bitwise reinterpretation). These accessors should preserve bits (casts), not convert values.

Suggested change
__forceinline const __m128i m128i() const { return _mm_cvtps_epi32(v); }
__forceinline const __m128d m128d() const { return _mm_cvtps_pd(v); }
__forceinline const __m128i m128i() const { return _mm_castps_si128(v); }
__forceinline const __m128d m128d() const { return _mm_castps_pd(v); }

Copilot uses AI. Check for mistakes.
__forceinline __mmask8 packedMask8() const { return v; }

/* return packed 16 bits mask */
__forceinline __mmask8 packedMask16() const { return (__mmask16)v; }
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

packedMask16() is declared to return __mmask8 but returns a __mmask16 value. This is inconsistent with the name and with its usage in AVX-512 mask intrinsics that operate on 16-bit masks. It should return __mmask16 (and only widen/narrow explicitly where needed) to avoid confusion and accidental truncation/overload issues.

Suggested change
__forceinline __mmask8 packedMask16() const { return (__mmask16)v; }
__forceinline __mmask16 packedMask16() const { return (__mmask16)v; }

Copilot uses AI. Check for mistakes.
Comment on lines 880 to +886
__m256d _mm256_permute4x64_pd (__m256d a, const int imm8)
{
float64x2_t lo,hi;
lo[0] = _mm256_permute4x64_pd_select(a,imm8 >> 0);
lo[1] = _mm256_permute4x64_pd_select(a,imm8 >> 2);
hi[0] = _mm256_permute4x64_pd_select(a,imm8 >> 4);
hi[1] = _mm256_permute4x64_pd_select(a,imm8 >> 6);
lo = vsetq_lane_f64(_mm256_permute4x64_pd_select(a,imm8 >> 0), lo, 0);
lo = vsetq_lane_f64(_mm256_permute4x64_pd_select(a,imm8 >> 2), lo, 1);
hi = vsetq_lane_f64(_mm256_permute4x64_pd_select(a,imm8 >> 4), hi, 0);
hi = vsetq_lane_f64(_mm256_permute4x64_pd_select(a,imm8 >> 6), hi, 1);
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_mm256_permute4x64_pd builds lo/hi via vsetq_lane_f64 but float64x2_t lo, hi; are left uninitialized before being passed to vsetq_lane_f64. These intrinsics typically read the input vector to produce the result, so using an uninitialized vector is undefined behavior. Initialize lo/hi first (e.g., to zero) or construct them in a way that doesn't depend on prior contents.

Copilot uses AI. Check for mistakes.
@stefanatwork
Copy link
Copy Markdown
Collaborator

Thanks for your patch! This will take a little time to review. Don't mind the Copilot review, I ran it mainly to check for typos etc, it will not be a substitute for a full manual review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants