Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

host group mutex #23

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

host group mutex #23

wants to merge 3 commits into from

Conversation

PhilipDeegan
Copy link
Member

@PhilipDeegan PhilipDeegan commented Oct 12, 2024

Summary by CodeRabbit

  • New Features

    • Introduced new synchronization mechanisms for improved GPU multi-launch functionality.
    • Added methods for managing host functions with mutexes and calculating group indices.
    • Enhanced testing capabilities with new functions for threaded group operations.
  • Bug Fixes

    • Enhanced error handling for group size validation.
  • Refactor

    • Improved structure and inheritance of existing classes related to stream functions.
    • Reformatted macro definitions for better readability.

Copy link

coderabbitai bot commented Oct 12, 2024

Walkthrough

The changes primarily involve modifications to the multi_launch.hpp file in the mkn::gpu namespace, enhancing the GPU multi-launch framework. Key updates include adjustments to the StreamFunction constructor, the introduction of new structs like StreamGroupFunction and StreamHostGroupMutexFunction, and the addition of new methods for group index calculation and mutex management. A corresponding test function has been added to async_streaming.cpp to validate the new functionality, ensuring synchronized operations across host and device threads.

Changes

File Change Summary
inc/mkn/gpu/multi_launch.hpp - Updated StreamFunction constructor to accept StreamFunctionMode const mode_.
- Added group_idx_modulo method for group index calculation.
- Introduced StreamGroupFunction struct inheriting from StreamFunction.
- Refactored StreamGroupBarrierFunction to inherit from StreamGroupFunction.
- Added StreamHostGroupMutexFunction struct for mutex management.
- Updated StreamLauncher with host_group_mutex and host_group_idx methods for mutex-based host functions.
- Refactored run methods in StreamHostGroupMutexFunction and StreamGroupBarrierFunction.
- Modified error handling for group size validation.
test/any/async_streaming.cpp - Introduced test_threaded_host_group_mutex function for testing host group mutex functionality.
- Introduced test_threaded_host_group_idx function for testing group index functionality.
- Updated main function to include the new test functions.
inc/mkn/gpu/rocm.hpp - Reformatted MKN_GPU_ASSERT macro for improved readability.
- Introduced bytes variable in alloc_managed function for size calculation.

Possibly related PRs

  • threaded stream launcher group barrier sync points #22: The changes in this PR introduce the StreamGroupBarrierFunction, which is directly related to the new synchronization mechanisms introduced in the main PR, particularly in how barriers are managed within the ThreadedStreamLauncher.

🐰 In the garden where we play,
New functions hop and dance today.
With mutex locks and groups so fine,
Our streams align, a perfect line!
Synchronized, we leap and bound,
In this code, new joys are found! 🌼


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Inline review comments failed to post

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (6)
test/any/async_streaming.cpp (1)

142-142: Consider adding more comprehensive result verification

The current check if (vals != std::vector<size_t>{3, 12}) return 1; verifies the final state of vals. While this is a good basic check, consider adding more detailed assertions or error messages to provide better feedback in case of failure.

You could enhance the verification like this:

if (vals != std::vector<size_t>{3, 12}) {
    std::cerr << "Unexpected vals: ";
    for (const auto& val : vals) {
        std::cerr << val << " ";
    }
    std::cerr << std::endl;
    return 1;
}
inc/mkn/gpu/multi_launch.hpp (5)

53-53: Avoid unnecessary const qualifier on pass-by-value parameter

The mode_ parameter is passed by value in the constructor. Adding const to a pass-by-value parameter has no effect since it doesn't prevent the caller or callee from modifying the local copy. Consider removing the redundant const qualifier.

Apply this diff to remove the unnecessary const:

-      StreamFunction(Strat& strat_, StreamFunctionMode const mode_) : strat{strat_}, mode{mode_} {}
+      StreamFunction(Strat& strat_, StreamFunctionMode mode_) : strat{strat_}, mode{mode_} {}

61-63: Prefer passing fundamental types by value instead of by const reference

For fundamental types like std::size_t, passing by value is more efficient than passing by const reference due to potential overhead associated with references. Consider changing the function parameters to pass by value.

Apply this diff to update the parameter passing:

-std::size_t group_idx_modulo(std::size_t const& gs, std::size_t const& i) {
+std::size_t group_idx_modulo(std::size_t gs, std::size_t i) {

68-69: Prefer passing fundamental types by value instead of by const reference

In the constructor of StreamGroupFunction, gs is a std::size_t, a fundamental type. Passing it by value is generally more efficient than passing by const reference. Similarly, consider removing the unnecessary const from the mode_ parameter.

Apply this diff:

-      StreamGroupFunction(std::size_t const& gs, Strat& strat_, StreamFunctionMode const mode_)
+      StreamGroupFunction(std::size_t gs, Strat& strat_, StreamFunctionMode mode_)

281-283: Prefer passing fundamental types by value instead of by const reference

In the constructor of StreamHostGroupMutexFunction, gs is a std::size_t. Passing fundamental types by value is more efficient than passing by const reference. Consider changing the parameter to pass by value.

Apply this diff:

-      StreamHostGroupMutexFunction(std::size_t const& gs, Strat& strat, Fn&& fn_)
+      StreamHostGroupMutexFunction(std::size_t gs, Strat& strat, Fn&& fn_)

342-344: Prefer passing fundamental types by value instead of by const reference

In the host_group_mutex method, group_size is a std::size_t. Passing fundamental types by value is more efficient than passing by const reference. Consider changing the parameter to pass by value.

Apply this diff:

-      This& host_group_mutex(std::size_t const& group_size, Fn&& fn) {
+      This& host_group_mutex(std::size_t group_size, Fn&& fn) {
🛑 Comments failed to post (1)
inc/mkn/gpu/multi_launch.hpp (1)

289-296: 🛠️ Refactor suggestion

Ensure mutex is properly acquired before using

In the run method, the unique_lock is constructed with std::defer_lock and try_lock is used. If the lock is not acquired, fn(i) is not called, and strat.status[i] is set to SFS::FIRST to retry. This logic may lead to busy waiting and potential performance issues if the mutex is heavily contended. Consider implementing a back-off strategy or using condition variables to avoid tight loops.

Consider refactoring the code to include a back-off mechanism:

      void run(std::uint32_t const i) override {
        auto const gidx = Super::group_idx(i);

        std::unique_lock<std::mutex> lock(mutices[Super::group_idx(i)], std::defer_lock);

        if (lock.try_lock()) {
          fn(i);
          strat.status[i] = SFS::WAIT;  // done
        } else {
+         // Introduce a small sleep to prevent tight loop in case of contention
+         std::this_thread::sleep_for(std::chrono::milliseconds(1));
          strat.status[i] = SFS::FIRST;  // retry
        }
      }

Committable suggestion was skipped due to low confidence.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (5)
inc/mkn/gpu/multi_launch.hpp (4)

61-63: LGTM: New group_idx_modulo function added

The group_idx_modulo function is a useful addition for group-based operations. Its implementation is correct and improves code readability by encapsulating the group index calculation.

Consider marking this function as constexpr and inline to allow for compile-time evaluation and potential performance improvements:

-std::size_t group_idx_modulo(std::size_t const& gs, std::size_t const& i) {
+constexpr inline std::size_t group_idx_modulo(std::size_t const& gs, std::size_t const& i) {
   return ((i - (i % gs)) / gs);
 }

65-75: LGTM: New StreamGroupFunction struct added

The StreamGroupFunction struct is a well-designed abstraction for group-based stream functions. It properly inherits from StreamFunction and provides useful group-related functionality.

Consider marking the group_idx method as constexpr and noexcept for potential compile-time optimizations and to indicate it doesn't throw exceptions:

-  std::size_t group_idx(std::size_t const& i) const { return group_idx_modulo(group_size, i); }
+  constexpr std::size_t group_idx(std::size_t const& i) const noexcept { return group_idx_modulo(group_size, i); }

267-298: LGTM: New StreamHostGroupMutexFunction added for group-based mutex synchronization

The StreamHostGroupMutexFunction is a valuable addition for synchronizing host-side operations within groups. Its implementation using std::mutex is appropriate for host-side synchronization.

Consider using std::shared_mutex instead of std::mutex if multiple readers are expected to access shared resources simultaneously. This could potentially improve performance in scenarios where read operations are more frequent than write operations:

-  std::vector<std::mutex> mutices;
+  std::vector<std::shared_mutex> mutices;

Also, to prevent potential busy-waiting, you might want to add a small sleep or yield when the lock isn't acquired:

   if (lock.try_lock()) {
     fn(i);
     strat.status[i] = SFS::WAIT;  // done
   } else {
     strat.status[i] = SFS::FIRST;  // retry
+    std::this_thread::yield();  // or std::this_thread::sleep_for(std::chrono::milliseconds(1));
   }

Line range hint 1-461: Overall: Excellent additions to enhance group-based operations and synchronization

The changes introduce well-designed group-based functionality and mutex synchronization to the GPU multi-launch framework. The new features are cohesively integrated with the existing codebase and provide valuable enhancements for parallel processing scenarios.

Consider adding more inline documentation or comments to explain the usage and benefits of the new group-based features and mutex synchronization. This would help users understand when and how to best utilize these new capabilities in their GPU multi-launch operations.

test/any/async_streaming.cpp (1)

135-137: Remove unnecessary mutable keyword in lambda

The lambda function passed to .host() does not modify any captured variables, yet it's marked as mutable. Unless there's a specific reason, consider removing the mutable keyword for clarity.

Apply this diff to remove mutable:

-.host([&](auto i) mutable {
+.host([&](auto i) {
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 7938041 and 0707fb0.

📒 Files selected for processing (2)
  • inc/mkn/gpu/multi_launch.hpp (5 hunks)
  • test/any/async_streaming.cpp (1 hunks)
🧰 Additional context used
🔇 Additional comments (5)
inc/mkn/gpu/multi_launch.hpp (3)

53-53: LGTM: Improved parameter passing in StreamFunction constructor

The change to pass mode_ as a constant reference is a good practice. It potentially improves performance by avoiding unnecessary copies without changing the functionality.


Line range hint 222-265: LGTM: StreamGroupBarrierFunction updated to use new group-based abstraction

The changes to StreamGroupBarrierFunction properly integrate it with the new StreamGroupFunction base class. This improves code consistency and leverages the new group functionality.

The run method now correctly uses the group_idx method from the base class, which is a good improvement in code reuse and consistency.


339-344: LGTM: New host_group_mutex method added to ThreadedStreamLauncher

The host_group_mutex method is a well-implemented addition to the ThreadedStreamLauncher class. It provides a convenient way to add group-based, mutex-synchronized host functions to the launcher.

The method correctly creates and adds a new StreamHostGroupMutexFunction to the fns vector, properly forwarding the provided function and group size.

test/any/async_streaming.cpp (2)

119-152: Function test_threaded_host_group_mutex implementation looks correct

The implementation of the test_threaded_host_group_mutex function appears to correctly utilize the ThreadedStreamLauncher with host_group_mutex to synchronize host operations across groups. The logic aligns with the intended functionality.


139-139: ⚠️ Potential issue

Undefined function group_idx_modulo

The function group_idx_modulo used here is not defined within the provided code or the included headers. This will lead to a compilation error due to the undefined reference.

Run the following script to check if group_idx_modulo is defined elsewhere in the codebase:

Ensure that group_idx_modulo is properly defined or included from the appropriate header file.

Comment on lines 156 to 157
return test() + test_threaded() + test_threaded(6) + test_threaded_group_barrier() +
test_threaded_host_group_mutex();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Check the cumulative return value in main

The main function returns the sum of the test function results. Since each test function returns 0 on success and 1 on failure, the cumulative return value could exceed 1, making it harder to interpret the specific failure. Consider handling test results individually or using bitwise operations to identify specific test failures.

For clearer result interpretation, you could modify main as follows:

-return test() + test_threaded() + test_threaded(6) + test_threaded_group_barrier() +
-       test_threaded_host_group_mutex();
+int result = 0;
+result |= test();
+result |= test_threaded() << 1;
+result |= test_threaded(6) << 2;
+result |= test_threaded_group_barrier() << 3;
+result |= test_threaded_host_group_mutex() << 4;
+return result;

This way, each test's result occupies a unique bit in the return value.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
return test() + test_threaded() + test_threaded(6) + test_threaded_group_barrier() +
test_threaded_host_group_mutex();
int result = 0;
result |= test();
result |= test_threaded() << 1;
result |= test_threaded(6) << 2;
result |= test_threaded_group_barrier() << 3;
result |= test_threaded_host_group_mutex() << 4;
return result;

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (3)
inc/mkn/gpu/multi_launch.hpp (3)

61-63: New group_idx_modulo function looks good, but consider adding a safety check

The new group_idx_modulo function is a good addition for calculating group indices. However, to prevent potential undefined behavior, consider adding a check for division by zero.

Consider adding a check for gs != 0 at the beginning of the function:

 std::size_t group_idx_modulo(std::size_t const& gs, std::size_t const& i) {
+  if (gs == 0) {
+    throw std::invalid_argument("Group size cannot be zero");
+  }
   return ((i - (i % gs)) / gs);
 }

267-298: Well-implemented StreamHostGroupMutexFunction with a suggestion for improvement

The new StreamHostGroupMutexFunction struct is a good addition that provides mutex-based synchronization for host group functions. The implementation is consistent with the existing code style and error handling.

The run method's try-lock mechanism prevents deadlocks, which is good. However, it might lead to busy waiting in high-contention scenarios.

Consider implementing a backoff strategy or using a condition variable to reduce CPU usage in case of high contention. For example:

void run(std::uint32_t const i) override {
  std::unique_lock<std::mutex> lock(mutices[Super::group_idx(i)], std::defer_lock);
  
  if (lock.try_lock()) {
    fn(i);
    strat.status[i] = SFS::WAIT;  // done
  } else {
    // Implement exponential backoff
    std::this_thread::sleep_for(std::chrono::milliseconds(backoff_time));
    backoff_time = std::min(backoff_time * 2, max_backoff_time);
    strat.status[i] = SFS::FIRST;  // retry
  }
}

This approach would reduce CPU usage while still maintaining the non-blocking nature of the current implementation.


Line range hint 385-409: Improved efficiency in get_work, but potential race condition introduced

The changes to the get_work method improve efficiency by allowing work to resume from where it left off in previous calls. This is a good optimization for scenarios with many work items.

However, the introduction of the shared work_i variable, which is modified outside the lock, could lead to race conditions in a multi-threaded environment.

To address the potential race condition, consider one of the following approaches:

  1. Move the work_i = 0 reset inside the lock:
 std::pair<SFP, std::size_t> get_work() {
   std::scoped_lock<std::mutex> lk(work_);
   for (; work_i < datas.size(); ++work_i) {
     // ... existing code ...
   }
+  work_i = 0;
   if (check_finished()) done = 1;
   return std::make_pair(SFP::SKIP, 0);
 }
-  work_i = 0;
  1. Use a thread-local variable instead of a shared member variable:
 std::pair<SFP, std::size_t> get_work() {
+  static thread_local std::size_t work_i = 0;
   std::scoped_lock<std::mutex> lk(work_);
   for (; work_i < datas.size(); ++work_i) {
     // ... existing code ...
   }
   work_i = 0;
   if (check_finished()) done = 1;
   return std::make_pair(SFP::SKIP, 0);
 }
-  std::size_t work_i = 0;

Either of these approaches would maintain the efficiency improvement while avoiding potential race conditions.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 0707fb0 and cd06ef9.

📒 Files selected for processing (1)
  • inc/mkn/gpu/multi_launch.hpp (8 hunks)
🧰 Additional context used
🔇 Additional comments (5)
inc/mkn/gpu/multi_launch.hpp (5)

53-53: Improved efficiency in StreamFunction constructor

The change to pass mode_ as a const reference is a good optimization. It prevents unnecessary copying and ensures that the mode_ parameter cannot be accidentally modified within the constructor.


65-75: Well-designed StreamGroupFunction struct

The new StreamGroupFunction struct is a good addition that provides a clear abstraction for group-based stream functions. It properly inherits from StreamFunction and introduces the necessary group-related functionality.

The group_idx method effectively utilizes the group_idx_modulo function, maintaining consistency in the codebase.


Line range hint 222-265: Improved StreamGroupBarrierFunction with better inheritance

The changes to StreamGroupBarrierFunction are well-implemented:

  1. Inheriting from StreamGroupFunction improves code organization and reuse.
  2. The constructor properly initializes the base class with the group size.
  3. The run method now uses the group_idx method from the base class, which is more consistent.

These modifications enhance the overall design and maintainability of the code.


339-344: Well-implemented host_group_mutex method

The new host_group_mutex method in ThreadedStreamLauncher is a good addition. It provides a clean and consistent way to add mutex-based host group functions to the launcher. The implementation correctly creates and adds a new StreamHostGroupMutexFunction to the fns vector, maintaining consistency with other similar methods in the class.


Line range hint 1-445: Overall assessment of changes to multi_launch.hpp

The changes to this file significantly enhance the GPU multi-launch framework by introducing well-designed group-based abstractions and improving existing functionality. Key improvements include:

  1. The new StreamGroupFunction provides a solid foundation for group-based operations.
  2. The StreamHostGroupMutexFunction adds useful mutex-based synchronization for host group functions.
  3. The ThreadedStreamLauncher class has been extended with new capabilities that leverage these group-based abstractions.

While these changes are generally positive, there are a few areas that require attention:

  1. The group_idx_modulo function should include a check for division by zero.
  2. The locking mechanism in StreamHostGroupMutexFunction could be optimized to reduce potential busy waiting.
  3. The use of the shared work_i variable in ThreadedStreamLauncher introduces potential thread safety issues that should be addressed.

Addressing these points will further improve the robustness and efficiency of the framework. Overall, this update represents a significant step forward in the capabilities of the GPU multi-launch system.

@@ -383,6 +440,7 @@ struct ThreadedStreamLauncher : public StreamLauncher<Datas, ThreadedStreamLaunc

private:
bool joined = false, started = false, done = false;
std::size_t work_i = 0;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Reconsider the use of shared work_i variable

The addition of the work_i member variable introduces shared mutable state in a multi-threaded environment. This could lead to race conditions and unexpected behavior.

Consider one of the following alternatives:

  1. Make work_i a local variable within the get_work method if its state doesn't need to be preserved between calls.

  2. If the state needs to be preserved, use a thread-local variable:

std::pair<SFP, std::size_t> get_work() {
  static thread_local std::size_t work_i = 0;
  // ... rest of the method ...
}
  1. If shared state is necessary, ensure all access to work_i is properly synchronized, preferably by keeping all operations on it within the locked section of get_work.

These alternatives would help prevent potential race conditions and improve the thread safety of the code.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Outside diff range and nitpick comments (5)
inc/mkn/gpu/rocm.hpp (1)

208-210: LGTM: Improved readability in alloc_managed function

The introduction of the bytes variable enhances code clarity by separating the size calculation from the hipMallocManaged call. This change improves readability and maintainability without altering the function's behavior.

Consider adding a comment explaining the purpose of the bytes variable for even better clarity:

+ // Calculate total bytes to allocate
  auto const bytes = size * sizeof(T);
  KLOG(TRC) << "GPU alloced: " << size * sizeof(T);
  MKN_GPU_ASSERT(hipMallocManaged((void**)&p, bytes));
test/any/async_streaming.cpp (1)

142-142: Define expected values as constants for clarity

The comparison if (vals != std::vector<size_t>{3, 12}) uses hardcoded magic numbers, which can hinder code readability. Consider defining these values as named constants or adding comments to explain their significance.

inc/mkn/gpu/multi_launch.hpp (3)

61-63: Simplify group index calculation

The group_idx_modulo function can be simplified. Currently, it calculates the group index using:

return ((i - (i % gs)) / gs);

This expression is equivalent to integer division i / gs for integers. You can simplify the function to:

return i / gs;

This makes the code clearer and may improve performance slightly.


61-63: Avoid passing primitive types by const reference

In the group_idx_modulo function, the parameters gs and i are passed as const&. Since std::size_t is a primitive type, it's more efficient to pass them by value:

std::size_t group_idx_modulo(std::size_t gs, std::size_t i) {
  return i / gs;
}

This change avoids unnecessary indirection and aligns with best practices.


361-365: Avoid passing std::size_t by const reference

In the host_group_mutex method, group_size is passed as std::size_t const&. Since std::size_t is a primitive type, it's more efficient to pass it by value:

This& host_group_mutex(std::size_t group_size, Fn&& fn) {
  // ...
}

This change eliminates unnecessary indirection and follows C++ best practices for passing primitive types.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between cd06ef9 and cb24f6d.

📒 Files selected for processing (3)
  • inc/mkn/gpu/multi_launch.hpp (8 hunks)
  • inc/mkn/gpu/rocm.hpp (2 hunks)
  • test/any/async_streaming.cpp (1 hunks)
🧰 Additional context used
🔇 Additional comments (4)
inc/mkn/gpu/rocm.hpp (2)

48-51: LGTM: Improved macro formatting

The reformatting of the MKN_GPU_ASSERT macro enhances code readability without altering its functionality. This change aligns with best practices for macro definitions.


Line range hint 1-458: Overall assessment: Minor improvements to code quality

The changes in this file, while minor, contribute positively to the overall code quality and readability. They align well with the broader objectives of the pull request to enhance the GPU multi-launch framework. No functional changes or potential issues were identified in this file.

test/any/async_streaming.cpp (1)

196-201: Refactor 'main' to better identify test failures

As previously mentioned, returning the sum of test results can make it difficult to identify which specific test failed. Consider using bitwise operations or handling each test result individually to improve failure detection.

inc/mkn/gpu/multi_launch.hpp (1)

Line range hint 413-437: Resolution of previous concurrency issue with work_i

The potential race condition with the shared variable work_i, as previously noted in past reviews, has been addressed. Access to work_i is now properly synchronized using the work_ mutex in the get_work method, ensuring thread safety.

Also applies to: 471-471

Comment on lines +119 to +152
std::uint32_t test_threaded_host_group_mutex(std::size_t const& nthreads = 2) {
using T = double;
KUL_DBG_FUNC_ENTER;

std::size_t constexpr group_size = 3;
std::vector<size_t> vals((C + 1) / group_size); // 2 values;
std::vector<ManagedVector<T>> vecs(C + 1, ManagedVector<T>(NUM, 0));
for (std::size_t i = 0; i < vecs.size(); ++i) std::fill_n(vecs[i].data(), NUM, i);

ManagedVector<T*> datas(C + 1);
for (std::size_t i = 0; i < vecs.size(); ++i) datas[i] = vecs[i].data();
auto views = datas.data();

ThreadedStreamLauncher{vecs, nthreads}
.dev([=] __device__(auto const& i) { views[i][mkn::gpu::idx()] += 1; })
.host([&](auto i) mutable {
std::this_thread::sleep_for(200ms);
for (auto& e : vecs[i]) e += 1;
})
.host_group_mutex(group_size, // lambda scope is locked per group
[&](auto const i) { vals[group_idx_modulo(group_size, i)] += i; })
.dev([=] __device__(auto const& i) { views[i][mkn::gpu::idx()] += 3; })();

if (vals != std::vector<size_t>{3, 12}) return 1;

std::size_t val = 5;
for (auto const& vec : vecs) {
for (auto const& e : vec)
if (e != val) return 1;
++val;
};

return 0;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Refactor duplicated code in test functions to improve maintainability

The functions test_threaded_host_group_mutex and test_threaded_host_group_idx contain similar code blocks for vector initialization and launcher setup. Refactoring these common sections into reusable helper functions can enhance code maintainability and reduce duplication.

Also applies to: 154-192

KUL_DBG_FUNC_ENTER;

std::size_t constexpr group_size = 3;
std::vector<size_t> vals((C + 1) / group_size); // 2 values;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Adjust the size calculation of 'vals' to handle all group sizes

The current calculation for the size of the vals vector may not correctly account for cases where (C + 1) is not perfectly divisible by group_size. This could lead to incorrect indexing or out-of-bounds errors.

Apply this diff to ensure the size of vals correctly accommodates all elements:

- std::vector<size_t> vals((C + 1) / group_size);  // 2 values;
+ std::vector<size_t> vals((C + group_size - 1) / group_size);
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
std::vector<size_t> vals((C + 1) / group_size); // 2 values;
std::vector<size_t> vals((C + group_size - 1) / group_size);

for (auto& e : vecs[i]) e += 1;
})
.host_group_mutex(group_size, // lambda scope is locked per group
[&](auto const i) { vals[group_idx_modulo(group_size, i)] += i; })
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codebase verification

⚠️ Potential issue

Include multi_launch.hpp to define group_idx_modulo

The function group_idx_modulo is defined in inc/mkn/gpu/multi_launch.hpp, but this header is not included in test/any/async_streaming.cpp. Please add the necessary include to prevent compilation errors.

  • File: test/any/async_streaming.cpp
🔗 Analysis chain

Ensure 'group_idx_modulo' is defined to prevent compilation errors

The function group_idx_modulo is used but not defined within this file or included from any header. This will result in a compilation error.

Please define group_idx_modulo or include the appropriate header where it is declared.

To verify if group_idx_modulo is defined elsewhere, run the following script:

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Search for the definition of 'group_idx_modulo' in the codebase.

# Test: Expect to find the definition of 'group_idx_modulo'
rg -A 5 "group_idx_modulo"

Length of output: 1176


Script:

#!/bin/bash
# Description: Check if 'multi_launch.hpp' is included in 'test/any/async_streaming.cpp'

rg '#include\s+"inc/mkn/gpu/multi_launch\.hpp"' test/any/async_streaming.cpp

Length of output: 78

Comment on lines +287 to +294
std::unique_lock<std::mutex> lock(mutices[Super::group_idx(i)], std::defer_lock);

if (lock.try_lock()) {
fn(i);
strat.status[i] = SFS::WAIT; // done
} else {
strat.status[i] = SFS::FIRST; // retry
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Potential data race on strat.status[i]

In the run method of StreamHostGroupMutexFunction, strat.status[i] is modified without synchronization. Since strat.status[i] may be accessed concurrently by multiple threads in methods like get_work and is_fn_finished, this can lead to data races and undefined behavior.

Consider one of the following solutions:

  • Protect access to strat.status[i] with a mutex or lock.
  • Use std::atomic<SFS> for strat.status[i] to ensure thread-safe access.

Apply this diff to change status to an atomic vector:

- std::vector<SFS> status;
+ std::vector<std::atomic<SFS>> status;

Ensure all reads and writes to status[i] use atomic operations.

Committable suggestion was skipped due to low confidence.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant