OmniSciDB  72c90bc290
 All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Friends Macros Groups Pages
GpuSharedMemCodeBuilder Class Reference

#include <GpuSharedMemoryUtils.h>

+ Collaboration diagram for GpuSharedMemCodeBuilder:

Public Member Functions

 GpuSharedMemCodeBuilder (llvm::Module *module, llvm::LLVMContext &context, const QueryMemoryDescriptor &qmd, const std::vector< TargetInfo > &targets, const std::vector< int64_t > &init_agg_values, const size_t executor_id)
 
void codegen ()
 
void injectFunctionsInto (llvm::Function *query_func)
 
llvm::Function * getReductionFunction () const
 
llvm::Function * getInitFunction () const
 
std::string toString () const
 

Protected Member Functions

void codegenReduction ()
 
void codegenInitialization ()
 
llvm::Function * createReductionFunction () const
 
llvm::Function * createInitFunction () const
 
llvm::Function * getFunction (const std::string &func_name) const
 

Protected Attributes

size_t executor_id_
 
llvm::Module * module_
 
llvm::LLVMContext & context_
 
llvm::Function * reduction_func_
 
llvm::Function * init_func_
 
const QueryMemoryDescriptor query_mem_desc_
 
const std::vector< TargetInfotargets_
 
const std::vector< int64_t > init_agg_values_
 

Detailed Description

This is a builder class for extra functions that are required to support GPU shared memory usage for GroupByPerfectHash query types.

This class does not own its own LLVM module and uses a pointer to the global module provided to it as an argument during construction

Definition at line 43 of file GpuSharedMemoryUtils.h.

Constructor & Destructor Documentation

GpuSharedMemCodeBuilder::GpuSharedMemCodeBuilder ( llvm::Module *  module,
llvm::LLVMContext &  context,
const QueryMemoryDescriptor qmd,
const std::vector< TargetInfo > &  targets,
const std::vector< int64_t > &  init_agg_values,
const size_t  executor_id 
)

This class currently works only with:

  1. row-wise output memory layout
  2. GroupByPerfectHash
  3. single-column group by
  4. Keyless hash strategy (no redundant group column in the output buffer)

All conditions in 1, 3, and 4 can be easily relaxed if proper code is added to support them in the future.

Definition at line 24 of file GpuSharedMemoryUtils.cpp.

References CHECK, QueryMemoryDescriptor::didOutputColumnar(), QueryMemoryDescriptor::getQueryDescriptionType(), GroupByPerfectHash, QueryMemoryDescriptor::hasKeylessHash(), and query_mem_desc_.

31  : executor_id_(executor_id)
32  , module_(llvm_module)
33  , context_(context)
34  , reduction_func_(nullptr)
35  , init_func_(nullptr)
36  , query_mem_desc_(qmd)
37  , targets_(targets)
38  , init_agg_values_(init_agg_values) {
53 }
const QueryMemoryDescriptor query_mem_desc_
llvm::LLVMContext & context_
QueryDescriptionType getQueryDescriptionType() const
const std::vector< int64_t > init_agg_values_
const std::vector< TargetInfo > targets_
#define CHECK(condition)
Definition: Logger.h:291
llvm::Function * reduction_func_

+ Here is the call graph for this function:

Member Function Documentation

void GpuSharedMemCodeBuilder::codegen ( )

generates code for both the reduction and initialization steps required for shared memory usage

Definition at line 55 of file GpuSharedMemoryUtils.cpp.

References CHECK, codegenInitialization(), codegenReduction(), createInitFunction(), createReductionFunction(), DEBUG_TIMER, init_func_, reduction_func_, and verify_function_ir().

55  {
56  auto timer = DEBUG_TIMER(__func__);
57 
58  // codegen the init function
63 
64  // codegen the reduction function:
69 }
llvm::Function * createInitFunction() const
void verify_function_ir(const llvm::Function *func)
#define CHECK(condition)
Definition: Logger.h:291
#define DEBUG_TIMER(name)
Definition: Logger.h:412
llvm::Function * reduction_func_
llvm::Function * createReductionFunction() const

+ Here is the call graph for this function:

void GpuSharedMemCodeBuilder::codegenInitialization ( )
protected

Generates code for the shared memory buffer initialization

This function generates code to initialize the shared memory buffer, the way we initialize the group by output buffer on the host. Similar to the reduction function, it is assumed that there are at least as many threads as there are entries in the buffer. Each entry is assigned to a single thread, and then all slots corresponding to that entry are initialized with aggregate init values.

Definition at line 258 of file GpuSharedMemoryUtils.cpp.

References CHECK, CHECK_GE, anonymous_namespace{GpuSharedMemoryUtils.cpp}::codegen_smem_dest_slot_ptr(), context_, ResultSet::fixupQueryMemoryDescriptor(), getFunction(), init_agg_values_, init_func_, ll_int(), query_mem_desc_, sync_threadblock(), targets_, and UNREACHABLE.

Referenced by codegen().

258  {
259  CHECK(init_func_);
260  // similar to the rest of the system, we used fixup QMD to be able to handle reductions
261  // it should be removed in the future.
262  auto fixup_query_mem_desc = ResultSet::fixupQueryMemoryDescriptor(query_mem_desc_);
263  CHECK(!fixup_query_mem_desc.didOutputColumnar());
264  CHECK(fixup_query_mem_desc.hasKeylessHash());
265  CHECK_GE(init_agg_values_.size(), targets_.size());
266 
267  auto bb_entry = llvm::BasicBlock::Create(context_, ".entry", init_func_);
268  auto bb_body = llvm::BasicBlock::Create(context_, ".body", init_func_);
269  auto bb_exit = llvm::BasicBlock::Create(context_, ".exit", init_func_);
270 
271  llvm::IRBuilder<> ir_builder(bb_entry);
272  const auto func_thread_index = getFunction("get_thread_index");
273  const auto thread_idx = ir_builder.CreateCall(func_thread_index, {}, "thread_index");
274 
275  // declare dynamic shared memory:
276  const auto declare_smem_func = getFunction("declare_dynamic_shared_memory");
277  const auto shared_mem_buffer =
278  ir_builder.CreateCall(declare_smem_func, {}, "shared_mem_buffer");
279 
280  const auto entry_count = ll_int(fixup_query_mem_desc.getEntryCount(), context_);
281  const auto is_thread_inbound =
282  ir_builder.CreateICmpSLT(thread_idx, entry_count, "is_thread_inbound");
283  ir_builder.CreateCondBr(is_thread_inbound, bb_body, bb_exit);
284 
285  ir_builder.SetInsertPoint(bb_body);
286  // compute byte offset assigned to this thread:
287  const auto row_size_bytes = ll_int(fixup_query_mem_desc.getRowWidth(), context_);
288  auto byte_offset_ll = ir_builder.CreateMul(row_size_bytes, thread_idx, "byte_offset");
289 
290  const auto dest_byte_stream = ir_builder.CreatePointerCast(
291  shared_mem_buffer, llvm::Type::getInt8PtrTy(context_), "dest_byte_stream");
292 
293  // each thread will be responsible for one
294  const auto& col_slot_context = fixup_query_mem_desc.getColSlotContext();
295  size_t init_agg_idx = 0;
296  for (size_t target_logical_idx = 0; target_logical_idx < targets_.size();
297  ++target_logical_idx) {
298  const auto& target_info = targets_[target_logical_idx];
299  const auto& slots_for_target = col_slot_context.getSlotsForCol(target_logical_idx);
300  for (size_t slot_idx = slots_for_target.front(); slot_idx <= slots_for_target.back();
301  slot_idx++) {
302  const auto slot_size = fixup_query_mem_desc.getPaddedSlotWidthBytes(slot_idx);
303 
304  auto casted_dest_slot_address = codegen_smem_dest_slot_ptr(context_,
305  fixup_query_mem_desc,
306  ir_builder,
307  slot_idx,
308  target_info,
309  dest_byte_stream,
310  byte_offset_ll);
311 
312  llvm::Value* init_value_ll = nullptr;
313  if (slot_size == sizeof(int32_t)) {
314  init_value_ll =
315  ll_int(static_cast<int32_t>(init_agg_values_[init_agg_idx++]), context_);
316  } else if (slot_size == sizeof(int64_t)) {
317  init_value_ll =
318  ll_int(static_cast<int64_t>(init_agg_values_[init_agg_idx++]), context_);
319  } else {
320  UNREACHABLE() << "Invalid slot size encountered.";
321  }
322  ir_builder.CreateStore(init_value_ll, casted_dest_slot_address);
323 
324  // if not the last loop, we compute the next offset:
325  if (slot_idx != (col_slot_context.getSlotCount() - 1)) {
326  byte_offset_ll = ir_builder.CreateAdd(
327  byte_offset_ll, ll_int(static_cast<size_t>(slot_size), context_));
328  }
329  }
330  }
331 
332  ir_builder.CreateBr(bb_exit);
333 
334  ir_builder.SetInsertPoint(bb_exit);
335  // synchronize all threads within a threadblock:
336  const auto sync_threadblock = getFunction("sync_threadblock");
337  ir_builder.CreateCall(sync_threadblock, {});
338  ir_builder.CreateRet(shared_mem_buffer);
339 }
llvm::ConstantInt * ll_int(const T v, llvm::LLVMContext &context)
#define UNREACHABLE()
Definition: Logger.h:338
#define CHECK_GE(x, y)
Definition: Logger.h:306
const QueryMemoryDescriptor query_mem_desc_
llvm::LLVMContext & context_
const std::vector< int64_t > init_agg_values_
llvm::Value * codegen_smem_dest_slot_ptr(llvm::LLVMContext &context, const QueryMemoryDescriptor &query_mem_desc, llvm::IRBuilder<> &ir_builder, const size_t slot_idx, const TargetInfo &target_info, llvm::Value *dest_byte_stream, llvm::Value *byte_offset)
static QueryMemoryDescriptor fixupQueryMemoryDescriptor(const QueryMemoryDescriptor &)
Definition: ResultSet.cpp:766
const std::vector< TargetInfo > targets_
#define CHECK(condition)
Definition: Logger.h:291
__device__ void sync_threadblock()
llvm::Function * getFunction(const std::string &func_name) const

+ Here is the call graph for this function:

+ Here is the caller graph for this function:

void GpuSharedMemCodeBuilder::codegenReduction ( )
protected

Generates code for the reduction functionality (from shared memory into global memory)

The reduction function is going to be used to reduce group by buffer stored in the shared memory, back into global memory buffer. The general procedure is very similar to the what we have ResultSetReductionJIT, with some major differences that will be discussed below:

The general procedure is as follows:

  1. the function takes three arguments: 1) dest_buffer_ptr which points to global memory group by buffer (what existed before), 2) src_buffer_ptr which points to the shared memory group by buffer, exclusively accessed by each specific GPU thread-block, 3) total buffer size.
  2. We assign each thread to a specific entry (all targets within that entry), so any thread with an index larger than max entries, will have an early return from this function
  3. It is assumed here that there are at least as many threads in the GPU as there are entries in the group by buffer. In practice, given the buffer sizes that we deal with, this is a reasonable asumption, but can be easily relaxed in the future if needed to: threads can form a loop and process all entries until all are finished. It should be noted that we currently don't use shared memory if there are more entries than number of threads.
  4. We loop over all slots corresponding to a specific entry, and use ResultSetReductionJIT's reduce_one_entry_idx to reduce one slot from the destination buffer into source buffer. The only difference is that we should replace all agg_* funcitons within this code with their agg_*_shared counterparts, which use atomics operations and are used on the GPU.
  5. Once all threads are done, we return from the function.

Definition at line 98 of file GpuSharedMemoryUtils.cpp.

References run_benchmark_import::args, CHECK, context_, executor_id_, ResultSet::fixupQueryMemoryDescriptor(), get_int_type(), CodegenUtil::getCalledFunctionName(), QueryMemoryDescriptor::getEntryCount(), getFunction(), result_set::initialize_target_values_for_storage(), ll_int(), module_, query_mem_desc_, reduction_func_, sync_threadblock(), and targets_.

Referenced by codegen().

98  {
100  // adding names to input arguments:
101  auto arg_it = reduction_func_->arg_begin();
102  auto dest_buffer_ptr = &*arg_it;
103  dest_buffer_ptr->setName("dest_buffer_ptr");
104  arg_it++;
105  auto src_buffer_ptr = &*arg_it;
106  src_buffer_ptr->setName("src_buffer_ptr");
107  arg_it++;
108  auto buffer_size = &*arg_it;
109  buffer_size->setName("buffer_size");
110 
111  auto bb_entry = llvm::BasicBlock::Create(context_, ".entry", reduction_func_);
112  auto bb_body = llvm::BasicBlock::Create(context_, ".body", reduction_func_);
113  auto bb_exit = llvm::BasicBlock::Create(context_, ".exit", reduction_func_);
114  llvm::IRBuilder<> ir_builder(bb_entry);
115 
116  // synchronize all threads within a threadblock:
117  const auto sync_threadblock = getFunction("sync_threadblock");
118  ir_builder.CreateCall(sync_threadblock, {});
119 
120  const auto func_thread_index = getFunction("get_thread_index");
121  const auto thread_idx = ir_builder.CreateCall(func_thread_index, {}, "thread_index");
122 
123  // branching out of out of bound:
124  const auto entry_count = ll_int(query_mem_desc_.getEntryCount(), context_);
125  const auto entry_count_i32 =
126  ll_int(static_cast<int32_t>(query_mem_desc_.getEntryCount()), context_);
127  const auto is_thread_inbound =
128  ir_builder.CreateICmpSLT(thread_idx, entry_count, "is_thread_inbound");
129  ir_builder.CreateCondBr(is_thread_inbound, bb_body, bb_exit);
130 
131  ir_builder.SetInsertPoint(bb_body);
132 
133  // cast src/dest buffers into byte streams:
134  auto src_byte_stream = ir_builder.CreatePointerCast(
135  src_buffer_ptr, llvm::Type::getInt8PtrTy(context_, 0), "src_byte_stream");
136  const auto dest_byte_stream = ir_builder.CreatePointerCast(
137  dest_buffer_ptr, llvm::Type::getInt8PtrTy(context_, 0), "dest_byte_stream");
138 
139  // running the result set reduction JIT code to get reduce_one_entry_idx function
140  auto fixup_query_mem_desc = ResultSet::fixupQueryMemoryDescriptor(query_mem_desc_);
141  auto rs_reduction_jit = std::make_unique<GpuReductionHelperJIT>(
142  fixup_query_mem_desc,
143  targets_,
145  executor_id_);
146  auto reduction_code = rs_reduction_jit->codegen();
147  CHECK(reduction_code.module);
148  reduction_code.module->setDataLayout(
149  "e-p:64:64:64-i1:8:8-i8:8:8-"
150  "i16:16:16-i32:32:32-i64:64:64-"
151  "f32:32:32-f64:64:64-v16:16:16-"
152  "v32:32:32-v64:64:64-v128:128:128-n16:32:64");
153  reduction_code.module->setTargetTriple("nvptx64-nvidia-cuda");
154  llvm::Linker linker(*module_);
155  std::unique_ptr<llvm::Module> owner(reduction_code.module);
156  bool link_error = linker.linkInModule(std::move(owner));
157  CHECK(!link_error);
158 
159  // go through the reduction code and replace all occurances of agg functions
160  // with their _shared counterparts, which are specifically used in GPUs
161  auto reduce_one_entry_func = getFunction("reduce_one_entry");
162  bool agg_func_found = true;
163  while (agg_func_found) {
164  agg_func_found = false;
165  for (auto it = llvm::inst_begin(reduce_one_entry_func);
166  it != llvm::inst_end(reduce_one_entry_func);
167  it++) {
168  if (!llvm::isa<llvm::CallInst>(*it)) {
169  continue;
170  }
171  auto& func_call = llvm::cast<llvm::CallInst>(*it);
172  auto const func_name = CodegenUtil::getCalledFunctionName(func_call);
173  if (func_name) {
174  std::string_view func_name_str = *func_name;
175  if (func_name_str.substr(0, 4) == "agg_") {
176  if (func_name_str.substr(func_name_str.length() - 7) == "_shared") {
177  continue;
178  }
179  agg_func_found = true;
180  std::vector<llvm::Value*> args;
181  args.reserve(func_call.getNumOperands());
182  for (size_t i = 0; i < func_call.getNumOperands() - 1; ++i) {
183  args.push_back(func_call.getArgOperand(i));
184  }
185  auto gpu_agg_func = getFunction(std::string(func_name_str) + "_shared");
186  llvm::ReplaceInstWithInst(&func_call,
187  llvm::CallInst::Create(gpu_agg_func, args, ""));
188  break;
189  }
190  }
191  }
192  }
193  const auto reduce_one_entry_idx_func = getFunction("reduce_one_entry_idx");
194  CHECK(reduce_one_entry_idx_func);
195 
196  // qmd_handles are only used with count distinct and baseline group by
197  // serialized varlen buffer is only used with SAMPLE on varlen types, which we will
198  // disable for current shared memory support.
199  const auto null_ptr_ll =
200  llvm::ConstantPointerNull::get(llvm::Type::getInt8PtrTy(context_, 0));
201  const auto thread_idx_i32 = ir_builder.CreateCast(
202  llvm::Instruction::CastOps::Trunc, thread_idx, get_int_type(32, context_));
203  ir_builder.CreateCall(reduce_one_entry_idx_func,
204  {dest_byte_stream,
205  src_byte_stream,
206  thread_idx_i32,
207  entry_count_i32,
208  null_ptr_ll,
209  null_ptr_ll,
210  null_ptr_ll},
211  "");
212  ir_builder.CreateBr(bb_exit);
213  llvm::ReturnInst::Create(context_, bb_exit);
214 }
std::optional< std::string_view > getCalledFunctionName(llvm::CallInst &call_inst)
llvm::ConstantInt * ll_int(const T v, llvm::LLVMContext &context)
const QueryMemoryDescriptor query_mem_desc_
llvm::Type * get_int_type(const int width, llvm::LLVMContext &context)
llvm::LLVMContext & context_
std::vector< int64_t > initialize_target_values_for_storage(const std::vector< TargetInfo > &targets)
static QueryMemoryDescriptor fixupQueryMemoryDescriptor(const QueryMemoryDescriptor &)
Definition: ResultSet.cpp:766
const std::vector< TargetInfo > targets_
#define CHECK(condition)
Definition: Logger.h:291
__device__ void sync_threadblock()
llvm::Function * getFunction(const std::string &func_name) const
llvm::Function * reduction_func_

+ Here is the call graph for this function:

+ Here is the caller graph for this function:

llvm::Function * GpuSharedMemCodeBuilder::createInitFunction ( ) const
protected

Creates the initialization function in the LLVM module, with predefined arguments and return type

Definition at line 354 of file GpuSharedMemoryUtils.cpp.

References context_, and module_.

Referenced by codegen().

354  {
355  std::vector<llvm::Type*> input_arguments;
356  input_arguments.push_back(
357  llvm::Type::getInt64PtrTy(context_)); // a pointer to the buffer
358  input_arguments.push_back(llvm::Type::getInt32Ty(context_)); // buffer size in bytes
359 
360  llvm::FunctionType* ft = llvm::FunctionType::get(
361  llvm::Type::getInt64PtrTy(context_), input_arguments, false);
362  const auto init_function = llvm::Function::Create(
363  ft, llvm::Function::ExternalLinkage, "init_smem_func", module_);
364  return init_function;
365 }
llvm::LLVMContext & context_

+ Here is the caller graph for this function:

llvm::Function * GpuSharedMemCodeBuilder::createReductionFunction ( ) const
protected

Create the reduction function in the LLVM module, with predefined arguments and return type

Definition at line 341 of file GpuSharedMemoryUtils.cpp.

References context_, and module_.

Referenced by codegen().

341  {
342  std::vector<llvm::Type*> input_arguments;
343  input_arguments.push_back(llvm::Type::getInt64PtrTy(context_));
344  input_arguments.push_back(llvm::Type::getInt64PtrTy(context_));
345  input_arguments.push_back(llvm::Type::getInt32Ty(context_));
346 
347  llvm::FunctionType* ft =
348  llvm::FunctionType::get(llvm::Type::getVoidTy(context_), input_arguments, false);
349  const auto reduction_function = llvm::Function::Create(
350  ft, llvm::Function::ExternalLinkage, "reduce_from_smem_to_gmem", module_);
351  return reduction_function;
352 }
llvm::LLVMContext & context_

+ Here is the caller graph for this function:

llvm::Function * GpuSharedMemCodeBuilder::getFunction ( const std::string &  func_name) const
protected

Search for a particular funciton name in the module, and returns it if found

Definition at line 367 of file GpuSharedMemoryUtils.cpp.

References CHECK, and module_.

Referenced by codegenInitialization(), and codegenReduction().

367  {
368  const auto function = module_->getFunction(func_name);
369  CHECK(function) << func_name << " is not found in the module.";
370  return function;
371 }
#define CHECK(condition)
Definition: Logger.h:291

+ Here is the caller graph for this function:

llvm::Function* GpuSharedMemCodeBuilder::getInitFunction ( ) const
inline

Definition at line 65 of file GpuSharedMemoryUtils.h.

References init_func_.

65 { return init_func_; }
llvm::Function* GpuSharedMemCodeBuilder::getReductionFunction ( ) const
inline

Definition at line 64 of file GpuSharedMemoryUtils.h.

References reduction_func_.

64 { return reduction_func_; }
llvm::Function * reduction_func_
void GpuSharedMemCodeBuilder::injectFunctionsInto ( llvm::Function *  query_func)

Once the reduction and init functions are generated, this function takes the main query function and replaces the previous placeholders, which were inserted in the query template, with these new functions.

Definition at line 405 of file GpuSharedMemoryUtils.cpp.

References CHECK, init_func_, reduction_func_, and anonymous_namespace{GpuSharedMemoryUtils.cpp}::replace_called_function_with().

405  {
407  CHECK(init_func_);
408  replace_called_function_with(query_func, "init_shared_mem", init_func_);
409  replace_called_function_with(query_func, "write_back_nop", reduction_func_);
410 }
void replace_called_function_with(llvm::Function *main_func, const std::string &target_func_name, llvm::Function *replace_func)
#define CHECK(condition)
Definition: Logger.h:291
llvm::Function * reduction_func_

+ Here is the call graph for this function:

std::string GpuSharedMemCodeBuilder::toString ( ) const

Definition at line 412 of file GpuSharedMemoryUtils.cpp.

References CHECK, init_func_, reduction_func_, and serialize_llvm_object().

412  {
414  CHECK(init_func_);
416 }
std::string serialize_llvm_object(const T *llvm_obj)
#define CHECK(condition)
Definition: Logger.h:291
llvm::Function * reduction_func_

+ Here is the call graph for this function:

Member Data Documentation

llvm::LLVMContext& GpuSharedMemCodeBuilder::context_
protected
size_t GpuSharedMemCodeBuilder::executor_id_
protected

Definition at line 93 of file GpuSharedMemoryUtils.h.

Referenced by codegenReduction().

const std::vector<int64_t> GpuSharedMemCodeBuilder::init_agg_values_
protected

Definition at line 100 of file GpuSharedMemoryUtils.h.

Referenced by codegenInitialization().

llvm::Function* GpuSharedMemCodeBuilder::init_func_
protected
llvm::Module* GpuSharedMemCodeBuilder::module_
protected
const QueryMemoryDescriptor GpuSharedMemCodeBuilder::query_mem_desc_
protected
llvm::Function* GpuSharedMemCodeBuilder::reduction_func_
protected
const std::vector<TargetInfo> GpuSharedMemCodeBuilder::targets_
protected

Definition at line 99 of file GpuSharedMemoryUtils.h.

Referenced by codegenInitialization(), and codegenReduction().


The documentation for this class was generated from the following files: