OmniSciDB  95562058bd
 All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Friends Macros Pages
GpuSharedMemCodeBuilder Class Reference

#include <GpuSharedMemoryUtils.h>

+ Collaboration diagram for GpuSharedMemCodeBuilder:

Public Member Functions

 GpuSharedMemCodeBuilder (llvm::Module *module, llvm::LLVMContext &context, const QueryMemoryDescriptor &qmd, const std::vector< TargetInfo > &targets, const std::vector< int64_t > &init_agg_values)
 
void codegen ()
 
void injectFunctionsInto (llvm::Function *query_func)
 
llvm::Function * getReductionFunction () const
 
llvm::Function * getInitFunction () const
 
std::string toString () const
 

Protected Member Functions

void codegenReduction ()
 
void codegenInitialization ()
 
llvm::Function * createReductionFunction () const
 
llvm::Function * createInitFunction () const
 
llvm::Function * getFunction (const std::string &func_name) const
 

Protected Attributes

llvm::Module * module_
 
llvm::LLVMContext & context_
 
llvm::Function * reduction_func_
 
llvm::Function * init_func_
 
const QueryMemoryDescriptor query_mem_desc_
 
const std::vector< TargetInfotargets_
 
const std::vector< int64_t > init_agg_values_
 

Detailed Description

This is a builder class for extra functions that are required to support GPU shared memory usage for GroupByPerfectHash query types.

This class does not own its own LLVM module and uses a pointer to the global module provided to it as an argument during construction

Definition at line 43 of file GpuSharedMemoryUtils.h.

Constructor & Destructor Documentation

GpuSharedMemCodeBuilder::GpuSharedMemCodeBuilder ( llvm::Module *  module,
llvm::LLVMContext &  context,
const QueryMemoryDescriptor qmd,
const std::vector< TargetInfo > &  targets,
const std::vector< int64_t > &  init_agg_values 
)

This class currently works only with:

  1. row-wise output memory layout
  2. GroupByPerfectHash
  3. single-column group by
  4. Keyless hash strategy (no redundant group column in the output buffer)

All conditions in 1, 3, and 4 can be easily relaxed if proper code is added to support them in the future.

Definition at line 21 of file GpuSharedMemoryUtils.cpp.

References CHECK, QueryMemoryDescriptor::didOutputColumnar(), QueryMemoryDescriptor::getQueryDescriptionType(), GroupByPerfectHash, QueryMemoryDescriptor::hasKeylessHash(), and query_mem_desc_.

27  : module_(module)
28  , context_(context)
29  , reduction_func_(nullptr)
30  , init_func_(nullptr)
31  , query_mem_desc_(qmd)
32  , targets_(targets)
33  , init_agg_values_(init_agg_values) {
48 }
const QueryMemoryDescriptor query_mem_desc_
llvm::LLVMContext & context_
QueryDescriptionType getQueryDescriptionType() const
const std::vector< int64_t > init_agg_values_
const std::vector< TargetInfo > targets_
#define CHECK(condition)
Definition: Logger.h:197
llvm::Function * reduction_func_

+ Here is the call graph for this function:

Member Function Documentation

void GpuSharedMemCodeBuilder::codegen ( )

generates code for both the reduction and initialization steps required for shared memory usage

Definition at line 50 of file GpuSharedMemoryUtils.cpp.

References CHECK, codegenInitialization(), codegenReduction(), createInitFunction(), createReductionFunction(), DEBUG_TIMER, init_func_, reduction_func_, and verify_function_ir().

50  {
51  auto timer = DEBUG_TIMER(__func__);
52 
53  // codegen the init function
58 
59  // codegen the reduction function:
64 }
llvm::Function * createInitFunction() const
void verify_function_ir(const llvm::Function *func)
#define CHECK(condition)
Definition: Logger.h:197
#define DEBUG_TIMER(name)
Definition: Logger.h:313
llvm::Function * reduction_func_
llvm::Function * createReductionFunction() const

+ Here is the call graph for this function:

void GpuSharedMemCodeBuilder::codegenInitialization ( )
protected

Generates code for the shared memory buffer initialization

This function generates code to initialize the shared memory buffer, the way we initialize the group by output buffer on the host. Similar to the reduction function, it is assumed that there are at least as many threads as there are entries in the buffer. Each entry is assigned to a single thread, and then all slots corresponding to that entry are initialized with aggregate init values.

Definition at line 244 of file GpuSharedMemoryUtils.cpp.

References CHECK, CHECK_GE, anonymous_namespace{GpuSharedMemoryUtils.cpp}::codegen_smem_dest_slot_ptr(), context_, ResultSet::fixupQueryMemoryDescriptor(), getFunction(), init_agg_values_, init_func_, ll_int(), query_mem_desc_, sync_threadblock(), targets_, and UNREACHABLE.

Referenced by codegen().

244  {
245  CHECK(init_func_);
246  // similar to the rest of the system, we used fixup QMD to be able to handle reductions
247  // it should be removed in the future.
248  auto fixup_query_mem_desc = ResultSet::fixupQueryMemoryDescriptor(query_mem_desc_);
249  CHECK(!fixup_query_mem_desc.didOutputColumnar());
250  CHECK(fixup_query_mem_desc.hasKeylessHash());
251  CHECK_GE(init_agg_values_.size(), targets_.size());
252 
253  auto bb_entry = llvm::BasicBlock::Create(context_, ".entry", init_func_);
254  auto bb_body = llvm::BasicBlock::Create(context_, ".body", init_func_);
255  auto bb_exit = llvm::BasicBlock::Create(context_, ".exit", init_func_);
256 
257  llvm::IRBuilder<> ir_builder(bb_entry);
258  const auto func_thread_index = getFunction("get_thread_index");
259  const auto thread_idx = ir_builder.CreateCall(func_thread_index, {}, "thread_index");
260 
261  // declare dynamic shared memory:
262  const auto declare_smem_func = getFunction("declare_dynamic_shared_memory");
263  const auto shared_mem_buffer =
264  ir_builder.CreateCall(declare_smem_func, {}, "shared_mem_buffer");
265 
266  const auto entry_count = ll_int(fixup_query_mem_desc.getEntryCount(), context_);
267  const auto is_thread_inbound =
268  ir_builder.CreateICmpSLT(thread_idx, entry_count, "is_thread_inbound");
269  ir_builder.CreateCondBr(is_thread_inbound, bb_body, bb_exit);
270 
271  ir_builder.SetInsertPoint(bb_body);
272  // compute byte offset assigned to this thread:
273  const auto row_size_bytes = ll_int(fixup_query_mem_desc.getRowWidth(), context_);
274  auto byte_offset_ll = ir_builder.CreateMul(row_size_bytes, thread_idx, "byte_offset");
275 
276  const auto dest_byte_stream = ir_builder.CreatePointerCast(
277  shared_mem_buffer, llvm::Type::getInt8PtrTy(context_), "dest_byte_stream");
278 
279  // each thread will be responsible for one
280  const auto& col_slot_context = fixup_query_mem_desc.getColSlotContext();
281  size_t init_agg_idx = 0;
282  for (size_t target_logical_idx = 0; target_logical_idx < targets_.size();
283  ++target_logical_idx) {
284  const auto& target_info = targets_[target_logical_idx];
285  const auto& slots_for_target = col_slot_context.getSlotsForCol(target_logical_idx);
286  for (size_t slot_idx = slots_for_target.front(); slot_idx <= slots_for_target.back();
287  slot_idx++) {
288  const auto slot_size = fixup_query_mem_desc.getPaddedSlotWidthBytes(slot_idx);
289 
290  auto casted_dest_slot_address = codegen_smem_dest_slot_ptr(context_,
291  fixup_query_mem_desc,
292  ir_builder,
293  slot_idx,
294  target_info,
295  dest_byte_stream,
296  byte_offset_ll);
297 
298  llvm::Value* init_value_ll = nullptr;
299  if (slot_size == sizeof(int32_t)) {
300  init_value_ll =
301  ll_int(static_cast<int32_t>(init_agg_values_[init_agg_idx++]), context_);
302  } else if (slot_size == sizeof(int64_t)) {
303  init_value_ll =
304  ll_int(static_cast<int64_t>(init_agg_values_[init_agg_idx++]), context_);
305  } else {
306  UNREACHABLE() << "Invalid slot size encountered.";
307  }
308  ir_builder.CreateStore(init_value_ll, casted_dest_slot_address);
309 
310  // if not the last loop, we compute the next offset:
311  if (slot_idx != (col_slot_context.getSlotCount() - 1)) {
312  byte_offset_ll = ir_builder.CreateAdd(
313  byte_offset_ll, ll_int(static_cast<size_t>(slot_size), context_));
314  }
315  }
316  }
317 
318  ir_builder.CreateBr(bb_exit);
319 
320  ir_builder.SetInsertPoint(bb_exit);
321  // synchronize all threads within a threadblock:
322  const auto sync_threadblock = getFunction("sync_threadblock");
323  ir_builder.CreateCall(sync_threadblock, {});
324  ir_builder.CreateRet(shared_mem_buffer);
325 }
llvm::ConstantInt * ll_int(const T v, llvm::LLVMContext &context)
#define UNREACHABLE()
Definition: Logger.h:241
#define CHECK_GE(x, y)
Definition: Logger.h:210
const QueryMemoryDescriptor query_mem_desc_
llvm::LLVMContext & context_
const std::vector< int64_t > init_agg_values_
llvm::Value * codegen_smem_dest_slot_ptr(llvm::LLVMContext &context, const QueryMemoryDescriptor &query_mem_desc, llvm::IRBuilder<> &ir_builder, const size_t slot_idx, const TargetInfo &target_info, llvm::Value *dest_byte_stream, llvm::Value *byte_offset)
static QueryMemoryDescriptor fixupQueryMemoryDescriptor(const QueryMemoryDescriptor &)
Definition: ResultSet.cpp:509
const std::vector< TargetInfo > targets_
#define CHECK(condition)
Definition: Logger.h:197
__device__ void sync_threadblock()
llvm::Function * getFunction(const std::string &func_name) const

+ Here is the call graph for this function:

+ Here is the caller graph for this function:

void GpuSharedMemCodeBuilder::codegenReduction ( )
protected

Generates code for the reduction functionality (from shared memory into global memory)

The reduction function is going to be used to reduce group by buffer stored in the shared memory, back into global memory buffer. The general procedure is very similar to the what we have ResultSetReductionJIT, with some major differences that will be discussed below:

The general procedure is as follows:

  1. the function takes three arguments: 1) dest_buffer_ptr which points to global memory group by buffer (what existed before), 2) src_buffer_ptr which points to the shared memory group by buffer, exclusively accessed by each specific GPU thread-block, 3) total buffer size.
  2. We assign each thread to a specific entry (all targets within that entry), so any thread with an index larger than max entries, will have an early return from this function
  3. It is assumed here that there are at least as many threads in the GPU as there are entries in the group by buffer. In practice, given the buffer sizes that we deal with, this is a reasonable asumption, but can be easily relaxed in the future if needed to: threads can form a loop and process all entries until all are finished. It should be noted that we currently don't use shared memory if there are more entries than number of threads.
  4. We loop over all slots corresponding to a specific entry, and use ResultSetReductionJIT's reduce_one_entry_idx to reduce one slot from the destination buffer into source buffer. The only difference is that we should replace all agg_* funcitons within this code with their agg_*_shared counterparts, which use atomics operations and are used on the GPU.
  5. Once all threads are done, we return from the function.

Definition at line 93 of file GpuSharedMemoryUtils.cpp.

References run_benchmark_import::args, CHECK, context_, ResultSet::fixupQueryMemoryDescriptor(), get_int_type(), QueryMemoryDescriptor::getEntryCount(), getFunction(), initialize_target_values_for_storage(), ll_int(), module_, query_mem_desc_, reduction_func_, sync_threadblock(), and targets_.

Referenced by codegen().

93  {
95  // adding names to input arguments:
96  auto arg_it = reduction_func_->arg_begin();
97  auto dest_buffer_ptr = &*arg_it;
98  dest_buffer_ptr->setName("dest_buffer_ptr");
99  arg_it++;
100  auto src_buffer_ptr = &*arg_it;
101  src_buffer_ptr->setName("src_buffer_ptr");
102  arg_it++;
103  auto buffer_size = &*arg_it;
104  buffer_size->setName("buffer_size");
105 
106  auto bb_entry = llvm::BasicBlock::Create(context_, ".entry", reduction_func_);
107  auto bb_body = llvm::BasicBlock::Create(context_, ".body", reduction_func_);
108  auto bb_exit = llvm::BasicBlock::Create(context_, ".exit", reduction_func_);
109  llvm::IRBuilder<> ir_builder(bb_entry);
110 
111  // synchronize all threads within a threadblock:
112  const auto sync_threadblock = getFunction("sync_threadblock");
113  ir_builder.CreateCall(sync_threadblock, {});
114 
115  const auto func_thread_index = getFunction("get_thread_index");
116  const auto thread_idx = ir_builder.CreateCall(func_thread_index, {}, "thread_index");
117 
118  // branching out of out of bound:
119  const auto entry_count = ll_int(query_mem_desc_.getEntryCount(), context_);
120  const auto entry_count_i32 =
121  ll_int(static_cast<int32_t>(query_mem_desc_.getEntryCount()), context_);
122  const auto is_thread_inbound =
123  ir_builder.CreateICmpSLT(thread_idx, entry_count, "is_thread_inbound");
124  ir_builder.CreateCondBr(is_thread_inbound, bb_body, bb_exit);
125 
126  ir_builder.SetInsertPoint(bb_body);
127 
128  // cast src/dest buffers into byte streams:
129  auto src_byte_stream = ir_builder.CreatePointerCast(
130  src_buffer_ptr, llvm::Type::getInt8PtrTy(context_, 0), "src_byte_stream");
131  const auto dest_byte_stream = ir_builder.CreatePointerCast(
132  dest_buffer_ptr, llvm::Type::getInt8PtrTy(context_, 0), "dest_byte_stream");
133 
134  // running the result set reduction JIT code to get reduce_one_entry_idx function
135  auto rs_reduction_jit = std::make_unique<GpuReductionHelperJIT>(
137  targets_,
139  auto reduction_code = rs_reduction_jit->codegen();
140  reduction_code.module->setDataLayout(
141  "e-p:64:64:64-i1:8:8-i8:8:8-"
142  "i16:16:16-i32:32:32-i64:64:64-"
143  "f32:32:32-f64:64:64-v16:16:16-"
144  "v32:32:32-v64:64:64-v128:128:128-n16:32:64");
145  reduction_code.module->setTargetTriple("nvptx64-nvidia-cuda");
146 
147  llvm::Linker linker(*module_);
148  bool link_error = linker.linkInModule(std::move(reduction_code.module));
149  CHECK(!link_error);
150 
151  // go through the reduction code and replace all occurances of agg functions
152  // with their _shared counterparts, which are specifically used in GPUs
153  auto reduce_one_entry_func = getFunction("reduce_one_entry");
154  bool agg_func_found = true;
155  while (agg_func_found) {
156  agg_func_found = false;
157  for (auto it = llvm::inst_begin(reduce_one_entry_func);
158  it != llvm::inst_end(reduce_one_entry_func);
159  it++) {
160  if (!llvm::isa<llvm::CallInst>(*it)) {
161  continue;
162  }
163  auto& func_call = llvm::cast<llvm::CallInst>(*it);
164  std::string func_name = func_call.getCalledFunction()->getName().str();
165  if (func_name.length() > 4 && func_name.substr(0, 4) == "agg_") {
166  if (func_name.length() > 7 &&
167  func_name.substr(func_name.length() - 7) == "_shared") {
168  continue;
169  }
170  agg_func_found = true;
171  std::vector<llvm::Value*> args;
172  for (size_t i = 0; i < func_call.getNumArgOperands(); ++i) {
173  args.push_back(func_call.getArgOperand(i));
174  }
175  auto gpu_agg_func = getFunction(func_name + "_shared");
176  llvm::ReplaceInstWithInst(&func_call,
177  llvm::CallInst::Create(gpu_agg_func, args, ""));
178  break;
179  }
180  }
181  }
182  const auto reduce_one_entry_idx_func = getFunction("reduce_one_entry_idx");
183  CHECK(reduce_one_entry_idx_func);
184 
185  // qmd_handles are only used with count distinct and baseline group by
186  // serialized varlen buffer is only used with SAMPLE on varlen types, which we will
187  // disable for current shared memory support.
188  const auto null_ptr_ll =
189  llvm::ConstantPointerNull::get(llvm::Type::getInt8PtrTy(context_, 0));
190  const auto thread_idx_i32 = ir_builder.CreateCast(
191  llvm::Instruction::CastOps::Trunc, thread_idx, get_int_type(32, context_));
192  ir_builder.CreateCall(reduce_one_entry_idx_func,
193  {dest_byte_stream,
194  src_byte_stream,
195  thread_idx_i32,
196  entry_count_i32,
197  null_ptr_ll,
198  null_ptr_ll,
199  null_ptr_ll},
200  "");
201  ir_builder.CreateBr(bb_exit);
202  llvm::ReturnInst::Create(context_, bb_exit);
203 }
std::vector< int64_t > initialize_target_values_for_storage(const std::vector< TargetInfo > &targets)
Definition: ResultSet.cpp:47
llvm::ConstantInt * ll_int(const T v, llvm::LLVMContext &context)
const QueryMemoryDescriptor query_mem_desc_
llvm::Type * get_int_type(const int width, llvm::LLVMContext &context)
llvm::LLVMContext & context_
static QueryMemoryDescriptor fixupQueryMemoryDescriptor(const QueryMemoryDescriptor &)
Definition: ResultSet.cpp:509
const std::vector< TargetInfo > targets_
#define CHECK(condition)
Definition: Logger.h:197
__device__ void sync_threadblock()
llvm::Function * getFunction(const std::string &func_name) const
llvm::Function * reduction_func_

+ Here is the call graph for this function:

+ Here is the caller graph for this function:

llvm::Function * GpuSharedMemCodeBuilder::createInitFunction ( ) const
protected

Creates the initialization function in the LLVM module, with predefined arguments and return type

Definition at line 340 of file GpuSharedMemoryUtils.cpp.

References context_, and module_.

Referenced by codegen().

340  {
341  std::vector<llvm::Type*> input_arguments;
342  input_arguments.push_back(
343  llvm::Type::getInt64PtrTy(context_)); // a pointer to the buffer
344  input_arguments.push_back(llvm::Type::getInt32Ty(context_)); // buffer size in bytes
345 
346  llvm::FunctionType* ft = llvm::FunctionType::get(
347  llvm::Type::getInt64PtrTy(context_), input_arguments, false);
348  const auto init_function = llvm::Function::Create(
349  ft, llvm::Function::ExternalLinkage, "init_smem_func", module_);
350  return init_function;
351 }
llvm::LLVMContext & context_

+ Here is the caller graph for this function:

llvm::Function * GpuSharedMemCodeBuilder::createReductionFunction ( ) const
protected

Create the reduction function in the LLVM module, with predefined arguments and return type

Definition at line 327 of file GpuSharedMemoryUtils.cpp.

References context_, and module_.

Referenced by codegen().

327  {
328  std::vector<llvm::Type*> input_arguments;
329  input_arguments.push_back(llvm::Type::getInt64PtrTy(context_));
330  input_arguments.push_back(llvm::Type::getInt64PtrTy(context_));
331  input_arguments.push_back(llvm::Type::getInt32Ty(context_));
332 
333  llvm::FunctionType* ft =
334  llvm::FunctionType::get(llvm::Type::getVoidTy(context_), input_arguments, false);
335  const auto reduction_function = llvm::Function::Create(
336  ft, llvm::Function::ExternalLinkage, "reduce_from_smem_to_gmem", module_);
337  return reduction_function;
338 }
llvm::LLVMContext & context_

+ Here is the caller graph for this function:

llvm::Function * GpuSharedMemCodeBuilder::getFunction ( const std::string &  func_name) const
protected

Search for a particular funciton name in the module, and returns it if found

Definition at line 353 of file GpuSharedMemoryUtils.cpp.

References CHECK, and module_.

Referenced by codegenInitialization(), and codegenReduction().

353  {
354  const auto function = module_->getFunction(func_name);
355  CHECK(function) << func_name << " is not found in the module.";
356  return function;
357 }
#define CHECK(condition)
Definition: Logger.h:197

+ Here is the caller graph for this function:

llvm::Function* GpuSharedMemCodeBuilder::getInitFunction ( ) const
inline

Definition at line 64 of file GpuSharedMemoryUtils.h.

References init_func_.

64 { return init_func_; }
llvm::Function* GpuSharedMemCodeBuilder::getReductionFunction ( ) const
inline

Definition at line 63 of file GpuSharedMemoryUtils.h.

References reduction_func_.

63 { return reduction_func_; }
llvm::Function * reduction_func_
void GpuSharedMemCodeBuilder::injectFunctionsInto ( llvm::Function *  query_func)

Once the reduction and init functions are generated, this function takes the main query function and replaces the previous placeholders, which were inserted in the query template, with these new functions.

Definition at line 390 of file GpuSharedMemoryUtils.cpp.

References CHECK, init_func_, reduction_func_, and anonymous_namespace{GpuSharedMemoryUtils.cpp}::replace_called_function_with().

390  {
392  CHECK(init_func_);
393  replace_called_function_with(query_func, "init_shared_mem", init_func_);
394  replace_called_function_with(query_func, "write_back_nop", reduction_func_);
395 }
void replace_called_function_with(llvm::Function *main_func, const std::string &target_func_name, llvm::Function *replace_func)
#define CHECK(condition)
Definition: Logger.h:197
llvm::Function * reduction_func_

+ Here is the call graph for this function:

std::string GpuSharedMemCodeBuilder::toString ( ) const

Definition at line 397 of file GpuSharedMemoryUtils.cpp.

References CHECK, init_func_, reduction_func_, and serialize_llvm_object().

397  {
399  CHECK(init_func_);
401 }
std::string serialize_llvm_object(const T *llvm_obj)
#define CHECK(condition)
Definition: Logger.h:197
llvm::Function * reduction_func_

+ Here is the call graph for this function:

Member Data Documentation

llvm::LLVMContext& GpuSharedMemCodeBuilder::context_
protected
const std::vector<int64_t> GpuSharedMemCodeBuilder::init_agg_values_
protected

Definition at line 98 of file GpuSharedMemoryUtils.h.

Referenced by codegenInitialization().

llvm::Function* GpuSharedMemCodeBuilder::init_func_
protected
llvm::Module* GpuSharedMemCodeBuilder::module_
protected
const QueryMemoryDescriptor GpuSharedMemCodeBuilder::query_mem_desc_
protected
llvm::Function* GpuSharedMemCodeBuilder::reduction_func_
protected
const std::vector<TargetInfo> GpuSharedMemCodeBuilder::targets_
protected

Definition at line 97 of file GpuSharedMemoryUtils.h.

Referenced by codegenInitialization(), and codegenReduction().


The documentation for this class was generated from the following files: