OmniSciDB  d2f719934e
 All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Friends Macros Groups Pages
GpuSharedMemCodeBuilder Class Reference

#include <GpuSharedMemoryUtils.h>

+ Collaboration diagram for GpuSharedMemCodeBuilder:

Public Member Functions

 GpuSharedMemCodeBuilder (llvm::Module *module, llvm::LLVMContext &context, const QueryMemoryDescriptor &qmd, const std::vector< TargetInfo > &targets, const std::vector< int64_t > &init_agg_values)
 
void codegen ()
 
void injectFunctionsInto (llvm::Function *query_func)
 
llvm::Function * getReductionFunction () const
 
llvm::Function * getInitFunction () const
 
std::string toString () const
 

Protected Member Functions

void codegenReduction ()
 
void codegenInitialization ()
 
llvm::Function * createReductionFunction () const
 
llvm::Function * createInitFunction () const
 
llvm::Function * getFunction (const std::string &func_name) const
 

Protected Attributes

llvm::Module * module_
 
llvm::LLVMContext & context_
 
llvm::Function * reduction_func_
 
llvm::Function * init_func_
 
const QueryMemoryDescriptor query_mem_desc_
 
const std::vector< TargetInfotargets_
 
const std::vector< int64_t > init_agg_values_
 

Detailed Description

This is a builder class for extra functions that are required to support GPU shared memory usage for GroupByPerfectHash query types.

This class does not own its own LLVM module and uses a pointer to the global module provided to it as an argument during construction

Definition at line 43 of file GpuSharedMemoryUtils.h.

Constructor & Destructor Documentation

GpuSharedMemCodeBuilder::GpuSharedMemCodeBuilder ( llvm::Module *  module,
llvm::LLVMContext &  context,
const QueryMemoryDescriptor qmd,
const std::vector< TargetInfo > &  targets,
const std::vector< int64_t > &  init_agg_values 
)

This class currently works only with:

  1. row-wise output memory layout
  2. GroupByPerfectHash
  3. single-column group by
  4. Keyless hash strategy (no redundant group column in the output buffer)

All conditions in 1, 3, and 4 can be easily relaxed if proper code is added to support them in the future.

Definition at line 21 of file GpuSharedMemoryUtils.cpp.

References CHECK, QueryMemoryDescriptor::didOutputColumnar(), QueryMemoryDescriptor::getQueryDescriptionType(), GroupByPerfectHash, QueryMemoryDescriptor::hasKeylessHash(), and query_mem_desc_.

27  : module_(module)
28  , context_(context)
29  , reduction_func_(nullptr)
30  , init_func_(nullptr)
31  , query_mem_desc_(qmd)
32  , targets_(targets)
33  , init_agg_values_(init_agg_values) {
48 }
const QueryMemoryDescriptor query_mem_desc_
llvm::LLVMContext & context_
QueryDescriptionType getQueryDescriptionType() const
const std::vector< int64_t > init_agg_values_
const std::vector< TargetInfo > targets_
#define CHECK(condition)
Definition: Logger.h:211
llvm::Function * reduction_func_

+ Here is the call graph for this function:

Member Function Documentation

void GpuSharedMemCodeBuilder::codegen ( )

generates code for both the reduction and initialization steps required for shared memory usage

Definition at line 50 of file GpuSharedMemoryUtils.cpp.

References CHECK, codegenInitialization(), codegenReduction(), createInitFunction(), createReductionFunction(), DEBUG_TIMER, init_func_, reduction_func_, and verify_function_ir().

50  {
51  auto timer = DEBUG_TIMER(__func__);
52 
53  // codegen the init function
58 
59  // codegen the reduction function:
64 }
llvm::Function * createInitFunction() const
void verify_function_ir(const llvm::Function *func)
#define CHECK(condition)
Definition: Logger.h:211
#define DEBUG_TIMER(name)
Definition: Logger.h:358
llvm::Function * reduction_func_
llvm::Function * createReductionFunction() const

+ Here is the call graph for this function:

void GpuSharedMemCodeBuilder::codegenInitialization ( )
protected

Generates code for the shared memory buffer initialization

This function generates code to initialize the shared memory buffer, the way we initialize the group by output buffer on the host. Similar to the reduction function, it is assumed that there are at least as many threads as there are entries in the buffer. Each entry is assigned to a single thread, and then all slots corresponding to that entry are initialized with aggregate init values.

Definition at line 245 of file GpuSharedMemoryUtils.cpp.

References CHECK, CHECK_GE, anonymous_namespace{GpuSharedMemoryUtils.cpp}::codegen_smem_dest_slot_ptr(), context_, ResultSet::fixupQueryMemoryDescriptor(), getFunction(), init_agg_values_, init_func_, ll_int(), query_mem_desc_, sync_threadblock(), targets_, and UNREACHABLE.

Referenced by codegen().

245  {
246  CHECK(init_func_);
247  // similar to the rest of the system, we used fixup QMD to be able to handle reductions
248  // it should be removed in the future.
249  auto fixup_query_mem_desc = ResultSet::fixupQueryMemoryDescriptor(query_mem_desc_);
250  CHECK(!fixup_query_mem_desc.didOutputColumnar());
251  CHECK(fixup_query_mem_desc.hasKeylessHash());
252  CHECK_GE(init_agg_values_.size(), targets_.size());
253 
254  auto bb_entry = llvm::BasicBlock::Create(context_, ".entry", init_func_);
255  auto bb_body = llvm::BasicBlock::Create(context_, ".body", init_func_);
256  auto bb_exit = llvm::BasicBlock::Create(context_, ".exit", init_func_);
257 
258  llvm::IRBuilder<> ir_builder(bb_entry);
259  const auto func_thread_index = getFunction("get_thread_index");
260  const auto thread_idx = ir_builder.CreateCall(func_thread_index, {}, "thread_index");
261 
262  // declare dynamic shared memory:
263  const auto declare_smem_func = getFunction("declare_dynamic_shared_memory");
264  const auto shared_mem_buffer =
265  ir_builder.CreateCall(declare_smem_func, {}, "shared_mem_buffer");
266 
267  const auto entry_count = ll_int(fixup_query_mem_desc.getEntryCount(), context_);
268  const auto is_thread_inbound =
269  ir_builder.CreateICmpSLT(thread_idx, entry_count, "is_thread_inbound");
270  ir_builder.CreateCondBr(is_thread_inbound, bb_body, bb_exit);
271 
272  ir_builder.SetInsertPoint(bb_body);
273  // compute byte offset assigned to this thread:
274  const auto row_size_bytes = ll_int(fixup_query_mem_desc.getRowWidth(), context_);
275  auto byte_offset_ll = ir_builder.CreateMul(row_size_bytes, thread_idx, "byte_offset");
276 
277  const auto dest_byte_stream = ir_builder.CreatePointerCast(
278  shared_mem_buffer, llvm::Type::getInt8PtrTy(context_), "dest_byte_stream");
279 
280  // each thread will be responsible for one
281  const auto& col_slot_context = fixup_query_mem_desc.getColSlotContext();
282  size_t init_agg_idx = 0;
283  for (size_t target_logical_idx = 0; target_logical_idx < targets_.size();
284  ++target_logical_idx) {
285  const auto& target_info = targets_[target_logical_idx];
286  const auto& slots_for_target = col_slot_context.getSlotsForCol(target_logical_idx);
287  for (size_t slot_idx = slots_for_target.front(); slot_idx <= slots_for_target.back();
288  slot_idx++) {
289  const auto slot_size = fixup_query_mem_desc.getPaddedSlotWidthBytes(slot_idx);
290 
291  auto casted_dest_slot_address = codegen_smem_dest_slot_ptr(context_,
292  fixup_query_mem_desc,
293  ir_builder,
294  slot_idx,
295  target_info,
296  dest_byte_stream,
297  byte_offset_ll);
298 
299  llvm::Value* init_value_ll = nullptr;
300  if (slot_size == sizeof(int32_t)) {
301  init_value_ll =
302  ll_int(static_cast<int32_t>(init_agg_values_[init_agg_idx++]), context_);
303  } else if (slot_size == sizeof(int64_t)) {
304  init_value_ll =
305  ll_int(static_cast<int64_t>(init_agg_values_[init_agg_idx++]), context_);
306  } else {
307  UNREACHABLE() << "Invalid slot size encountered.";
308  }
309  ir_builder.CreateStore(init_value_ll, casted_dest_slot_address);
310 
311  // if not the last loop, we compute the next offset:
312  if (slot_idx != (col_slot_context.getSlotCount() - 1)) {
313  byte_offset_ll = ir_builder.CreateAdd(
314  byte_offset_ll, ll_int(static_cast<size_t>(slot_size), context_));
315  }
316  }
317  }
318 
319  ir_builder.CreateBr(bb_exit);
320 
321  ir_builder.SetInsertPoint(bb_exit);
322  // synchronize all threads within a threadblock:
323  const auto sync_threadblock = getFunction("sync_threadblock");
324  ir_builder.CreateCall(sync_threadblock, {});
325  ir_builder.CreateRet(shared_mem_buffer);
326 }
llvm::ConstantInt * ll_int(const T v, llvm::LLVMContext &context)
#define UNREACHABLE()
Definition: Logger.h:255
#define CHECK_GE(x, y)
Definition: Logger.h:224
const QueryMemoryDescriptor query_mem_desc_
llvm::LLVMContext & context_
const std::vector< int64_t > init_agg_values_
llvm::Value * codegen_smem_dest_slot_ptr(llvm::LLVMContext &context, const QueryMemoryDescriptor &query_mem_desc, llvm::IRBuilder<> &ir_builder, const size_t slot_idx, const TargetInfo &target_info, llvm::Value *dest_byte_stream, llvm::Value *byte_offset)
static QueryMemoryDescriptor fixupQueryMemoryDescriptor(const QueryMemoryDescriptor &)
Definition: ResultSet.cpp:593
const std::vector< TargetInfo > targets_
#define CHECK(condition)
Definition: Logger.h:211
__device__ void sync_threadblock()
llvm::Function * getFunction(const std::string &func_name) const

+ Here is the call graph for this function:

+ Here is the caller graph for this function:

void GpuSharedMemCodeBuilder::codegenReduction ( )
protected

Generates code for the reduction functionality (from shared memory into global memory)

The reduction function is going to be used to reduce group by buffer stored in the shared memory, back into global memory buffer. The general procedure is very similar to the what we have ResultSetReductionJIT, with some major differences that will be discussed below:

The general procedure is as follows:

  1. the function takes three arguments: 1) dest_buffer_ptr which points to global memory group by buffer (what existed before), 2) src_buffer_ptr which points to the shared memory group by buffer, exclusively accessed by each specific GPU thread-block, 3) total buffer size.
  2. We assign each thread to a specific entry (all targets within that entry), so any thread with an index larger than max entries, will have an early return from this function
  3. It is assumed here that there are at least as many threads in the GPU as there are entries in the group by buffer. In practice, given the buffer sizes that we deal with, this is a reasonable asumption, but can be easily relaxed in the future if needed to: threads can form a loop and process all entries until all are finished. It should be noted that we currently don't use shared memory if there are more entries than number of threads.
  4. We loop over all slots corresponding to a specific entry, and use ResultSetReductionJIT's reduce_one_entry_idx to reduce one slot from the destination buffer into source buffer. The only difference is that we should replace all agg_* funcitons within this code with their agg_*_shared counterparts, which use atomics operations and are used on the GPU.
  5. Once all threads are done, we return from the function.

Definition at line 93 of file GpuSharedMemoryUtils.cpp.

References run_benchmark_import::args, CHECK, context_, ResultSet::fixupQueryMemoryDescriptor(), get_int_type(), QueryMemoryDescriptor::getEntryCount(), getFunction(), i, result_set::initialize_target_values_for_storage(), ll_int(), module_, query_mem_desc_, reduction_func_, sync_threadblock(), and targets_.

Referenced by codegen().

93  {
95  // adding names to input arguments:
96  auto arg_it = reduction_func_->arg_begin();
97  auto dest_buffer_ptr = &*arg_it;
98  dest_buffer_ptr->setName("dest_buffer_ptr");
99  arg_it++;
100  auto src_buffer_ptr = &*arg_it;
101  src_buffer_ptr->setName("src_buffer_ptr");
102  arg_it++;
103  auto buffer_size = &*arg_it;
104  buffer_size->setName("buffer_size");
105 
106  auto bb_entry = llvm::BasicBlock::Create(context_, ".entry", reduction_func_);
107  auto bb_body = llvm::BasicBlock::Create(context_, ".body", reduction_func_);
108  auto bb_exit = llvm::BasicBlock::Create(context_, ".exit", reduction_func_);
109  llvm::IRBuilder<> ir_builder(bb_entry);
110 
111  // synchronize all threads within a threadblock:
112  const auto sync_threadblock = getFunction("sync_threadblock");
113  ir_builder.CreateCall(sync_threadblock, {});
114 
115  const auto func_thread_index = getFunction("get_thread_index");
116  const auto thread_idx = ir_builder.CreateCall(func_thread_index, {}, "thread_index");
117 
118  // branching out of out of bound:
119  const auto entry_count = ll_int(query_mem_desc_.getEntryCount(), context_);
120  const auto entry_count_i32 =
121  ll_int(static_cast<int32_t>(query_mem_desc_.getEntryCount()), context_);
122  const auto is_thread_inbound =
123  ir_builder.CreateICmpSLT(thread_idx, entry_count, "is_thread_inbound");
124  ir_builder.CreateCondBr(is_thread_inbound, bb_body, bb_exit);
125 
126  ir_builder.SetInsertPoint(bb_body);
127 
128  // cast src/dest buffers into byte streams:
129  auto src_byte_stream = ir_builder.CreatePointerCast(
130  src_buffer_ptr, llvm::Type::getInt8PtrTy(context_, 0), "src_byte_stream");
131  const auto dest_byte_stream = ir_builder.CreatePointerCast(
132  dest_buffer_ptr, llvm::Type::getInt8PtrTy(context_, 0), "dest_byte_stream");
133 
134  // running the result set reduction JIT code to get reduce_one_entry_idx function
135  auto fixup_query_mem_desc = ResultSet::fixupQueryMemoryDescriptor(query_mem_desc_);
136  auto rs_reduction_jit = std::make_unique<GpuReductionHelperJIT>(
137  fixup_query_mem_desc,
138  targets_,
140  auto reduction_code = rs_reduction_jit->codegen();
141  reduction_code.module->setDataLayout(
142  "e-p:64:64:64-i1:8:8-i8:8:8-"
143  "i16:16:16-i32:32:32-i64:64:64-"
144  "f32:32:32-f64:64:64-v16:16:16-"
145  "v32:32:32-v64:64:64-v128:128:128-n16:32:64");
146  reduction_code.module->setTargetTriple("nvptx64-nvidia-cuda");
147 
148  llvm::Linker linker(*module_);
149  bool link_error = linker.linkInModule(std::move(reduction_code.module));
150  CHECK(!link_error);
151 
152  // go through the reduction code and replace all occurances of agg functions
153  // with their _shared counterparts, which are specifically used in GPUs
154  auto reduce_one_entry_func = getFunction("reduce_one_entry");
155  bool agg_func_found = true;
156  while (agg_func_found) {
157  agg_func_found = false;
158  for (auto it = llvm::inst_begin(reduce_one_entry_func);
159  it != llvm::inst_end(reduce_one_entry_func);
160  it++) {
161  if (!llvm::isa<llvm::CallInst>(*it)) {
162  continue;
163  }
164  auto& func_call = llvm::cast<llvm::CallInst>(*it);
165  std::string func_name = func_call.getCalledFunction()->getName().str();
166  if (func_name.length() > 4 && func_name.substr(0, 4) == "agg_") {
167  if (func_name.length() > 7 &&
168  func_name.substr(func_name.length() - 7) == "_shared") {
169  continue;
170  }
171  agg_func_found = true;
172  std::vector<llvm::Value*> args;
173  for (size_t i = 0; i < func_call.getNumArgOperands(); ++i) {
174  args.push_back(func_call.getArgOperand(i));
175  }
176  auto gpu_agg_func = getFunction(func_name + "_shared");
177  llvm::ReplaceInstWithInst(&func_call,
178  llvm::CallInst::Create(gpu_agg_func, args, ""));
179  break;
180  }
181  }
182  }
183  const auto reduce_one_entry_idx_func = getFunction("reduce_one_entry_idx");
184  CHECK(reduce_one_entry_idx_func);
185 
186  // qmd_handles are only used with count distinct and baseline group by
187  // serialized varlen buffer is only used with SAMPLE on varlen types, which we will
188  // disable for current shared memory support.
189  const auto null_ptr_ll =
190  llvm::ConstantPointerNull::get(llvm::Type::getInt8PtrTy(context_, 0));
191  const auto thread_idx_i32 = ir_builder.CreateCast(
192  llvm::Instruction::CastOps::Trunc, thread_idx, get_int_type(32, context_));
193  ir_builder.CreateCall(reduce_one_entry_idx_func,
194  {dest_byte_stream,
195  src_byte_stream,
196  thread_idx_i32,
197  entry_count_i32,
198  null_ptr_ll,
199  null_ptr_ll,
200  null_ptr_ll},
201  "");
202  ir_builder.CreateBr(bb_exit);
203  llvm::ReturnInst::Create(context_, bb_exit);
204 }
llvm::ConstantInt * ll_int(const T v, llvm::LLVMContext &context)
const QueryMemoryDescriptor query_mem_desc_
llvm::Type * get_int_type(const int width, llvm::LLVMContext &context)
llvm::LLVMContext & context_
std::vector< int64_t > initialize_target_values_for_storage(const std::vector< TargetInfo > &targets)
static QueryMemoryDescriptor fixupQueryMemoryDescriptor(const QueryMemoryDescriptor &)
Definition: ResultSet.cpp:593
const std::vector< TargetInfo > targets_
#define CHECK(condition)
Definition: Logger.h:211
__device__ void sync_threadblock()
llvm::Function * getFunction(const std::string &func_name) const
llvm::Function * reduction_func_

+ Here is the call graph for this function:

+ Here is the caller graph for this function:

llvm::Function * GpuSharedMemCodeBuilder::createInitFunction ( ) const
protected

Creates the initialization function in the LLVM module, with predefined arguments and return type

Definition at line 341 of file GpuSharedMemoryUtils.cpp.

References context_, and module_.

Referenced by codegen().

341  {
342  std::vector<llvm::Type*> input_arguments;
343  input_arguments.push_back(
344  llvm::Type::getInt64PtrTy(context_)); // a pointer to the buffer
345  input_arguments.push_back(llvm::Type::getInt32Ty(context_)); // buffer size in bytes
346 
347  llvm::FunctionType* ft = llvm::FunctionType::get(
348  llvm::Type::getInt64PtrTy(context_), input_arguments, false);
349  const auto init_function = llvm::Function::Create(
350  ft, llvm::Function::ExternalLinkage, "init_smem_func", module_);
351  return init_function;
352 }
llvm::LLVMContext & context_

+ Here is the caller graph for this function:

llvm::Function * GpuSharedMemCodeBuilder::createReductionFunction ( ) const
protected

Create the reduction function in the LLVM module, with predefined arguments and return type

Definition at line 328 of file GpuSharedMemoryUtils.cpp.

References context_, and module_.

Referenced by codegen().

328  {
329  std::vector<llvm::Type*> input_arguments;
330  input_arguments.push_back(llvm::Type::getInt64PtrTy(context_));
331  input_arguments.push_back(llvm::Type::getInt64PtrTy(context_));
332  input_arguments.push_back(llvm::Type::getInt32Ty(context_));
333 
334  llvm::FunctionType* ft =
335  llvm::FunctionType::get(llvm::Type::getVoidTy(context_), input_arguments, false);
336  const auto reduction_function = llvm::Function::Create(
337  ft, llvm::Function::ExternalLinkage, "reduce_from_smem_to_gmem", module_);
338  return reduction_function;
339 }
llvm::LLVMContext & context_

+ Here is the caller graph for this function:

llvm::Function * GpuSharedMemCodeBuilder::getFunction ( const std::string &  func_name) const
protected

Search for a particular funciton name in the module, and returns it if found

Definition at line 354 of file GpuSharedMemoryUtils.cpp.

References CHECK, and module_.

Referenced by codegenInitialization(), and codegenReduction().

354  {
355  const auto function = module_->getFunction(func_name);
356  CHECK(function) << func_name << " is not found in the module.";
357  return function;
358 }
#define CHECK(condition)
Definition: Logger.h:211

+ Here is the caller graph for this function:

llvm::Function* GpuSharedMemCodeBuilder::getInitFunction ( ) const
inline

Definition at line 64 of file GpuSharedMemoryUtils.h.

References init_func_.

64 { return init_func_; }
llvm::Function* GpuSharedMemCodeBuilder::getReductionFunction ( ) const
inline

Definition at line 63 of file GpuSharedMemoryUtils.h.

References reduction_func_.

63 { return reduction_func_; }
llvm::Function * reduction_func_
void GpuSharedMemCodeBuilder::injectFunctionsInto ( llvm::Function *  query_func)

Once the reduction and init functions are generated, this function takes the main query function and replaces the previous placeholders, which were inserted in the query template, with these new functions.

Definition at line 391 of file GpuSharedMemoryUtils.cpp.

References CHECK, init_func_, reduction_func_, and anonymous_namespace{GpuSharedMemoryUtils.cpp}::replace_called_function_with().

391  {
393  CHECK(init_func_);
394  replace_called_function_with(query_func, "init_shared_mem", init_func_);
395  replace_called_function_with(query_func, "write_back_nop", reduction_func_);
396 }
void replace_called_function_with(llvm::Function *main_func, const std::string &target_func_name, llvm::Function *replace_func)
#define CHECK(condition)
Definition: Logger.h:211
llvm::Function * reduction_func_

+ Here is the call graph for this function:

std::string GpuSharedMemCodeBuilder::toString ( ) const

Definition at line 398 of file GpuSharedMemoryUtils.cpp.

References CHECK, init_func_, reduction_func_, and serialize_llvm_object().

398  {
400  CHECK(init_func_);
402 }
std::string serialize_llvm_object(const T *llvm_obj)
#define CHECK(condition)
Definition: Logger.h:211
llvm::Function * reduction_func_

+ Here is the call graph for this function:

Member Data Documentation

llvm::LLVMContext& GpuSharedMemCodeBuilder::context_
protected
const std::vector<int64_t> GpuSharedMemCodeBuilder::init_agg_values_
protected

Definition at line 98 of file GpuSharedMemoryUtils.h.

Referenced by codegenInitialization().

llvm::Function* GpuSharedMemCodeBuilder::init_func_
protected
llvm::Module* GpuSharedMemCodeBuilder::module_
protected
const QueryMemoryDescriptor GpuSharedMemCodeBuilder::query_mem_desc_
protected
llvm::Function* GpuSharedMemCodeBuilder::reduction_func_
protected
const std::vector<TargetInfo> GpuSharedMemCodeBuilder::targets_
protected

Definition at line 97 of file GpuSharedMemoryUtils.h.

Referenced by codegenInitialization(), and codegenReduction().


The documentation for this class was generated from the following files: