OmniSciDB  c0231cc57d
 All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Friends Macros Groups Pages
GpuSharedMemCodeBuilder Class Reference

#include <GpuSharedMemoryUtils.h>

+ Collaboration diagram for GpuSharedMemCodeBuilder:

Public Member Functions

 GpuSharedMemCodeBuilder (llvm::Module *module, llvm::LLVMContext &context, const QueryMemoryDescriptor &qmd, const std::vector< TargetInfo > &targets, const std::vector< int64_t > &init_agg_values, const size_t executor_id)
 
void codegen ()
 
void injectFunctionsInto (llvm::Function *query_func)
 
llvm::Function * getReductionFunction () const
 
llvm::Function * getInitFunction () const
 
std::string toString () const
 

Protected Member Functions

void codegenReduction ()
 
void codegenInitialization ()
 
llvm::Function * createReductionFunction () const
 
llvm::Function * createInitFunction () const
 
llvm::Function * getFunction (const std::string &func_name) const
 

Protected Attributes

size_t executor_id_
 
llvm::Module * module_
 
llvm::LLVMContext & context_
 
llvm::Function * reduction_func_
 
llvm::Function * init_func_
 
const QueryMemoryDescriptor query_mem_desc_
 
const std::vector< TargetInfotargets_
 
const std::vector< int64_t > init_agg_values_
 

Detailed Description

This is a builder class for extra functions that are required to support GPU shared memory usage for GroupByPerfectHash query types.

This class does not own its own LLVM module and uses a pointer to the global module provided to it as an argument during construction

Definition at line 43 of file GpuSharedMemoryUtils.h.

Constructor & Destructor Documentation

GpuSharedMemCodeBuilder::GpuSharedMemCodeBuilder ( llvm::Module *  module,
llvm::LLVMContext &  context,
const QueryMemoryDescriptor qmd,
const std::vector< TargetInfo > &  targets,
const std::vector< int64_t > &  init_agg_values,
const size_t  executor_id 
)

This class currently works only with:

  1. row-wise output memory layout
  2. GroupByPerfectHash
  3. single-column group by
  4. Keyless hash strategy (no redundant group column in the output buffer)

All conditions in 1, 3, and 4 can be easily relaxed if proper code is added to support them in the future.

Definition at line 23 of file GpuSharedMemoryUtils.cpp.

References CHECK, QueryMemoryDescriptor::didOutputColumnar(), QueryMemoryDescriptor::getQueryDescriptionType(), GroupByPerfectHash, QueryMemoryDescriptor::hasKeylessHash(), and query_mem_desc_.

30  : executor_id_(executor_id)
31  , module_(llvm_module)
32  , context_(context)
33  , reduction_func_(nullptr)
34  , init_func_(nullptr)
35  , query_mem_desc_(qmd)
36  , targets_(targets)
37  , init_agg_values_(init_agg_values) {
52 }
const QueryMemoryDescriptor query_mem_desc_
llvm::LLVMContext & context_
QueryDescriptionType getQueryDescriptionType() const
const std::vector< int64_t > init_agg_values_
const std::vector< TargetInfo > targets_
#define CHECK(condition)
Definition: Logger.h:222
llvm::Function * reduction_func_

+ Here is the call graph for this function:

Member Function Documentation

void GpuSharedMemCodeBuilder::codegen ( )

generates code for both the reduction and initialization steps required for shared memory usage

Definition at line 54 of file GpuSharedMemoryUtils.cpp.

References CHECK, codegenInitialization(), codegenReduction(), createInitFunction(), createReductionFunction(), DEBUG_TIMER, init_func_, reduction_func_, and verify_function_ir().

54  {
55  auto timer = DEBUG_TIMER(__func__);
56 
57  // codegen the init function
62 
63  // codegen the reduction function:
68 }
llvm::Function * createInitFunction() const
void verify_function_ir(const llvm::Function *func)
#define CHECK(condition)
Definition: Logger.h:222
#define DEBUG_TIMER(name)
Definition: Logger.h:371
llvm::Function * reduction_func_
llvm::Function * createReductionFunction() const

+ Here is the call graph for this function:

void GpuSharedMemCodeBuilder::codegenInitialization ( )
protected

Generates code for the shared memory buffer initialization

This function generates code to initialize the shared memory buffer, the way we initialize the group by output buffer on the host. Similar to the reduction function, it is assumed that there are at least as many threads as there are entries in the buffer. Each entry is assigned to a single thread, and then all slots corresponding to that entry are initialized with aggregate init values.

Definition at line 254 of file GpuSharedMemoryUtils.cpp.

References CHECK, CHECK_GE, anonymous_namespace{GpuSharedMemoryUtils.cpp}::codegen_smem_dest_slot_ptr(), context_, ResultSet::fixupQueryMemoryDescriptor(), getFunction(), init_agg_values_, init_func_, ll_int(), query_mem_desc_, sync_threadblock(), targets_, and UNREACHABLE.

Referenced by codegen().

254  {
255  CHECK(init_func_);
256  // similar to the rest of the system, we used fixup QMD to be able to handle reductions
257  // it should be removed in the future.
258  auto fixup_query_mem_desc = ResultSet::fixupQueryMemoryDescriptor(query_mem_desc_);
259  CHECK(!fixup_query_mem_desc.didOutputColumnar());
260  CHECK(fixup_query_mem_desc.hasKeylessHash());
261  CHECK_GE(init_agg_values_.size(), targets_.size());
262 
263  auto bb_entry = llvm::BasicBlock::Create(context_, ".entry", init_func_);
264  auto bb_body = llvm::BasicBlock::Create(context_, ".body", init_func_);
265  auto bb_exit = llvm::BasicBlock::Create(context_, ".exit", init_func_);
266 
267  llvm::IRBuilder<> ir_builder(bb_entry);
268  const auto func_thread_index = getFunction("get_thread_index");
269  const auto thread_idx = ir_builder.CreateCall(func_thread_index, {}, "thread_index");
270 
271  // declare dynamic shared memory:
272  const auto declare_smem_func = getFunction("declare_dynamic_shared_memory");
273  const auto shared_mem_buffer =
274  ir_builder.CreateCall(declare_smem_func, {}, "shared_mem_buffer");
275 
276  const auto entry_count = ll_int(fixup_query_mem_desc.getEntryCount(), context_);
277  const auto is_thread_inbound =
278  ir_builder.CreateICmpSLT(thread_idx, entry_count, "is_thread_inbound");
279  ir_builder.CreateCondBr(is_thread_inbound, bb_body, bb_exit);
280 
281  ir_builder.SetInsertPoint(bb_body);
282  // compute byte offset assigned to this thread:
283  const auto row_size_bytes = ll_int(fixup_query_mem_desc.getRowWidth(), context_);
284  auto byte_offset_ll = ir_builder.CreateMul(row_size_bytes, thread_idx, "byte_offset");
285 
286  const auto dest_byte_stream = ir_builder.CreatePointerCast(
287  shared_mem_buffer, llvm::Type::getInt8PtrTy(context_), "dest_byte_stream");
288 
289  // each thread will be responsible for one
290  const auto& col_slot_context = fixup_query_mem_desc.getColSlotContext();
291  size_t init_agg_idx = 0;
292  for (size_t target_logical_idx = 0; target_logical_idx < targets_.size();
293  ++target_logical_idx) {
294  const auto& target_info = targets_[target_logical_idx];
295  const auto& slots_for_target = col_slot_context.getSlotsForCol(target_logical_idx);
296  for (size_t slot_idx = slots_for_target.front(); slot_idx <= slots_for_target.back();
297  slot_idx++) {
298  const auto slot_size = fixup_query_mem_desc.getPaddedSlotWidthBytes(slot_idx);
299 
300  auto casted_dest_slot_address = codegen_smem_dest_slot_ptr(context_,
301  fixup_query_mem_desc,
302  ir_builder,
303  slot_idx,
304  target_info,
305  dest_byte_stream,
306  byte_offset_ll);
307 
308  llvm::Value* init_value_ll = nullptr;
309  if (slot_size == sizeof(int32_t)) {
310  init_value_ll =
311  ll_int(static_cast<int32_t>(init_agg_values_[init_agg_idx++]), context_);
312  } else if (slot_size == sizeof(int64_t)) {
313  init_value_ll =
314  ll_int(static_cast<int64_t>(init_agg_values_[init_agg_idx++]), context_);
315  } else {
316  UNREACHABLE() << "Invalid slot size encountered.";
317  }
318  ir_builder.CreateStore(init_value_ll, casted_dest_slot_address);
319 
320  // if not the last loop, we compute the next offset:
321  if (slot_idx != (col_slot_context.getSlotCount() - 1)) {
322  byte_offset_ll = ir_builder.CreateAdd(
323  byte_offset_ll, ll_int(static_cast<size_t>(slot_size), context_));
324  }
325  }
326  }
327 
328  ir_builder.CreateBr(bb_exit);
329 
330  ir_builder.SetInsertPoint(bb_exit);
331  // synchronize all threads within a threadblock:
332  const auto sync_threadblock = getFunction("sync_threadblock");
333  ir_builder.CreateCall(sync_threadblock, {});
334  ir_builder.CreateRet(shared_mem_buffer);
335 }
llvm::ConstantInt * ll_int(const T v, llvm::LLVMContext &context)
#define UNREACHABLE()
Definition: Logger.h:266
#define CHECK_GE(x, y)
Definition: Logger.h:235
const QueryMemoryDescriptor query_mem_desc_
llvm::LLVMContext & context_
const std::vector< int64_t > init_agg_values_
llvm::Value * codegen_smem_dest_slot_ptr(llvm::LLVMContext &context, const QueryMemoryDescriptor &query_mem_desc, llvm::IRBuilder<> &ir_builder, const size_t slot_idx, const TargetInfo &target_info, llvm::Value *dest_byte_stream, llvm::Value *byte_offset)
static QueryMemoryDescriptor fixupQueryMemoryDescriptor(const QueryMemoryDescriptor &)
Definition: ResultSet.cpp:756
const std::vector< TargetInfo > targets_
#define CHECK(condition)
Definition: Logger.h:222
__device__ void sync_threadblock()
llvm::Function * getFunction(const std::string &func_name) const

+ Here is the call graph for this function:

+ Here is the caller graph for this function:

void GpuSharedMemCodeBuilder::codegenReduction ( )
protected

Generates code for the reduction functionality (from shared memory into global memory)

The reduction function is going to be used to reduce group by buffer stored in the shared memory, back into global memory buffer. The general procedure is very similar to the what we have ResultSetReductionJIT, with some major differences that will be discussed below:

The general procedure is as follows:

  1. the function takes three arguments: 1) dest_buffer_ptr which points to global memory group by buffer (what existed before), 2) src_buffer_ptr which points to the shared memory group by buffer, exclusively accessed by each specific GPU thread-block, 3) total buffer size.
  2. We assign each thread to a specific entry (all targets within that entry), so any thread with an index larger than max entries, will have an early return from this function
  3. It is assumed here that there are at least as many threads in the GPU as there are entries in the group by buffer. In practice, given the buffer sizes that we deal with, this is a reasonable asumption, but can be easily relaxed in the future if needed to: threads can form a loop and process all entries until all are finished. It should be noted that we currently don't use shared memory if there are more entries than number of threads.
  4. We loop over all slots corresponding to a specific entry, and use ResultSetReductionJIT's reduce_one_entry_idx to reduce one slot from the destination buffer into source buffer. The only difference is that we should replace all agg_* funcitons within this code with their agg_*_shared counterparts, which use atomics operations and are used on the GPU.
  5. Once all threads are done, we return from the function.

Definition at line 97 of file GpuSharedMemoryUtils.cpp.

References run_benchmark_import::args, CHECK, context_, executor_id_, ResultSet::fixupQueryMemoryDescriptor(), get_int_type(), QueryMemoryDescriptor::getEntryCount(), getFunction(), result_set::initialize_target_values_for_storage(), ll_int(), module_, query_mem_desc_, reduction_func_, sync_threadblock(), and targets_.

Referenced by codegen().

97  {
99  // adding names to input arguments:
100  auto arg_it = reduction_func_->arg_begin();
101  auto dest_buffer_ptr = &*arg_it;
102  dest_buffer_ptr->setName("dest_buffer_ptr");
103  arg_it++;
104  auto src_buffer_ptr = &*arg_it;
105  src_buffer_ptr->setName("src_buffer_ptr");
106  arg_it++;
107  auto buffer_size = &*arg_it;
108  buffer_size->setName("buffer_size");
109 
110  auto bb_entry = llvm::BasicBlock::Create(context_, ".entry", reduction_func_);
111  auto bb_body = llvm::BasicBlock::Create(context_, ".body", reduction_func_);
112  auto bb_exit = llvm::BasicBlock::Create(context_, ".exit", reduction_func_);
113  llvm::IRBuilder<> ir_builder(bb_entry);
114 
115  // synchronize all threads within a threadblock:
116  const auto sync_threadblock = getFunction("sync_threadblock");
117  ir_builder.CreateCall(sync_threadblock, {});
118 
119  const auto func_thread_index = getFunction("get_thread_index");
120  const auto thread_idx = ir_builder.CreateCall(func_thread_index, {}, "thread_index");
121 
122  // branching out of out of bound:
123  const auto entry_count = ll_int(query_mem_desc_.getEntryCount(), context_);
124  const auto entry_count_i32 =
125  ll_int(static_cast<int32_t>(query_mem_desc_.getEntryCount()), context_);
126  const auto is_thread_inbound =
127  ir_builder.CreateICmpSLT(thread_idx, entry_count, "is_thread_inbound");
128  ir_builder.CreateCondBr(is_thread_inbound, bb_body, bb_exit);
129 
130  ir_builder.SetInsertPoint(bb_body);
131 
132  // cast src/dest buffers into byte streams:
133  auto src_byte_stream = ir_builder.CreatePointerCast(
134  src_buffer_ptr, llvm::Type::getInt8PtrTy(context_, 0), "src_byte_stream");
135  const auto dest_byte_stream = ir_builder.CreatePointerCast(
136  dest_buffer_ptr, llvm::Type::getInt8PtrTy(context_, 0), "dest_byte_stream");
137 
138  // running the result set reduction JIT code to get reduce_one_entry_idx function
139  auto fixup_query_mem_desc = ResultSet::fixupQueryMemoryDescriptor(query_mem_desc_);
140  auto rs_reduction_jit = std::make_unique<GpuReductionHelperJIT>(
141  fixup_query_mem_desc,
142  targets_,
144  executor_id_);
145  auto reduction_code = rs_reduction_jit->codegen();
146  CHECK(reduction_code.module);
147  reduction_code.module->setDataLayout(
148  "e-p:64:64:64-i1:8:8-i8:8:8-"
149  "i16:16:16-i32:32:32-i64:64:64-"
150  "f32:32:32-f64:64:64-v16:16:16-"
151  "v32:32:32-v64:64:64-v128:128:128-n16:32:64");
152  reduction_code.module->setTargetTriple("nvptx64-nvidia-cuda");
153  llvm::Linker linker(*module_);
154  std::unique_ptr<llvm::Module> owner(reduction_code.module);
155  bool link_error = linker.linkInModule(std::move(owner));
156  CHECK(!link_error);
157 
158  // go through the reduction code and replace all occurances of agg functions
159  // with their _shared counterparts, which are specifically used in GPUs
160  auto reduce_one_entry_func = getFunction("reduce_one_entry");
161  bool agg_func_found = true;
162  while (agg_func_found) {
163  agg_func_found = false;
164  for (auto it = llvm::inst_begin(reduce_one_entry_func);
165  it != llvm::inst_end(reduce_one_entry_func);
166  it++) {
167  if (!llvm::isa<llvm::CallInst>(*it)) {
168  continue;
169  }
170  auto& func_call = llvm::cast<llvm::CallInst>(*it);
171  std::string func_name = func_call.getCalledFunction()->getName().str();
172  if (func_name.length() > 4 && func_name.substr(0, 4) == "agg_") {
173  if (func_name.length() > 7 &&
174  func_name.substr(func_name.length() - 7) == "_shared") {
175  continue;
176  }
177  agg_func_found = true;
178  std::vector<llvm::Value*> args;
179  for (size_t i = 0; i < func_call.getNumOperands() - 1; ++i) {
180  args.push_back(func_call.getArgOperand(i));
181  }
182  auto gpu_agg_func = getFunction(func_name + "_shared");
183  llvm::ReplaceInstWithInst(&func_call,
184  llvm::CallInst::Create(gpu_agg_func, args, ""));
185  break;
186  }
187  }
188  }
189  const auto reduce_one_entry_idx_func = getFunction("reduce_one_entry_idx");
190  CHECK(reduce_one_entry_idx_func);
191 
192  // qmd_handles are only used with count distinct and baseline group by
193  // serialized varlen buffer is only used with SAMPLE on varlen types, which we will
194  // disable for current shared memory support.
195  const auto null_ptr_ll =
196  llvm::ConstantPointerNull::get(llvm::Type::getInt8PtrTy(context_, 0));
197  const auto thread_idx_i32 = ir_builder.CreateCast(
198  llvm::Instruction::CastOps::Trunc, thread_idx, get_int_type(32, context_));
199  ir_builder.CreateCall(reduce_one_entry_idx_func,
200  {dest_byte_stream,
201  src_byte_stream,
202  thread_idx_i32,
203  entry_count_i32,
204  null_ptr_ll,
205  null_ptr_ll,
206  null_ptr_ll},
207  "");
208  ir_builder.CreateBr(bb_exit);
209  llvm::ReturnInst::Create(context_, bb_exit);
210 }
llvm::ConstantInt * ll_int(const T v, llvm::LLVMContext &context)
const QueryMemoryDescriptor query_mem_desc_
llvm::Type * get_int_type(const int width, llvm::LLVMContext &context)
llvm::LLVMContext & context_
std::vector< int64_t > initialize_target_values_for_storage(const std::vector< TargetInfo > &targets)
static QueryMemoryDescriptor fixupQueryMemoryDescriptor(const QueryMemoryDescriptor &)
Definition: ResultSet.cpp:756
const std::vector< TargetInfo > targets_
#define CHECK(condition)
Definition: Logger.h:222
__device__ void sync_threadblock()
llvm::Function * getFunction(const std::string &func_name) const
llvm::Function * reduction_func_

+ Here is the call graph for this function:

+ Here is the caller graph for this function:

llvm::Function * GpuSharedMemCodeBuilder::createInitFunction ( ) const
protected

Creates the initialization function in the LLVM module, with predefined arguments and return type

Definition at line 350 of file GpuSharedMemoryUtils.cpp.

References context_, and module_.

Referenced by codegen().

350  {
351  std::vector<llvm::Type*> input_arguments;
352  input_arguments.push_back(
353  llvm::Type::getInt64PtrTy(context_)); // a pointer to the buffer
354  input_arguments.push_back(llvm::Type::getInt32Ty(context_)); // buffer size in bytes
355 
356  llvm::FunctionType* ft = llvm::FunctionType::get(
357  llvm::Type::getInt64PtrTy(context_), input_arguments, false);
358  const auto init_function = llvm::Function::Create(
359  ft, llvm::Function::ExternalLinkage, "init_smem_func", module_);
360  return init_function;
361 }
llvm::LLVMContext & context_

+ Here is the caller graph for this function:

llvm::Function * GpuSharedMemCodeBuilder::createReductionFunction ( ) const
protected

Create the reduction function in the LLVM module, with predefined arguments and return type

Definition at line 337 of file GpuSharedMemoryUtils.cpp.

References context_, and module_.

Referenced by codegen().

337  {
338  std::vector<llvm::Type*> input_arguments;
339  input_arguments.push_back(llvm::Type::getInt64PtrTy(context_));
340  input_arguments.push_back(llvm::Type::getInt64PtrTy(context_));
341  input_arguments.push_back(llvm::Type::getInt32Ty(context_));
342 
343  llvm::FunctionType* ft =
344  llvm::FunctionType::get(llvm::Type::getVoidTy(context_), input_arguments, false);
345  const auto reduction_function = llvm::Function::Create(
346  ft, llvm::Function::ExternalLinkage, "reduce_from_smem_to_gmem", module_);
347  return reduction_function;
348 }
llvm::LLVMContext & context_

+ Here is the caller graph for this function:

llvm::Function * GpuSharedMemCodeBuilder::getFunction ( const std::string &  func_name) const
protected

Search for a particular funciton name in the module, and returns it if found

Definition at line 363 of file GpuSharedMemoryUtils.cpp.

References CHECK, and module_.

Referenced by codegenInitialization(), and codegenReduction().

363  {
364  const auto function = module_->getFunction(func_name);
365  CHECK(function) << func_name << " is not found in the module.";
366  return function;
367 }
#define CHECK(condition)
Definition: Logger.h:222

+ Here is the caller graph for this function:

llvm::Function* GpuSharedMemCodeBuilder::getInitFunction ( ) const
inline

Definition at line 65 of file GpuSharedMemoryUtils.h.

References init_func_.

65 { return init_func_; }
llvm::Function* GpuSharedMemCodeBuilder::getReductionFunction ( ) const
inline

Definition at line 64 of file GpuSharedMemoryUtils.h.

References reduction_func_.

64 { return reduction_func_; }
llvm::Function * reduction_func_
void GpuSharedMemCodeBuilder::injectFunctionsInto ( llvm::Function *  query_func)

Once the reduction and init functions are generated, this function takes the main query function and replaces the previous placeholders, which were inserted in the query template, with these new functions.

Definition at line 400 of file GpuSharedMemoryUtils.cpp.

References CHECK, init_func_, reduction_func_, and anonymous_namespace{GpuSharedMemoryUtils.cpp}::replace_called_function_with().

400  {
402  CHECK(init_func_);
403  replace_called_function_with(query_func, "init_shared_mem", init_func_);
404  replace_called_function_with(query_func, "write_back_nop", reduction_func_);
405 }
void replace_called_function_with(llvm::Function *main_func, const std::string &target_func_name, llvm::Function *replace_func)
#define CHECK(condition)
Definition: Logger.h:222
llvm::Function * reduction_func_

+ Here is the call graph for this function:

std::string GpuSharedMemCodeBuilder::toString ( ) const

Definition at line 407 of file GpuSharedMemoryUtils.cpp.

References CHECK, init_func_, reduction_func_, and serialize_llvm_object().

407  {
409  CHECK(init_func_);
411 }
std::string serialize_llvm_object(const T *llvm_obj)
#define CHECK(condition)
Definition: Logger.h:222
llvm::Function * reduction_func_

+ Here is the call graph for this function:

Member Data Documentation

llvm::LLVMContext& GpuSharedMemCodeBuilder::context_
protected
size_t GpuSharedMemCodeBuilder::executor_id_
protected

Definition at line 93 of file GpuSharedMemoryUtils.h.

Referenced by codegenReduction().

const std::vector<int64_t> GpuSharedMemCodeBuilder::init_agg_values_
protected

Definition at line 100 of file GpuSharedMemoryUtils.h.

Referenced by codegenInitialization().

llvm::Function* GpuSharedMemCodeBuilder::init_func_
protected
llvm::Module* GpuSharedMemCodeBuilder::module_
protected
const QueryMemoryDescriptor GpuSharedMemCodeBuilder::query_mem_desc_
protected
llvm::Function* GpuSharedMemCodeBuilder::reduction_func_
protected
const std::vector<TargetInfo> GpuSharedMemCodeBuilder::targets_
protected

Definition at line 99 of file GpuSharedMemoryUtils.h.

Referenced by codegenInitialization(), and codegenReduction().


The documentation for this class was generated from the following files: