#include <GpuSharedMemoryUtils.h>

Collaboration diagram for GpuSharedMemCodeBuilder:

Public Member Functions
	GpuSharedMemCodeBuilder (llvm::Module *module, llvm::LLVMContext &context, const QueryMemoryDescriptor &qmd, const std::vector< TargetInfo > &targets, const std::vector< int64_t > &init_agg_values, const size_t executor_id)

void	codegen ()

void	injectFunctionsInto (llvm::Function *query_func)

llvm::Function *	getReductionFunction () const

llvm::Function *	getInitFunction () const

std::string	toString () const

Protected Member Functions
void	codegenReduction ()

void	codegenInitialization ()

llvm::Function *	createReductionFunction () const

llvm::Function *	createInitFunction () const

llvm::Function *	getFunction (const std::string &func_name) const

Protected Attributes
size_t	executor_id_

llvm::Module *	module_

llvm::LLVMContext &	context_

llvm::Function *	reduction_func_

llvm::Function *	init_func_

const QueryMemoryDescriptor	query_mem_desc_

const std::vector< TargetInfo >	targets_

const std::vector< int64_t >	init_agg_values_

Detailed Description

This is a builder class for extra functions that are required to support GPU shared memory usage for GroupByPerfectHash query types.

This class does not own its own LLVM module and uses a pointer to the global module provided to it as an argument during construction

Definition at line 43 of file GpuSharedMemoryUtils.h.

Constructor & Destructor Documentation

GpuSharedMemCodeBuilder::GpuSharedMemCodeBuilder	(	llvm::Module *	module,
		llvm::LLVMContext &	context,
		const QueryMemoryDescriptor &	qmd,
		const std::vector< TargetInfo > &	targets,
		const std::vector< int64_t > &	init_agg_values,
		const size_t	executor_id
	)

This class currently works only with:

row-wise output memory layout
GroupByPerfectHash
single-column group by
Keyless hash strategy (no redundant group column in the output buffer)

All conditions in 1, 3, and 4 can be easily relaxed if proper code is added to support them in the future.

Definition at line 24 of file GpuSharedMemoryUtils.cpp.

References CHECK, QueryMemoryDescriptor::didOutputColumnar(), QueryMemoryDescriptor::getQueryDescriptionType(), heavyai::GroupByPerfectHash, QueryMemoryDescriptor::hasKeylessHash(), and query_mem_desc_.

     : executor_id_(executor_id)
     , module_(llvm_module)
     , context_(context)
     , reduction_func_(nullptr)
     , init_func_(nullptr)
     , query_mem_desc_(qmd)
     , targets_(targets)
     , init_agg_values_(init_agg_values) {
   CHECK(!query_mem_desc_.didOutputColumnar());
   CHECK(query_mem_desc_.getQueryDescriptionType() ==
         QueryDescriptionType::GroupByPerfectHash);
   CHECK(query_mem_desc_.hasKeylessHash());
 }

Here is the call graph for this function:

Member Function Documentation

void GpuSharedMemCodeBuilder::codegen ( )

generates code for both the reduction and initialization steps required for shared memory usage

Definition at line 55 of file GpuSharedMemoryUtils.cpp.

References CHECK, codegenInitialization(), codegenReduction(), createInitFunction(), createReductionFunction(), DEBUG_TIMER, init_func_, reduction_func_, and verify_function_ir().

                                       {
   auto timer = DEBUG_TIMER(__func__);
 
   // codegen the init function
   init_func_ = createInitFunction();
   CHECK(init_func_);
   codegenInitialization();
   verify_function_ir(init_func_);
 
   // codegen the reduction function:
   reduction_func_ = createReductionFunction();
   CHECK(reduction_func_);
   codegenReduction();
   verify_function_ir(reduction_func_);
 }

Here is the call graph for this function:

void GpuSharedMemCodeBuilder::codegenInitialization ( )

protected

Generates code for the shared memory buffer initialization

This function generates code to initialize the shared memory buffer, the way we initialize the group by output buffer on the host. Similar to the reduction function, it is assumed that there are at least as many threads as there are entries in the buffer. Each entry is assigned to a single thread, and then all slots corresponding to that entry are initialized with aggregate init values.

Definition at line 258 of file GpuSharedMemoryUtils.cpp.

References CHECK, CHECK_GE, anonymous_namespace{GpuSharedMemoryUtils.cpp}::codegen_smem_dest_slot_ptr(), context_, ResultSet::fixupQueryMemoryDescriptor(), getFunction(), init_agg_values_, init_func_, ll_int(), query_mem_desc_, sync_threadblock(), targets_, and UNREACHABLE.

Referenced by codegen().

                                                     {
   CHECK(init_func_);
   // similar to the rest of the system, we used fixup QMD to be able to handle reductions
   // it should be removed in the future.
   auto fixup_query_mem_desc = ResultSet::fixupQueryMemoryDescriptor(query_mem_desc_);
   CHECK(!fixup_query_mem_desc.didOutputColumnar());
   CHECK(fixup_query_mem_desc.hasKeylessHash());
   CHECK_GE(init_agg_values_.size(), targets_.size());
 
   auto bb_entry = llvm::BasicBlock::Create(context_, ".entry", init_func_);
   auto bb_body = llvm::BasicBlock::Create(context_, ".body", init_func_);
   auto bb_exit = llvm::BasicBlock::Create(context_, ".exit", init_func_);
 
   llvm::IRBuilder<> ir_builder(bb_entry);
   const auto func_thread_index = getFunction("get_thread_index");
   const auto thread_idx = ir_builder.CreateCall(func_thread_index, {}, "thread_index");
 
   // declare dynamic shared memory:
   const auto declare_smem_func = getFunction("declare_dynamic_shared_memory");
   const auto shared_mem_buffer =
       ir_builder.CreateCall(declare_smem_func, {}, "shared_mem_buffer");
 
   const auto entry_count = ll_int(fixup_query_mem_desc.getEntryCount(), context_);
   const auto is_thread_inbound =
       ir_builder.CreateICmpSLT(thread_idx, entry_count, "is_thread_inbound");
   ir_builder.CreateCondBr(is_thread_inbound, bb_body, bb_exit);
 
   ir_builder.SetInsertPoint(bb_body);
   // compute byte offset assigned to this thread:
   const auto row_size_bytes = ll_int(fixup_query_mem_desc.getRowWidth(), context_);
   auto byte_offset_ll = ir_builder.CreateMul(row_size_bytes, thread_idx, "byte_offset");
 
   const auto dest_byte_stream = ir_builder.CreatePointerCast(
       shared_mem_buffer, llvm::Type::getInt8PtrTy(context_), "dest_byte_stream");
 
   // each thread will be responsible for one
   const auto& col_slot_context = fixup_query_mem_desc.getColSlotContext();
   size_t init_agg_idx = 0;
   for (size_t target_logical_idx = 0; target_logical_idx < targets_.size();
        ++target_logical_idx) {
     const auto& target_info = targets_[target_logical_idx];
     const auto& slots_for_target = col_slot_context.getSlotsForCol(target_logical_idx);
     for (size_t slot_idx = slots_for_target.front(); slot_idx <= slots_for_target.back();
          slot_idx++) {
       const auto slot_size = fixup_query_mem_desc.getPaddedSlotWidthBytes(slot_idx);
 
       auto casted_dest_slot_address = codegen_smem_dest_slot_ptr(context_,
                                                                  fixup_query_mem_desc,
                                                                  ir_builder,
                                                                  slot_idx,
                                                                  target_info,
                                                                  dest_byte_stream,
                                                                  byte_offset_ll);
 
       llvm::Value* init_value_ll = nullptr;
       if (slot_size == sizeof(int32_t)) {
         init_value_ll =
             ll_int(static_cast<int32_t>(init_agg_values_[init_agg_idx++]), context_);
       } else if (slot_size == sizeof(int64_t)) {
         init_value_ll =
             ll_int(static_cast<int64_t>(init_agg_values_[init_agg_idx++]), context_);
       } else {
         UNREACHABLE() << "Invalid slot size encountered.";
       }
       ir_builder.CreateStore(init_value_ll, casted_dest_slot_address);
 
       // if not the last loop, we compute the next offset:
       if (slot_idx != (col_slot_context.getSlotCount() - 1)) {
         byte_offset_ll = ir_builder.CreateAdd(
             byte_offset_ll, ll_int(static_cast<size_t>(slot_size), context_));
       }
     }
   }
 
   ir_builder.CreateBr(bb_exit);
 
   ir_builder.SetInsertPoint(bb_exit);
   // synchronize all threads within a threadblock:
   const auto sync_threadblock = getFunction("sync_threadblock");
   ir_builder.CreateCall(sync_threadblock, {});
   ir_builder.CreateRet(shared_mem_buffer);
 }

Here is the call graph for this function:

Here is the caller graph for this function:

void GpuSharedMemCodeBuilder::codegenReduction ( )

protected

Generates code for the reduction functionality (from shared memory into global memory)

The reduction function is going to be used to reduce group by buffer stored in the shared memory, back into global memory buffer. The general procedure is very similar to the what we have ResultSetReductionJIT, with some major differences that will be discussed below:

The general procedure is as follows:

the function takes three arguments: 1) dest_buffer_ptr which points to global memory group by buffer (what existed before), 2) src_buffer_ptr which points to the shared memory group by buffer, exclusively accessed by each specific GPU thread-block, 3) total buffer size.
We assign each thread to a specific entry (all targets within that entry), so any thread with an index larger than max entries, will have an early return from this function
It is assumed here that there are at least as many threads in the GPU as there are entries in the group by buffer. In practice, given the buffer sizes that we deal with, this is a reasonable asumption, but can be easily relaxed in the future if needed to: threads can form a loop and process all entries until all are finished. It should be noted that we currently don't use shared memory if there are more entries than number of threads.
We loop over all slots corresponding to a specific entry, and use ResultSetReductionJIT's reduce_one_entry_idx to reduce one slot from the destination buffer into source buffer. The only difference is that we should replace all agg_* funcitons within this code with their agg_*_shared counterparts, which use atomics operations and are used on the GPU.
Once all threads are done, we return from the function.

Definition at line 98 of file GpuSharedMemoryUtils.cpp.

References run_benchmark_import::args, CHECK, context_, executor_id_, ResultSet::fixupQueryMemoryDescriptor(), get_int_type(), CodegenUtil::getCalledFunctionName(), QueryMemoryDescriptor::getEntryCount(), getFunction(), result_set::initialize_target_values_for_storage(), ll_int(), module_, query_mem_desc_, reduction_func_, sync_threadblock(), and targets_.

Referenced by codegen().

                                                {
   CHECK(reduction_func_);
   // adding names to input arguments:
   auto arg_it = reduction_func_->arg_begin();
   auto dest_buffer_ptr = &*arg_it;
   dest_buffer_ptr->setName("dest_buffer_ptr");
   arg_it++;
   auto src_buffer_ptr = &*arg_it;
   src_buffer_ptr->setName("src_buffer_ptr");
   arg_it++;
   auto buffer_size = &*arg_it;
   buffer_size->setName("buffer_size");
 
   auto bb_entry = llvm::BasicBlock::Create(context_, ".entry", reduction_func_);
   auto bb_body = llvm::BasicBlock::Create(context_, ".body", reduction_func_);
   auto bb_exit = llvm::BasicBlock::Create(context_, ".exit", reduction_func_);
   llvm::IRBuilder<> ir_builder(bb_entry);
 
   // synchronize all threads within a threadblock:
   const auto sync_threadblock = getFunction("sync_threadblock");
   ir_builder.CreateCall(sync_threadblock, {});
 
   const auto func_thread_index = getFunction("get_thread_index");
   const auto thread_idx = ir_builder.CreateCall(func_thread_index, {}, "thread_index");
 
   // branching out of out of bound:
   const auto entry_count = ll_int(query_mem_desc_.getEntryCount(), context_);
   const auto entry_count_i32 =
       ll_int(static_cast<int32_t>(query_mem_desc_.getEntryCount()), context_);
   const auto is_thread_inbound =
       ir_builder.CreateICmpSLT(thread_idx, entry_count, "is_thread_inbound");
   ir_builder.CreateCondBr(is_thread_inbound, bb_body, bb_exit);
 
   ir_builder.SetInsertPoint(bb_body);
 
   // cast src/dest buffers into byte streams:
   auto src_byte_stream = ir_builder.CreatePointerCast(
       src_buffer_ptr, llvm::Type::getInt8PtrTy(context_, 0), "src_byte_stream");
   const auto dest_byte_stream = ir_builder.CreatePointerCast(
       dest_buffer_ptr, llvm::Type::getInt8PtrTy(context_, 0), "dest_byte_stream");
 
   // running the result set reduction JIT code to get reduce_one_entry_idx function
   auto fixup_query_mem_desc = ResultSet::fixupQueryMemoryDescriptor(query_mem_desc_);
   auto rs_reduction_jit = std::make_unique<GpuReductionHelperJIT>(
       fixup_query_mem_desc,
       targets_,
       result_set::initialize_target_values_for_storage(targets_),
       executor_id_);
   auto reduction_code = rs_reduction_jit->codegen();
   CHECK(reduction_code.module);
   reduction_code.module->setDataLayout(
       "e-p:64:64:64-i1:8:8-i8:8:8-"
       "i16:16:16-i32:32:32-i64:64:64-"
       "f32:32:32-f64:64:64-v16:16:16-"
       "v32:32:32-v64:64:64-v128:128:128-n16:32:64");
   reduction_code.module->setTargetTriple("nvptx64-nvidia-cuda");
   llvm::Linker linker(*module_);
   std::unique_ptr<llvm::Module> owner(reduction_code.module);
   bool link_error = linker.linkInModule(std::move(owner));
   CHECK(!link_error);
 
   // go through the reduction code and replace all occurances of agg functions
   // with their _shared counterparts, which are specifically used in GPUs
   auto reduce_one_entry_func = getFunction("reduce_one_entry");
   bool agg_func_found = true;
   while (agg_func_found) {
     agg_func_found = false;
     for (auto it = llvm::inst_begin(reduce_one_entry_func);
          it != llvm::inst_end(reduce_one_entry_func);
          it++) {
       if (!llvm::isa<llvm::CallInst>(*it)) {
         continue;
       }
       auto& func_call = llvm::cast<llvm::CallInst>(*it);
       auto const func_name = CodegenUtil::getCalledFunctionName(func_call);
       if (func_name) {
         std::string_view func_name_str = *func_name;
         if (func_name_str.substr(0, 4) == "agg_") {
           if (func_name_str.substr(func_name_str.length() - 7) == "_shared") {
             continue;
           }
           agg_func_found = true;
           std::vector<llvm::Value*> args;
           args.reserve(func_call.getNumOperands());
           for (size_t i = 0; i < func_call.getNumOperands() - 1; ++i) {
             args.push_back(func_call.getArgOperand(i));
           }
           auto gpu_agg_func = getFunction(std::string(func_name_str) + "_shared");
           llvm::ReplaceInstWithInst(&func_call,
                                     llvm::CallInst::Create(gpu_agg_func, args, ""));
           break;
         }
       }
     }
   }
   const auto reduce_one_entry_idx_func = getFunction("reduce_one_entry_idx");
   CHECK(reduce_one_entry_idx_func);
 
   // qmd_handles are only used with count distinct and baseline group by
   // serialized varlen buffer is only used with SAMPLE on varlen types, which we will
   // disable for current shared memory support.
   const auto null_ptr_ll =
       llvm::ConstantPointerNull::get(llvm::Type::getInt8PtrTy(context_, 0));
   const auto thread_idx_i32 = ir_builder.CreateCast(
       llvm::Instruction::CastOps::Trunc, thread_idx, get_int_type(32, context_));
   ir_builder.CreateCall(reduce_one_entry_idx_func,
                         {dest_byte_stream,
                          src_byte_stream,
                          thread_idx_i32,
                          entry_count_i32,
                          null_ptr_ll,
                          null_ptr_ll,
                          null_ptr_ll},
                         "");
   ir_builder.CreateBr(bb_exit);
   llvm::ReturnInst::Create(context_, bb_exit);
 }

Here is the call graph for this function:

Here is the caller graph for this function:

llvm::Function * GpuSharedMemCodeBuilder::createInitFunction ( ) const

protected

Creates the initialization function in the LLVM module, with predefined arguments and return type

Definition at line 354 of file GpuSharedMemoryUtils.cpp.

References context_, and module_.

Referenced by codegen().

                                                                 {
   std::vector<llvm::Type*> input_arguments;
   input_arguments.push_back(
       llvm::Type::getInt64PtrTy(context_));                     // a pointer to the buffer
   input_arguments.push_back(llvm::Type::getInt32Ty(context_));  // buffer size in bytes
 
   llvm::FunctionType* ft = llvm::FunctionType::get(
       llvm::Type::getInt64PtrTy(context_), input_arguments, false);
   const auto init_function = llvm::Function::Create(
       ft, llvm::Function::ExternalLinkage, "init_smem_func", module_);
   return init_function;
 }

Here is the caller graph for this function:

llvm::Function * GpuSharedMemCodeBuilder::createReductionFunction ( ) const

protected

Create the reduction function in the LLVM module, with predefined arguments and return type

Definition at line 341 of file GpuSharedMemoryUtils.cpp.

References context_, and module_.

Referenced by codegen().

                                                                      {
   std::vector<llvm::Type*> input_arguments;
   input_arguments.push_back(llvm::Type::getInt64PtrTy(context_));
   input_arguments.push_back(llvm::Type::getInt64PtrTy(context_));
   input_arguments.push_back(llvm::Type::getInt32Ty(context_));
 
   llvm::FunctionType* ft =
       llvm::FunctionType::get(llvm::Type::getVoidTy(context_), input_arguments, false);
   const auto reduction_function = llvm::Function::Create(
       ft, llvm::Function::ExternalLinkage, "reduce_from_smem_to_gmem", module_);
   return reduction_function;
 }

Here is the caller graph for this function:

llvm::Function * GpuSharedMemCodeBuilder::getFunction ( const std::string & func_name ) const

protected

Search for a particular funciton name in the module, and returns it if found

Definition at line 367 of file GpuSharedMemoryUtils.cpp.

References CHECK, and module_.

Referenced by codegenInitialization(), and codegenReduction().

                                                                                    {
   const auto function = module_->getFunction(func_name);
   CHECK(function) << func_name << " is not found in the module.";
   return function;
 }

Here is the caller graph for this function:

llvm::Function* GpuSharedMemCodeBuilder::getInitFunction ( ) const

inline

Definition at line 65 of file GpuSharedMemoryUtils.h.

References init_func_.

65 { return init_func_; }

GpuSharedMemCodeBuilder::init_func_

llvm::Function * init_func_

Definition: GpuSharedMemoryUtils.h:97

llvm::Function* GpuSharedMemCodeBuilder::getReductionFunction ( ) const

inline

Definition at line 64 of file GpuSharedMemoryUtils.h.

References reduction_func_.

64 { return reduction_func_; }

GpuSharedMemCodeBuilder::reduction_func_

llvm::Function * reduction_func_

Definition: GpuSharedMemoryUtils.h:96

void GpuSharedMemCodeBuilder::injectFunctionsInto ( llvm::Function * query_func )

Once the reduction and init functions are generated, this function takes the main query function and replaces the previous placeholders, which were inserted in the query template, with these new functions.

Definition at line 405 of file GpuSharedMemoryUtils.cpp.

References CHECK, init_func_, reduction_func_, and anonymous_namespace{GpuSharedMemoryUtils.cpp}::replace_called_function_with().

                                                                           {
   CHECK(reduction_func_);
   CHECK(init_func_);
   replace_called_function_with(query_func, "init_shared_mem", init_func_);
   replace_called_function_with(query_func, "write_back_nop", reduction_func_);
 }

Here is the call graph for this function:

std::string GpuSharedMemCodeBuilder::toString ( ) const

Definition at line 412 of file GpuSharedMemoryUtils.cpp.

References CHECK, init_func_, reduction_func_, and serialize_llvm_object().

                                                   {
   CHECK(reduction_func_);
   CHECK(init_func_);
   return serialize_llvm_object(init_func_) + serialize_llvm_object(reduction_func_);
 }

Here is the call graph for this function:

Member Data Documentation

llvm::LLVMContext& GpuSharedMemCodeBuilder::context_

protected

Definition at line 95 of file GpuSharedMemoryUtils.h.

Referenced by codegenInitialization(), codegenReduction(), createInitFunction(), and createReductionFunction().

size_t GpuSharedMemCodeBuilder::executor_id_

protected

Definition at line 93 of file GpuSharedMemoryUtils.h.

Referenced by codegenReduction().

const std::vector<int64_t> GpuSharedMemCodeBuilder::init_agg_values_

protected

Definition at line 100 of file GpuSharedMemoryUtils.h.

Referenced by codegenInitialization().

llvm::Function* GpuSharedMemCodeBuilder::init_func_

protected

Definition at line 97 of file GpuSharedMemoryUtils.h.

Referenced by codegen(), codegenInitialization(), getInitFunction(), injectFunctionsInto(), and toString().

llvm::Module* GpuSharedMemCodeBuilder::module_

protected

Definition at line 94 of file GpuSharedMemoryUtils.h.

Referenced by codegenReduction(), createInitFunction(), createReductionFunction(), and getFunction().

const QueryMemoryDescriptor GpuSharedMemCodeBuilder::query_mem_desc_

protected

Definition at line 98 of file GpuSharedMemoryUtils.h.

Referenced by codegenInitialization(), codegenReduction(), and GpuSharedMemCodeBuilder().

llvm::Function* GpuSharedMemCodeBuilder::reduction_func_

protected

Definition at line 96 of file GpuSharedMemoryUtils.h.

Referenced by codegen(), codegenReduction(), getReductionFunction(), injectFunctionsInto(), and toString().

const std::vector<TargetInfo> GpuSharedMemCodeBuilder::targets_

protected

Definition at line 99 of file GpuSharedMemoryUtils.h.

Referenced by codegenInitialization(), and codegenReduction().

The documentation for this class was generated from the following files:

/home/jenkins-slave/workspace/core-os-doxygen/QueryEngine/GpuSharedMemoryUtils.h
/home/jenkins-slave/workspace/core-os-doxygen/QueryEngine/GpuSharedMemoryUtils.cpp

Public Member Functions

Protected Member Functions

Protected Attributes

Detailed Description

Constructor & Destructor Documentation

Member Function Documentation

Member Data Documentation