OmniSciDB  72c90bc290
 All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Friends Macros Groups Pages
TableOptimizer Class Reference

Driver for running cleanup processes on a table. TableOptimizer provides functions for various cleanup processes that improve performance on a table. Only tables that have been modified using updates or deletes are candidates for cleanup. If the table descriptor corresponds to a sharded table, table optimizer processes each physical shard. More...

#include <TableOptimizer.h>

+ Collaboration diagram for TableOptimizer:

Public Member Functions

 TableOptimizer (const TableDescriptor *td, Executor *executor, const Catalog_Namespace::Catalog &cat)
 
void recomputeMetadata () const
 Recomputes per-chunk metadata for each fragment in the table. Updates and deletes can cause chunk metadata to become wider than the values in the chunk. Recomputing the metadata narrows the range to fit the chunk, as well as setting or unsetting the nulls flag as appropriate. More...
 
void recomputeMetadataUnlocked (const TableUpdateMetadata &table_update_metadata) const
 Recomputes column chunk metadata for the given set of fragments. The caller of this method is expected to have already acquired the executor lock. More...
 
void vacuumDeletedRows () const
 Compacts fragments to remove deleted rows. When a row is deleted, a boolean deleted system column is set to true. Vacuuming removes all deleted rows from a fragment. Note that vacuuming is a checkpointing operation, so data on disk will increase even though the number of rows for the current epoch has decreased. More...
 
void vacuumFragmentsAboveMinSelectivity (const TableUpdateMetadata &table_update_metadata) const
 

Private Member Functions

DeletedColumnStats recomputeDeletedColumnMetadata (const TableDescriptor *td, const std::set< size_t > &fragment_indexes={}) const
 
void recomputeColumnMetadata (const TableDescriptor *td, const ColumnDescriptor *cd, const std::unordered_map< int, size_t > &tuple_count_map, std::optional< Data_Namespace::MemoryLevel > memory_level, const std::set< size_t > &fragment_indexes) const
 
std::set< size_t > getFragmentIndexes (const TableDescriptor *td, const std::set< int > &fragment_ids) const
 
void vacuumFragments (const TableDescriptor *td, const std::set< int > &fragment_ids={}) const
 
DeletedColumnStats getDeletedColumnStats (const TableDescriptor *td, const std::set< size_t > &fragment_indexes) const
 

Private Attributes

const TableDescriptortd_
 
Executorexecutor_
 
const Catalog_Namespace::Catalogcat_
 

Static Private Attributes

static constexpr size_t ROW_SET_SIZE {1000000000}
 

Detailed Description

Driver for running cleanup processes on a table. TableOptimizer provides functions for various cleanup processes that improve performance on a table. Only tables that have been modified using updates or deletes are candidates for cleanup. If the table descriptor corresponds to a sharded table, table optimizer processes each physical shard.

Definition at line 38 of file TableOptimizer.h.

Constructor & Destructor Documentation

TableOptimizer::TableOptimizer ( const TableDescriptor td,
Executor executor,
const Catalog_Namespace::Catalog cat 
)

Definition at line 29 of file TableOptimizer.cpp.

References CHECK.

32  : td_(td), executor_(executor), cat_(cat) {
33  CHECK(td);
34 }
const TableDescriptor * td_
Executor * executor_
#define CHECK(condition)
Definition: Logger.h:291
const Catalog_Namespace::Catalog & cat_

Member Function Documentation

DeletedColumnStats TableOptimizer::getDeletedColumnStats ( const TableDescriptor td,
const std::set< size_t > &  fragment_indexes 
) const
private

Definition at line 222 of file TableOptimizer.cpp.

References anonymous_namespace{TableOptimizer.cpp}::build_ra_exe_unit(), cat_, CHECK_EQ, DeletedColumnStats::chunk_stats_per_fragment, ColumnDescriptor::columnId, CPU, executor_, anonymous_namespace{TableOptimizer.cpp}::get_compilation_options(), anonymous_namespace{TableOptimizer.cpp}::get_execution_options(), get_logical_type_info(), get_table_infos(), Catalog_Namespace::Catalog::getDatabaseId(), Catalog_Namespace::Catalog::getDeletedColumn(), TableDescriptor::hasDeletedCol, kCOUNT, LOG, anonymous_namespace{TableOptimizer.cpp}::set_metadata_from_results(), TableDescriptor::tableId, DeletedColumnStats::total_row_count, DeletedColumnStats::visible_row_count_per_fragment, and logger::WARNING.

Referenced by recomputeDeletedColumnMetadata(), and vacuumFragmentsAboveMinSelectivity().

224  {
225  if (!td->hasDeletedCol) {
226  return {};
227  }
228 
229  auto cd = cat_.getDeletedColumn(td);
230  const auto column_id = cd->columnId;
231 
232  const auto input_col_desc = std::make_shared<const InputColDescriptor>(
233  column_id, td->tableId, cat_.getDatabaseId(), 0);
234  const auto col_expr = makeExpr<Analyzer::ColumnVar>(
235  cd->columnType, shared::ColumnKey{cat_.getDatabaseId(), td->tableId, column_id}, 0);
236  const auto count_expr =
237  makeExpr<Analyzer::AggExpr>(cd->columnType, kCOUNT, col_expr, false, nullptr);
238 
239  const auto ra_exe_unit = build_ra_exe_unit(input_col_desc, {count_expr.get()});
240  const auto table_infos = get_table_infos(ra_exe_unit, executor_);
241  CHECK_EQ(table_infos.size(), size_t(1));
242 
244  const auto eo = get_execution_options();
245 
246  DeletedColumnStats deleted_column_stats;
247  Executor::PerFragmentCallBack compute_deleted_callback =
248  [&deleted_column_stats, cd](
249  ResultSetPtr results, const Fragmenter_Namespace::FragmentInfo& fragment_info) {
250  // count number of tuples in $deleted as total number of tuples in table.
251  if (cd->isDeletedCol) {
252  deleted_column_stats.total_row_count += fragment_info.getPhysicalNumTuples();
253  }
254  if (fragment_info.getPhysicalNumTuples() == 0) {
255  // TODO(adb): Should not happen, but just to be safe...
256  LOG(WARNING) << "Skipping completely empty fragment for column "
257  << cd->columnName;
258  return;
259  }
260 
261  const auto row = results->getNextRow(false, false);
262  CHECK_EQ(row.size(), size_t(1));
263 
264  const auto& ti = cd->columnType;
265 
266  auto chunk_metadata = std::make_shared<ChunkMetadata>();
267  chunk_metadata->sqlType = get_logical_type_info(ti);
268 
269  const auto count_val = read_scalar_target_value<int64_t>(row[0]);
270 
271  // min element 0 max element 1
272  std::vector<TargetValue> fakerow;
273 
274  auto num_tuples = static_cast<size_t>(count_val);
275 
276  // calculate min
277  if (num_tuples == fragment_info.getPhysicalNumTuples()) {
278  // nothing deleted
279  // min = false;
280  // max = false;
281  fakerow.emplace_back(TargetValue{int64_t(0)});
282  fakerow.emplace_back(TargetValue{int64_t(0)});
283  } else {
284  if (num_tuples == 0) {
285  // everything marked as delete
286  // min = true
287  // max = true
288  fakerow.emplace_back(TargetValue{int64_t(1)});
289  fakerow.emplace_back(TargetValue{int64_t(1)});
290  } else {
291  // some deleted
292  // min = false
293  // max = true;
294  fakerow.emplace_back(TargetValue{int64_t(0)});
295  fakerow.emplace_back(TargetValue{int64_t(1)});
296  }
297  }
298 
299  // place manufacture min and max in fake row to use common infra
300  if (!set_metadata_from_results(*chunk_metadata, fakerow, ti, false)) {
301  LOG(WARNING) << "Unable to process new metadata values for column "
302  << cd->columnName;
303  return;
304  }
305 
306  deleted_column_stats.chunk_stats_per_fragment.emplace(
307  std::make_pair(fragment_info.fragmentId, chunk_metadata->chunkStats));
308  deleted_column_stats.visible_row_count_per_fragment.emplace(
309  std::make_pair(fragment_info.fragmentId, num_tuples));
310  };
311 
312  executor_->executeWorkUnitPerFragment(ra_exe_unit,
313  table_infos[0],
314  co,
315  eo,
316  cat_,
317  compute_deleted_callback,
318  fragment_indexes);
319  return deleted_column_stats;
320 }
#define CHECK_EQ(x, y)
Definition: Logger.h:301
const ColumnDescriptor * getDeletedColumn(const TableDescriptor *td) const
Definition: Catalog.cpp:3897
#define LOG(tag)
Definition: Logger.h:285
CompilationOptions get_compilation_options(const ExecutorDeviceType &device_type)
SQLTypeInfo get_logical_type_info(const SQLTypeInfo &type_info)
Definition: sqltypes.h:1470
std::shared_ptr< ResultSet > ResultSetPtr
Executor * executor_
Used by Fragmenter classes to store info about each fragment - the fragment id and number of tuples(r...
Definition: Fragmenter.h:86
std::unordered_map< int, size_t > visible_row_count_per_fragment
int getDatabaseId() const
Definition: Catalog.h:326
std::unordered_map< int, ChunkStats > chunk_stats_per_fragment
Definition: sqldefs.h:78
bool set_metadata_from_results(ChunkMetadata &chunk_metadata, const std::vector< TargetValue > &row, const SQLTypeInfo &ti, const bool has_nulls)
std::function< void(ResultSetPtr, const Fragmenter_Namespace::FragmentInfo &)> PerFragmentCallBack
Definition: Execute.h:890
std::vector< InputTableInfo > get_table_infos(const std::vector< InputDescriptor > &input_descs, Executor *executor)
boost::variant< ScalarTargetValue, ArrayTargetValue, GeoTargetValue, GeoTargetValuePtr > TargetValue
Definition: TargetValue.h:195
RelAlgExecutionUnit build_ra_exe_unit(const std::shared_ptr< const InputColDescriptor > input_col_desc, const std::vector< Analyzer::Expr * > &target_exprs)
const Catalog_Namespace::Catalog & cat_

+ Here is the call graph for this function:

+ Here is the caller graph for this function:

std::set< size_t > TableOptimizer::getFragmentIndexes ( const TableDescriptor td,
const std::set< int > &  fragment_ids 
) const
private

Definition at line 421 of file TableOptimizer.cpp.

References CHECK, shared::contains(), and TableDescriptor::fragmenter.

Referenced by recomputeMetadataUnlocked(), and vacuumFragmentsAboveMinSelectivity().

423  {
424  CHECK(td->fragmenter);
425  auto table_info = td->fragmenter->getFragmentsForQuery();
426  std::set<size_t> fragment_indexes;
427  for (size_t i = 0; i < table_info.fragments.size(); i++) {
428  if (shared::contains(fragment_ids, table_info.fragments[i].fragmentId)) {
429  fragment_indexes.emplace(i);
430  }
431  }
432  return fragment_indexes;
433 }
bool contains(const T &container, const U &element)
Definition: misc.h:195
std::shared_ptr< Fragmenter_Namespace::AbstractFragmenter > fragmenter
#define CHECK(condition)
Definition: Logger.h:291

+ Here is the call graph for this function:

+ Here is the caller graph for this function:

void TableOptimizer::recomputeColumnMetadata ( const TableDescriptor td,
const ColumnDescriptor cd,
const std::unordered_map< int, size_t > &  tuple_count_map,
std::optional< Data_Namespace::MemoryLevel memory_level,
const std::set< size_t > &  fragment_indexes 
) const
private

Definition at line 322 of file TableOptimizer.cpp.

References anonymous_namespace{TableOptimizer.cpp}::build_ra_exe_unit(), cat_, CHECK, CHECK_EQ, ColumnDescriptor::columnId, ColumnDescriptor::columnName, ColumnDescriptor::columnType, CPU, executor_, TableDescriptor::fragmenter, anonymous_namespace{TableOptimizer.cpp}::get_compilation_options(), anonymous_namespace{TableOptimizer.cpp}::get_execution_options(), get_logical_type_info(), get_table_infos(), Catalog_Namespace::Catalog::getDatabaseId(), logger::INFO, kCOUNT, kINT, kMAX, kMIN, LOG, anonymous_namespace{TableOptimizer.cpp}::set_metadata_from_results(), TableDescriptor::tableId, and logger::WARNING.

Referenced by recomputeMetadata(), and recomputeMetadataUnlocked().

327  {
328  const auto ti = cd->columnType;
329  if (ti.is_varlen()) {
330  LOG(INFO) << "Skipping varlen column " << cd->columnName;
331  return;
332  }
333 
334  const auto column_id = cd->columnId;
335  const auto input_col_desc = std::make_shared<const InputColDescriptor>(
336  column_id, td->tableId, cat_.getDatabaseId(), 0);
337  const auto col_expr = makeExpr<Analyzer::ColumnVar>(
338  cd->columnType, shared::ColumnKey{cat_.getDatabaseId(), td->tableId, column_id}, 0);
339  auto max_expr =
340  makeExpr<Analyzer::AggExpr>(cd->columnType, kMAX, col_expr, false, nullptr);
341  auto min_expr =
342  makeExpr<Analyzer::AggExpr>(cd->columnType, kMIN, col_expr, false, nullptr);
343  auto count_expr =
344  makeExpr<Analyzer::AggExpr>(cd->columnType, kCOUNT, col_expr, false, nullptr);
345 
346  if (ti.is_string()) {
347  const SQLTypeInfo fun_ti(kINT);
348  const auto fun_expr = makeExpr<Analyzer::KeyForStringExpr>(col_expr);
349  max_expr = makeExpr<Analyzer::AggExpr>(fun_ti, kMAX, fun_expr, false, nullptr);
350  min_expr = makeExpr<Analyzer::AggExpr>(fun_ti, kMIN, fun_expr, false, nullptr);
351  }
352  const auto ra_exe_unit = build_ra_exe_unit(
353  input_col_desc, {min_expr.get(), max_expr.get(), count_expr.get()});
354  const auto table_infos = get_table_infos(ra_exe_unit, executor_);
355  CHECK_EQ(table_infos.size(), size_t(1));
356 
358  const auto eo = get_execution_options();
359 
360  std::unordered_map</*fragment_id*/ int, ChunkStats> stats_map;
361 
362  Executor::PerFragmentCallBack compute_metadata_callback =
363  [&stats_map, &tuple_count_map, cd](
364  ResultSetPtr results, const Fragmenter_Namespace::FragmentInfo& fragment_info) {
365  if (fragment_info.getPhysicalNumTuples() == 0) {
366  // TODO(adb): Should not happen, but just to be safe...
367  LOG(WARNING) << "Skipping completely empty fragment for column "
368  << cd->columnName;
369  return;
370  }
371 
372  const auto row = results->getNextRow(false, false);
373  CHECK_EQ(row.size(), size_t(3));
374 
375  const auto& ti = cd->columnType;
376 
377  auto chunk_metadata = std::make_shared<ChunkMetadata>();
378  chunk_metadata->sqlType = get_logical_type_info(ti);
379 
380  const auto count_val = read_scalar_target_value<int64_t>(row[2]);
381  if (count_val == 0) {
382  // Assume chunk of all nulls, bail
383  return;
384  }
385 
386  bool has_nulls = true; // default to wide
387  auto tuple_count_itr = tuple_count_map.find(fragment_info.fragmentId);
388  if (tuple_count_itr != tuple_count_map.end()) {
389  has_nulls = !(static_cast<size_t>(count_val) == tuple_count_itr->second);
390  } else {
391  // no deleted column calc so use raw physical count
392  has_nulls =
393  !(static_cast<size_t>(count_val) == fragment_info.getPhysicalNumTuples());
394  }
395 
396  if (!set_metadata_from_results(*chunk_metadata, row, ti, has_nulls)) {
397  LOG(WARNING) << "Unable to process new metadata values for column "
398  << cd->columnName;
399  return;
400  }
401 
402  stats_map.emplace(
403  std::make_pair(fragment_info.fragmentId, chunk_metadata->chunkStats));
404  };
405 
406  executor_->executeWorkUnitPerFragment(ra_exe_unit,
407  table_infos[0],
408  co,
409  eo,
410  cat_,
411  compute_metadata_callback,
412  fragment_indexes);
413 
414  auto* fragmenter = td->fragmenter.get();
415  CHECK(fragmenter);
416  fragmenter->updateChunkStats(cd, stats_map, memory_level);
417 }
#define CHECK_EQ(x, y)
Definition: Logger.h:301
#define LOG(tag)
Definition: Logger.h:285
CompilationOptions get_compilation_options(const ExecutorDeviceType &device_type)
SQLTypeInfo get_logical_type_info(const SQLTypeInfo &type_info)
Definition: sqltypes.h:1470
std::shared_ptr< ResultSet > ResultSetPtr
Definition: sqldefs.h:75
Executor * executor_
Used by Fragmenter classes to store info about each fragment - the fragment id and number of tuples(r...
Definition: Fragmenter.h:86
int getDatabaseId() const
Definition: Catalog.h:326
std::shared_ptr< Fragmenter_Namespace::AbstractFragmenter > fragmenter
Definition: sqldefs.h:78
bool set_metadata_from_results(ChunkMetadata &chunk_metadata, const std::vector< TargetValue > &row, const SQLTypeInfo &ti, const bool has_nulls)
std::function< void(ResultSetPtr, const Fragmenter_Namespace::FragmentInfo &)> PerFragmentCallBack
Definition: Execute.h:890
#define CHECK(condition)
Definition: Logger.h:291
std::vector< InputTableInfo > get_table_infos(const std::vector< InputDescriptor > &input_descs, Executor *executor)
Definition: sqltypes.h:72
SQLTypeInfo columnType
Definition: sqldefs.h:76
RelAlgExecutionUnit build_ra_exe_unit(const std::shared_ptr< const InputColDescriptor > input_col_desc, const std::vector< Analyzer::Expr * > &target_exprs)
std::string columnName
const Catalog_Namespace::Catalog & cat_

+ Here is the call graph for this function:

+ Here is the caller graph for this function:

DeletedColumnStats TableOptimizer::recomputeDeletedColumnMetadata ( const TableDescriptor td,
const std::set< size_t > &  fragment_indexes = {} 
) const
private

Definition at line 206 of file TableOptimizer.cpp.

References cat_, CHECK, TableDescriptor::fragmenter, Catalog_Namespace::Catalog::getDeletedColumn(), getDeletedColumnStats(), TableDescriptor::hasDeletedCol, and report::stats.

Referenced by recomputeMetadata(), and recomputeMetadataUnlocked().

208  {
209  if (!td->hasDeletedCol) {
210  return {};
211  }
212 
213  auto stats = getDeletedColumnStats(td, fragment_indexes);
214  auto* fragmenter = td->fragmenter.get();
215  CHECK(fragmenter);
216  auto cd = cat_.getDeletedColumn(td);
217  fragmenter->updateChunkStats(cd, stats.chunk_stats_per_fragment, {});
218  fragmenter->setNumRows(stats.total_row_count);
219  return stats;
220 }
DeletedColumnStats getDeletedColumnStats(const TableDescriptor *td, const std::set< size_t > &fragment_indexes) const
const ColumnDescriptor * getDeletedColumn(const TableDescriptor *td) const
Definition: Catalog.cpp:3897
dictionary stats
Definition: report.py:116
std::shared_ptr< Fragmenter_Namespace::AbstractFragmenter > fragmenter
#define CHECK(condition)
Definition: Logger.h:291
const Catalog_Namespace::Catalog & cat_

+ Here is the call graph for this function:

+ Here is the caller graph for this function:

void TableOptimizer::recomputeMetadata ( ) const

Recomputes per-chunk metadata for each fragment in the table. Updates and deletes can cause chunk metadata to become wider than the values in the chunk. Recomputing the metadata narrows the range to fit the chunk, as well as setting or unsetting the nulls flag as appropriate.

Definition at line 134 of file TableOptimizer.cpp.

References cat_, CHECK_GE, Catalog_Namespace::DBMetadata::dbId, DEBUG_TIMER, executor_, Catalog_Namespace::Catalog::getAllColumnMetadataForTable(), Catalog_Namespace::Catalog::getCurrentDB(), Catalog_Namespace::Catalog::getDataMgr(), Catalog_Namespace::Catalog::getPhysicalTablesDescriptors(), lockmgr::TableLockMgrImpl< TableDataLockMgr >::getWriteLockForTable(), logger::INFO, LOG, TableDescriptor::nShards, recomputeColumnMetadata(), recomputeDeletedColumnMetadata(), ROW_SET_SIZE, report::stats, TableDescriptor::tableId, TableDescriptor::tableName, and td_.

Referenced by Parser::OptimizeTableStmt::execute(), and migrations::MigrationMgr::migrateDateInDaysMetadata().

134  {
135  auto timer = DEBUG_TIMER(__func__);
137 
138  LOG(INFO) << "Recomputing metadata for " << td_->tableName;
139 
140  CHECK_GE(td_->tableId, 0);
141 
142  std::vector<const TableDescriptor*> table_descriptors;
143  if (td_->nShards > 0) {
144  const auto physical_tds = cat_.getPhysicalTablesDescriptors(td_);
145  table_descriptors.insert(
146  table_descriptors.begin(), physical_tds.begin(), physical_tds.end());
147  } else {
148  table_descriptors.push_back(td_);
149  }
150 
151  auto& data_mgr = cat_.getDataMgr();
152 
153  // acquire write lock on table data
155 
156  for (const auto td : table_descriptors) {
157  ScopeGuard row_set_holder = [this] { executor_->row_set_mem_owner_ = nullptr; };
158  executor_->row_set_mem_owner_ = std::make_shared<RowSetMemoryOwner>(
159  ROW_SET_SIZE, executor_->executor_id_, /*num_threads=*/1);
160  const auto table_id = td->tableId;
162 
163  // TODO(adb): Support geo
164  auto col_descs = cat_.getAllColumnMetadataForTable(table_id, false, false, false);
165  for (const auto& cd : col_descs) {
166  recomputeColumnMetadata(td, cd, stats.visible_row_count_per_fragment, {}, {});
167  }
168  data_mgr.checkpoint(cat_.getCurrentDB().dbId, table_id);
169  executor_->clearMetaInfoCache();
170  }
171 
172  data_mgr.clearMemory(Data_Namespace::MemoryLevel::CPU_LEVEL);
173  if (data_mgr.gpusPresent()) {
174  data_mgr.clearMemory(Data_Namespace::MemoryLevel::GPU_LEVEL);
175  }
176 }
std::string tableName
Data_Namespace::DataMgr & getDataMgr() const
Definition: Catalog.h:266
#define LOG(tag)
Definition: Logger.h:285
static WriteLock getWriteLockForTable(const Catalog_Namespace::Catalog &cat, const std::string &table_name)
Definition: LockMgrImpl.h:225
#define CHECK_GE(x, y)
Definition: Logger.h:306
dictionary stats
Definition: report.py:116
static constexpr size_t ROW_SET_SIZE
const TableDescriptor * td_
Executor * executor_
const DBMetadata & getCurrentDB() const
Definition: Catalog.h:265
std::unique_lock< T > unique_lock
std::vector< const TableDescriptor * > getPhysicalTablesDescriptors(const TableDescriptor *logical_table_desc, bool populate_fragmenter=true) const
Definition: Catalog.cpp:4869
void recomputeColumnMetadata(const TableDescriptor *td, const ColumnDescriptor *cd, const std::unordered_map< int, size_t > &tuple_count_map, std::optional< Data_Namespace::MemoryLevel > memory_level, const std::set< size_t > &fragment_indexes) const
DeletedColumnStats recomputeDeletedColumnMetadata(const TableDescriptor *td, const std::set< size_t > &fragment_indexes={}) const
std::list< const ColumnDescriptor * > getAllColumnMetadataForTable(const int tableId, const bool fetchSystemColumns, const bool fetchVirtualColumns, const bool fetchPhysicalColumns) const
Returns a list of pointers to constant ColumnDescriptor structs for all the columns from a particular...
Definition: Catalog.cpp:2172
#define DEBUG_TIMER(name)
Definition: Logger.h:412
const Catalog_Namespace::Catalog & cat_

+ Here is the call graph for this function:

+ Here is the caller graph for this function:

void TableOptimizer::recomputeMetadataUnlocked ( const TableUpdateMetadata table_update_metadata) const

Recomputes column chunk metadata for the given set of fragments. The caller of this method is expected to have already acquired the executor lock.

Definition at line 178 of file TableOptimizer.cpp.

References cat_, CHECK, TableUpdateMetadata::columns_for_metadata_update, Data_Namespace::CPU_LEVEL, DEBUG_TIMER, getFragmentIndexes(), Catalog_Namespace::Catalog::getMetadataForTable(), recomputeColumnMetadata(), recomputeDeletedColumnMetadata(), and report::stats.

179  {
180  auto timer = DEBUG_TIMER(__func__);
181  std::map<int, std::list<const ColumnDescriptor*>> columns_by_table_id;
182  auto& columns_for_update = table_update_metadata.columns_for_metadata_update;
183  for (const auto& entry : columns_for_update) {
184  auto column_descriptor = entry.first;
185  columns_by_table_id[column_descriptor->tableId].emplace_back(column_descriptor);
186  }
187 
188  for (const auto& [table_id, columns] : columns_by_table_id) {
189  auto td = cat_.getMetadataForTable(table_id);
191  for (const auto cd : columns) {
192  CHECK(columns_for_update.find(cd) != columns_for_update.end());
193  auto fragment_indexes = getFragmentIndexes(td, columns_for_update.find(cd)->second);
195  cd,
196  stats.visible_row_count_per_fragment,
198  fragment_indexes);
199  }
200  }
201 }
dictionary stats
Definition: report.py:116
void recomputeColumnMetadata(const TableDescriptor *td, const ColumnDescriptor *cd, const std::unordered_map< int, size_t > &tuple_count_map, std::optional< Data_Namespace::MemoryLevel > memory_level, const std::set< size_t > &fragment_indexes) const
DeletedColumnStats recomputeDeletedColumnMetadata(const TableDescriptor *td, const std::set< size_t > &fragment_indexes={}) const
ColumnToFragmentsMap columns_for_metadata_update
Definition: Execute.h:339
#define CHECK(condition)
Definition: Logger.h:291
#define DEBUG_TIMER(name)
Definition: Logger.h:412
const TableDescriptor * getMetadataForTable(const std::string &tableName, const bool populateFragmenter=true) const
Returns a pointer to a const TableDescriptor struct matching the provided tableName.
const Catalog_Namespace::Catalog & cat_
std::set< size_t > getFragmentIndexes(const TableDescriptor *td, const std::set< int > &fragment_ids) const

+ Here is the call graph for this function:

void TableOptimizer::vacuumDeletedRows ( ) const

Compacts fragments to remove deleted rows. When a row is deleted, a boolean deleted system column is set to true. Vacuuming removes all deleted rows from a fragment. Note that vacuuming is a checkpointing operation, so data on disk will increase even though the number of rows for the current epoch has decreased.

Definition at line 435 of file TableOptimizer.cpp.

References cat_, Catalog_Namespace::Catalog::checkpoint(), File_Namespace::GlobalFileMgr::compactDataFiles(), DEBUG_TIMER, Catalog_Namespace::Catalog::getDatabaseId(), Catalog_Namespace::Catalog::getDataMgr(), Data_Namespace::DataMgr::getGlobalFileMgr(), Catalog_Namespace::Catalog::getPhysicalTablesDescriptors(), Catalog_Namespace::Catalog::getTableEpochs(), lockmgr::TableLockMgrImpl< TableDataLockMgr >::getWriteLockForTable(), Catalog_Namespace::Catalog::removeFragmenterForTable(), Catalog_Namespace::Catalog::setTableEpochsLogExceptions(), TableDescriptor::tableId, td_, and vacuumFragments().

Referenced by Parser::OptimizeTableStmt::execute(), and anonymous_namespace{DdlCommandExecutor.cpp}::vacuum_table_if_required().

435  {
436  auto timer = DEBUG_TIMER(__func__);
437  const auto table_id = td_->tableId;
438  const auto db_id = cat_.getDatabaseId();
439  const auto table_lock =
441  const auto table_epochs = cat_.getTableEpochs(db_id, table_id);
442  const auto shards = cat_.getPhysicalTablesDescriptors(td_);
443  try {
444  for (const auto shard : shards) {
445  vacuumFragments(shard);
446  }
447  cat_.checkpoint(table_id);
448  } catch (...) {
449  cat_.setTableEpochsLogExceptions(db_id, table_epochs);
450  throw;
451  }
452 
453  for (auto shard : shards) {
454  cat_.removeFragmenterForTable(shard->tableId);
456  shard->tableId);
457  }
458 }
Data_Namespace::DataMgr & getDataMgr() const
Definition: Catalog.h:266
static WriteLock getWriteLockForTable(const Catalog_Namespace::Catalog &cat, const std::string &table_name)
Definition: LockMgrImpl.h:225
const TableDescriptor * td_
int getDatabaseId() const
Definition: Catalog.h:326
std::vector< const TableDescriptor * > getPhysicalTablesDescriptors(const TableDescriptor *logical_table_desc, bool populate_fragmenter=true) const
Definition: Catalog.cpp:4869
File_Namespace::GlobalFileMgr * getGlobalFileMgr() const
Definition: DataMgr.cpp:649
void checkpoint(const int logicalTableId) const
Definition: Catalog.cpp:5022
void removeFragmenterForTable(const int table_id) const
Definition: Catalog.cpp:4260
void compactDataFiles(const int32_t db_id, const int32_t tb_id)
#define DEBUG_TIMER(name)
Definition: Logger.h:412
void setTableEpochsLogExceptions(const int32_t db_id, const std::vector< TableEpochInfo > &table_epochs) const
Definition: Catalog.cpp:3885
const Catalog_Namespace::Catalog & cat_
std::vector< TableEpochInfo > getTableEpochs(const int32_t db_id, const int32_t table_id) const
Definition: Catalog.cpp:3821
void vacuumFragments(const TableDescriptor *td, const std::set< int > &fragment_ids={}) const

+ Here is the call graph for this function:

+ Here is the caller graph for this function:

void TableOptimizer::vacuumFragments ( const TableDescriptor td,
const std::set< int > &  fragment_ids = {} 
) const
private

Definition at line 498 of file TableOptimizer.cpp.

References cat_, UpdelRoll::catalog, CHECK_EQ, CHUNK_KEY_COLUMN_IDX, CHUNK_KEY_FRAGMENT_IDX, ColumnDescriptor::columnId, shared::contains(), Data_Namespace::CPU_LEVEL, anonymous_namespace{TableOptimizer.cpp}::delete_cpu_chunks(), TableDescriptor::fragmenter, anonymous_namespace{TableOptimizer.cpp}::get_uncached_cpu_chunk_keys(), Chunk_NS::Chunk::getChunk(), Data_Namespace::DataMgr::getChunkMetadataVecForKeyPrefix(), Catalog_Namespace::Catalog::getDatabaseId(), Catalog_Namespace::Catalog::getDataMgr(), Catalog_Namespace::Catalog::getDeletedColumn(), Catalog_Namespace::Catalog::getLogicalTableId(), UpdelRoll::logicalTableId, UpdelRoll::memoryLevel, UpdelRoll::stageUpdate(), UpdelRoll::table_descriptor, and TableDescriptor::tableId.

Referenced by vacuumDeletedRows(), and vacuumFragmentsAboveMinSelectivity().

499  {
500  // "if not a table that supports delete return, nothing more to do"
501  const ColumnDescriptor* cd = cat_.getDeletedColumn(td);
502  if (nullptr == cd) {
503  return;
504  }
505  // vacuum chunks which show sign of deleted rows in metadata
506  ChunkKey chunk_key_prefix = {cat_.getDatabaseId(), td->tableId, cd->columnId};
507  ChunkMetadataVector chunk_metadata_vec;
508  auto& data_mgr = cat_.getDataMgr();
509  data_mgr.getChunkMetadataVecForKeyPrefix(chunk_metadata_vec, chunk_key_prefix);
510  for (auto& [chunk_key, chunk_metadata] : chunk_metadata_vec) {
511  auto fragment_id = chunk_key[CHUNK_KEY_FRAGMENT_IDX];
512  // If delete has occurred, only vacuum fragments that are in the fragment_ids set.
513  // Empty fragment_ids set implies all fragments.
514  if (chunk_metadata->chunkStats.max.tinyintval == 1 &&
515  (fragment_ids.empty() || shared::contains(fragment_ids, fragment_id))) {
516  auto cpu_chunks_to_delete =
517  get_uncached_cpu_chunk_keys(cat_, td->tableId, fragment_id);
518 
519  UpdelRoll updel_roll;
520  updel_roll.catalog = &cat_;
521  updel_roll.logicalTableId = cat_.getLogicalTableId(td->tableId);
523  updel_roll.table_descriptor = td;
524  CHECK_EQ(cd->columnId, chunk_key[CHUNK_KEY_COLUMN_IDX]);
525  const auto chunk = Chunk_NS::Chunk::getChunk(cd,
526  &cat_.getDataMgr(),
527  chunk_key,
528  updel_roll.memoryLevel,
529  0,
530  chunk_metadata->numBytes,
531  chunk_metadata->numElements);
532  td->fragmenter->compactRows(&cat_,
533  td,
534  fragment_id,
535  td->fragmenter->getVacuumOffsets(chunk),
536  updel_roll.memoryLevel,
537  updel_roll);
538  updel_roll.stageUpdate();
539 
540  delete_cpu_chunks(cat_, cpu_chunks_to_delete);
541  }
542  }
543  td->fragmenter->resetSizesFromFragments();
544 }
bool contains(const T &container, const U &element)
Definition: misc.h:195
Data_Namespace::MemoryLevel memoryLevel
Definition: UpdelRoll.h:55
#define CHECK_EQ(x, y)
Definition: Logger.h:301
std::vector< int > ChunkKey
Definition: types.h:36
const TableDescriptor * table_descriptor
Definition: UpdelRoll.h:58
const ColumnDescriptor * getDeletedColumn(const TableDescriptor *td) const
Definition: Catalog.cpp:3897
Data_Namespace::DataMgr & getDataMgr() const
Definition: Catalog.h:266
#define CHUNK_KEY_FRAGMENT_IDX
Definition: types.h:41
void delete_cpu_chunks(const Catalog_Namespace::Catalog &catalog, const std::set< ChunkKey > &cpu_chunks_to_delete)
void stageUpdate()
const Catalog_Namespace::Catalog * catalog
Definition: UpdelRoll.h:53
void getChunkMetadataVecForKeyPrefix(ChunkMetadataVector &chunkMetadataVec, const ChunkKey &keyPrefix)
Definition: DataMgr.cpp:496
std::vector< std::pair< ChunkKey, std::shared_ptr< ChunkMetadata >>> ChunkMetadataVector
int getDatabaseId() const
Definition: Catalog.h:326
int getLogicalTableId(const int physicalTableId) const
Definition: Catalog.cpp:5008
specifies the content in-memory of a row in the column metadata table
std::shared_ptr< Fragmenter_Namespace::AbstractFragmenter > fragmenter
int logicalTableId
Definition: UpdelRoll.h:54
std::set< ChunkKey > get_uncached_cpu_chunk_keys(const Catalog_Namespace::Catalog &catalog, int32_t table_id, int32_t fragment_id)
static std::shared_ptr< Chunk > getChunk(const ColumnDescriptor *cd, DataMgr *data_mgr, const ChunkKey &key, const MemoryLevel mem_level, const int deviceId, const size_t num_bytes, const size_t num_elems, const bool pinnable=true)
Definition: Chunk.cpp:31
#define CHUNK_KEY_COLUMN_IDX
Definition: types.h:40
const Catalog_Namespace::Catalog & cat_

+ Here is the call graph for this function:

+ Here is the caller graph for this function:

void TableOptimizer::vacuumFragmentsAboveMinSelectivity ( const TableUpdateMetadata table_update_metadata) const

Vacuums fragments with a deleted rows percentage that exceeds the configured minimum vacuum selectivity threshold.

Definition at line 546 of file TableOptimizer.cpp.

References cat_, Catalog_Namespace::Catalog::checkpoint(), Catalog_Namespace::Catalog::checkpointWithAutoRollback(), DEBUG_TIMER, Data_Namespace::DISK_LEVEL, executor_, TableUpdateMetadata::fragments_with_deleted_rows, g_vacuum_min_selectivity, Catalog_Namespace::Catalog::getDatabaseId(), getDeletedColumnStats(), getFragmentIndexes(), Catalog_Namespace::Catalog::getMetadataForTable(), Catalog_Namespace::Catalog::getTableEpochs(), lockmgr::TableLockMgrImpl< TableDataLockMgr >::getWriteLockForTable(), TableDescriptor::persistenceLevel, shared::printContainer(), ROW_SET_SIZE, Catalog_Namespace::Catalog::setTableEpochsLogExceptions(), TableDescriptor::tableId, td_, vacuumFragments(), DeletedColumnStats::visible_row_count_per_fragment, and VLOG.

547  {
549  return;
550  }
551  auto timer = DEBUG_TIMER(__func__);
552  std::map<const TableDescriptor*, std::set<int32_t>> fragments_to_vacuum;
553  for (const auto& [table_id, fragment_ids] :
554  table_update_metadata.fragments_with_deleted_rows) {
555  auto td = cat_.getMetadataForTable(table_id);
556  // Skip automatic vacuuming for tables with uncapped epoch
557  if (td->maxRollbackEpochs == -1) {
558  continue;
559  }
560 
561  DeletedColumnStats deleted_column_stats;
562  {
564  executor_->execute_mutex_);
565  ScopeGuard row_set_holder = [this] { executor_->row_set_mem_owner_ = nullptr; };
566  executor_->row_set_mem_owner_ = std::make_shared<RowSetMemoryOwner>(
567  ROW_SET_SIZE, executor_->executor_id_, /*num_threads=*/1);
568  deleted_column_stats =
569  getDeletedColumnStats(td, getFragmentIndexes(td, fragment_ids));
570  executor_->clearMetaInfoCache();
571  }
572 
573  std::set<int32_t> filtered_fragment_ids;
574  for (const auto [fragment_id, visible_row_count] :
575  deleted_column_stats.visible_row_count_per_fragment) {
576  auto total_row_count =
577  td->fragmenter->getFragmentInfo(fragment_id)->getPhysicalNumTuples();
578  float deleted_row_count = total_row_count - visible_row_count;
579  if ((deleted_row_count / total_row_count) >= g_vacuum_min_selectivity) {
580  filtered_fragment_ids.emplace(fragment_id);
581  }
582  }
583 
584  if (!filtered_fragment_ids.empty()) {
585  fragments_to_vacuum[td] = filtered_fragment_ids;
586  }
587  }
588 
589  if (!fragments_to_vacuum.empty()) {
590  const auto db_id = cat_.getDatabaseId();
591  const auto table_lock =
593  const auto table_epochs = cat_.getTableEpochs(db_id, td_->tableId);
594  try {
595  for (const auto& [td, fragment_ids] : fragments_to_vacuum) {
596  vacuumFragments(td, fragment_ids);
597  VLOG(1) << "Auto-vacuumed fragments: " << shared::printContainer(fragment_ids)
598  << ", table id: " << td->tableId;
599  }
601  } catch (...) {
602  cat_.setTableEpochsLogExceptions(db_id, table_epochs);
603  throw;
604  }
605  } else {
606  // Checkpoint, even when no data update occurs, in order to ensure that epochs are
607  // uniformly incremented in distributed mode.
609  }
610 }
DeletedColumnStats getDeletedColumnStats(const TableDescriptor *td, const std::set< size_t > &fragment_indexes) const
static WriteLock getWriteLockForTable(const Catalog_Namespace::Catalog &cat, const std::string &table_name)
Definition: LockMgrImpl.h:225
TableToFragmentIds fragments_with_deleted_rows
Definition: Execute.h:340
static constexpr size_t ROW_SET_SIZE
const TableDescriptor * td_
Executor * executor_
std::unique_lock< T > unique_lock
int getDatabaseId() const
Definition: Catalog.h:326
void checkpointWithAutoRollback(const int logical_table_id) const
Definition: Catalog.cpp:5030
void checkpoint(const int logicalTableId) const
Definition: Catalog.cpp:5022
Data_Namespace::MemoryLevel persistenceLevel
float g_vacuum_min_selectivity
#define DEBUG_TIMER(name)
Definition: Logger.h:412
void setTableEpochsLogExceptions(const int32_t db_id, const std::vector< TableEpochInfo > &table_epochs) const
Definition: Catalog.cpp:3885
PrintContainer< CONTAINER > printContainer(CONTAINER &container)
Definition: misc.h:107
const TableDescriptor * getMetadataForTable(const std::string &tableName, const bool populateFragmenter=true) const
Returns a pointer to a const TableDescriptor struct matching the provided tableName.
#define VLOG(n)
Definition: Logger.h:388
const Catalog_Namespace::Catalog & cat_
std::vector< TableEpochInfo > getTableEpochs(const int32_t db_id, const int32_t table_id) const
Definition: Catalog.cpp:3821
std::set< size_t > getFragmentIndexes(const TableDescriptor *td, const std::set< int > &fragment_ids) const
void vacuumFragments(const TableDescriptor *td, const std::set< int > &fragment_ids={}) const

+ Here is the call graph for this function:

Member Data Documentation

Executor* TableOptimizer::executor_
private
constexpr size_t TableOptimizer::ROW_SET_SIZE {1000000000}
staticprivate

Definition at line 101 of file TableOptimizer.h.

Referenced by recomputeMetadata(), and vacuumFragmentsAboveMinSelectivity().

const TableDescriptor* TableOptimizer::td_
private

The documentation for this class was generated from the following files: