General Guidelines on Cubes

This article explains the general guidelines to be followed to optimally design Cubes for fast-access known queries that serve interactive dashboards, reports, and analytical requirements.

Before you begin designing your Cube, analyse the following steps:

Step 1: Design of a Cube entirely driven by the kind of queries it is meant to serve. Know your data and business requirements of this data, to build an optimal Cube design. You must identify the query patterns which would be executed on the Cube.
Step 2: Determine the data model you are working with. This includes all the Fact and Dimension tables/entities, the columns/attributes, and their relationships. Identify the volume and size information for these tables.
Step 3: Determine the measures computed through the queries. Examples of measures are Sum, Min, Max, and Avg operations on the Fact columns. Also, identify the columns for “Group By” and “Where” clauses which help to roll-up or filter the data.
Step 4: Identify the combinations that are not used based on the known queries. This will help to optimize the Cube storage, and build time by pruning them.

Types of Dimension Attributes

Normal Dimension Attributes: The dimension attributes which participate in operations such as “Group By”, “Where”, or “Join”, are considered as Normal dimension attributes. These will aggregate the Fact data across all possible combinations of these normal attributes, for faster querying.
Derived Dimension Attributes: The dimension attributes which are not Normal are considered to be derived. The Fact data arebe aggregated for these attributes. However, you can still query using these attributes, and any computation necessary are on the fly.

Best Practices

Best practices for cube include the following:

Reducing Combinations

Aggregation Groups: Cube lattice is made up of all possible combinations of the dimension attributes that are of interest to the user. Each aggregation group is a subset of cube lattice thereby removing the other combinations that are unnecessary.By using the aggregation groups, you can control which cuboid must be computed or skipped. This could significantly reduce the computation and storage overhead, especially when the cube is serving for a fixed dashboard. This will reproduce SQL queries that only require some specific cuboids.
Joint Dimensions: If two or more dimensions are “joint”, then any valid cuboid will either contain none of these dimensions, or contain them all. In other words, these dimensions will always be “together”. This is useful when the cube designer is sure some of the dimension attributes will always be queried together. It is also a nuclear weapon for combination pruning on less-likely-to-use dimensions.
Hierarchy Dimensions: Creating hierarchies on large dimensions allows users to start at a high level with few members and traverse the dimension without having to load all the members. A hierarchy must follow a many-to-one tree structure. A classic example is a standard Date dimension where you traverse year -> month -> day so that navigating to a particular date in a ten year date dimension involves traversing ~50 members rather than 3650. Additionally, if dimensions Country,State,City form a “hierarchy” relation, then only combinations with Country, Country-State or Country-State-City shall be remained.
Mandatory Dimensions: This dimension type is used for cuboid pruning. If a dimension attribute or group of dimension attributes is specified as “mandatory”, then those combinations without such dimension attribute or group of dimension attributes are pruned.

Measures to be taken are as follows:

Judiciously select the limit value for Top-N measure since this would store Top-N Fact data for all combinations, thereby increasing the storage built time computation.
Specify error limit for Distinct Count measure to the right threshold as required. Specifying 0 or a lower limit will increase the build time.
Raw measure will require all the data to be stored at the most granular level to query the detail data on the measure column. So this is ideally suitable for a smaller dataset.

Cube Configurations

Query Pushdowns: Ad-hoc, on the fly, or one-time queries that cannot be serviced by the Cube can be routed to alternate execution engines such as Hive, Spark or BQ for faster performance.
HBase Installation: HBase installations are explained using the following questions:

1: How to install Hbase?

You can install Hbase from your cluster distribution manager. For CDH, it is Cloudera Manager, and for HDP its Ambari. The installation should be standard with the default configurations.

2: Which Hbase configuration must be used?

You must use distributed configuration (master slave) of HBase. Configure HBase in Service Recovery mode. This would mean that there will be two master nodes (primary and secondary), and the remaining will operate as slave nodes.

3: How many region servers must be used?

This depends on parameters, such as intended replication factor, region size, region server memory available, size of data to serve, and so on. For example, if you have set replication factor to 3, you should use at least 3 region servers.

4: How much memory must be assigned to region server?

Estimate the ratio of raw disk space to required Java heap as follows:

RegionSize / MemstoreSize * ReplicationFactor * HeapFractionForMemstores

If you have the following parameters (Defaults in HBase 0.94):

10GB regions
128MB memstores
HDFS replication factor of 3
40% of the heap use for the memstores

Then: 10GB/128MB_3_0.4 = 96.

With the default setting, this means that if you want to serve 10T worth of disk space per region server, you need a 107GB Java heap space.

Alternatively, if you give a region server a 10G heap space, you can only utilize about 1T of disk space per region server machine.

For more details, see http://hadoop-hbase.blogspot.com/2013/01/hbase-region-server-memory-sizing.html

5: How much regions can a region server handle?

A region server can serve about 1,000 regions depending on the node and the configurations. Optimal value is around 200 regions per server.

6: What is HBase region cut size for Infoworks Cube?

HBase region cut size for Infoworks cube is 5GB.

7: Do I need to install Region Server on all data nodes?

This depends on factors such as the total number of HBase regions in your cluster. Each region server can host hundreds of regions. Determine the number of region servers (which may be lower than the number of Data Nodes). It is recommended that you install HBase RegionServers on each data node. Region Servers can benefit from "short-circuit" reads which can only happen when the RegionServer is co-located with the Data Node.

8: What is the impact if you do not don't install Region Servers on all the datanodes?

Some data access would not be able to utilize the short circuit reads. Short circuit reads are directly from disk instead of going through the data node when the data is local.

9: Can we share the HBase node with other components such as Yarn, Spark, or Presto?

You can configure nodes for x% memory to Yarn/Presto/Spark, and the rest of the memory to HBase. However, you must monitor I/O. Running MR, or other jobs on the same nodes that has HBase could create bottle neck on your I/O.

References:

Last updated on