A Hadoop cluster is a special type of computational cluster designed specifically for storing and analyzing huge amounts of unstructured data in a distributed computing environment.

This is a formula to estimate Hadoop storage (H):

H=c*r*S/(1-i)

where:

c = average compression ratio. It depends on the type of compression used (Snappy, LZOP, ...) and size of the data. When no compression is used, c=1.

r = replication factor. It is usually 3 in a production cluster.

S = size of data to be moved to Hadoop.

i = intermediate factor. It is usually 1/3 or 1/4. Hadoop's working space dedicated to storing intermediate results of Map phases.

Example: With no compression i.e. c=1, a replication factor of 3, an intermediate factor of .25=1/4

H= 1*3*S/(1-1/4)=3*S/(3/4)=4*S

With the assumptions above, the Hadoop storage is estimated to be 4 times the size of the initial data size.

This is a formula to estimate the number of data nodes (n):

n= H/d = c*r*S/(1-i)*d

where

d= disk space available per node.

All other parameters remain the same as above.

Thanks...!!







