Sunday, September 4, 2016

Taming the Data Lake


Big data is not more a fad that geeks, major enterprises, start-ups alike are in love with - it is a reality driven by the dynamic and diverse nature of channels, business lines, innovative products and customer behavior. All the 4Vs – Volume, Velocity, Variety and Veracity of data are true and us analysts, data scientists, data professionals, strategists, business leaders have to live with. Investments are being made into Technology, Infastructure and Talents but like a wise man once said “all problems in the world cannot be solved by throwing money at it”.
Reality Check:
It is not as simple as creating a data lake where everything can be dumped and Data Scientists and Analysts can feed off of that. The adoption should not be just an Investment question (cost of data storage, data preparation, management and retrieval) on which predominantly the decisions are made. It is also a Returns question (Reports, Business Analytics, Advanced Analytics, Data Products, Decision Engines, etc.) which is usually ignored when making the decision. Investment only decisions usually create a sub-optimal experience for the end users, i.e., it may be efficient for Reporting but may be very slow and inefficient when an Analyst has to use it or vice versa. Adoption and Engagement needs a Strategic framework of key corporate needs, an Tactical Outcome Focused delivery approach and an iterative learning execution model.
Scalable Metrics Model:
RDBMS structure is still one of the most “go-to” framework for Enterprise Data Warehouse and has been so for decades. The reliability, stability, speed, ease of understanding makes it optimal for many core services. The downside is the flexibility, extensibility, cost of modifications and rigidity of the structure which is what Hadoop File System framework tries to address. But lack of structure brings its own problems of performance, reliability, error corrections, etc. and just forcing a structure via Metadata or Aggregates might not be sufficient for a wide variety of users. We need a hybrid framework which brings in the strengths of RDBMS with merits of HDFS whose key objective is to serve the diverse needs of users and is malleable enough to efficiently and effectively change with the needs. It has to be modular enough to predominantly address a bucket of needs (e.g, Reports/Decision Engines by functions) but also with connections that can help connect the dots (e.g., Deep Dive into drivers). The Scalable metrics model is one such option and we are discussing it at the Global Big Data Conference at Santa Clara on Sep 2nd.  
More about the Global Big Data Conference:
 It brings together leaders and practitioners in the field of Big Data and provides a platform for sharing ideas, getting feedback and learning about the new trends and technologies that are in the industry today. We are excited to be a part of it and hope to have a very good chat and learning session.
 The slides on Scalable Metrics Model can be found at: