Saturday, February 4, 2017

Just enough python programming for interview

Factorial of a number (using while loop)
Sum every 3rd number Adding consecutive numbers Checking if a number is prime Palindrome check List functions Average of list Perfect number LCM Recursive Algorithm - Fibonacci series Recursive Algorithm - Exponent Recursive algorithm - GCD

Sunday, September 4, 2016

Taming the Data Lake


Big data is not more a fad that geeks, major enterprises, start-ups alike are in love with - it is a reality driven by the dynamic and diverse nature of channels, business lines, innovative products and customer behavior. All the 4Vs – Volume, Velocity, Variety and Veracity of data are true and us analysts, data scientists, data professionals, strategists, business leaders have to live with. Investments are being made into Technology, Infastructure and Talents but like a wise man once said “all problems in the world cannot be solved by throwing money at it”.
Reality Check:
It is not as simple as creating a data lake where everything can be dumped and Data Scientists and Analysts can feed off of that. The adoption should not be just an Investment question (cost of data storage, data preparation, management and retrieval) on which predominantly the decisions are made. It is also a Returns question (Reports, Business Analytics, Advanced Analytics, Data Products, Decision Engines, etc.) which is usually ignored when making the decision. Investment only decisions usually create a sub-optimal experience for the end users, i.e., it may be efficient for Reporting but may be very slow and inefficient when an Analyst has to use it or vice versa. Adoption and Engagement needs a Strategic framework of key corporate needs, an Tactical Outcome Focused delivery approach and an iterative learning execution model.
Scalable Metrics Model:
RDBMS structure is still one of the most “go-to” framework for Enterprise Data Warehouse and has been so for decades. The reliability, stability, speed, ease of understanding makes it optimal for many core services. The downside is the flexibility, extensibility, cost of modifications and rigidity of the structure which is what Hadoop File System framework tries to address. But lack of structure brings its own problems of performance, reliability, error corrections, etc. and just forcing a structure via Metadata or Aggregates might not be sufficient for a wide variety of users. We need a hybrid framework which brings in the strengths of RDBMS with merits of HDFS whose key objective is to serve the diverse needs of users and is malleable enough to efficiently and effectively change with the needs. It has to be modular enough to predominantly address a bucket of needs (e.g, Reports/Decision Engines by functions) but also with connections that can help connect the dots (e.g., Deep Dive into drivers). The Scalable metrics model is one such option and we are discussing it at the Global Big Data Conference at Santa Clara on Sep 2nd.  
More about the Global Big Data Conference:
 It brings together leaders and practitioners in the field of Big Data and provides a platform for sharing ideas, getting feedback and learning about the new trends and technologies that are in the industry today. We are excited to be a part of it and hope to have a very good chat and learning session.
 The slides on Scalable Metrics Model can be found at:


Sunday, July 12, 2015

Metrics on the fly – with Hadoop and Tableau

Let’s take a use case, customer segmentation. How do we segment our customer base? - just accumulate whatever dimensions and metrics you can find for that user and store on the granularity of customer, then put a reporting layer on top of it for the Analyst to create reports and visualizations.

There are numerous ways to do aggregations - one way is to build Aggregates on top of a star schema. This approach might be costlier to process on a Databases (ELT) or in a ETL tool, especially if our customer base is huge.

Some challenges in building aggregates:
1. Required to do many full table scans - eg # transactions by user, #logins by user
2. The requirement for these metrics changes frequently - we have to make frequent DDL and ETL changes 

How about using Hadoop? Hadoop/Hive is good in handling huge datasets, but bad in interactivity and ease of use, where Tableau excels. Let’s discuss how to marry these two technologies to create metric framework where we can add metrics on the fly and actually do data analysis. Assume the required base-tables for deriving dimensions and calculating metrics are injected (using sqoop) into Hadoop and available as Hive tables.

The idea is to create a metrics (final fact) table with a id columns and one metrics column(map datatype) which bags all the metrics. Since Metrics column is a Map datatype, it can hold column names and corresponding metrics values. In this way, we are not constrained to defined columns, we can add as many metrics as we go.

CREATE EXTERNAL TABLE `customer_metrics`(
  `customer_id` int,
  `metrics` map<string,string>)

Data Model

1.       Determine the granularity of the metrics table - In our case its customerId. (can be multiple columns too)
2.       Write hive queries to get metrics on the specified granularity. Just create separate hive queries - a query for one or more metrics. All these queries should have same granularity. (Login_attempts,Last_login_date,Transaction_count)
3.       Create a staging table with 3 columns( id,key and value)
CREATE EXTERNAL TABLE `stg_customer_metrics`(
  `customerId` int,
  `key` string,
  `value` string)
4.       Create the target metrics table 
CREATE EXTERNAL TABLE `customer_metrics`(
  `customer_id` int,
  `metrics` map<string,string>)

ETL framework:

Now, to create a scalable ETL framework in order to add metrics on the fly. The Metrics query is going to be a union of any number of subqueries. Each subquery should follow the standards
1.       Should include columns of granularity and an array of key value pairs.
2.       Each key value pair should be defined in the format ‘metric_name’=metric_value
3.       This can be achieved by concatenating a static string to the aggregated value field.
4.       The result will be a union of all multiple subqueries with grain columns and array of key value pairs.
5.       Convert array column into multiple rows using hive explode function
6.       Split the key value pairs into two columns key and value.
7.       This entire query should be loaded in to the staging table created earlier.

INSERT OVERWRITE TABLE stg_customer_metrics
Select
 customer_id as customer_id,
 split(keyvalue,'=')[0] as key,
 split(keyvalue,'=')[1] as value
    from (

 select u.customer_id, /* METRICS 1&2 - LOGIN ATTEMPTS AND LAST LOGIN DATE*/
             array(   
             concat_ws('=','Login_attempts',cast(count (DISTINCT(case when i.CSTMR_STA_NM = 'LOGIN'
             then i.CSTMR_ID End)) as STRING)),
             concat_ws('=','Last_login_date',cast(Max(i.chg_dt) as STRING))) as mp
                    from customer_login
            group by u.customer_id

UNION ALL

select u.customer_id,  /* METRIC 3: TO GET TRANSACTION COUNT */
array(
concat_ws('=','Transaction_count',cast(count(distinct p.PYMT_ID) as STRING))
) as mp
from user_transaction
group by u.customer_id  /* GRANULARITY: CUSTOMER_ID */

) b
LATERAL VIEW explode(mp) mptable as keyvalue
where 1=1
;
The stage table may look like..
Customer_id
Key
Value
1123
Login_attempts
10
1123
Last_login_date
2015-07-07
1123
Transaction_count
2
9989
Login_attempts
20
9989
Last_login_date
2015-06-01
9989
Transaction_count
5
To add new metrics, create a select query and append to the union at staging table load. 

Load to final metrics table


ADD JAR udf-0.0.1-SNAPSHOT.jar;
INSERT OVERWRITE TABLE customer_metrics
select customer_id,UNION_MAP(MAP(key,value)) from stg_customer_metrics
group by customer_id;
Customer_id
Metrics
1123
{" Login_attempts":”10”," Last_login_date ":"2015-07-07"," Transaction_count ":"2"}
9989
{" Login_attempts":”20”," Last_login_date ":"2015-06-01"," Transaction_count ":"5"}

View:

Create a view on top of this final table. It helps in
1.     Creating a reference metadata
2.     Abstracts the complexity of selecting map data-type and changes happening to metrics table (you are free to add new metrics without impacting reporting layer)
3.     Easy drag and drop in Tableau reporting layer
4.     Different views for different user-groups

CREATE VIEW `vw_customer_metrics` AS
select `customer_id`
,`metrics`[" Login_attempts"] as ` Login_attempts`
,`metrics`[" Last_login_date"] as ` Last_login_date`
,`metrics`[" Transaction_count"] as ` Transaction_count`
From customer_metrics;

Reporting layer (tableau):

Pull this view into Tableau layer as an extract, so that the report will be faster and interactive. You can add new calculations on tableau layer especially for non-additive facts.
(If you are new to tableau, I strongly suggest to download Tableau public at : https://public.tableau.com/s/  and checkout the free learning videos on: http://www.tableau.com/learn/training  for anyone who works on data)
Bringing this metrics to tableau layer provides the following advantages which Hadoop layer cannot provide
1.    Interactive (faster response)
2.    Slicing and dicing ( by not invoking MapReduce every time)
3.    Blending with other data sets (quickly! Without IT J)
4.    Create calculated columns
5.    Visualization (Ofcourse!)

Conclusion:


We designed this idea with the mindset “Business logic at the cost of resources” – implementing the business logic is more important. We need to also think about file-formats, compression and incremental logic depending on specific use-case. Tableau layer is not effective on huge detailed data, aggregating dataset and reducing the rows (and size) matters a lot during data-analysis. Pushing down the business logic to Hadoop layer abstracts the complexity and reduce clutter at Tableau layer.

Friday, June 6, 2014

Hadoop summit 2014 - commentary

3 days of continuous presentations, some really cool ones, many sales pitches and few useful networking, food and drinks :). Apart from these, My agenda is get the Datawarehousing side of Hadoop, to learn some more tricks for my long running quest to implement a BI solution in Hadoop. I do learnt some and got some benchmarks to prove some designs. I feel confident that a lots of effort put into this space from Hadoop community.

Like one of a presenter quoted Churchill
"Now this is not the end. It is not even the beginning of the end. But it is, perhaps, the end of the beginning"

 This conference was organized by HortonWorks and not cloudera.YARN and TEZ dominated the talks, I saw "mapreduce is dead" in couple of presentations, hope its for good. All the enterprise users must move to YARN in near future else they will miss the future tools and essential enhancements to current tools like Hive,spark.

In a way These conferences gives an idea how mature the technology is and how serious are the businesses in using it. But you can always relate to this quote "“Big data is like teen sex. Everybody is talking about it, everyone thinks everyone else is doing it, so everyone claims they are doing it.”

Coming to the overall theme - I could see 4 themes from my perspective.

Performance

  • Parallelism and mapreduce is old story, now getting results fast is the coolest thing.
  • Key concepts to look for in this area are YARN, TEZ, ORC (compressed columnar format), Parquet (another columnar format), In-memory query engines (Impala, spark), there were talks about collecting statistics to make Hive behave like databases.


Security
Most enterprises brought their data into Hadoop, there is a big gap to be filled in securing data. Some concepts on this line like

  • Keeping Hadoop behind the Database and restricting direct access to Hadoop. In this way you can utilize the security features of Database for Hadoop objects (teradata and Oracle already has appliances for this)
  • Hive 0.13 - Improved security features like  grant/revoke, ROLE, columns and row level access, permissions for views.
  • LUKE (Linux unified key setup)
  • Apache sentry
  • Apache Knox


Federation
Data blending, this is where big players want to play a role like "We too do Hadoop" and making Hadoop coexists with traditional datawarehouse ecosystem.

  • Teradata hadoop connector
  • Oracle smart something!
  • Microsoft PDW (parallel data warehouse) and PolyBase
  • My favorite data-blending is by using Tableau (ofcourse on a smaller scale)


ACID
This is the topic Im mostly interested in, How to do ACID on Hive, if not how to mock, because I feel updates are very critical to truly deliver an Datawarehouse in Hadoop.

  • Store data as columnar format (ORC, Parquet)
  • Maintain a single partition using compaction(this table is going to be huge, but thats what Hadoop is made for)
  • Keep appending the history (dont delete).
  • Get the latest data while reading. There is a new Acidinput and Acidreader to accomplish this.
Have to write a seperate blog for this topic.I need to get deeper the architecture and need to do a POC on this.


Tuesday, May 13, 2014

Learning something new? - try varying your methods

Now-a-days Learning is largely democratized and nothing stops you from learning or mastering a totally new subject, thanks to MOOC. (read Massive Open Online Courses.. infographic) These are learning systems which are already disrupting the traditional university structure. Im pretty sure that we are just in the beginning of a great education revolution. (very excited for my 3 year old daughter on her learning options!)
Although free is great, Three things are missing though in these learning platforms -- the planning, control and validation process of traditional methods.(These three are available in some MOOC like coursera). So its upto the learner to acquire the must 3's of any learning venture.
Lets assume you found the subject area let it be Micro economics, a programming language, a new language or natural history. Not sure what right for you or your career? I cannot answer this question as it is a personal thing. But have the attitude of "Whats the worst might happen?" -- "You may end up learning something new, however minuscule it might be" -- "You might loose 3-4 hrs of your life, take it as a wrong movie choice".
The road ahead is not going to be easy especially at the beginning, as I said be an explorer.
"All change is hard at first, messy in the middle and gorgeous at the end." - Robin Sharma
This formula might help in making sure you reach the goal which you are capable off or more than that.
warm-up
1. If its totally new or the one you scared about, cover the basics. Spend a lots of time in understanding the fundamentals. Its worth the time here as it will ease the learning of complex things comes later in process.
2. Dont try or target expert problems first - it may break your courage to venture very earlier in the process.
Finding
1. The method which suits your style. (Youtube, webminars, Books, online courses, community courses, podcasts, audiobooks, slideshare, meetups, specific learning blogs).
2. The material which suits your knowledge. All learning materials are not created equal. It should match your level of competency.
3. Vary your learning methods - Accept the fact that google's or Amazons top search may not find you the one you are looking for, try different methods. And the method should also suit your lifestyle. Like can you watch youtube at office!? read a pdf in your reader while commuting or listening to podcast.
4. Fail Faster - Move ahead, may be a different author, a different youtube playlist, a different course.
5. Find the "one" - the above steps will lead you to this, stuck to it, follow it and go until the level that method can take you.
Execution:
1. Create a "plan" and a achievable target.
2. Make incremental progress. Dont have targets like "learn and build a social site in a month." instead "learn to create a dynamic website using PHP in a week/month"
3. Track your progress based on the plan.This will prevent you from getting distracted.
4. "cost it" - there might be some $ involved in some learning methods, you might be surprised to know how cheaper it is to learn when compared to the returns. The biggest cost you will incur is your time.
5. "try it" - creating a sandbox, executing a sample code or trying to solve an existing problem.
6. "do it" - a small project, get certification if its available or publish it your personal blog. The options are limitless.
Stuck:
You are fortunate if you could find a human who can mentor and good in mentoring. The other options could be "joining a meetup" , following a niche blog, identifying a dedicated forum.. and your "GRIT" to get out of a stuck situation.
“When the pupil is ready, the Master appears”
Happy learning!

"Big Data" - Im a believer!

What is the use case where Hadoop is "the" best technology to use?. Im working in Hadoop and Big-data for some years, using it for production analytics and learning and improving everyday .Attended numerous big data webminars and conferences. Talked great about the mighty of Big data to those who never used it or scared about it. But I haven't felt the real benefit of using Hadoop or big data ( Im talking about my case !) apart from its cost effective.Most of the stuff can be done using some other existing technology. With the traction Hadoop is getting in technology grounds there must be something really BIG use case here. Yeah there are stories about Facebook, twitter and other sites, but there must be something which Hadoop can only do. I couldn't find that unavoidable purpose for this technology, until...
I recently enrolled in "machine learning" course from Coursera. An awesome course from an awesome teacher (Andrew Ng). I never imagined that I could understand why all the loads of Math in school and college (why the heck differentiation, integration?). I could have learned just languages and computer language. But this course changed my perspective towards math like a slap across the face. All the things we see, use and consume are developed using some form of math. This course is an eye-opener. I would recommend this course to anyone who has a data background, you will not see any data as a waste of memory, you will see it as a gold-mine waiting for the right explorer.

Now while machine learning, there are lots of techniques to predict something (give 'x' and it will give you 'y' not simple like that :0). Im still a beginner but what I found is that we need to do thousands of, in some cases even millions of iterations just to find a simple parameter. And there could be thousands of parameters in some cases. That's the aha! moment for me, Hadoop/ Big data is the only place "where you can store and process humongous data" in a cost effective way. Previously engineers could have limited themselves in number of iteration because of resources, not anymore they have the power of HDFS and mapreduce to store and process respectively.

There are many compelling use-case I seen in numerous webniars, conferences and whitepapers, but machine learning / predictive modeling is the most compelling reason( at least for me) that Hadoop is indispensable for the future analytics world. Especially we are living in digital-social world where the factors(x1,x2,x3...Xn) which could affect any outcome(y) is ever increasing. Now I'm no more a practitioner but a believer!

Friday, June 14, 2013

HBase Conference 2013

At my current workplace we are using Hadoop and related technologies for Datawarehousing. The main advantage we get out of this was ofcourse cost and scalability. But Hadoop and related technologies (especially Hive,PIG) are not perfect for DW and BI development projects. The major issue we faced are updates and capturing incremental data, since we are trying to do a Dimensional model with slowly changing dimensions.

I heard many saying about HBase, which can do updates in Hadoop. But my question is, Is HBase suitable for DW related updates? , Is it suitable for Dimensional modeling?, How about maintainance since DW systems constantly undergo change..

So got an oppurtunity to attend HBase conference to get my answer.. and certainly its a no.. There are some use cases where HBase is used to serve reports or an online reporting portal but certainly unsuitable for a Bus architecture based Dimensional models.

Anyway HBase seems to be serving magnifiently for online applications and nice to see some case studies which we use everyday like Groupon, jws player.

Some notes from the sessions Im able to attend..

apache hive and Hbase

Seems like data-retrieval and updates are very programatic and hard in HBase
Since Hive is becoming very popular because of SQL nature, this presentation talks about how to use Hive above HBase
  Hive over Hbase - online,unstructured and used by programmers
  Hive over HDFS - offline, structured and used by analyst
  use case 1 -- keep dimensions in Hbase .. fact in Hive and query can derive data from both
  use case 2 -- continuous updates to Hbase and   periodic dump to HDFS
                          join the above with a single Hive query
Actual presentation link
http://www.slideshare.net/hortonworks/integration-of-hive-and-hbase

Phoenix Project

An SQL skin for HBase
embedded JDBC driver to run HBase

presentation link
http://www.slideshare.net/dmitrymakarchuk/phoenix-h-basemeetup

Impala and Hbase 

Cloudera trying to help the analyst community by returning query data faster
problem could be on efficiency
it may not used for a batch process like ETL - as there is a possibility of failiure
can be used for analyst queries
not the best yet -- in other words not production ready yet
Im skeptical about the optimizer these tools use

Couldnt find the actual presentation, but this PPT has most of the info
http://www.slideshare.net/cloudera/cloudera-impala-a-modern-sql-engine-for-hadoop

Apache Drill

Another SQL for HBase, I liked this talk since the presenter from MapR accepted that this product is not yet production ready.
The design has a Distributed cache in each data-node, which means more data-transfer to build this cache.
Maybe suitable for small, quick analytical queries. Have to wait and see for production use cases.

http://www.slideshare.net/ApacheDrill/apache-drill-technical-overview

HBase in Pinterest - Use case

Interesting talk about pinning, unpinning, follow etc, Couldnt find the actual presentation

And there is WiBiData's Kiji ..maybe this link will give some idea
http://gigaom.com/2012/11/14/wibidata-open-sources-kiji-to-make-hbase-more-useful/