Hadoop Tutorials: Aggregation over Data Partition with Apache PIG

While working on a Hadoop Big Data project for a major US retailer, we had to slice and dice the sales transaction table very intensely. Some of the built-in Apache PIG features such aggregation over data partition made our job easier by helping us to avoid writing custom code. In this installment of our Hadoop Tutorials series I will describe how we did this.

Here is a very simplified version of transaction table of a retailer.

Transaction Id	Product Id	Transaction Type	Quantity
1000	1	S	10
1001	1	R	3
1002	2	S	6

Typically, sales transactions are written in journal entry fashion. Two common types of transactions are Sales and Return. For example, Transaction ID 1000 and 1002 are sales transaction where Transaction Type is “S”. Transaction Id 1001 is return transaction where transaction type is “R”.

Suppose a business report needs to show the Total Sale, Total Return and Net Sale (Total Sale – Total Return) by each Product Id. Example report should look like following table.

Product ID	Total Sale	Total Return	Net Sale
1	10	3	7
2	6	0	6

This report requires pivoting and aggregation over Product ID partition of the entire dataset. In SQL, PARTITION BY and OVER syntax would achieve the goal. The Hadoop ecosystem provides a number of different tools that can accomplish an equivalent result -- in Apache PIG, the same can be done with curly bracket syntax.

Here is the PIG script

/* Load the input csv file */ 
transaction = LOAD 'tran.csv' USING PigStorage(',') AS (transaction_id:chararray, productId:int, transctionType:chararray,quantity:int); 
/* group by the Product Id */ 
groupByProductId = GROUP  transaction  BY ( productId ); 
/* Aggregation over Partition with curly bracket. Notice that filter and aggregation functions don’t operate on the entire dataset. They only work on the partitioned dataset by the current Product Id */ 
report = FOREACH groupByProductId  { 
/* get all the Sale transactions for the current Product ID */ 
  sale = FILTER transaction BY transctionType == 'S'; 
/* get all the Return transactions for the current Product ID */ 
return = FILTER transaction BY transctionType == 'R'; 
GENERATE  group AS productId 
/* get the total sale by current Product Id */ 
,SUM(sale.quantity) AS totalSale 
/* get the total sale by current Product Id. If there is no return record found, put zero */ 
                                    ,(COUNT(return) > 0 ? SUM(return.quantity) : 0 )AS totalReturn  
                                   /* get the net sale for the current Product Id */  
,SUM(sale.quantity) - (COUNT(return) > 0 ? SUM(return.quantity) : 0) AS netSale; 
} 
/* Store the report in csv file */ 
store report into ‘report.csv’  USING PigStorage(',');

As you can see, this straightforward approach allowed us to use some off-the-shelf tools to simplify the aggregation process and improve our time to results. We were still able to leverage the performance and scale benefits of Hadoop while maintaining the simplicity and readability of SQL thanks to the flexibility of PIG.

Let me know if there are other techniques or tools you’d like to see us highlight in future Hadoop tutorials!

The post Hadoop Tutorials: Aggregation over Data Partition with Apache PIG appeared first on Blogs@Intel.

Hadoop Tutorials: Aggregation over Data Partition with Apache PIG

Trending Articles

RAMAYAMPET Mandal Sarpanch | Upa-Sarpanch | Ward member Mobile Numbers Medak...

लड़कियां सेक्स के दौरान क्यों करती है उह! आह!लड़कियां सेक्स के दौरान क्यों करती...

Neem Baba Extra Questions Answer Class 6 English Poorvi

Throw Back: 4×4 — Sikilitele (Ft Castro) Prod by JQ

Rajasthan Board 10th Result 2016 Roll No wise & Name Wise

Lowe faces four theft charges

Practice Sheet of Right form of verbs for HSC Students

Mafia, Murder & Mayhem In The Motor City: Detroit Mob Hit Timeline (1937-2007)

The 10 Tennessee Cities With The Largest Black Population For 2021

Materials Around Us Class 6 Worksheet Science Chapter 6

デスクトップヒープの枯渇

Best Suvichar in Hindi |बेस्ट सुविचार |शुभ विचार हिंदी में

Kanulanu Thaake Lyrics and translation | Manam (2014)

Korean Sex Porn Videos: XXX Videos & Free Porn Movies

Teen Shot In Miami Drive-By Dies From Injuries

Download: IQ Muzatasha feat Shy D & Pmj – Ulesi NiFertilizer Yamavuto

Mahakal Attitude Status

Property developer set up cannabis factory to help pay off debts...

♡

KB: How to troubleshoot issues when adding a Hyper-V host in System Center...