Basic Operators
Operator | Description | Example |
---|---|---|
Arithmetic Operators | +, -, *, /, %, ?: | X = FOREACH A GENERATE f1, f2, f1%f2; X = FOREACH A GENERATE f2, (f2==1?1:COUNT(B)); |
Boolean Operators | and, or, not | X = FILTER A BY (f1==8) OR (NOT (f2+f3 > f1)); |
Cast Operators | Casting from one datatype to another | B = FOREACH A GENERATE (int)$0 + 1; B = FOREACH A GENERATE $0 + 1, $1 + 1.0 |
Comparison Operators | ==, !=, >, <, >=, <=, matches | X = FILTER A BY (f1 == 8); X = FILTER A BY (f2 == ‘apache’); X = FILTER A BY (f1 matches ‘.*apache.*’); |
Construction Operators | Used to construct tuple (), bag {} and map [] | B = foreach A generate (name, age); B = foreach A generate {(name, age)}, {name, age}; B = foreach A generate [name, gpa]; |
Dereference Operators | dereference tuples (tuple.id or tuple.(id,…)), bags (bag.id or bag.(id,…)) and maps (map#’key’) | X = FOREACH A GENERATE f2.t1,f2.t3 (dereferencing is used to retrieve two fields from tuple f2) |
Disambiguate Operator | ( :: ) used to identify field names after JOIN, COGROUP, CROSS, or FLATTEN operators | A = load ‘data1′ as (x, y); B = load ‘data2′ as (x, y, z); C = join A by x, B by x; D = foreach C generate A::y; |
Flatten Operator | Flatten un-nests tuples as well as bags | consider a relation that has a tuple of the form (a, (b, c)). The expression GENERATE $0, flatten($1), will cause that tuple to become (a, b, c). |
Null Operator | is null, is not null | X = FILTER A BY f1 is not null; |
Sign Operators | + -> has no effect, – -> changes the sign of a positive/negative number | A = LOAD ‘data’ as (x, y, z); B = FOREACH A GENERATE -x, y; |
Relational Operators
Operator | Description | Example |
---|---|---|
COGROUP/GROUP | Groups the data in one or more relations. The COGROUP operator groups together tuples that have the same group key (key field) | A = load ‘student’ AS (name:chararray,age:int,gpa:float); B = GROUP A BY age; |
CROSS | Computes the cross product of two or more relations | X = CROSS A,B A = (1, 2, 3) B = (2, 4) DUMP X; (4, 2, 1) (8, 9) (1,2,3,2,4) (1, 3) (1,2,3,8,9) (1,2,3,1,3) (4,2,1,2,4) (4,2,1,8,9) (4,2,1,1,3) |
DEFINE | Assigns an alias to a UDF or streaming command. | DEFINE CMD `perl PigStreaming.pl – nameMap` input(stdin using PigStreaming(‘,’)) output(stdout using PigStreaming(‘,’)); A = LOAD ‘file’; B = STREAM B THROUGH CMD; |
DISTINCT | Removes duplicate tuples in a relation. | X = DISTINCT A; A = (8,3,4) DUMP X; (1,2,3) (1,2,3) (4,3,3) (4,3,3) (4,3,3) (8,3,4) (1,2,3) |
FILTER | Selects tuples from a relation based on some condition. | X = FILTER A BY f3 == 3; A = (1,2,3) DUMP X; (4,5,6) (1,2,3) (7,8,9) (4,3,3) (4,3,3) (8,4,3) (8,4,3) |
FOREACH | Generates transformation of data for each row as specified | X = FOREACH A GENERATE a1, a2; A = (1,2,3) DUMP X; (4,2,5) (1,2) (8,3,6) (4,2) (8,3) |
IMPORT | Import macros defined in a separate file. | /* myscript.pig */ IMPORT ‘my_macro.pig’; |
JOIN | Performs an inner join of two or more relations based on common field values. | X = JOIN A BY a1, B BY b1; DUMP X (1,2,1,3) A = (1,2) B = (1,3) (1,2,1,2) (4,5) (1,2) (4,5,4,7) (4,7) |
LOAD | Loads data from the file system. | A = LOAD ‘myfile.txt’; LOAD ‘myfile.txt’ AS (f1:int, f2:int, f3:int); |
MAPREDUCE | Executes native MapReduce jobs inside a Pig script. | A = LOAD ‘WordcountInput.txt’; B = MAPREDUCE ‘wordcount.jar’ STORE A INTO ‘inputDir’ LOAD ‘outputDir’ AS (word:chararray, count: int) `org.myorg.WordCount inputDir outputDir`; |
ORDERBY | Sorts a relation based on one or more fields. | A = LOAD ‘mydata’ AS (x: int, y: map[]); B = ORDER A BY x; |
SAMPLE | Partitions a relation into two or more relations, selects a random data sample with the stated sample size. | Relation X will contain 1% of the data in relation A. A = LOAD ‘data’ AS (f1:int,f2:int,f3:int); X = SAMPLE A 0.01; |
SPLIT | Partitions a relation into two or more relations based on some expression. | SPLIT input_var INTO output_var IF (field1 is not null), ignored_var IF (field1 is null); |
STORE | Stores or saves results to the file system. | STORE A INTO ‘myoutput’ USING PigStorage (‘*’); 1*2*3 4*2*1 |
STREAM | Sends data to an external script or program | A = LOAD ‘data’; B = STREAM A THROUGH `stream.pl -n 5`; |
UNION | Computes the union of two or more relations. (Does not preserve the order of tuples) | X = UNION A, B; A = (1,2,3) B = (2,4) DUMP X; (4,2,1) (8,9) (1,2,3) (1,3) (4,2,1) (2,4) (8,9) (1,3) |
Functions
Function | Syntax | Description |
---|---|---|
AVG | AVG(expression | Computes the average of the numeric values in a single-column bag. |
CONCAT | CONCAT (expression, expression) | Concatenates two expressions of identical type. |
COUNT | COUNT(expression) | Computes the number of elements in a bag, it ignores null. |
COUNT_STAR | COUNT_STAR(expression) | Computes the number of elements in a bag, it includes null. |
DIFF | DIFF (expression, expression) | Compares two fields in a tuple, any tuples that are in one bag but not the other are returned in a bag. |
DIFF | DIFF (expression, expression) | Compares two fields in a tuple, any tuples that are in one bag but not the other are returned in a bag. |
IsEmpty | IsEmpty(expression) | Checks if a bag or map is empty. |
MAX | MAX(expression) | Computes the maximum of the numeric values or chararrays in a single-column bag |
MIN | MIN(expression) | Computes the minimum of the numeric values or chararrays in a single-column bag. |
SIZE | SIZE(expression) | Computes the number of elements based on any Pig data type. SIZE includes NULL values in the size computation |
SUM | SUM(expression) | Computes the sum of the numeric values in a single-column bag. |
TOKENIZE | TOKENIZE(expression [, 'field_delimiter']) | Splits a string and outputs a bag of words. |
Load/Store Functions
FUnction | Syntax | Description |
---|---|---|
Handling Compression | A = load ‘myinput.gz’; store A into ‘myoutput.gz’; |
PigStorage and TextLoader support gzip and bzip compression for both read (load) and write (store). BinStorage does not support compression. |
BinStorage | A = LOAD ‘data’ USING BinStorage(); | Loads and stores data in machine-readable format. |
JsonLoader, JsonStorage | A = load ‘a.json’ using JsonLoader(); | Load or store JSON data. |
PigDump | STORE X INTO ‘output’ USING PigDump(); | Stores data in UTF-8 format. |
PigStorage | A = LOAD ‘student’ USING PigStorage(‘\t’) AS (name: chararray, age:int, gpa: float); | Loads and stores data as structured text files. |
TextLoader | A = LOAD ‘data’ USING TextLoader(); | Loads unstructured data in UTF-8 format. |
Math Functions
Operator | Description | Example |
---|---|---|
ABS | ABS(expression) | Returns the absolute value of an expression. If the result is not negative (x ≥ 0), the result is returned. If the result is negative (x < 0), the negation of the result is returned. |
ACOS | ACOS(expression) | Returns the arc cosine of an expression. |
ASIN | ASIN(expression) | Returns the arc sine of an expression. |
ATAN | ATAN(expression) | Returns the arc tangent of an expression. |
CBRT | CBRT(expression) | Returns the cube root of an expression. |
CEIL | CEIL(expression) | Returns the value of an expression rounded up to the nearest integer. This function never decreases the result value. |
COS | COS(expression) | Returns the trigonometric cosine of an expression. |
COSH | COSH(expression) | Returns the hyperbolic cosine of an expression. |
EXP | EXP(expression) | Returns Euler’s number e raised to the power of x. |
FLOOR | FLOOR(expression) | Returns the value of an expression rounded down to the nearest integer. This function never increases the result value. |
LOG | LOG(expression) | Returns the natural logarithm (base e) of an expression. |
LOG10 | LOG10(expression) | Returns the base 10 logarithm of an expression. |
RANDOM | RANDOM( ) | Returns a pseudo random number (type double) greater than or equal to 0.0 and less than 1.0. |
ROUND | ROUND(expression) | Returns the value of an expression rounded to an integer (if the result type is float) or rounded to a long (if the result type is double). |
SIN | SIN(expression) | Returns the sine of an expression. |
SINH | SINH(expression) | Returns the hyperbolic sine of an expression. |
SQRT | SQRT(expression) | Returns the positive square root of an expression. |
TAN | TAN(expression) | Returns the trignometric tangent of an angle. |
TANH | TANH(expression) | Returns the hyperbolic tangent of an expression. |
String Functions
Operator | Description | Example |
---|---|---|
INDEXOF | INDEXOF(string, ‘character’, startIndex) | Returns the index of the first occurrence of a character in a string, searching forward from a start index. |
LAST_INDEX | LAST_INDEX_OF(expression) | Returns the index of the last occurrence of a character in a string, searching backward from a start index. |
LCFIRST | LCFIRST(expression) | Converts the first character in a string to lower case. |
LOWER | LOWER(expression) | Converts all characters in a string to lower case. |
REGEX_EXTRACT | REGEX_EXTRACT (string, regex, index) | Performs regular expression matching and extracts the matched group defined by an index parameter. The function uses Java regular expression form. |
REGEX_EXTRACT_ALL | REGEX_EXTRACT (string, regex) | Performs regular expression matching and extracts all matched groups. The function uses Java regular expression form. |
REPLACE | REPLACE(string, ‘oldChar’, ‘newChar’); | Replaces existing characters in a string with new characters. |
STRSPLIT | STRSPLIT(string, regex, limit) | Splits a string around matches of a given regular expression. |
SUBSTRING | SUBSTRING(string, startIndex, stopIndex) | Returns a substring from a given string. |
TRIM | TRIM(expression) | Returns a copy of a string with leading and trailing white space removed. |
UCFIRST | UCFIRST(expression) | Returns a string with the first character converted to upper case. |
UPPER | UPPER(expression) | Returns a string converted to upper case. |
Tuple, Bag, Map Functions
Operator | Description | Example |
---|---|---|
TOTUPLE | TOTUPLE(expression [, expression ...]) | Converts one or more expressions to type tuple. |
TOBAG | TOBAG(expression [, expression ...]) | Converts one or more expressions to individual tuples which are then placed in a bag. |
TOMAP | TOMAP(key-expression, value-expression [, key-expression, value-expression ...]) | Converts key/value expression pairs into a map. Needs an even number of expressions as parameters. The elements must comply with map type rules. |
TOP | TOP(topN,column,relation) | Returns the top-n tuples from a bag of tuples. |
User Defined Functions (UDFs)
Pig provides extensive support for user defined functions (UDFs) as a way to specify custom processing. Pig UDFs can currently be implemented in three languages: Java, Python, JavaScript and Ruby.Registering UDFs
Registering Java UDFs:
---register_java_udf.pig register 'your_path_to_piggybank/piggybank.jar'; divs = load 'NYSE_dividends' as (exchange:chararray, symbol:chararray, date:chararray, dividends:float);Registering Python UDFs (The Python script must be in your current directory):
--register_python_udf.pig register 'production.py' using jython as bballudfs; players = load 'baseball' as (name:chararray, team:chararray, pos:bag{t:(p:chararray)}, bat:map[]);Writing UDFs
Java UDFs:
package myudfs; import java.io.IOException; import org.apache.pig.EvalFunc; import org.apache.pig.data.Tuple; public class UPPER extends EvalFunc { public String exec(Tuple input) throws IOException { if (input == null || input.size() == 0) return null; try{ String str = (String)input.get(0); return str.toUpperCase(); }catch(Exception e){ throw new IOException("Caught exception processing input row ", e); } } }Python UDFs
#Square - Square of a number of any data type @outputSchemaFunction("squareSchema") -- Defines a script delegate function that defines schema for this function depending upon the input type. def square(num): return ((num)*(num)) @schemaFunction("squareSchema") --Defines delegate function and is not registered to Pig. def squareSchema(input): return input #Percent- Percentage @outputSchema("percent:double") --Defines schema for a script UDF in a format that Pig understands and is able to parse def percent(num, total): return num * 100 / total
Data Types
Simple Types
Operator | Description | Example |
---|---|---|
int | Signed 32-bit integer | 10 |
long | Signed 64-bit integer | Data: 10L or 10l Display: 10L |
float | 32-bit floating point | Data: 10.5F or 10.5f or 10.5e2f or 10.5E2F Display: 10.5F or 1050.0F |
double | 64-bit floating point | Data: 10.5 or 10.5e2 or 10.5E2 Display: 10.5 or 1050.0 |
chararray | Character array (string) in Unicode UTF-8 format | hello world |
bytearray | Byte array (blob) | |
boolean | boolean | true/false (case insensitive) |