Tech @ Runtastic • 14.06.2017 • Runtastic Tech Team

How to Write User Defined Functions for Hive

by Tanja Grill & David Eigenstuhler, Data Engineers

Hive offers several built-in functions that can be used when querying data in Hive. They come in very handy when you want to do some basic calculations, string manipulations or simply work with date values in different formats.

A simple example would be the greatest() function that returns the greatest value of a given list of values.

select greatest(40,100,3);
----
100

Although Hive comes with a comprehensive list of built-in functions, you might find yourself in the situation where the desired function to process your data is not on that list. Fortunately, this is not an issue at all as Hive allows you to develop user-defined functions, also known as UDF. What’s even better, it doesn’t take an Einstein to build UDFs, you just need to know a little Java and soon you’ll be able to use your new UDF in a Hive query.

This post will show you how to write a Hive UDF in Java. For this, we came up with the perfect example. If you have ever tried to convert a UTC timestamp to a readable datetime string in UTC, you might have come across the function from_utc_timestamp. Unfortunately, you most likely found out that by default Hive converts the timestamp to your local time zone when displaying it as a readable string. This means you would need to convert this string then back again to UTC. Therefore, we will show you how to build a user-defined function that converts UTC timestamps to UTC time strings in one easy function.

Let’s create a base class which extends org.apache.hadoop.hive.ql.exec.UDF with an overloaded method name evaluate. The evaluate method will be passed a UTC timestamp in milliseconds and an optional string to return an org.apache.hadoop.io.Text object and will make use of the joda date library to convert the timestamp. With the optional string argument, you can specify to which format you want to convert the timestamp like date or datetime or you might only be interested in the month of the timestamp.

The creation of a base class like ours gives us the chance to easily implement new timestamps to time string conversion functions for other time zones as well.

The result should look something like this:

package com.yourcompany.hadoop.hive.udf;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.LongWritable;
import org.joda.time.DateTime;
import org.joda.time.DateTimeZone;
import org.joda.time.format.DateTimeFormat;
import org.joda.time.format.DateTimeFormatter;
public abstract class TimestampBase extends UDF {
  protected static final Text DEFAULT_MODE = new Text("time");
  protected static final Text OUT_OF_RANGE = new Text("out of range");
  protected static final Text INVALID_ARGS = new Text("invalid argument");
  protected DateTimeZone dateTimeZone;
  public TimestampBase(DateTimeZone dtz) {
    dateTimeZone = dtz;
  }
  public Text evaluate(final LongWritable input_timestamp) {
        Long input_timestamp_as_long = input_timestamp.get();
    return evaluate(input_timestamp_as_long, DEFAULT_MODE.toString());
  }
  public Text evaluate(final LongWritable input_timestamp, final Text mode) {
        Long input_timestamp_as_long = input_timestamp.get();
        String mode_as_string = mode.toString();
        return evaluate(input_timestamp_as_long, mode_as_string);
    }
  private Text evaluate(final Long input_timestamp, final String mode) {
        if (input_timestamp == null) { return INVALID_ARGS; }
        if (input_timestamp < 0)     { return OUT_OF_RANGE; }
    DateTime dt = new DateTime(input_timestamp).withZone(this.dateTimeZone);
    DateTimeFormatter fmt;
    switch (mode.toLowerCase()) {
    case "year":
      fmt = DateTimeFormat.forPattern("yyyy");
      break;
    case "month":
      fmt = DateTimeFormat.forPattern("MM");
      break;
    case "day":
      fmt = DateTimeFormat.forPattern("dd");
      break;
    case "date":
      fmt = DateTimeFormat.forPattern("yyyy-MM-dd");
      break;
    case "hour":
      fmt = DateTimeFormat.forPattern("HH");
      break;
    case "minute":
      fmt = DateTimeFormat.forPattern("mm");
      break;
    case "second":
      fmt = DateTimeFormat.forPattern("ss");
      break;
    case "week":
      fmt = DateTimeFormat.forPattern("ww");
      break;
    case "time":
    default:
      fmt = DateTimeFormat.forPattern("yyyy-MM-dd HH:mm:ss");
    }
    return new Text(dt.toString(fmt));
  }
}

To use our user-defined function within Hive, we now write another class inheriting from the above base class, add a description and define the time zone.

package com.yourcompany.hadoop.hive.udf;
import org.apache.hadoop.hive.ql.exec.Description;
import org.joda.time.DateTimeZone;
public final class TimestampToUTC extends TimestampBase {
  public TimestampToUTC() {
    super(DateTimeZone.UTC);
  }
}

To be able to use this function in Hive, you have to compile your code to a jar, add it to your Hive classpath and store it in HDFS.

hive> add jar my_jar.jar

As a final step, the function needs to be registered with the following commands:

hive> create database if not exists udfs;
hive> use udfs;
hive> drop function if exists timestamp_to_utc;
hive> create function timestamp_to_utc as "com.yourcompany.hadoop.hive.udf.TimestampToUTC" USING JAR 'hdfs://your_hdfs_nameserver/user/hive/myjars/my_jar.jar';

And voilà, now you can use your UDF in Hive!

select timestamp_to_utc(1487080137000);
-----
2017-02-14 13:48:57

If you are using Impala as well, you are lucky as you also can use your new UDF there. Follow these commands and your functions will also be available within Impala queries:

impala> use udfs;
impala> drop function timestamp_to_utc(bigint);
impala> drop function timestamp_to_utc(bigint, string);
impala> create function timestamp_to_utc(bigint) returns string location '/user/hive/myjars/my_jar.jar' symbol='com.yourcomany.hadoop.hive.udf.TimestampToUTC';
impala> create function timestamp_to_utc(bigint, string) returns string location '/user/hive/myjars/my_jar.jar' symbol='com.yourcomany.hadoop.hive.udf.TimestampToUTC';

Dear reader, we hope our little excursion into user defined functions will help you to create your own neat functions.

Feel free to share your opinions in the comments section below 🙂

***

Runtastic Tech Team

We are made up of all the technical departments at Runtastic like iOS, Android, Web, Infrastructure, Data Engineering, etc. We're eager to tell you about how we work here at Runtastic, what our processes look like, the cool stuff we create and what we have learned along the way. And BTW we are always looking for talented people who want to join us!
View all posts by Runtastic Tech Team »