Working with the Kafka Consumer and Producer Steps in Kettle

The other day a partner asked how to work with the Kafka Marketplace plugins for Kettle contributed by Ruckus Wireless.  I decided to kick the tires and get the steps up and running.

I first started off downloading Kafka, you can find it here:

http://kafka.apache.org/

I downloaded version 0.9 of Kafka.  I happened to have a Cloudera Quickstart VM running (CDH 5.4), so I figured I’d run through the quick start of Kafka from that VM.  I had no trouble starting up Kafka and sending and receiving basic messages via the console consumer and producer.  Getting started with Kafka is very simple!  Now on to Kettle.

Within Spoon (Version 6.0), I installed the Kafka Marketplace Plugins.  After restarting, I created a very simple transformation.  I placed an “Apache Kafka Consumer” step  on the palette followed by a “Write to Log” step, can’t get much simpler than that!

In the Kafka Consumer dialog, I specified the topic name as “test” to match what I did during the Kafka Quick Start.  I then set the “zookeeper.connect” property to my Zookeeper’s location running on my Cloudera VM, “192.168.56.102:2181″.  Finally I specified the group.id as “kettle-group”.




Now that I had things wired up, I figured it was time to run!  I had some basic thoughts at this point.  Which message does the consumer group start reading from in the Kafka Topic?  How long does the step run for before exiting?  We’ll get to those answers in a few minutes.  First let’s run it and see what happens…

BOOM!

2015/12/23 11:58:18 - Apache Kafka Consumer.0 - ERROR (version 6.0.0.0-353, build 1 from 2015-10-07 13.27.43 by buildguy) : Error initializing step [Apache Kafka Consumer]
2015/12/23 11:58:18 - Apache Kafka Consumer.0 - ERROR (version 6.0.0.0-353, build 1 from 2015-10-07 13.27.43 by buildguy) : java.lang.NoClassDefFoundError: kafka/consumer/ConsumerConfig

Fun with Java Classes.  I’m not exactly sure why Kettle can’t find the Kafka class here.  I quickly resolved this by placing all the plugin lib jar files in Spoon’s main classpath:

cp plugins/pentaho-kafka-consumer/lib/* lib

Note that this was a hammer of a solution.  I renamed all the jar files to start with “kafka”, that way I could quickly undo my change if necessary.  Also, I’ve created the following issue over on github, maybe there’s a better approach to fixing this one that I haven’t thought of yet.

https://github.com/RuckusWirelessIL/pentaho-kafka-consumer/issues/11

Once I restarted Spoon, I re-ran the transformation and … got no results from Kafka.  I tried a bunch of different configurations, I sent additional messages to Kafka, but no luck.  So I did what any developer would do, and checked out the latest source code.

git clone https://github.com/RuckusWirelessIL/pentaho-kafka-consumer.git

From there I ran “mvn package”, and got a fresh new build.  I replaced plugins/steps/pentaho-kafka-consumer with the new target/pentaho-kafka-consumer-TRUNK-SNAPSHOT.zip.  After running it and seeing a similar NoClassDefFoundError, I repeated my steps with the new plugin jars, moving them to the main classpath.

Another thing I ran into was on the Kafka configuration side.  Kafka was using the hostname of my VM for comms, which my OS wasn’t aware of.  I fixed this by updating config/server.properties advertised.host.name to the public IP address of the VM.

After restarting Spoon, I successfully read in the messages from Kafka!  Note that at this time you can’t reset the message offset for a specified group, so the only way to re-read messages is to change the “group.id”.  This is a feature that Ruckus is considering adding, it would be a great way to contribute to the open source plugin!

After getting the Consumer working, I went ahead and tried out the Producer.  Note that the Producer step needs Binary data to feed a topic.  All I had to do was feed in Binary data, specify the topic name (I used “test” again), and finally specify the “metadata.broker.list” with the correct IP and port, and it worked like a charm!  Note that at this time I didn’t have to rebuild the producer plugin like I did the consumer, but without the consumer jars being placed in the lib folder the producer wouldn’t function either.



So how might you use Kettle and Kafka together?  Kafka is becoming the de-facto big data message queue, and can be used in combination with Spark and other Hadoop technologies for data ingestion and streaming.  Kettle can be used as a way to populate a Kafka Topic via the Apache Kafka Producer, or it could be used to consume messages from a topic via the Apache Kafka Consumer for downstream processing.  Ruckus Wireless, the company that contributed the steps, uses Pentaho Data Integration to ingest data into Vertica and then visualize the data with Pentaho Business Analytics.  You can learn more about Ruckus Wireless use case here:

http://www.pentaho.com/customers/ruckus-wireless

Here are links to the github locations for the plugins:

https://github.com/RuckusWirelessIL/pentaho-kafka-consumer/ 

https://github.com/RuckusWirelessIL/pentaho-kafka-producer/

 

 

First Impressions of Pentaho Business Analytics Cookbook

I just finished reading Packt Publishing’s Pentaho Business Analytics Cookbook by Sergio Ramazzina, this is a great up-to-date guide on utilizing the full Pentaho Analytics Suite, including a mix of both enterprise and community components.  Useful details around configuring data sources, building a set of your first reports, parameterization, dashboarding, and a whole lot more are covered  step by step to make sure you walk away with a good understanding of what tools are available and how to get started with them.  This is the first definitive guide I have seen around Pentaho Mobile, and I really appreciate chapter 11 on customizing the Pentaho experience for your business.

If you are looking to get up to speed on Pentaho Analytics very quickly, I highly recommend this book!

Mondrian 4, OSGi in Pentaho 5.1 CE

During the development of 5.1, Pentaho has taken steps to integrate Mondrian 4 into our business analytics platform.  This article goes over what we have accomplished so far, where we are headed, and also instructions for getting Mondrian 4 working with Pentaho 5.1 Community Edition.

Pentaho Enterprise Edition has Mondrian 4 bundled for a specific reason - we’ve now introduced native MongoDB support as a plugin to Mondrian 4.  This use case allows customers to slice and dice data from MongoDB collections in Pentaho Analyzer.  You can learn more about the capability here:  http://www.pentaho.com/request-analyzer-mongodb

As we continue to evolve the Pentaho Platform, we need a more flexible plugin architecture for driving innovation.  To allow both Mondrian 3 and Mondrian 4 runtime environments, we’ve introduced OSGi as a core part of the platform.  Mondrian 4 is our first use case, but we’ll be introducing many others in future versions.

Once Mondrian 4 is installed as an OSGi bundle, it is available as an OLAP4J resource to the platform via Pentaho’s proxy system Driver aptly named “PentahoSystemDriver”.  The following steps below walk you through getting Pentaho Mondrian up and running within Pentaho CE 5.1.  Note that these instructions won’t work against previous versions of Pentaho, and these instructions are not necessary in Pentaho EE 5.1, as Mondrian 4 is already configured and installed.

Download Pentaho 5.1 CE

You can download Pentaho 5.1 from sourceforge here: http://sourceforge.net/projects/pentaho/files/Business%20Intelligence%20Server/5.1/

Make sure it is working normally before continuing with these instructions.

Deploy Mondrian 4’s required OSGi bundles

We first need to add Mondrian 4 and its dependencies as OSGi Bundles.  Copy the following JARs, which are now OSGi compatible, to pentaho-solutions/system/osgi/bundles:

Also, you may copy the Mondrian properties file to “mondrian.cfg” in the same folder to customize various Mondrian 4 properties.

Install the CDA Plugin via the Marketplace

Go to the Marketplace perspective and install Community Data Access, this plugin allows you to query various data sources including Mondrian 4 via OLAP4J.

Setup a Mondrian 4 Database and Schema

I already had Foodmart installed in a local MySQL instance, I also deployed the Mondrian 4 FoodMart.mondrian.xml schema (https://github.com/pentaho/mondrian/blob/lagunitas/demo/FoodMart.mondrian.xml) to the BA Server by uploading the file in the /public/Foodmart folder.

Create a new CDA file that queries Foodmart

I copied the olap4j example that comes with CDA.  Here are the important parts:

<DataSources>
<Connection id=”1″ type=”olap4j”>
<Driver>mondrian.olap4j.MondrianOlap4jDriver</Driver>
<Url>jdbc:mondrian4:</Url>
<Property name=”JdbcUser”>foodmart</Property>
<Property name=”JdbcPassword”>foodmart</Property>
<Property name=”Jdbc”>jdbc:mysql://localhost:3306/foodmart_mondrian_4</Property>
<Property name=”JdbcDrivers”>com.mysql.jdbc.Driver</Property>
<Property name=”Catalog”>solution:/public/Foodmart/FoodMart.mondrian.xml</Property>
</Connection>
</DataSources>

<Query>
select {[Measures].[Unit Sales]} ON COLUMNS,
NON EMPTY [Time].[Time].[1997].Children ON ROWS
from [Sales]
where ([Product].[${productfamily}])
</Query>

You can download the full CDA file I used here: mondrian4.cda

Upload the CDA file in the /public/Foodmart folder.

You can now run the CDA file and see the results come back from Mondrian 4!

Additional Info

For developers wanting to access Mondrian 4 metadata like they can access Mondrian 3 schemas by schema name in the Pentaho Platform, we’ve done some low level plumbing to get folks started.  If you define a Mondrian 4 connection in pentaho-solutions/system/olap4j.properties, you can gain access to that programmatically through code like the following:

import org.pentaho.platform.plugin.services.connections.mondrian.MDXOlap4jConnection;
import org.pentaho.platform.plugin.services.importexport.legacy.MondrianCatalogRepositoryHelper;
import org.pentaho.platform.plugin.services.importexport.legacy.MondrianCatalogRepositoryHelper.Olap4jServerInfo;
...
final MondrianCatalogRepositoryHelper helper = new MondrianCatalogRepositoryHelper( repo );
if ( helper.getOlap4jServers().contains( catalogName ) ) {
final Olap4jServerInfo serverInfo = helper.getOlap4jServerInfo( catalogName );
properties.setProperty( “url”, serverInfo.URL );
properties.setProperty( “driver”, serverInfo.className );
if ( serverInfo.user != null ) {
properties.setProperty( “user”, serverInfo.user );
}
if ( serverInfo.password != null ) {
properties.setProperty( “password”, serverInfo.password );
}
MDXOlap4jConnection connection =
(MDXOlap4jConnection) PentahoConnectionFactory.getConnection( IPentahoConnection.MDX_OLAP4J_DATASOURCE,
properties, PentahoSessionHolder.getSession(), null );
}

Using this utility code is nice because the MDXOlap4jConnection will manage mapping Pentaho’s roles to Mondrian’s.

So how does all of this work?

Pentaho has bundled Apache Felix into the Pentaho Platform.  Felix is an OSGi container which now manages Mondrian 4 and its dependencies.  The core bundles that make up Pentaho’s OSGi container can be found in pentaho-solutions/system/osgi/core_bundles, here you’ll find a number of utility OSGi jars including Gemini’s Blueprint, which we use for wiring OSGi components.  Blueprint is similar to the Spring Framework.  Also, the Mondrian 4 jar contains some metadata that registers it with the Pentaho platform as an available OLAP4J driver with the JDBC prefix name of “mondrian4″.  You can check out the metadata file OSGI-INF/blueprint/beans.xml to see the specific XML to declare the driver.  To see how the internal wiring is done, and how PentahoSystemDriver is involved, you can check out the pentaho-platform package org.pentaho.platform.osgi.

There is still a lot of work to do!

Here are some of the areas we will need to complete in future versions to make this a seamless experience:

  • Ship the Mondrian 4 binaries with the release!
  • Update Pentaho’s Datasource Manager to easily manage Mondrian 4 connections instead of editing olap4j.properties
  • Enable Mondrian 4 to access Pentaho Database Connections
  • Create an easy to use Mondrian 4 Schema Editor - Maybe Ivy Information System’s work will move in that direction?
  • Have projects such as Pivot4J and Saiku support Mondrian 4 connections via OSGi

If any of this work is of interest to you, folks in the Mondrian mailing list and at Pentaho would be happy to help point the way!

Thanks!

Will

Pentaho 5.0 Reporting by Example: Beginner’s Guide

It’s been four years since I published Pentaho Reporting 3.5 for Java Developers.  A lot has changed in Pentaho Reporting since then, so it’s great to see a new book now available from Packt, Pentaho 5.0 Reporting by Example: Beginner’s Guide, co-authored by Mariano Mattio and Dario Bernabeu.  This book has a different purpose than the Java Developer book, it’s focus is a deeper dive into examples to quickly bring folks up to speed on the various capabilities of Pentaho Reporting.

For those who already are familiar with the basics of Pentaho Reporting, I would still recommend this book for a couple of reasons.  First, Chapter 12 covers both content linking and sparklines, very useful features for your every day reports.  Second, one of the newest features in Pentaho Reporting 5.0 is stylesheets.  In Chapter 13, this book does a great job at an initial introduction to get you started on this powerful capability.

Thanks Mariano and Dario for this great contribution!

Pentaho’s move to GitHub

Last week,  Nick Baker lead the effort to transition the Pentaho Platform and many of our open source commons and plugin projects over to GitHub.  Check out his forum post about the transition here:

http://forums.pentaho.com/showthread.php?135048-BA-Server-source-code-migration-to-GitHub

While SVN was a great tool to work with, GitHub offers a number of capabilities that we consider important to us as an open source engineering organization.

Over the next few months, we’ll be transitioning the remaining open source projects Pentaho hosts over to GitHub, next up is Pentaho Reporting!

So go ahead and fork, have fun!

Pentaho Marketplace is Here!

I’m proud to announce that included in our release of Pentaho BA Server 4.8 and Pentaho Data Integration 4.4 to SourceForge today, we’ve bundled as a plugin our first version of the Pentaho Marketplace!  With the Marketplace, it is now easy to download and install cool plugins developed by the community.

To get started with the BA Server Marketplace, log in as an administrator, and click the Marketplace toolbar icon or select Tools -> Marketplace.  In Spoon, click the Help -> Marketplace.  Find the plugin you want to install or upgrade, and click!

I would like to say thank you to all the hard work done by the many folks who participated in making this project reach 1.0!  The Pedros @ WebDetails have worked around the clock to deliver the BA Server Marketplace, their team has done a great job with the UI experience as well as a telemetry capability so the community can see what plugins are popular and how the marketplace is used (More on that in a future blog!).   Also, Wes Brown, Pentaho’s head of pixel management, provided a lot of great feedback and ideas for the UI, helping make the user experience fantastic!  On the PDI side, Matt Casters did a great job putting together a first version of the Marketplace within Spoon, followed by assistance from Sean Flatley and Matt Burgess, two of our core Kettle developers at Pentaho.  I hear Matt B. is also brewing up a number of new plugins, go check out more about them on his blog :-).

It’s great to see so many helping hands on a project like this, all done out of passion for the product, and the goal of opening up the product to even more capabilities and contributions!!

If you do find any issues with the Marketplace, please let us know.  We already have many plans for future versions, so keep an eye out … in the Marketplace … for Marketplace updates :-).

Finally, are you a plugin developer and would like to see your plugin appear in the Marketplace?  Go check out https://github.com/pentaho/marketplace-metadata for more details.  We want to get as many plugins listed as possible!

Announcing Pentaho Developer IRC Office Hours

I’m pleased to announce that Pentaho’s Engineering Team will be hosting IRC Office Hours each week.  IRC is a great place to go and chat with Pentaho’s developers, but sometimes we’re too busy traveling the world or hacking away at the next release to catch up with folks in the irc.freenode.net ##pentaho channel.  We’re hoping that hosting office hours will allow for more collaboration, so as a community we can continue to expand and build on the #1 open source business analytics and data integration platform.

Check out the wiki for full details:
http://wiki.pentaho.com/display/COM/IRC+Office+Hours

Pentaho Data Integration 4 Cookbook: Get your swiss army knife out

This weekend I had the pleasure of reading Maria Roldan and Adrian Pulvirenti’s Pentaho Data Integration 4 Cookbook, published by Packt Publishing.  I was one of the reviewers for Maria’s first Packt book, Pentaho 3.2 Data Integration: Beginner’s Guide, as well as a Packt author myself, so when I was asked if I’d be willing to write about the most recent addition to the Pentaho collection of books, I happily obliged.

I highly recommend this book to all those out there looking to learn more about PDI.  The book has many great recipes for specific situations, but also throughout the book you learn many important swiss-army-knife-type skills that will aid you in your daily use of Pentaho Data Integration.  The book includes everything from dealing with unstructured text files to working with fuzzy logic.  As a Java developer, I especially appreciate the many uses of the User Defined Java step for the more advanced scenarios.  The book also introduces the many uses of Pentaho Data Integration within Pentaho’s BI Suite, allowing power BI Developers to create a flow of information from a transformation to a report or dashboard.

Chapter 6, Understanding Data Flows, may be the most important chapter in this book.  Managing the merging and splitting of data within a transformation requires key insights that this book covers in detail.  Having this information will allow you to take your transformation building skills to the next level.

Thanks Maria and Adrian on the wonderful piece of work!  The copy I received will reside in the bullpen at Pentaho’s Headquarters here in Orlando, I’m sure many of the Engineers here will use and learn from it!  Now don’t waste any more time, get your own copy today!

Recent Pentaho Tech Tips

Hi Folks,

I wanted to share a couple of the technical articles I’ve written in the last month:

Mondrian Cache Priming and Cache Control in the BI Server - This article covers how to take control of your Mondrian cache via action sequences, including priming the cache with MDX Queries as well as using Mondrian’s CacheControl API to flush specific segments of the cache.

Customizing your Pentaho Metadata Query in Pentaho Reporting - This article describes how to customize Pentaho’s Metadata Query Language (MQL) in a significant way before execution, allowing reports to respond to user input through prompting in ways that weren’t possible before.

Enjoy!

Will

5 Tips for Styling with Pentaho Report Designer

I’m a big fan of Google Analytics, I use it for all my personal websites to see what type of traffic I get.  One of my colleagues who is also impressed with their reports wanted to know if you could make a Pentaho Report look as good as Google’s output.  I quickly threw together the following report, to show that you can design just about anything in Pentaho Report Designer!

Check out the PDF and HTML rendering of the report.  Feel free to use the PRPT as a template for your own reports.

Here are my top 5 recommendations for folks when designing reports like this:

  1. Don’t be tempted to use the lines and rectangles.  Instead, use padding and borders of bands and elements.
  2. Inline Subreports allow you to pretty much layout anything, use them!
  3. The message-field report element is very powerful, you can specify number and date formats as part of the message: $(field, date, MMM yyyy).
  4. Make sure to test rendering in the output formats that you care about.  HTML renders as a set of tables, so you can’t have overlapping objects in your report.
  5. Take advantage of the “Paste Formatting” option, this allows you to copy colors, font sizes, etc, and will save you a lot of time.

And of course don’t forget to get a copy of Pentaho Reporting 3.5 for Java Developers :-).  The book covers many topics, you can learn a lot about formula functions, chart options, shortcut keys, and much more.

Next Page »