Wednesday, August 5, 2015

Top 10 Machine Learning Libraries and Services for Java Developers

In recent times, Machine Learning has emerged as one of the most talked about topics in the field of information science and technology. Although the subject has probably intrigued the researchers and academicians for decades, it's only now every Software Engineer is trying to get a hang of it. It's changing the face of computing for ever and in a way, accelerating the move from Software Eating The World to Software Eating Software. 

In this blog, I will cover (rather list out) some of the top Machine Learning libraries and services available for a Java developer and try to highlight some of the salient points associated with each of those libraries. Please note some of the most powerful machine learning libraries are in Python and they are not covered in this post at all.

Before I go to the list of libraries, here is a short list of algorithms or broad categories of problems that most of the Machine Learning libraries would cover either fully or partially -

  • Classification
  • Regression
  • Clustering
  • Ranking

Here are few examples of application:
  • Outlier Detection
  • Recommendation
  • Natural Language Processing
  • Neural Networks

1. Apache SPARK MLlib 

2. Deeplearning4j - http://deeplearning4j.org
  • One of the top ML Libraries in Java
  • Integrates with Hadoop, Spark
  • Use Cases
    • Face/image recognition
    • Voice search
    • Speech-to-text (transcription)
    • Spam filtering (anomaly detection)
    • E-commerce fraud detection
    • Regression 


3. Apache Mahout - http://mahout.apache.org
  • Runs on Hadoop Cluster, so infinite scalability
  • Good for recommendations

5. Google Prediction APIs (as service) - https://cloud.google.com/prediction

Types of the problems where you may see few ready examples 
  • Classification
  • Regression 
Google Prediction provides two types of APIs - one that leverages the hosted models and the rest where you have to train the model with you sufficient and then expect the APIs to predict.


5. IBM Watson + AlchemyAPI (as service) - http://www.ibm.com/smarterplanet/us/en/ibmwatson


You can try Alchemy APIs at http://www.alchemyapi.com/products/demo . One of the newest and a very significant acquisition by IBM is Alchemy API. Here are the two primary offerings from Alchemy API -
  • AlchemyLanguage - Text Analytics and Natural Language Processing.
  • AlchemyVision - Leverages deep learning for photo and image processing.
Easy to get registered and get started.


6. MS Machine Learning (as service) - http://azure.microsoft.com/en-in/services/machine-learning

Microsoft has done big deal around coming up with an intuitive UI where users can create a model, train it and run the analytics - all in drag and drops. For a new user it would take sometime to get accustomed to the various UI controls and how Application works. Developers can create her own model and sell it in Azure Marketplace. 

Outlook account works seamlessly. 

7.  Amazon AWS Machine Learning - https://aws.amazon.com/machine-learning 

  • Provides visualisation tools to create ML Models
  • Simple API support for the models generated this way
  • Highly Scalable, can generate billions of predictions in day
8. Weka - http://www.cs.waikato.ac.nz/ml/weka
  • Provides a graphical user interface, command line interface and Java API
  • One of the most popular Java machine learning library
  • Available under GPL License 
9. Mallet - http://mallet.cs.umass.edu
  • Statistical natural language processing, document classification, clustering, topic modeling and information extraction.
10. H2O - http://0xdata.com
  • In-memory data engine
  • Designed for running various types of types of statistical computations (including Deep Learning)
  • Works with Hadoop Distributed File System

Others deserving a mention but could not make it to the list of top 10

JSAT - https://code.google.com/p/java-statistical-analysis-tool
  • Library for quickly getting started with Machine Learning problems
  • Available under GPL 3 but author is open for discussion
  • List of supported algorithms is impressive
  • One man project, done in his free time. Creator is Edward Raff @EdwardRaffML

LensKit - http://lenskit.org
  • Focused on building recommender system, primarily for research based projects
  • Good for trying out. For scale and for production env, one can move to Apache Mahout.
  • Actively developed, managed.
oryx - https://github.com/cloudera/oryx
  • Built on top of Mahout
  • Supports streaming instead of batch jobs, making it realtime
  • Still in early stage. 
Java-ML - http://java-ml.sourceforge.net
  • Provides a collection of algorithms
  • No new release since 2012
Hopefully you will find this short post useful. Leave your feedback and comments.

Connect to me on twitter: @satya_paul