Sunday, October 25, 2015

10 Things you must know about Charts, Graphs & Visualization

"In God we trust, all others must bring data" said W. Edwards Deming. But we all know it is easier said than understood when data is actually presented in front of you. Often you need a statistician to interpret it. And they are rare! What you need is a straight and simple way to represent data that brain can interpret relatively easily. Here comes the need for Charts, Graphs and other visualization.

I started exploring the various Charts & Graphs libraries for generating insight for my hobby project www.fanffair.com. Goal was to identify the top posts based on statistical analysis of Like, Share and Comments data associated each post and share the top posts on fanffair Facebook Page. That in turn led to second hobby project of mine www.statspanda.com. In this blog post, I am going to share my notes, learning & observations as I sifted through various charting libraries ranging from Google Charts, morris.js, raphael.js, xCharts, nvd3, Flot charts, dygraphs, Rickshaw, Highcharts et al to D3.js.

All of the above are the charting libraries that either you download or use as hosted-library and invoke the JavaScript methods to create a chart or graph in your web application, albeit some of them are pretty straight and simple to use while libraries like D3.js requires some amount of programming skills and knowledge of HTML, Javascript etc. But unlike Charting libraries, D3.js gives all the power and flexibility for creating exotic visualizations. You can as well pick some of the visualization created and curated by Mike Bostok.

If you want to avoid either of the options, then you can go for online services that take your data and generate the charts and graphs online. There are many such services, some of them are really big. Many of them are often categorized as data analytics companies as they deal a lot in the data layer. I have provided a list of such services/companies in later part of this post. In fact, statspanda.com will qualify for this category.

There is a fourth category where companies sell ready-to-use Dashborad softwares. They have pre-built charts and graphs but not limited to charting and graphs. One needs to customize, implement and host them on their own.

To start with, I wil
l cover various chart types and some basics around their usage. I will also briefly touch upon the technologies used behind these charting libraries. Then I will move on to talk very briefly about some of the libraries I have explored closely. This will be followed by a list of companies who sell Charts and Graphs as service or as Dashboard software.  


Popular Chart Types (#1)


Line Chart

Spline Chart
Bar Chart
Pie Chart
Doughnut Chart
Area Chart
Bubble Chart
Scatter Plot
Bullet Chart
Gauge Chart
Combination Chart
Creative Visualization

Sample Charts generated using StatsPanda.com



Factors Behind Making a Choice (#2)

There are host of factors that may influence your choice of a particular type of chart or visualization. Here are few factors that I could compile 

Size of Data points - this is probably the most important factor behind making a choice. For example, a Pie Chart may not work very well if your data point is more than say 30 but Line Chart may still be good choice. Similarly, an Area Chart or Line chart may represent a very large set of data. However, remember number of data points also indicates whether it's a low level or high level data. Millions of raw data feeds may translate into couple of data points when aggregated by those factors.

Types of Data Points - Next, check if you are dealing with only one type of data or there multiple different types of data that need to be plotted. Often combination charts are used if data types are different or different set of data need to represented. Example - using a combination of Bar Chart & Line chart for representing revenue and profit numbers of one company viz-a-viz using multiple Line Charts for representing revenue numbers of different companies.

Relationship of Data Points - If multiple Data Types are being presented in a Combination Chart, then if the different types data are related or not also influence the choice of a Chart type.

Complexity Of Relationship - How the data sets are related matters in the selection of a chart. Complex relationships like many-to-many relations or multiple-relations between two data points are not easy represent in regular Charts & Graphs and calls for creative visualization.

Static vs Animation vs Flow Representation - Most of the Charts and Visualizations really capture a snapshot or static state of data. In certain cases you may need to show a transition from one snapshot of data to the other (i.e. animated) or capture the states that the data pass through (i.e. data flow). A good example is Sankey Chart.   

A general rule of thumb as per a paper from IBM
  • Pie chart: 3-10
  • Bar chart: fewer than 50
  • Line chart: fewer than 500
  • Bubble plot: fewer than 500
  • Scatter plot: fewer than 10,000
  • Creative Visualization comes handy for data points > 10,000  
Rendering Engines (#3)

SVG - Scalable Vector Graphics (SVG) is an XML-based vector image format for two-dimensional graphics with support for interactivity and animation. The SVG specification is an open standard developed by the World Wide Web Consortium (W3C) since 1999. Widely supported. Traditionally, MS-Internet Explorer has been a laggard but starting with IE 8 SVG is supported in IE. Some of the popular libraries like morris.js, D3.js and others built on top of D3.js use SVG.


Canvas (HTML5) - Introduced as part of HTML 5 spec. example: flot, chartjs, jqPlot and other Javascript based visualisation libraries.

VML - Vector Markup Language (VML) is an XML-based file format for two-dimensional vector graphics. Developed and promoted by Microsoft.

Chart Performance (#4)


Check out Charts performance at http://jsperf.com/charts-comparison-d3-js-kendo-highcharts-echart-flot-gr 



HTML5 Based Libraries (#5)


chartjs.org - Offers six chart types. HTML5 Canvas based, responsive. One of the best options for the chart types it supports.


canvasjs.com - Pretty good & rich collection of Charts. Free for non-commercial use.


D3.js & D3.js Based Libraries (#6)

d3.js is the best and gives highest amount of flexibility but that comes at the cost of added complexity in terms of programming skills while developing a chart or graph. If you want to avoid that, please refer to some of the libraries and that pretty straight to use.

NVD3 d3.js based charts library, has good options. Some of them are unique when compared to other generally available charts e.g. Scatter / Bubble Chart, Stacked Area Chart. Reads data from CSV and other text formats.

C3.js - Another D3.js based charting library. Pretty comprehensive set of charts, simple and easy to use, provides good documentation and available under The MIT License (MIT). C3.js has a pretty active Google Group, easy to get support.

Rickshaw built on top of d3.js. Very neat library for creating time series graphs, line chart, bar chart etc. It is free, open source and available under MIT License

http://d3pie.org - nice pie charts, offers good options. Available under The MIT License (MIT)

xCharts - D3.js based library. Uses HTML, CSS, SVG. Default charts have polished look but very limited options. It was developed by https://www.tenxer.com/ and has been made free with no strings attached.

dimplejs - it's crazy, it's very good. Reads data from CSV and other text formats. It's an open source project by http://align-alytics.com/

dc.js - It's amazing. Just look at this - http://dc-js.github.io/dc.js/ . It's available under Apache License, Version 2.0 (the "License")
Few cool examples - http://dc-js.github.io/dc.js/examples/cust.html

http://d3plus.org Built on top of d3.js, simple to use. It has limited set of examples/apis but some of them are pretty good e.g. Geo Map, Tree Map, Simple Network.

https://github.com/mbostock/d3/wiki/Gallery - d3 example library in one page.

http://bl.ocks.org/mbostock - Amazing Charts and Visualization examples by Mike Bostock, the creator of D3.js and many other libraries used for rich visualization.

http://bost.ocks.org/mike/ - Another repository by Mike Bostock.



Pure JavaScript (#7)

Flot Charts - Flot Charts are pure Javascript library. 


Google Charts - It has very rich set of charts & graphs, simple to use. You may like look at AngularJs Google Chart Tools directive as well.


jqPlot - Pure Javascript charting library.  Offers good set of options. Look and feel is not very polished. 

morris.js - raphael.js based library. Simple to use. Cool fluid look and feel. Limited options, most of the common scenarios are covered. Uses HTML, CSS, SVG

dygraphs - has good options for time series graphs but look and feel is not very polished. However, dygraphs can handle huge datasets running into millions of data points. It takes .txt and .csv file as inputs.


Mini Charting Libraries (#8)


Peity is a simple jquery.com plugin that converts an element's content into a simple mini pie line or bar chart.

jQuery Sparkline generates sparklines (small inline charts) directly in the browser using data supplied either inline in the HTML, or via javascript.


Commercial Offerings (#9)

This section covers a list of Charts and Graphs libraries. 

HighCharts - Offers very rich set of options. Probably one of the best libraries. Available free for non-commercial use, under Creative Commons License

Canvas JS - HTML5 JavaScript Charting Library with a simple API and 10x better performance compared SVG/Flash based charts. Charts are responsive & can run across devices including iPhone, Android, Desktops, etc. Offers good set of Charts and Graphs.

ZingChart - Javascript Charting library. Offers a rich set of options. Visually appealing and good for large data sets. 

JSChart - Javascript Charts. Limited options.

Ember Charts - Open source. Offers limited popular options of Charts, Tables etc.


Charts & Graphs as Service (#10)

statspanda.com : Offers creation and hosting of Charts, Graphs & Dashboards as Service. Purely REST Api driven, also provides a REST API Console.

tableau.com : Grand Daddy of Visualisation. Offers Desktop, Cloud based service as well as Back-end data integration. Has huge array of very creative Charts Graphs and Dashboards. 

datawrapper.de : Creation of embeddable Charts, Maps. Primarily used by publishing, news and media companies. 

chartio.com : Provides data integration layer with various sources like CSV, Amazon RedShift or Stripe etc. It then convert the data into intuitive Visualisation such as Charts, Graphs etc. 

chartblocks.com : Basic chart building tool. It reads the data from spreadsheet and gives tools to customise the cha. The charts can be shared in popular social media or can be embedded as iFrame. 

infogr.am : Create charts and infographics and publish them easily, primarily used by the news and media companies. 

jaspersoft.com : Create Charts, Graphs, Dashboard and embed or publish. Good integration with Amazon AWS (RDS, Redshift) . Owned by TIBCO. Focuses on Apps and On-prem Applications for embedding the Charts, Graphs and Dashboards created/hosted on their platform.

Amazon QuickSight : Works with AWS dataset to create visualisation. Offers good set of Graphs, Tables and other Visualisation. Amazon QuickSight uses a new, Super-fast, Parallel, In-memory Calculation Engine (“SPICE”) to perform advanced calculations and render visualizations rapidly. 

domo.com : Offers good set of connectors. Dashboards are designed for specific roles and for different industry verticals. Rich set of chart, graphs and other visualization are offered. Provides good integration with Amazon. 

gooddata.com : Primary positioning as Business Intelligence platform for real time analytics. Offers good set of charts, graphs and dashboard options. Has good set of Connectors.

keen.io : Keen IO is an API platform that lets developers collect and study custom events at a massive scale and converts them to visualisation. It's designed in a way so that users can embed the visualisation like Charts, Dashboards in their apps, websites. All popular Charts and Graphs are available for the Dashboards. 

chartbeat.com : Focused at Advertising and Publishing industries. Provides insights using Charts, Graphs and other visualisation.

vida.io : Create attractive visualisation, embed and use them, can create Dashboard. 

plot.ly : Plotly is the data visualization and collaboration platform for engineers and data scientists. Provides integration with Python, Excel, MATLAB & R. Offering available in three different formats - Cloud, On Prem, and Desktop Tool. 

zoomdata.com : Focused at Visualisation for Big Data, can handle large volume of data. Provides integration with Hadoop, NoSQL, ElasticSearch, Solr and Spark.

datahero.com : Provides connectors to a large array of Cloud Based data sources like Box, Dropbox, Stripe, Hubspot, MailChimp, Google Drive, Google Analytics, MixPanel etc. You can then convert the data to a Chart, Graph and other visualisation. You can create one dashboard composed of multiple Charts, Graphs from different Datasources. Have focused offerings for various industry verticals.

getdataseed.com : Provides data exploration tool sets that can import data from Excel Spreadsheet and from fetch Dropbox, Google Drive or any public url. Can convert the data into Charts, Graphs, Maps and other Visualisation. Provides REST Api integration. 

anaplan.com : Excel Spreadsheet on Cloud with ability to create crisp Charts, Graphs, Dashboards and other visualisation on the fly. Excel Spreadsheet like data is editable. Has focused offerings around Finance, Sales, Operations and HR.

collabion.com : Creates Charts, Graphs and Dashboards from SharePoint Data. Need to Downloaded and installed, need to Server Licenses.

www.domo.com : Offers good options of Charts, Graphs and other visualisation. Provides wide range of connectors for various datasources and Apps ranging from Excel, Google sheets, Box, Facebook, Marketo, Salesforce etc. Solutions are tailored for various roles, industry verticals and operations.

www.klipfolio.com : Offers creation of company Dashboards composed of cool Charts, Graphs and other visualisation. Provides a wide range of connectors to all major datasources and various options for data infusion. 

Microsoft Power BI : Available as Desktop Application and mobile Apps for iOS, Android and Windows Phones. Wide range of Charts, Graphs can be glued to a Dashboard quickly, supports drag and drop features. Provides wide range of Datasource connectors. Supports natural language query inside a Dashboard.

RJMetrics.com : Offers creation of Charts, Graphs and Dashboards on Cloud. Has good Data Integration with popular Datasources. RJMetrics Pipeline transfers the to Redshift. Good for large volume of Data - from less than 5 million rows to upside of 500 million rows sync up. 

http://kilometer.io : SaaS Analytics Tool

https://chartmogul.com : Subscription Analytics

gramener.com They are into creative visualization. Provides visualization of the data from various business operations like Sales, Marketing, HR etc. 

Finally, I still feel the blog looks like a work in progress. In fact, I would spend more time taking a deeper look into all the companies listed under #10. However, I hope you found the post useful even in current shape and form.

Tuesday, September 1, 2015

How to Add Charts, Graphs and Visualization to a Blog Post

Over last few months, I have written few blog posts where I used pretty sophisticated Charts, Graphs and Visualization. They make the blog post lot more meaningful and readable, readers love those charts, tables & displays. In this post, I will share the service I use for those visuals and few quick steps on how to create a chart and add that to your blog.

I use the services of http://www.StatsPanda.com, it's in public beta and it's free for "Individual" users.

StatsPanda.com Home Page


Once you login using your Facebook account, it redirects to registration page and during registration it gives two options - "Individual" and "Enterprise". Choose "Individual".

Go to API Console. I will suggest that you spend some time exploring the listed Visualization APIs. You may have to spend some time trying out the APIs in order to understand the JSON Inputs structures. It's not super complicated, although it may take some time. Initially you can use the example input data provided along with API Documentation.

API Console to create Charts, Graphs etc


Once decided, go ahead and create the Chart of your choice and that would give you a unique url and iFrame Code for your chart. For this blog I created a Stacked Area Chart with the example data that API Console provided. In the above screenshot, you can see a greyed out box right below the chart. The content of that box is copied below for your reference. You can straight away use that in your blog.

<iframe src="http://www.statspanda.com/charts/ui/nvd3/stacked-area-chart?unique_key=dc27b99e933da8969218fb3d8e06aeafdc61e170dad7c5aed1b5d41381b2d8" style="border: #E8E8E8 1px solid; height: 300px; width: 100%;"></iframe>

Here is an example how the Visualisation really appears in a blog once you include the iFrame code snippet to your blog.



The chart is dynamic and not a jpg pic, data is being served from the Website. In fact, if you want you can "Edit" the input JSON Data and have a different chart of same type.

Hopefully you find it useful and able to use the charts and graphs in your blogs.



Wednesday, August 5, 2015

Top 10 Machine Learning Libraries and Services for Java Developers

In recent times, Machine Learning has emerged as one of the most talked about topics in the field of information science and technology. Although the subject has probably intrigued the researchers and academicians for decades, it's only now every Software Engineer is trying to get a hang of it. It's changing the face of computing for ever and in a way, accelerating the move from Software Eating The World to Software Eating Software. 

In this blog, I will cover (rather list out) some of the top Machine Learning libraries and services available for a Java developer and try to highlight some of the salient points associated with each of those libraries. Please note some of the most powerful machine learning libraries are in Python and they are not covered in this post at all.

Before I go to the list of libraries, here is a short list of algorithms or broad categories of problems that most of the Machine Learning libraries would cover either fully or partially -

  • Classification
  • Regression
  • Clustering
  • Ranking

Here are few examples of application:
  • Outlier Detection
  • Recommendation
  • Natural Language Processing
  • Neural Networks

1. Apache SPARK MLlib 

2. Deeplearning4j - http://deeplearning4j.org
  • One of the top ML Libraries in Java
  • Integrates with Hadoop, Spark
  • Use Cases
    • Face/image recognition
    • Voice search
    • Speech-to-text (transcription)
    • Spam filtering (anomaly detection)
    • E-commerce fraud detection
    • Regression 


3. Apache Mahout - http://mahout.apache.org
  • Runs on Hadoop Cluster, so infinite scalability
  • Good for recommendations

5. Google Prediction APIs (as service) - https://cloud.google.com/prediction

Types of the problems where you may see few ready examples 
  • Classification
  • Regression 
Google Prediction provides two types of APIs - one that leverages the hosted models and the rest where you have to train the model with you sufficient and then expect the APIs to predict.


5. IBM Watson + AlchemyAPI (as service) - http://www.ibm.com/smarterplanet/us/en/ibmwatson


You can try Alchemy APIs at http://www.alchemyapi.com/products/demo . One of the newest and a very significant acquisition by IBM is Alchemy API. Here are the two primary offerings from Alchemy API -
  • AlchemyLanguage - Text Analytics and Natural Language Processing.
  • AlchemyVision - Leverages deep learning for photo and image processing.
Easy to get registered and get started.


6. MS Machine Learning (as service) - http://azure.microsoft.com/en-in/services/machine-learning

Microsoft has done big deal around coming up with an intuitive UI where users can create a model, train it and run the analytics - all in drag and drops. For a new user it would take sometime to get accustomed to the various UI controls and how Application works. Developers can create her own model and sell it in Azure Marketplace. 

Outlook account works seamlessly. 

7.  Amazon AWS Machine Learning - https://aws.amazon.com/machine-learning 

  • Provides visualisation tools to create ML Models
  • Simple API support for the models generated this way
  • Highly Scalable, can generate billions of predictions in day
8. Weka - http://www.cs.waikato.ac.nz/ml/weka
  • Provides a graphical user interface, command line interface and Java API
  • One of the most popular Java machine learning library
  • Available under GPL License 
9. Mallet - http://mallet.cs.umass.edu
  • Statistical natural language processing, document classification, clustering, topic modeling and information extraction.
10. H2O - http://0xdata.com
  • In-memory data engine
  • Designed for running various types of types of statistical computations (including Deep Learning)
  • Works with Hadoop Distributed File System

Others deserving a mention but could not make it to the list of top 10

JSAT - https://code.google.com/p/java-statistical-analysis-tool
  • Library for quickly getting started with Machine Learning problems
  • Available under GPL 3 but author is open for discussion
  • List of supported algorithms is impressive
  • One man project, done in his free time. Creator is Edward Raff @EdwardRaffML

LensKit - http://lenskit.org
  • Focused on building recommender system, primarily for research based projects
  • Good for trying out. For scale and for production env, one can move to Apache Mahout.
  • Actively developed, managed.
oryx - https://github.com/cloudera/oryx
  • Built on top of Mahout
  • Supports streaming instead of batch jobs, making it realtime
  • Still in early stage. 
Java-ML - http://java-ml.sourceforge.net
  • Provides a collection of algorithms
  • No new release since 2012
Hopefully you will find this short post useful. Leave your feedback and comments.

Connect to me on twitter: @satya_paul

Sunday, January 11, 2015

Did TCS under report employee strength in 2014?

In the wake of recent news on layoffs at TCS, I started looking into TCS Annual reports. As I sifted through the Annual Reports of TCS from last 10 years, something did not seem to be quite right around the employee strength reported in Annual Report in 2014.

Usually, for an IT Services Company the revenue growth is directly proportional to the growth in employee strength, it seemed odd that the Y-O-Y growth rate of Operating Profit was the highest in 2014 whereas the employee growth was the lowest in the history of TCS. So, I started looking into the details of the data based on TCS Annual Reports since 2005.

All the Charts are reproduced courtesy of www.StatsPanda.com

TCS Y-O-Y Key Growth Data (Revenue, Op Margin, Productivity, Avg Salary, Employee Growth)

 

In 2014, Operating Profit Growth(39.43%) and Productivity Gain(19.57%) were the highest while Employee Growth (8.62%) was the lowest in the history of TCS. That surely raises some eyebrows.
So, I decided to regenerate the same chart by tweaking the Employee Strength from 300,000 (as reported in AR 2014) to 320,000. Now, take a look at the chart below. The numbers look much more realistic - Productivity Gain(12.1%) & Employee Growth (15.86%).

TCS Y-O-Y Data (with Employee Strength as 320,000 as on Mar'2014)



Next, I looked into the Average Salary Increase data and the productivity gain data and tried to see the impact if we change the Employee Strength from 300 K to 320 K at the end of Mar'2014. Check out the generated charts and the analysis below.

TCS Avg Productivity vs Avg Salary when Employee Strength is 300K as on Mar'2014



As per the chart above, TCS Employees received an average salary increase of 14.35% in 2014 vis-a-vis 11.81% in 2013. But little bit of poking with the TCS employees revealed the Average Salary Increase was better in 2012 - 2013 than in 2013 - 2014. Then why do we see a different trend here? Now, take a look at the same chart but regenerated with employee strength as 320 K. You can see the Average Salary increase matches with the data heard from TCS Employees and consistent with the past trend - Average Salary increase changed from 14.35% to 7.21%.  
 
Similar observations can be made with respect to Productivity Gain as well. 19.57% Productivity gain in one financial year is near impossible unless there is fundamental shift in Business Model and Revenue Pattern. With correction of employee strength from 300 K to 320K, even that's get corrected from 19.57% to 12.1%, a much more grounded and realistic number.

TCS Avg Productivity vs Avg Salary when Employee Strength is 320K as on Mar'2014


Last take a look at the Utilization Ratio in the chart below. It has remained range bound over the years and no significant upside can be seen in 2014 w.r.t 2013. So, when there is no significant change in Utilization Ratio & there is a dip in Employee Growth Y-O-Y, how one would explain such a steep growth in Revenue as well Operating Profit level and gain in Productivity gain in 2014?

TCS Key Metrics along with Utilization Ratio

Based on the data and analysis above, it appears TCS did under report their total employee strength at the end on Mar 2014. This observation actually seems to be consistent with the news around layoffs as that's the only way to get rid of off-the-book-employees. So, why did TCS do this?

To show (at least on paper) -

1. Superior growth in productivity number.
2. Higher Average Salary Growth. 

These two are important factors when it comes to valuation. Above par growth in Productivity numbers does indicate superior & improving asset quality, better strategy and execution and that commands higher premium. Higher Average Salary Growth gets reflected in lower attrition as well as increase in ability to attract talent. These factors help de-risk current and future revenue stream. Overall, these numbers are very critical and any change in their values will have impact on the valuation as well. The under reporting does mean that TCS was planning to lay off people in subsequent period. However, this seems to be a reactive step and does indicate the discomfort of TCS Management with the increase in employee strength in higher salary band.

This problem is not going to disappear any time soon. You can expect to see similar waves of employees-in-higher-salary-band continue to keep bothering TCS Management for next few years (basically the hires from 2006 and earlier). The organisation will go through a period of turbulence and may force the management to re-engineer the business model leading to a period of uncertainty for TCS for next few years.

As for the ongoing lay offs this year, there seems to be a sense of urgency and an accelerated pace of reducing the employees at higher salary band. In fact, it's getting uglier now. This could be related to the fact that TCS under reported the employee strength at the end of Mar 2014 and as a result they have to do a catch up job now. So, while superior productivity numbers & better average salary increase made the investors happy & the valuation sky rocketed, it also created a challenge for TCS to handle now. It brought in unnecessary instability to the organisation and increased the risk. TCS needs to handle the employee layoffs more gracefully and better manage the investors' expectations to make the growth story sustainable. Hopefully they report more grounded numbers but closer to reality.

The valuation at current level does seem to be very high and I will not be surprised if we see a correction of 20% in next one to two years time (Rs 1900 - 2100 range).

This article is speculative in nature and a work of subjective interpretation of data available in public domain. I am not a TCS employee and never I was. I don't have any professional relationship with TCS whatsoever and have no share holding of TCS.



You can follow me at @satya_paul