Saturday, August 20, 2016

Stupid! You never Gonna Make Money in Gamble.

Recently I bought two dices thinking I will use them to play with my three and half year old daughter and that may help enhance her maths skills. Unfortunately, she lost interest in the dices in first 10 minutes but they made me curious enough to spend few more days with the dices and then finally to write a java program that can simulate the game of rolling two dices for millions of times at my wish.

A simple game will involve rolling two dices and add the numbers appearing on the sides facing upwards to determine winning bet (number). The person who betted on that number is the winner. Every dice has 6 faces with numbers 1 through 6. So, sum of the numbers from two dices will be one of these 11 numbers - 2, 3, 4, 5, 6 , 7, 8, 9, 10, 11, 12. Interestingly, some of the numbers from the list have higher chances of appearing than others, simply because they can be produced by more combinations. For example, 2 can be produced only by 1 + 1 but 7 can be produced by 1 + 6, 2 + 5, 3 + 4, 4 + 3 and more. That makes 7 a far more lucrative an option for a gambler to pick when compared to 2. Overall there are 6 x 6 = 36 different ways you can get one of 11 numbers.

Let's go ahead and spend some time looking into the data that the Java program generated and how gambling may use this variability in probability distribution to give rich people an adrenaline rush. This program simulates rolling of two dices by generating two random numbers in the range  of 1 - 6. Then is sums them up to come up with the winning number. It repeats the steps as many times as you ask it to.

Let's generate the frequency distribution by rolling of two dices for 10 million times. 


| Position  | Frequency | Feq in % | 
------------------------------------------
| Number 12 | 277,771 | 2.78% | 
------------------------------------------
| Number 2 | 277,148 | 2.77% | 
------------------------------------------
| Number 4 | 831,981 | 8.32% | 
------------------------------------------
| Number 3 | 555,619 | 5.56% | 
------------------------------------------
| Number 6 | 1,389,822 | 13.9% | 
------------------------------------------
| Number 5 | 1,110,738 | 11.11% | 
------------------------------------------
| Number 8 | 1,389,151 | 13.89% | 
------------------------------------------
| Number 7 | 1,666,267 | 16.66% | 
------------------------------------------
| Number 9 | 1,112,411 | 11.12% | 
------------------------------------------
| Number 11 | 555,447 | 5.55% | 
------------------------------------------
| Number 10 | 833,645 | 8.34% | 
------------------------------------------
| Total Rounds | 10,000,000 | 100% | 
------------------------------------------

This distribution is similar to what you will see if you do a simple calculation using permutation formula. Although, I personally never found permutation & combination simple.

Now, let's make it a little serious affair. To have a real gamble, we will bring in real money and a Table Owner who runs the Games of Gamble. Table owner takes certain % cut and rest of the money goes to the winning bet.
  
The games will be played by 11 Players. Each player can bet only on one specific number and can bet for $10. Table owner will charge 20% of the amount betted. The games will be played for 2 million times. So, total money involved is $220 million.

| Player Position  | Money Earned | % of Total Bet Amount |
---------------------------------------------
| Player 10 | $14,669,600 | 7% | 
---------------------------------------------
| Player 11 | $9,750,488 | 4% | 
---------------------------------------------
| Player 8 | $24,552,264 | 11% | +ve
---------------------------------------------
| Player 7 | $29,277,952 | 13% | +ve
---------------------------------------------
| Player 9 | $19,553,160 | 9% | 
---------------------------------------------
| Player 4 | $14,633,696 | 7% | 
---------------------------------------------
| Player 3 | $9,773,192 | 4% | 
---------------------------------------------
| Player 6 | $24,457,400 | 11% | +ve
---------------------------------------------
| Player 5 | $19,573,752 | 9% | 
---------------------------------------------
| Player 2 | $4,876,344 | 2% | 
---------------------------------------------
| Player 12 | $4,882,152 | 2% | 
---------------------------------------------
| Table Owner | $44,000,000 | 20% | 
---------------------------------------------
| Spend Per Player | $20,000,000 | 9% | 
---------------------------------------------
| Total Amount Betted | $220,000,000 | 100% | 
---------------------------------------------

As you can see Table Owner gains the most. With exception of him, there are only three other players - Player 7, Player 6 & Player 8 who made some money and there are 8 other players who lost money. 

But we know it was not practical. How can you expect a Gambler to try his luck only with one number? Let's take out that restriction along with few other restrictions - number of players could be anything between 1 to 11, players can bet any amount ranging from $10 to $1000 on any number and Table Owner will not have any pre-defined cut. However, we will still not allow more than one bet to be placed against one number. Number of players and the amount they bet in one particular round will be randomly determined by the program. The game will continue for 1 million times.


TOTAL AMOUNT BETTED BY EACH PLAYER POSITION
------------------------------------
Player Position 4 | $276,025,231
------------------------------------
Player Position 5 | $275,850,639
------------------------------------
Player Position 6 | $275,374,362
------------------------------------
Player Position 7 | $275,526,040
------------------------------------
Player Position 2 | $275,613,052
------------------------------------
Player Position 3 | $274,947,829
------------------------------------
Player Position 10 | $275,777,883
------------------------------------
Player Position 11 | $275,233,244
------------------------------------
Player Position 12 | $275,308,337
------------------------------------
Player Position 8 | $275,197,852
------------------------------------
Player Position 9 | $274,973,749
------------------------------------
Total Betted Amount | $3,029,828,218
------------------------------------

TOTAL AMOUNT WON BY EACH PLAYER POSITION 
-----------------------------------------------
| Player Position 4 | $175,887,913 | 5.81% | 
-----------------------------------------------
| Player Position 5 | $234,566,644 | 7.74% | 
-----------------------------------------------
| Player Position 6 | $292,437,606 | 9.65% | +ve
----------------------------------------------- 
| Player Position 7 | $350,616,777 | 11.57% | +ve
-----------------------------------------------
| Player Position 2 | $57,980,399 | 1.91% |
-----------------------------------------------
| Player Position 3 | $118,038,318 | 3.90% |
-----------------------------------------------
| Player Position 10 | $176,005,502 | 5.81% |
-----------------------------------------------
| Player Position 11 | $117,401,400 | 3.87% |
-----------------------------------------------
| Player Position 12 | $58,484,553 | 1.93% |
-----------------------------------------------
| Player Position 8 | $294,861,804 | 9.73% |
-----------------------------------------------
| Player Position 9 | $233,939,146 | 7.72% | +ve
-----------------------------------------------
| Table Owner | $919,608,156 | 30.35%
-----------------------------------------------
| Total Amount Betted | $3,029,828,218 | 100.00%
-----------------------------------------------

Take a look, in this case we had no pre-determined cut for the Table Owner but he still made a billion ;). As for the playing positions, same trend continues i.e. Playing Position 6, 7, 8 gave positive returns and rest all numbers lost money. If that's the case why would any one put money on any other number? They won't put money anywhere other than 7. Obviously, this is not a fair game.

So, lets design a Fair Game. Goal would be to neutralise the effect of position by giving differential returns.  

Here are two tables showing how much will be your return for every $ you bet on each of the positions if yours is the winning bet:


Table Owner doesn't charge anything
----------------------
| Number 12 | $36 |
----------------------
| Number 2 | $36 |
----------------------
| Number 4 | $12 |
----------------------
| Number 3 | $18 |
----------------------
| Number 6 | $7 |
----------------------
| Number 5 | $9 |
----------------------
| Number 8 | $7 |
----------------------
| Number 7 | $6 |
----------------------
| Number 9 | $9 |
----------------------
| Number 11 | $18 |
----------------------
| Number 10 | $12 |
----------------------

Table owner takes a cut of 20%.
----------------------
| Number 10 | $10 |
----------------------
| Number 11 | $15 |
----------------------
| Number 8 | $6 |
----------------------
| Number 7 | $5 |
----------------------
| Number 9 | $7 |
----------------------
| Number 4 | $10 |
----------------------
| Number 3 | $14 |
----------------------
| Number 6 | $6 |
----------------------
| Number 5 | $7 |
----------------------
| Number 12 | $29 |
----------------------
| Number 2 | $29 |
----------------------

Take a look at the above tables once again and you will realise the returns on the Winning number looks pretty good and at the same time it's fair. No number has any advantage over another number when it comes to overall return over sufficiently large number of games of gamble. At the same time it can accommodate the risk appetite of any gambler by giving differentiated returns, inversely proportional to the probability of a number coming up as a winning bet. Voila! we just designed a game of gamble which is probably closest to what you would see in a casino.

Let's check the returns for various numbers - 8, 7, 12 over 3 million bets for each of them when the Table Owner takes a 30% cut. Amount of each bet is $10. Returns are similar irrespective of probability of the number coming up.

Gambler betting on 7
 -----------------------------------------------
| Table Owner | $9,526,117 | 31.75%
-----------------------------------------------
| Player 7 | $20,473,883 | 68.25%
-----------------------------------------------
| Total Amount Betted | $30,000,000 | 100.00%
-----------------------------------------------

Gambler betting on 8
-----------------------------------------------
| Table Owner | $9,218,150 | 30.73%
-----------------------------------------------
| Player 8 | $20,781,850 | 69.27%
-----------------------------------------------
| Total Amount Betted | $30,000,000 | 100.00%
-----------------------------------------------

Gambler betting on 12
-----------------------------------------------
| Table Owner | $9,019,975 | 30.07%
-----------------------------------------------
| Player 12 | $20,980,025 | 69.93%
-----------------------------------------------
| Total Amount Betted | $30,000,000 | 100.00%
-----------------------------------------------

Next, I am going to try the same gamble i.e. Payoff Matrix is fair and Table Owner takes a cut, with rest of the factors are randomised. In the example below number of players ranges from 6 to 20, game continues for 10 to 100,000 times, betted amount ranges from $10 to $1000 and Table Owner takes a cut of 20%. Outcome is similar. On an average, all the players lose the amount that the Table Owner charges. 

No of Rounds 9973
No Of Players 18

TOTAL AMOUNT BETTED BY GAMBLERS 
-----------------------------------------
| Player 10 | $5,049,130 | 5.57%
-----------------------------------------
| Player 11 | $5,049,573 | 5.57%
-----------------------------------------
| Player 8 | $5,041,264 | 5.56%
-----------------------------------------
| Player 7 | $5,052,238 | 5.57%
-----------------------------------------
| Player 9 | $5,039,512 | 5.56%
-----------------------------------------
| Player 4 | $5,064,714 | 5.58%
-----------------------------------------
| Player 3 | $5,022,313 | 5.54%
-----------------------------------------
| Player 6 | $5,003,299 | 5.52%
-----------------------------------------
| Player 5 | $5,069,510 | 5.59%
-----------------------------------------
| Player 0 | $5,032,736 | 5.55%
-----------------------------------------
| Player 2 | $5,090,774 | 5.61%
-----------------------------------------
| Player 1 | $5,014,831 | 5.53%
-----------------------------------------
| Player 16 | $5,026,509 | 5.54%
-----------------------------------------
| Player 17 | $5,028,077 | 5.54%
-----------------------------------------
| Player 12 | $5,015,401 | 5.53%
-----------------------------------------
| Player 13 | $5,024,943 | 5.54%
-----------------------------------------
| Player 14 | $5,047,221 | 5.56%
-----------------------------------------
| Player 15 | $5,024,303 | 5.54%
-----------------------------------------
| Total Amount Betted | $90,696,348 | 100.00%
-----------------------------------------

Total Amount Won
--------------------------------------------
| Player 10 | $4,211,682 | 4.64%
--------------------------------------------
| Player 11 | $4,117,135 | 4.54%
--------------------------------------------
| Player 8 | $3,931,574 | 4.33%
--------------------------------------------
| Player 7 | $4,170,266 | 4.60%
--------------------------------------------
| Player 9 | $4,144,855 | 4.57%
--------------------------------------------
| Player 4 | $4,310,571 | 4.75%
--------------------------------------------
| Player 3 | $4,013,216 | 4.42%
--------------------------------------------
| Player 6 | $3,910,301 | 4.31%
--------------------------------------------
| Player 5 | $4,083,093 | 4.50%
--------------------------------------------
| Player 0 | $4,090,285 | 4.51%
--------------------------------------------
| Player 2 | $4,209,126 | 4.64%
--------------------------------------------
| Player 1 | $3,955,758 | 4.36%
--------------------------------------------
| Player 16 | $4,161,625 | 4.59%
--------------------------------------------
| Player 17 | $3,849,358 | 4.24%
--------------------------------------------
| Player 12 | $3,904,212 | 4.30%
--------------------------------------------
| Player 13 | $4,074,304 | 4.49%
--------------------------------------------
| Player 14 | $4,022,716 | 4.44%
--------------------------------------------
| Player 15 | $3,983,948 | 4.39%
--------------------------------------------
| Table Owner | $17,552,323 | 19.35%
--------------------------------------------
| Total Amount Won | $90,696,348 | 100.00%

--------------------------------------------

If you had generated the Payoff Matrix without paying anything to the Table Owner, the long term returns for the Gamblers would reflect the same i.e. amount betted and won would be almost the same. But that's not how real world works. Such patterns hold good only for sufficiently large amount games. For lesser number of games the results could be very unpredictable. Gamblers may make a killing or may lose significantly. In fact, the Table Owner may not make any money or may end up earning much more than the planned cut. 

Across all of the data shared above, one point becomes pretty obvious - that a gambler will never make money irrespective of where he puts his bet on. That's true even when he spends all his wealth and his lifetime playing a fair game of gamble. Gambling is meant for adrenaline rush & one pays for that service. That's it. No one is smart enough to make money in gambling over period of time, unless you run the Game of Gamble. 

Hope you enjoyed reading the post.


Sunday, October 25, 2015

10 Things you must know about Charts, Graphs & Visualization

"In God we trust, all others must bring data" said W. Edwards Deming. But we all know it is easier said than understood when data is actually presented in front of you. Often you need a statistician to interpret it. And they are rare! What you need is a straight and simple way to represent data that brain can interpret relatively easily. Here comes the need for Charts, Graphs and other visualization.

I started exploring the various Charts & Graphs libraries for generating insight for my hobby project www.fanffair.com. Goal was to identify the top posts based on statistical analysis of Like, Share and Comments data associated each post and share the top posts on fanffair Facebook Page. That in turn led to second hobby project of mine www.statspanda.com. In this blog post, I am going to share my notes, learning & observations as I sifted through various charting libraries ranging from Google Charts, morris.js, raphael.js, xCharts, nvd3, Flot charts, dygraphs, Rickshaw, Highcharts et al to D3.js.

All of the above are the charting libraries that either you download or use as hosted-library and invoke the JavaScript methods to create a chart or graph in your web application, albeit some of them are pretty straight and simple to use while libraries like D3.js requires some amount of programming skills and knowledge of HTML, Javascript etc. But unlike Charting libraries, D3.js gives all the power and flexibility for creating exotic visualizations. You can as well pick some of the visualization created and curated by Mike Bostok.

If you want to avoid either of the options, then you can go for online services that take your data and generate the charts and graphs online. There are many such services, some of them are really big. Many of them are often categorized as data analytics companies as they deal a lot in the data layer. I have provided a list of such services/companies in later part of this post. In fact, statspanda.com will qualify for this category.

There is a fourth category where companies sell ready-to-use Dashborad softwares. They have pre-built charts and graphs but not limited to charting and graphs. One needs to customize, implement and host them on their own.

To start with, I wil
l cover various chart types and some basics around their usage. I will also briefly touch upon the technologies used behind these charting libraries. Then I will move on to talk very briefly about some of the libraries I have explored closely. This will be followed by a list of companies who sell Charts and Graphs as service or as Dashboard software.  


Popular Chart Types (#1)


Line Chart

Spline Chart
Bar Chart
Pie Chart
Doughnut Chart
Area Chart
Bubble Chart
Scatter Plot
Bullet Chart
Gauge Chart
Combination Chart
Creative Visualization

Sample Charts generated using StatsPanda.com



Factors Behind Making a Choice (#2)

There are host of factors that may influence your choice of a particular type of chart or visualization. Here are few factors that I could compile 

Size of Data points - this is probably the most important factor behind making a choice. For example, a Pie Chart may not work very well if your data point is more than say 30 but Line Chart may still be good choice. Similarly, an Area Chart or Line chart may represent a very large set of data. However, remember number of data points also indicates whether it's a low level or high level data. Millions of raw data feeds may translate into couple of data points when aggregated by those factors.

Types of Data Points - Next, check if you are dealing with only one type of data or there multiple different types of data that need to be plotted. Often combination charts are used if data types are different or different set of data need to represented. Example - using a combination of Bar Chart & Line chart for representing revenue and profit numbers of one company viz-a-viz using multiple Line Charts for representing revenue numbers of different companies.

Relationship of Data Points - If multiple Data Types are being presented in a Combination Chart, then if the different types data are related or not also influence the choice of a Chart type.

Complexity Of Relationship - How the data sets are related matters in the selection of a chart. Complex relationships like many-to-many relations or multiple-relations between two data points are not easy represent in regular Charts & Graphs and calls for creative visualization.

Static vs Animation vs Flow Representation - Most of the Charts and Visualizations really capture a snapshot or static state of data. In certain cases you may need to show a transition from one snapshot of data to the other (i.e. animated) or capture the states that the data pass through (i.e. data flow). A good example is Sankey Chart.   

A general rule of thumb as per a paper from IBM
  • Pie chart: 3-10
  • Bar chart: fewer than 50
  • Line chart: fewer than 500
  • Bubble plot: fewer than 500
  • Scatter plot: fewer than 10,000
  • Creative Visualization comes handy for data points > 10,000  
Rendering Engines (#3)

SVG - Scalable Vector Graphics (SVG) is an XML-based vector image format for two-dimensional graphics with support for interactivity and animation. The SVG specification is an open standard developed by the World Wide Web Consortium (W3C) since 1999. Widely supported. Traditionally, MS-Internet Explorer has been a laggard but starting with IE 8 SVG is supported in IE. Some of the popular libraries like morris.js, D3.js and others built on top of D3.js use SVG.


Canvas (HTML5) - Introduced as part of HTML 5 spec. example: flot, chartjs, jqPlot and other Javascript based visualisation libraries.

VML - Vector Markup Language (VML) is an XML-based file format for two-dimensional vector graphics. Developed and promoted by Microsoft.

Chart Performance (#4)


Check out Charts performance at http://jsperf.com/charts-comparison-d3-js-kendo-highcharts-echart-flot-gr 



HTML5 Based Libraries (#5)


chartjs.org - Offers six chart types. HTML5 Canvas based, responsive. One of the best options for the chart types it supports.


canvasjs.com - Pretty good & rich collection of Charts. Free for non-commercial use.


D3.js & D3.js Based Libraries (#6)

d3.js is the best and gives highest amount of flexibility but that comes at the cost of added complexity in terms of programming skills while developing a chart or graph. If you want to avoid that, please refer to some of the libraries and that pretty straight to use.

NVD3 d3.js based charts library, has good options. Some of them are unique when compared to other generally available charts e.g. Scatter / Bubble Chart, Stacked Area Chart. Reads data from CSV and other text formats.

C3.js - Another D3.js based charting library. Pretty comprehensive set of charts, simple and easy to use, provides good documentation and available under The MIT License (MIT). C3.js has a pretty active Google Group, easy to get support.

Rickshaw built on top of d3.js. Very neat library for creating time series graphs, line chart, bar chart etc. It is free, open source and available under MIT License

http://d3pie.org - nice pie charts, offers good options. Available under The MIT License (MIT)

xCharts - D3.js based library. Uses HTML, CSS, SVG. Default charts have polished look but very limited options. It was developed by https://www.tenxer.com/ and has been made free with no strings attached.

dimplejs - it's crazy, it's very good. Reads data from CSV and other text formats. It's an open source project by http://align-alytics.com/

dc.js - It's amazing. Just look at this - http://dc-js.github.io/dc.js/ . It's available under Apache License, Version 2.0 (the "License")
Few cool examples - http://dc-js.github.io/dc.js/examples/cust.html

http://d3plus.org Built on top of d3.js, simple to use. It has limited set of examples/apis but some of them are pretty good e.g. Geo Map, Tree Map, Simple Network.

https://github.com/mbostock/d3/wiki/Gallery - d3 example library in one page.

http://bl.ocks.org/mbostock - Amazing Charts and Visualization examples by Mike Bostock, the creator of D3.js and many other libraries used for rich visualization.

http://bost.ocks.org/mike/ - Another repository by Mike Bostock.



Pure JavaScript (#7)

Flot Charts - Flot Charts are pure Javascript library. 


Google Charts - It has very rich set of charts & graphs, simple to use. You may like look at AngularJs Google Chart Tools directive as well.


jqPlot - Pure Javascript charting library.  Offers good set of options. Look and feel is not very polished. 

morris.js - raphael.js based library. Simple to use. Cool fluid look and feel. Limited options, most of the common scenarios are covered. Uses HTML, CSS, SVG

dygraphs - has good options for time series graphs but look and feel is not very polished. However, dygraphs can handle huge datasets running into millions of data points. It takes .txt and .csv file as inputs.


Mini Charting Libraries (#8)


Peity is a simple jquery.com plugin that converts an element's content into a simple mini pie line or bar chart.

jQuery Sparkline generates sparklines (small inline charts) directly in the browser using data supplied either inline in the HTML, or via javascript.


Commercial Offerings (#9)

This section covers a list of Charts and Graphs libraries. 

HighCharts - Offers very rich set of options. Probably one of the best libraries. Available free for non-commercial use, under Creative Commons License

Canvas JS - HTML5 JavaScript Charting Library with a simple API and 10x better performance compared SVG/Flash based charts. Charts are responsive & can run across devices including iPhone, Android, Desktops, etc. Offers good set of Charts and Graphs.

ZingChart - Javascript Charting library. Offers a rich set of options. Visually appealing and good for large data sets. 

JSChart - Javascript Charts. Limited options.

Ember Charts - Open source. Offers limited popular options of Charts, Tables etc.


Charts & Graphs as Service (#10)

statspanda.com : Offers creation and hosting of Charts, Graphs & Dashboards as Service. Purely REST Api driven, also provides a REST API Console.

tableau.com : Grand Daddy of Visualisation. Offers Desktop, Cloud based service as well as Back-end data integration. Has huge array of very creative Charts Graphs and Dashboards. 

datawrapper.de : Creation of embeddable Charts, Maps. Primarily used by publishing, news and media companies. 

chartio.com : Provides data integration layer with various sources like CSV, Amazon RedShift or Stripe etc. It then convert the data into intuitive Visualisation such as Charts, Graphs etc. 

chartblocks.com : Basic chart building tool. It reads the data from spreadsheet and gives tools to customise the cha. The charts can be shared in popular social media or can be embedded as iFrame. 

infogr.am : Create charts and infographics and publish them easily, primarily used by the news and media companies. 

jaspersoft.com : Create Charts, Graphs, Dashboard and embed or publish. Good integration with Amazon AWS (RDS, Redshift) . Owned by TIBCO. Focuses on Apps and On-prem Applications for embedding the Charts, Graphs and Dashboards created/hosted on their platform.

Amazon QuickSight : Works with AWS dataset to create visualisation. Offers good set of Graphs, Tables and other Visualisation. Amazon QuickSight uses a new, Super-fast, Parallel, In-memory Calculation Engine (“SPICE”) to perform advanced calculations and render visualizations rapidly. 

domo.com : Offers good set of connectors. Dashboards are designed for specific roles and for different industry verticals. Rich set of chart, graphs and other visualization are offered. Provides good integration with Amazon. 

gooddata.com : Primary positioning as Business Intelligence platform for real time analytics. Offers good set of charts, graphs and dashboard options. Has good set of Connectors.

keen.io : Keen IO is an API platform that lets developers collect and study custom events at a massive scale and converts them to visualisation. It's designed in a way so that users can embed the visualisation like Charts, Dashboards in their apps, websites. All popular Charts and Graphs are available for the Dashboards. 

chartbeat.com : Focused at Advertising and Publishing industries. Provides insights using Charts, Graphs and other visualisation.

vida.io : Create attractive visualisation, embed and use them, can create Dashboard. 

plot.ly : Plotly is the data visualization and collaboration platform for engineers and data scientists. Provides integration with Python, Excel, MATLAB & R. Offering available in three different formats - Cloud, On Prem, and Desktop Tool. 

zoomdata.com : Focused at Visualisation for Big Data, can handle large volume of data. Provides integration with Hadoop, NoSQL, ElasticSearch, Solr and Spark.

datahero.com : Provides connectors to a large array of Cloud Based data sources like Box, Dropbox, Stripe, Hubspot, MailChimp, Google Drive, Google Analytics, MixPanel etc. You can then convert the data to a Chart, Graph and other visualisation. You can create one dashboard composed of multiple Charts, Graphs from different Datasources. Have focused offerings for various industry verticals.

getdataseed.com : Provides data exploration tool sets that can import data from Excel Spreadsheet and from fetch Dropbox, Google Drive or any public url. Can convert the data into Charts, Graphs, Maps and other Visualisation. Provides REST Api integration. 

anaplan.com : Excel Spreadsheet on Cloud with ability to create crisp Charts, Graphs, Dashboards and other visualisation on the fly. Excel Spreadsheet like data is editable. Has focused offerings around Finance, Sales, Operations and HR.

collabion.com : Creates Charts, Graphs and Dashboards from SharePoint Data. Need to Downloaded and installed, need to Server Licenses.

www.domo.com : Offers good options of Charts, Graphs and other visualisation. Provides wide range of connectors for various datasources and Apps ranging from Excel, Google sheets, Box, Facebook, Marketo, Salesforce etc. Solutions are tailored for various roles, industry verticals and operations.

www.klipfolio.com : Offers creation of company Dashboards composed of cool Charts, Graphs and other visualisation. Provides a wide range of connectors to all major datasources and various options for data infusion. 

Microsoft Power BI : Available as Desktop Application and mobile Apps for iOS, Android and Windows Phones. Wide range of Charts, Graphs can be glued to a Dashboard quickly, supports drag and drop features. Provides wide range of Datasource connectors. Supports natural language query inside a Dashboard.

RJMetrics.com : Offers creation of Charts, Graphs and Dashboards on Cloud. Has good Data Integration with popular Datasources. RJMetrics Pipeline transfers the to Redshift. Good for large volume of Data - from less than 5 million rows to upside of 500 million rows sync up. 

http://kilometer.io : SaaS Analytics Tool

https://chartmogul.com : Subscription Analytics

gramener.com They are into creative visualization. Provides visualization of the data from various business operations like Sales, Marketing, HR etc. 

Finally, I still feel the blog looks like a work in progress. In fact, I would spend more time taking a deeper look into all the companies listed under #10. However, I hope you found the post useful even in current shape and form.

Tuesday, September 1, 2015

How to Add Charts, Graphs and Visualization to a Blog Post

Over last few months, I have written few blog posts where I used pretty sophisticated Charts, Graphs and Visualization. They make the blog post lot more meaningful and readable, readers love those charts, tables & displays. In this post, I will share the service I use for those visuals and few quick steps on how to create a chart and add that to your blog.

I use the services of http://www.StatsPanda.com, it's in public beta and it's free for "Individual" users.

StatsPanda.com Home Page


Once you login using your Facebook account, it redirects to registration page and during registration it gives two options - "Individual" and "Enterprise". Choose "Individual".

Go to API Console. I will suggest that you spend some time exploring the listed Visualization APIs. You may have to spend some time trying out the APIs in order to understand the JSON Inputs structures. It's not super complicated, although it may take some time. Initially you can use the example input data provided along with API Documentation.

API Console to create Charts, Graphs etc


Once decided, go ahead and create the Chart of your choice and that would give you a unique url and iFrame Code for your chart. For this blog I created a Stacked Area Chart with the example data that API Console provided. In the above screenshot, you can see a greyed out box right below the chart. The content of that box is copied below for your reference. You can straight away use that in your blog.

<iframe src="http://www.statspanda.com/charts/ui/nvd3/stacked-area-chart?unique_key=dc27b99e933da8969218fb3d8e06aeafdc61e170dad7c5aed1b5d41381b2d8" style="border: #E8E8E8 1px solid; height: 300px; width: 100%;"></iframe>

Here is an example how the Visualisation really appears in a blog once you include the iFrame code snippet to your blog.



The chart is dynamic and not a jpg pic, data is being served from the Website. In fact, if you want you can "Edit" the input JSON Data and have a different chart of same type.

Hopefully you find it useful and able to use the charts and graphs in your blogs.



Wednesday, August 5, 2015

Top 10 Machine Learning Libraries and Services for Java Developers

In recent times, Machine Learning has emerged as one of the most talked about topics in the field of information science and technology. Although the subject has probably intrigued the researchers and academicians for decades, it's only now every Software Engineer is trying to get a hang of it. It's changing the face of computing for ever and in a way, accelerating the move from Software Eating The World to Software Eating Software. 

In this blog, I will cover (rather list out) some of the top Machine Learning libraries and services available for a Java developer and try to highlight some of the salient points associated with each of those libraries. Please note some of the most powerful machine learning libraries are in Python and they are not covered in this post at all.

Before I go to the list of libraries, here is a short list of algorithms or broad categories of problems that most of the Machine Learning libraries would cover either fully or partially -

  • Classification
  • Regression
  • Clustering
  • Ranking

Here are few examples of application:
  • Outlier Detection
  • Recommendation
  • Natural Language Processing
  • Neural Networks

1. Apache SPARK MLlib 

2. Deeplearning4j - http://deeplearning4j.org
  • One of the top ML Libraries in Java
  • Integrates with Hadoop, Spark
  • Use Cases
    • Face/image recognition
    • Voice search
    • Speech-to-text (transcription)
    • Spam filtering (anomaly detection)
    • E-commerce fraud detection
    • Regression 


3. Apache Mahout - http://mahout.apache.org
  • Runs on Hadoop Cluster, so infinite scalability
  • Good for recommendations

5. Google Prediction APIs (as service) - https://cloud.google.com/prediction

Types of the problems where you may see few ready examples 
  • Classification
  • Regression 
Google Prediction provides two types of APIs - one that leverages the hosted models and the rest where you have to train the model with you sufficient and then expect the APIs to predict.


5. IBM Watson + AlchemyAPI (as service) - http://www.ibm.com/smarterplanet/us/en/ibmwatson


You can try Alchemy APIs at http://www.alchemyapi.com/products/demo . One of the newest and a very significant acquisition by IBM is Alchemy API. Here are the two primary offerings from Alchemy API -
  • AlchemyLanguage - Text Analytics and Natural Language Processing.
  • AlchemyVision - Leverages deep learning for photo and image processing.
Easy to get registered and get started.


6. MS Machine Learning (as service) - http://azure.microsoft.com/en-in/services/machine-learning

Microsoft has done big deal around coming up with an intuitive UI where users can create a model, train it and run the analytics - all in drag and drops. For a new user it would take sometime to get accustomed to the various UI controls and how Application works. Developers can create her own model and sell it in Azure Marketplace. 

Outlook account works seamlessly. 

7.  Amazon AWS Machine Learning - https://aws.amazon.com/machine-learning 

  • Provides visualisation tools to create ML Models
  • Simple API support for the models generated this way
  • Highly Scalable, can generate billions of predictions in day
8. Weka - http://www.cs.waikato.ac.nz/ml/weka
  • Provides a graphical user interface, command line interface and Java API
  • One of the most popular Java machine learning library
  • Available under GPL License 
9. Mallet - http://mallet.cs.umass.edu
  • Statistical natural language processing, document classification, clustering, topic modeling and information extraction.
10. H2O - http://0xdata.com
  • In-memory data engine
  • Designed for running various types of types of statistical computations (including Deep Learning)
  • Works with Hadoop Distributed File System

Others deserving a mention but could not make it to the list of top 10

JSAT - https://code.google.com/p/java-statistical-analysis-tool
  • Library for quickly getting started with Machine Learning problems
  • Available under GPL 3 but author is open for discussion
  • List of supported algorithms is impressive
  • One man project, done in his free time. Creator is Edward Raff @EdwardRaffML

LensKit - http://lenskit.org
  • Focused on building recommender system, primarily for research based projects
  • Good for trying out. For scale and for production env, one can move to Apache Mahout.
  • Actively developed, managed.
oryx - https://github.com/cloudera/oryx
  • Built on top of Mahout
  • Supports streaming instead of batch jobs, making it realtime
  • Still in early stage. 
Java-ML - http://java-ml.sourceforge.net
  • Provides a collection of algorithms
  • No new release since 2012
Hopefully you will find this short post useful. Leave your feedback and comments.

Connect to me on twitter: @satya_paul

Sunday, January 11, 2015

Did TCS under report employee strength in 2014?

In the wake of recent news on layoffs at TCS, I started looking into TCS Annual reports. As I sifted through the Annual Reports of TCS from last 10 years, something did not seem to be quite right around the employee strength reported in Annual Report in 2014.

Usually, for an IT Services Company the revenue growth is directly proportional to the growth in employee strength, it seemed odd that the Y-O-Y growth rate of Operating Profit was the highest in 2014 whereas the employee growth was the lowest in the history of TCS. So, I started looking into the details of the data based on TCS Annual Reports since 2005.

All the Charts are reproduced courtesy of www.StatsPanda.com

TCS Y-O-Y Key Growth Data (Revenue, Op Margin, Productivity, Avg Salary, Employee Growth)

 

In 2014, Operating Profit Growth(39.43%) and Productivity Gain(19.57%) were the highest while Employee Growth (8.62%) was the lowest in the history of TCS. That surely raises some eyebrows.
So, I decided to regenerate the same chart by tweaking the Employee Strength from 300,000 (as reported in AR 2014) to 320,000. Now, take a look at the chart below. The numbers look much more realistic - Productivity Gain(12.1%) & Employee Growth (15.86%).

TCS Y-O-Y Data (with Employee Strength as 320,000 as on Mar'2014)



Next, I looked into the Average Salary Increase data and the productivity gain data and tried to see the impact if we change the Employee Strength from 300 K to 320 K at the end of Mar'2014. Check out the generated charts and the analysis below.

TCS Avg Productivity vs Avg Salary when Employee Strength is 300K as on Mar'2014



As per the chart above, TCS Employees received an average salary increase of 14.35% in 2014 vis-a-vis 11.81% in 2013. But little bit of poking with the TCS employees revealed the Average Salary Increase was better in 2012 - 2013 than in 2013 - 2014. Then why do we see a different trend here? Now, take a look at the same chart but regenerated with employee strength as 320 K. You can see the Average Salary increase matches with the data heard from TCS Employees and consistent with the past trend - Average Salary increase changed from 14.35% to 7.21%.  
 
Similar observations can be made with respect to Productivity Gain as well. 19.57% Productivity gain in one financial year is near impossible unless there is fundamental shift in Business Model and Revenue Pattern. With correction of employee strength from 300 K to 320K, even that's get corrected from 19.57% to 12.1%, a much more grounded and realistic number.

TCS Avg Productivity vs Avg Salary when Employee Strength is 320K as on Mar'2014


Last take a look at the Utilization Ratio in the chart below. It has remained range bound over the years and no significant upside can be seen in 2014 w.r.t 2013. So, when there is no significant change in Utilization Ratio & there is a dip in Employee Growth Y-O-Y, how one would explain such a steep growth in Revenue as well Operating Profit level and gain in Productivity gain in 2014?

TCS Key Metrics along with Utilization Ratio

Based on the data and analysis above, it appears TCS did under report their total employee strength at the end on Mar 2014. This observation actually seems to be consistent with the news around layoffs as that's the only way to get rid of off-the-book-employees. So, why did TCS do this?

To show (at least on paper) -

1. Superior growth in productivity number.
2. Higher Average Salary Growth. 

These two are important factors when it comes to valuation. Above par growth in Productivity numbers does indicate superior & improving asset quality, better strategy and execution and that commands higher premium. Higher Average Salary Growth gets reflected in lower attrition as well as increase in ability to attract talent. These factors help de-risk current and future revenue stream. Overall, these numbers are very critical and any change in their values will have impact on the valuation as well. The under reporting does mean that TCS was planning to lay off people in subsequent period. However, this seems to be a reactive step and does indicate the discomfort of TCS Management with the increase in employee strength in higher salary band.

This problem is not going to disappear any time soon. You can expect to see similar waves of employees-in-higher-salary-band continue to keep bothering TCS Management for next few years (basically the hires from 2006 and earlier). The organisation will go through a period of turbulence and may force the management to re-engineer the business model leading to a period of uncertainty for TCS for next few years.

As for the ongoing lay offs this year, there seems to be a sense of urgency and an accelerated pace of reducing the employees at higher salary band. In fact, it's getting uglier now. This could be related to the fact that TCS under reported the employee strength at the end of Mar 2014 and as a result they have to do a catch up job now. So, while superior productivity numbers & better average salary increase made the investors happy & the valuation sky rocketed, it also created a challenge for TCS to handle now. It brought in unnecessary instability to the organisation and increased the risk. TCS needs to handle the employee layoffs more gracefully and better manage the investors' expectations to make the growth story sustainable. Hopefully they report more grounded numbers but closer to reality.

The valuation at current level does seem to be very high and I will not be surprised if we see a correction of 20% in next one to two years time (Rs 1900 - 2100 range).

This article is speculative in nature and a work of subjective interpretation of data available in public domain. I am not a TCS employee and never I was. I don't have any professional relationship with TCS whatsoever and have no share holding of TCS.



You can follow me at @satya_paul 

Tuesday, May 13, 2014

Distributed Datastores - let's take a look under the hood

This one is continuation from my last post where I had looked into various alternative options to traditional RDBMS Databases. In this post I will cover some of the basics and go over the factors that influence the choice of a datastore in general and NoSQL Databases in particular. I will also cover the trade offs associated with a choice of a datastore.

Fundamentally, a Database is a specialized software system that allows you to write/store ( i.e. create, update, delete), read, and even do some amount of processing of the data e.g. executing the aggregate functions.

In a world dominated by RDBMSs, Databases are expected to be ACID compliant, in fact, a must have & an important measure of Quality. This is the case with all the RDBMS and they have been doing that job fairly well for many decades. So, what changed recently? To the core, there are really few handful needs that became very important -
  • Increased Complexity of relationship
  • Need for Flexible Data Structure
  • High Availability
  • Scalability (typically referred as Web Scale) 
Increased Complexity of relationship between entities is handled well by Graph Databases. Typical applications include recommendations, social network etc.

Document Databases do exceedingly well when it comes to supporting Flexible Data Structures. Column Family Databases also provide some amount of flexibility, each row can have a different set of attributes. However, in this post, without getting into further details on those factors, I will shift the focus on last two points and explore how various parameters really influence the choice.

So, how does anyone achieve High Availability (HA) for any system? By building redundancy into the system and databases are no exception, they create replicated failover nodes. Failover nodes are exact replicas of the master node and remains passive unless required. Usually, Databases ensure HA but the challenge of ensuring HA is different when it comes to distributed, partitioned databases. Second, it is one thing ensuring HA against a node or machine failure and it's entirely different thing when it comes to ensuring that at no point DB should be unavailable should there be a Network, Machine, Power or any other failure e.g. Data Center goes down. Typically it is achieved by putting the replicas across different Data Centers spread over different geographies and those replicas are not offline. This is also known as Geographically Distributed High Availability (GDHA). Thus Network Partition tolerance becomes critical. Not all databases support GDHA. Note GDHA is more than Disaster Recovery (DR) where in the replicated nodes remain offline and used only when any disaster hits the master node. Usually the focus of DR Systems is not limited to Databases, they kind of keep the entire stack ready.

Other big issue is really about Scalability. How much a database (read RDBMS) can grow? It can grow as much as the largest machine will allow it to grow. But what if you hit that ceiling too? Obvious answer would be to put a second machine. That's correct, but then can the Database still meet the important quality measure called ACID or can be made highly Available when some of  the operations (i.e. reading, writing or processing of the Data) are happening in distributed systems? A simple answer is NO and that's the time you start looking into trade off matrix. You take a second look into the operations as discrete activities and take a call on what is critical for your business and what you can give up.

Before we go any further lets put the definition of ACID for reference:
  • Atomic: Atomicity refers to the ability of the database to guarantee that either all of the tasks of a transaction are performed or none of them are. 
  • Consistent: The consistency property ensures that the database remains in a consistent state before the start of the transaction and after the transaction is over (whether successful or not).
  • Isolated: Isolation refers to the requirement that other operations/transactions cannot access or see the data in an intermediate state during a transaction.
  • Durable: Durability refers to the guarantee that once the user has been notified of success, the transaction will persist, and not be undone.
Databases achieve these by effective handling of Concurrency i.e. how many person can act or modify the state of the data. Here are the various Concurrency handling mechanisms/options:
  • Lock or Exclusive Lock or Pessimistic Lock. Some databases allow only one user to modify a record, row or document at a time. Preventive.
  • MVCC (multi-version concurrency control) or Optimistic Lock is a mechanism that guarantees consistent reading. It allows multiple users to modify a record with multiple conflicting versions without acquiring an exclusive lock. However, it puts a check when it comes to committing the changes into the database. At that point it allows a successful commit only for the first user to attempt. 
Locks ensure changes are either committed or rolled back in case of a successful transaction and it rolls back everything in case of transaction failure.

Next, lets take a look at Replication i.e. copying the datastore to a different node. High Availability is achieved by replicating a database node. Replication comes in two forms:
  • Master-slave replication makes one node the authoritative copy that handles writes while slaves synchronize with the master and may handle reads.
  • Peer-to-peer/Master-Master replication allows writes to any node; the nodes coordinate to synchronize their copies of the data.
Master-slave replication reduces the chance of update conflicts but peer-to-peer replication avoids loading all writes onto a single point of failure. Other important factor to consider is when data is written on a node, it takes time before it is reflected on all the nodes. You can do it either synchronously or asynchronously for a particular transaction. Your choice will determine whether your database supports Consistency or Eventual Consistency. In case of Peer-to-Peer replication, the same record can be modified by two different transactions on two different nodes. How a databases handles these scenarios is also influenced by the choice of Consistency viz-a-viz Eventual Consistency. There are specific databases that excel in one usecase over the other. I do plan to cover that in my next blog.

While Database replication primarily helps to handle failover and ensures Higher Availability, it also helps Scalability. Master-Slave replication works well for Read Scalability while write operations can take place only on the Master node and Slaves then syncs up with the Master either synchronously or asynchronously. Peer-to-Peer or Master-Master replication helps achieve both Read and Write Scalability as both read and write operations can take place on all the replicas. Here, all the replicas will have the same copy of Database. This is traditionally known as Scaling Up or Vertical Scaling where a Database system can scale as much as a node can grow. This should work well for most of the systems. However, for infinite scale or web scale, one needs to go for Scale out or Horizontal Scaling where data is Partitioned or Sharded across multiple nodes. This allows Databases to grow infinitely just by adding new hardware (usually commodity hardware). Note each partitioned node will have different set of data and may have its own replicas for high availability, each partitioned node is actually a database in it's own capacity.

Scale Up vs Scale Out

Now lets look at the trade off matrix I mentioned earlier in this post. This trade off matrix is known as CAP Theorem. This is also known as Brewer's theorem. It states that it is impossible for a distributed computer system to simultaneously provide all three of the following guarantees:
  • Consistency (all nodes see the same data at the same time). Note this consistency is different than what it is in ACID.
  • Availability (a guarantee that every request receives a response about whether it was successful or failed)
  • Partition tolerance (the system continues to operate despite arbitrary message loss or failure of part of the system)
A distributed system can achieve only two of them at a time.
Here is a nice summary of how different Datastores complies with CAP Theorem from a presentation by Aleksandar Bradic




Finally, here is a matrix, I prepared, to capture various parameters that one would consider while analyzing a Distributed DB System.


Not all values are filled. I will continue to work on this and update it further. 

Connect to me on twitter @satya_paul
Check out my storyboard on www.fanffair.com - http://www.fanffair.com/storyboard/satyajitp2011