MathFinder: Math API Discovery and Migration

MathFinder is an Eclipse plugin supported by a unit test mining backend for discovering and migrating math APIs. It is intended to make (re)implementing math algorithms in Java easier. Given a math expressions (see the syntax below), it returns a pseudo-code involving calls to suitable Java APIs. At present, it supports programming tasks that require use of matrix and linear algebra APIs. The underlying technique is however general and can be extended to support other math domains.



Installing the MathFinder Eclipse plugin
  • First, download the plugin.
  • Put the plugin file in the dropins directory of your eclipse installation. 
  • The plugin relies on some auxiliary files to function, in particular the operator index that the backend constructs. These are here. Unzip this file into your home folder (update Eclipse->Preferences->MathFinder if you choose to place it elsewhere).
  • Then, open Eclipse. Open the MathFinder view: Window -> Show View -> Other -> SampleCategory -> MathFinder View.



Illustrative Examples

Here is a quick example to get you started:

1. Implementing PageRank in Java

Consider the Scilab code for the PageRank algorithm, a ranking algorithm used by Google. 
(This has been adapted from Wikipedia.)

%input: matrix W of doubles
%and double scalars d, v_error
pagerank(W, d, v_error)
  N = cols(W);  
  v = rand(N, 1);
  v = v ./ normf(v);
  last_v = ones(N, 1) * INF;
  s = (1 - d) / N;
  M_hat = d * W + s * ones(N, N);
  while(normf(v - last_v) > v_error)
  last_v = v;
     v = M_hat * v;
     v = v ./ normf(v);
  end

Even this reasonably small algorithm requires 9 matrix operators that are not supported by the standard Java library. Selecting a third-party library that supports all of them is a tedious and time-consuming task. Using MathFinder, you can find methods to implement the math expressions in the pseudo code by using the expressions themselves as queries.

For example, to search for methods to implement v = v ./ normf(v), use the query
double M v; v = v ./ normf(v);
Here, double M tells the tool that v is a double matrix.

In fact the tool does much more - it tells you how to call the method, it ranks methods based on relevance and gives you pseudo-code snippets to implement the expression - from multiple libraries.
  • A history of the queries typed is maintained in the left pane
  • A count of the number of operators supported by each library is maintained on the first tab
  • The second tab provides snippets to implement the current query from each library. If the method has a side effect, it clones the receiver in a temporary before invoking the method.
  • The third tab displays, library-wise, all the results fetched against the current query
Here are some additional queries from this example you could try out by pasting them into the search bar:
  1. int S N; double M W; N = columns(W);
  2. double M last_v; int S N; double S inf; last_v = ones(N, 1) * inf;
  3. double M Mat_hat, Mat; double S d; double S N; int S K; Mat_hat = (d .* Mat) + (((1. - d) / N) .* ones(K, K));
  4. double M v, last_v; double S res; res = norm(v - last_v);
  5. double M v, M_hat; v = M_hat * v;
Simple isn't it? It is fairly natural to go from the expression to a query - the only thing you need to do is to add type annotations.

The query expression expression form “LHS = RHS” indicates that the programmer wants methods to implement the RHS, and the types of the result and the LHS should be the same. The tool supports the following types at present:
double M - double Matrix
double S - double Scalar
int M - integer Matrix
int S - integer Scalar

2. API Migration for Weka (machine learning) library

Another application of the tool is to migrate clients that use an existing math library to another math library.

To illustrate this, consider the following snippet from the Weka machine learning library:

// W*(P^T*W)^-1
tmp = W.times(((P.transpose()).times(W)).inverse());
// X_new = X*W*(P^T*W)^-1
X_new = getX(instances).times(tmp);
// factor = W*(P^T*W)^-1 * b_hat
m_PLS1_RegVector = tmp.times(b_hat);

This code is from the PLSFilter class and is a part of a method that implements the PLS1 algorithm to find the Partial Least Squares Regression on a given data set.

Suppose you wish to migrate the first statement shown above. The first step is to frame a math query that embodies the semantics of the statement. This involves mapping the library methods used to operators of the math expression language, and specifying the query types of the variables used. So the query to use is:
double M tmp, W, P; tmp = W∗inv(P’∗W);

The same query is used to fetch pseudo-code snippets from multiple libraries. It returns, for example, the following pseudo-code snippet that uses the SimpleMatrix class of EJML:
SimpleMatrix tmp, P, W, T1, T2, T3;
T1 = P.transpose(); T2 = T1.mult(W);
T3 = T2.invert(); tmp = W.mult(T3);

The tool does not currently support method chaining, but it is straightforward to modify the snippet it returns to chain methods on the same receiver.



Under the Hood

Interested in figuring out how the whole thing works?

The idea is to mine information that is embedded in unit tests of the library methods to later discover them using math queries. We use (the set of input/output objects in) unit tests as a description of method semantics - i.e. we use the set of input/output objects as a description of the meaning of the method. 

The key insight in MathFinder is to use an interpreter for a math language (such as Scilab) to evaluate subexpressions on unit tests of library methods. In simple terms, this means taking the values of parameters in the unit tests, and plugging them into the variables used in the query and then evaluating the query in Scilab. The result of the interpretation on inputs of a test can be compared to the output of the test. The MathFinder hypothesis is that if a subexpression results in the same value as the output of a method on a large number of tests, the method can be used for computing the subexpression. 



Given a query expression, MathFinder extracts subexpressions from it. Given a subexpression and a method, MathFinder computes the set of all candidate mappings between variables of the subexpression and method parameters. MathFinder then searches for a mapping that maximizes the number of unit tests on which the subexpression gives results equivalent to the method outputs. The idea is that there may be many ways to plug in values from a unit test into query variables. MathFinder tries all of them, and picks the one that agrees with what the unit test originally computed in the most number of tests.

Going back to our first example expression v = v ./ normf(v);
Here, normf stands for the Frobenius norm, and ./ is matrix-scalar division.

The first step is to type the variables; the variable v is given the type "double M" standing for a matrix of doubles in the query 
double M v; v = v ./ normf(v);

The expression form “LHS = RHS” indicates that the programmer wants methods to implement the RHS, and the types of the result and the LHS should be the same.
MathFinder then parses the math expression and decomposes it into subexpressions (similar to three-address code generation in compilers). The subexpressions have a single operator on the RHS by default. v = v ./ normf(v) is decomposed as
double S T1; double M v; T1 = normf(v); v = v ./ T1;

Operator precedence enforces the sequential ordering of computation and temporary variables like T1 are used to explicate data flow. 
Results are obtained against each subexpression by mining the unit tests using an interpreter as outlined earlier.

MathFinder then uses the results mined against each subexpression to issue a pseudo-code snippet. The snippet takes a set of input objects, returns an output object, and performs the computation queried for. The input objects correspond to variables from the RHS of the query and the output object, to the LHS variable. By convention, objects are given the same name as the variables they correspond to. A sequence of API method calls, with a call corresponding to every operator used in the query, is used to generate the output object. In this sequence, if the return value of a method is passed as an argument to another, then their library types should be compatible, and the output object is the return value of the last method. MathFinder suggests this snippet from JBlas
DoubleMatrix v; double T1; T1 = v.norm2(); v = v.div(T1);
where div is discovered to implement v = v ./ T1.

MathFinder can thus automate the process of API discovery and comprehension to a large extent. You will still have to verify the validity of the results and translate the pseudo-code to Java code by introducing object instantiations as necessary. You can use snippets from different libraries for different (sub)expressions in her algorithm, provided you writes code to convert between datatypes of the different libraries.



Digging Deeper

A good place to start would be to read our FASE paper describing the technique.
We have released the sources for both the plugin and our backend. If you have some new ideas about how to make things better, well, its all open source, so hack away!

Development
This is meant to serve as a brief guide to setting up Eclipse to modify the MathFinder backend.

Prerequisites
The first step is to install prerequisites:

The Scilab interpreter 
We use Scilab 5.3.3 
Download the appropriate file for your operating system, for example,
for 64-bit linux, use
wget -c http://www.scilab.org/download/5.3.3/scilab-5.3.3.bin.linux-x86_64.tar.gz
tar xvf scilab-5.3.3.bin.linux-x86_64.tar.gz    

The ant build system for Java
sudo apt-get install ant

hadoop-0.20.2
You need to have Hadoop-0.20.2 set up as a project in Eclipse.
A good resource to help set this up is Shuyo's blog

          antlr-3.4 or greater
                    The grammars we use to parse our query language are in antlr.
                    A good resource to help set up the antlr plugin for eclipse is Scott Stanchfield's video

Setting up the workspace
Now unpack the mathComplete tarball in your workspace, and import it as a java project.
Add the hadoop project to its build path. If you don't have antlr-3.4-complete.jar in your
classpath, add that as well.
If all goes well, it should build in your workspace.

Note
One possible error could be that Operator.g in iisc.csa.mathComplete.whitebox is not compiled.
This is because of an error in the antlr plugin, because of which a list of tokens a lexer 
generates is not placed in the right folder.
Simply place the Linlex.tokens file from iisc.csa.mathComplete.whitebox into the "antlr-generated"
source folder

You are now ready to start hacking!



How To

1. How do I index more methods?
The packages unitTests.* contain unit tests for the libraries we surveyed.
See the MatrixTest.java file inside unitTests.jama to get an idea of how we instrument a unit test.
After you have added your test, to regenerate the index, run the script gendata.sh in the mathComplete folder.
cd mathComplete
./gendata.sh <number of tests> <intermediate directory> <output directory>

2. How do I test the backend?
We have provided a driver, in the file RetrievalDriver.java in the
csa.iisc.mathComplete.mapreduce package.
Running this driver will let you see the results retrieved against a query. 
Use the run configuration "RetrievalDriver" that we have provided from within eclipse.
The command line arguments expected are:
0 - path to unit test index
1 - path to mapreduce output. Any temporary folder will do
2 - path to the query file 
3 - true/false according to whether the job is being run on a cluster or not. 
    say false if you are on a standalone machine 
4 - path to file where results are to be cached 
5 -  Number of reducers to use (if this job is being run on a cluster)
6 -  The number of tests/method in the unit test index being used

3. How do I build a complete operator index
Here is how to build the operator index:
Note: This operator index is built against a set of benchmark queries provided together with the source (mathComplete/input/)
From within eclipse: use the BatchProcessing run configuration provided.
You first need to set some environment variables [use the "Environment tab" under the Run Configuration window of Eclipse].
(if you are unfamiliar with this, refer the eclipse documentation - http://help.eclipse.org/juno/topic/org.eclipse.cdt.doc.user/tasks/cdt_t_run_env.htm )
On linux, these are the paths to use:
LD_LIBRARY_PATH= path-to-scilab-5.3.3/lib/scilab
SCI=path-to-scilab-5.3.3/share/scilab
For more operating systems, instructions are here

Here are the meanings of the command line parameters
0 -  The input unit test index
1 -  The output for the mapreduce job. Any temporary folder will do
2 -  The name of the file containing the query
3 -  A boolean which is true if this job is to be run on a cluster
4 -  The full path to the operator index
5 -  Number of reducers to use (if this job is being run on a cluster)
6 -  The number of tests/method in the unit test index being used
7 -  Directory where the text results against each query are to be written

Running this will take some time, and will result in generating the operator index in the path you specified

4. How do I modify the behaviour of the plugin?
The code for the plugin is very experimental - modify at your own risk. You have been warned!
Download this file, untar it, and import it as an eclipse project.

5. I just can't get something to work, what do I do?
          We don't have a mailing list at present, so please write to the developers: anirudh_s at csa dot iisc dot ernet dot in



Publications

Experimental Data

  • Some experimental data (the tasks we used in the user study) connected to this publication is here.
  • The queries we framed to evaluate precision and recall of these tasks are here.
  • The larger set of benchmark queries used in connection with our journal version is here.
  • The migrated versions of Weka 3.6: In these versions, Weka is migrated to use EJML (25MB+), JBLAS (25MB+), or COLT (25MB+) instead of Jama.


People

Contact
Queries/bug reports may be sent to anirudh_s at csa dot iisc dot ernet dot in

Acknowledgements
This work is partially funded by Robert Bosch Centre for Cyber Physical Systems, Indian Institute of Science. We also thank ETAPS and IARCS for student travel grants to partially support travel to present this work at FASE 2013.

LICENSE

MathFinder is available under the Apache License, version 2.0.


Projected completed: 2014