MathFinder: Math API Discovery and Migration
MathFinder is an Eclipse plugin supported by a unit test mining backend for discovering and migrating math APIs. It is intended to make (re)implementing math algorithms in Java easier. Given a math expressions (see the syntax below), it returns a pseudo-code involving calls to suitable Java APIs. At present, it supports programming tasks that require use of matrix and linear algebra APIs. The underlying technique is however general and can be extended to support other math domains.
Installing the MathFinder Eclipse plugin
Here is a quick example to get you started:
1. Implementing PageRank in Java
Consider the Scilab code for the PageRank algorithm, a ranking algorithm used by Google.
(This has been adapted from Wikipedia.)
%input: matrix W of doubles%and double scalars d, v_errorpagerank(W, d, v_error)N = cols(W);v = rand(N, 1);v = v ./ normf(v);last_v = ones(N, 1) * INF;s = (1 - d) / N;M_hat = d * W + s * ones(N, N);while(normf(v - last_v) > v_error)last_v = v;v = M_hat * v;v = v ./ normf(v);end
Even this reasonably small algorithm requires 9 matrix operators that are not supported by the standard Java library. Selecting a third-party library that supports all of them is a tedious and time-consuming task. Using MathFinder, you can find methods to implement the math expressions in the pseudo code by using the expressions themselves as queries.
For example, to search for methods to implement v = v ./ normf(v), use the query
double M v; v = v ./ normf(v);
Here, double M tells the tool that v is a double matrix.
In fact the tool does much more - it tells you how to call the method, it ranks methods based on relevance and gives you pseudo-code snippets to implement the expression - from multiple libraries.
Here are some additional queries from this example you could try out by pasting them into the search bar:
Simple isn't it? It is fairly natural to go from the expression to a query - the only thing you need to do is to add type annotations.
The query expression expression form “LHS = RHS” indicates that the programmer wants methods to implement the RHS, and the types of the result and the LHS should be the same. The tool supports the following types at present:
2. API Migration for Weka (machine learning) library
Another application of the tool is to migrate clients that use an existing math library to another math library.
To illustrate this, consider the following snippet from the Weka machine learning library:
This code is from the PLSFilter class and is a part of a method that implements the PLS1 algorithm to find the Partial Least Squares Regression on a given data set.
Suppose you wish to migrate the first statement shown above. The first step is to frame a math query that embodies the semantics of the statement. This involves mapping the library methods used to operators of the math expression language, and specifying the query types of the variables used. So the query to use is:
double M tmp, W, P; tmp = W∗inv(P’∗W);
The same query is used to fetch pseudo-code snippets from multiple libraries. It returns, for example, the following pseudo-code snippet that uses the SimpleMatrix class of EJML:
SimpleMatrix tmp, P, W, T1, T2, T3;
T1 = P.transpose(); T2 = T1.mult(W);
T3 = T2.invert(); tmp = W.mult(T3);
The tool does not currently support method chaining, but it is straightforward to modify the snippet it returns to chain methods on the same receiver.
Under the Hood
Interested in figuring out how the whole thing works?
The idea is to mine information that is embedded in unit tests of the library methods to later discover them using math queries. We use (the set of input/output objects in) unit tests as a description of method semantics - i.e. we use the set of input/output objects as a description of the meaning of the method.
The key insight in MathFinder is to use an interpreter for a math language (such as Scilab) to evaluate subexpressions on unit tests of library methods. In simple terms, this means taking the values of parameters in the unit tests, and plugging them into the variables used in the query and then evaluating the query in Scilab. The result of the interpretation on inputs of a test can be compared to the output of the test. The MathFinder hypothesis is that if a subexpression results in the same value as the output of a method on a large number of tests, the method can be used for computing the subexpression.
Given a query expression, MathFinder extracts subexpressions from it. Given a subexpression and a method, MathFinder computes the set of all candidate mappings between variables of the subexpression and method parameters. MathFinder then searches for a mapping that maximizes the number of unit tests on which the subexpression gives results equivalent to the method outputs. The idea is that there may be many ways to plug in values from a unit test into query variables. MathFinder tries all of them, and picks the one that agrees with what the unit test originally computed in the most number of tests.
Going back to our first example expression v = v ./ normf(v);
Here, normf stands for the Frobenius norm, and ./ is matrix-scalar division.
The first step is to type the variables; the variable v is given the type "double M" standing for a matrix of doubles in the query
double M v; v = v ./ normf(v);
The expression form “LHS = RHS” indicates that the programmer wants methods to implement the RHS, and the types of the result and the LHS should be the same.
MathFinder then parses the math expression and decomposes it into subexpressions (similar to three-address code generation in compilers). The subexpressions have a single operator on the RHS by default. v = v ./ normf(v) is decomposed as
double S T1; double M v; T1 = normf(v); v = v ./ T1;
Operator precedence enforces the sequential ordering of computation and temporary variables like T1 are used to explicate data flow.
Results are obtained against each subexpression by mining the unit tests using an interpreter as outlined earlier.
MathFinder then uses the results mined against each subexpression to issue a pseudo-code snippet. The snippet takes a set of input objects, returns an output object, and performs the computation queried for. The input objects correspond to variables from the RHS of the query and the output object, to the LHS variable. By convention, objects are given the same name as the variables they correspond to. A sequence of API method calls, with a call corresponding to every operator used in the query, is used to generate the output object. In this sequence, if the return value of a method is passed as an argument to another, then their library types should be compatible, and the output object is the return value of the last method. MathFinder suggests this snippet from JBlas
DoubleMatrix v; double T1; T1 = v.norm2(); v = v.div(T1);
where div is discovered to implement v = v ./ T1.
MathFinder can thus automate the process of API discovery and comprehension to a large extent. You will still have to verify the validity of the results and translate the pseudo-code to Java code by introducing object instantiations as necessary. You can use snippets from different libraries for different (sub)expressions in her algorithm, provided you writes code to convert between datatypes of the different libraries.
A good place to start would be to read our FASE paper describing the technique.
We have released the sources for both the plugin and our backend. If you have some new ideas about how to make things better, well, its all open source, so hack away!
This is meant to serve as a brief guide to setting up Eclipse to modify the MathFinder backend.
The first step is to install prerequisites:
The Scilab interpreter
We use Scilab 5.3.3Download the appropriate file for your operating system, for example,for 64-bit linux, usewget -c http://www.scilab.org/download/5.3.3/scilab-5.3.3.bin.linux-x86_64.tar.gztar xvf scilab-5.3.3.bin.linux-x86_64.tar.gz
The ant build system for Java
sudo apt-get install ant
You need to have Hadoop-0.20.2 set up as a project in Eclipse.A good resource to help set this up is Shuyo's blog
antlr-3.4 or greater
The grammars we use to parse our query language are in antlr.
A good resource to help set up the antlr plugin for eclipse is Scott Stanchfield's video
Setting up the workspace
One possible error could be that Operator.g in iisc.csa.mathComplete.whitebox is not compiled.This is because of an error in the antlr plugin, because of which a list of tokens a lexergenerates is not placed in the right folder.Simply place the Linlex.tokens file from iisc.csa.mathComplete.whitebox into the "antlr-generated"source folder
You are now ready to start hacking!
1. How do I index more methods?
2. How do I test the backend?
3. How do I build a complete operator index
Running this will take some time, and will result in generating the operator index in the path you specified
4. How do I modify the behaviour of the plugin?
5. I just can't get something to work, what do I do?
We don't have a mailing list at present, so please write to the developers: anirudh_s at csa dot iisc dot ernet dot in
Queries/bug reports may be sent to anirudh_s at csa dot iisc dot ernet dot in
This work is partially funded by Robert Bosch Centre for Cyber Physical Systems, Indian Institute of Science. We also thank ETAPS and IARCS for student travel grants to partially support travel to present this work at FASE 2013.
MathFinder is available under the Apache License, version 2.0.
Projected completed: 2014