Mining Preconditions Using Boa Language - Improve Accuracy and Verify Progam Accuracy

documentclass[conference]{IEEEtran}

usepackage{graphicx}

graphicspath{ {./images/} }

begin{document}

title{Mining Preconditions Using Boa language}

author{IEEEauthorblockN{Bishal Neupane}

IEEEauthorblockA{Department of Computer Science\

Bowling Green State University\

Bowling Green, Ohio, 43403\

bneupan@bgsu.edu}

}

% make the title area

maketitle

% As a general rule, do not put math, special symbols or citations

% in the abstract

begin{abstract}

In this work,the research to find the preconditions is extended to improve the accuracy of the result.The control flow graph is added to increase the clarity of the research. The large-scale code corpus is mined to find the preconditions of the APIs.

newline

Keywords – Preconditions, APIs, Source Code mining

end{abstract}

section{Introduction}

The software development industry widely uses Application programming interfaces (APIs).APIs are the functions that provide the specific functionality in the programs.The proper use of API makes it easier to develop the program.In order to use correctly, it is important to know the preconditions.Preconditions are the predicates that must always be satisfied before calling the APIs.APIs can be accurately used when the conditions required to call them are met.This work mines the large amount of code corpus present in the software repositories to find out the preconditions of the APIs[1].There are a large number of code corpus in the modern day ultra large software repositories like Git, Source Forge,and Google Code.Mining is extracting data and information from the software repositories.The extracted data and information can be extremely useful for improvement of the software engineering techniques and methodology.

newline

The focus of this study is to get the APIs present in the software repository and find out the potential preconditions that need to be true before calling that APIs. For example, to call the ensurecapacity() API in the program, we need an object ArrayList. The popular example is to call the substring method in the JDK String Class. If the API substring is called as substring (int beginIndex, int endIndex), there are few conditions that must be true. The beginIndex must not be negative, or endIndex must not be larger than the length of this String object, or beginIndex must not larger than endIndex. The object must be a string. These are the preconditions required to call the substring API.APIs can be used correctly and efficiently when we know the preconditions.The accurate use of APIs helps the program to avoid different errors in the program. The correct use of API assists to verify the accuracy of the program[2],to generate test cases[9], and detect the bug in the program[1].

newline

There are various ways to find the preconditions. We can find the precondition by studying the original documentation and matching it manually in the respective format[1]. Program analysis-based and data-mining are two approaches generally used for deriving the specifications.

subsection {Program Analysis}

The program analysis-based approach needs a large number of test cases. It may be inconvenient to find the preconditions due to the incomplete test cases when we use dynamic approaches. Similarly, In static approaches, a very minimal amount of APIs are used. The minimal amount of APIs may not lead to the ideal result. These are the downsides of each dynamic and static approaches[1].

subsection{Data mining}

Similarly, In the data mining approaches, the paper [1] introduces an approach that uses both program analysis and source code mining approach. In the ultra large code corpus present in the software repositories, the APIs will occur frequently. The main idea of the paper by Nguyen et al[1] is that APIs will appear frequently in the large code corpus. In that case, for each API, the preconditions will appear in all cases whereas the conditions for specific conditions will appear less frequently [1]. The combination of both program analysis and source code mining helps us to derive the preconditions using the large code corpus[1].

newline

The procedure involves with finding the client methods present in the software repositories that calls the APIs. Then, we draw a control flow graph to represent the control dependence relation that figures out the predicates to reach to the call sites of APIs.Then, we normalize and infer the extracted conditions to express the reduced appropriate conditions. The inference is followed by the process of filtering and ranking. The paper by Nguyen et al[1] has well organized and developed the mechanism for finding the preconditions using this procedure. My work here is to develop the code to address some shortcomings and bugs which excludes some of the preconditions and API call sites that we are supposed to find out using this process. I am adding the control flow graph to represent control dependence relation. While doing so, I should maintain the accuracy with good precision and recall.

section{Motivation}

The main motivation comes from the example of substring API.The research[1] finds out the preconditions of the given API by following the respective steps. There are some of the cases where the API finder method may not find the cases. One of them is when we use API java.util.ArrayList.ensureCapacity. This API increases the capacity of this ArrayList instance to ensure that it can hold at least the number of elements specified by the minimum capacity argument.

In order to call this API, the object must be an ArrayList. The API finder can find the regular API call when ensurecapacity is called using the algorithm. Some of the cases that API finder may miss are newline

1) (ArrayList)l

newline

2) ((ArrayList)l).size()

newline

3) m((ArrayList)l, o)

newline

This is the type of casting which was not addressed by the previous algorithm.Thus, there are similar cases where we have to handle the annotation, and parenthesis and other expression kinds. This motivation takes me to use the program for few more APIs and handle the shortcomings of the research.

section{Related Work}

There are various techniques used for preconditions mining. We have discussed the program analysis and data mining earlier. In the program analysis there are two approaches, static and dynamic.The paper by Nguyen et al[1]uses both dynamic and static analysis along with data mining techniques.

newline

Earlier, the technique by Ramanathan et al.[5] derives the preconditions using the static inference. This technique gathers the predicate at each point after analyzing the call sites.Then it uses the path-sensitive data flow analysis.This process collects the predicates along each path and merges them. Predicates computed within a procedure are memorized. Then they are used to compute preconditions. There are various differences between these two processes. The Process by Nguyen et al[1] that we are using is carried out with large code corpus and the work by Ramanathan et al.[5] is carried out on only an individual client program containing the APIsâ€™ call sites [1].The work[1] considers the predicate across the projects while their work does not consider the work across the projects [7]. The process[1] also uses the static approaches. There are dynamic approaches used in mining specifications as well. Westley Weimer and George C. Necula [6] can find out the complex code condition using the dynamic approaches.This approach can match with the dynamic approach is also used in the paper[1].

section{Methodology}

The paper [1] has described the process in discrete steps. The first step is taking the API we are looking preconditions for. Then we find all the methods that calls the API and draw the control flow graph to represent control dependence relation.We extract the possible precondtions.Then we normalize each condition.We carry the process of inference and filtration there.Then we rank the result in the list.

subsection{Extracting Preconditions}

We will have an API for which we are trying to find preconditions.The API-finder algorithm will find all the method that call an API. In this case, we are taking an example from paper[1] present in a project seMoA [4] where we take the client code of API string. Substring (int, int). we know the preconditions for the substring API that is begin index $geq$ 0, endIndex $geq$ beginIndex and beginIndex $leq$ endIndex. Now we have the client code from the project that looks like the figure(1).

newline

includegraphics[width=0.5textwidth]{page}

Figure 1: Client code of API (String.substring(int, int) in project

SeMoA[4] url{http://goo.gl/u0HK16}

newline

The method is set Fragmentation (int servletPathStart, int ExtraPathStart) is present in the project [4]. In the control flow graph, servletPathStart is denoted by arg0, ExtraPathStart is denoted by using arg1 and the completePathtextunderscore object is denoted by using the rcv. When we see the figure (2), we see the several conditions that need to be false to reach the API call in the control flow graph. This analysis of the client code and control flow graph gives various observations [1]. The observations are preconditions can be figured out looking at the conditions that need to be satisfied before reaching to API call sites [1]. Similarly, from the control flow graph, the argument that is passed need to be taken into account to find the preconditions[1]. There may be client specific preconditions that are called noises and attempt should be made to minimize the noises and the relationship between two different conditions should be taken into account during the procedure[1]. The conditions that occur frequently are likely to be the preconditions because the actual preconditions will occur each time API is called in the program. The code in the figure (1) is present in the seMoA project and the first step is to extract the possible preconditions according to the control dependence relation. As mentioned in the observation, preconditions are mined from the guard conditions at the call sites of the codes using the APIs. Figure (2) shows the control flow graph of the code (1) that helps to figure out the possible preconditions from the client codes control dependence relation. The program used as an example[1] has the part of the code in a project seMoA [4] where we take the client code of API string Substring (int, int).We know the preconditions for the substring API that is begin index $geq$ 0, endIndex $geq$beginIndex and beginIndex $leq$ endIndex.

(1). The method is set Fragmentation (int servletPathStart, int ExtraPathStart) is present in the project [4].In the control flow graph figure(2), servletPathStart is denoted by arg0, ExtraPathStart is denoted by using arg1 and the completePathtextunderscore object is denoted by using the rcv.

newline

includegraphics[width=0.5textwidth]{cfg}

Figure 2: Control flow graph of client code from figure(1) (String.substring(int, int) in project SeMoA[4]

newline

When we see the figure (2), we see the several conditions that need to be false to reach the API call in the control flow graph. This analysis of the client code and control flow graph gives various observations [1]. The observations are preconditions can be figured out looking at the conditions that need to be satisfied before reaching to API call sites [1]. Similarly, from the control flow graph, the argument that is passed need to be taken into account to find the preconditions[1].

There may be client specific preconditions that are called noises and attempt should be made to minimize the noises and the relationship between two different conditions should be taken into account during the procedure[1]. The conditions that occur frequently are likely to be the preconditions because the actual preconditions will occur each time API is called in the program. The code in the figure (1) is present in the seMoA project and the first step is to extract the possible preconditions according to the control dependence relation. As mentioned in the observation, preconditions are mined from the guard conditions at the call sites of the codes using the APIs. Figure (2) shows the control flow graph of the code (1) that helps to figure out the possible preconditions from the client codes control dependence relation. The table(1) shows the extracted preconditions for the code from the figure(1) using the control flow graph of figure (2).The control flow graph of the given code (figure 1) is drawn in figure (2) that explains the conditions that need to be satisfied with all the substring API on line 19 and 20. The observations explain the idea that we need to take into account for preconditions. The set of all the API methods are taken from the client codes in the software repositories and for each APIs, we draw the control dependence relation and extract the APIs similar to [1]. For the above code in figure(1), the extracted preconditions are mentioned in the table below in the table [1]. The preconditions are identified and listed in the table.

newline

includegraphics[width=0.5textwidth]{pc}

Table 1 : Extracted preconditions from figure(1) and figure (2)

subsection{Normalization}

The preconditions are identified and listed in the table. After the preconditions are listed in the table, we normalize and reduce the preconditions to the standard form. Normalization is the process where the different preconditions are compared and expressed in standard form [1]. The preconditions can be expressed in different forms and they are normalized as using the comparison operator. For example, if the preconditions are found like arg0=0, arg0-1=1, arg0-2=2, they can be normalized and updated on the map as arg0=0. They all are used to write the same preconditions. This process is called normalization. After the process of normalization, we take the normalized conditions and infer them using an algorithm [1].subsection{Inference}

In the process of inference, conditions are taken, and the non-strict inequality conditions can be broken into different strict inequalities that can be checked at different sites [1]. There is an algorithm (figure )3 that helps us to infer the non-inequality preconditions [1]. When two conditions are used equally, all the call sites are counted to inferred condition (figure 3). In other cases, the call sites with fewer frequencies are added [1]. The key idea is the frequency of the conditions. The preconditions having more frequency is stronger than others. The count on frequency also helps us to reduce the noise while finding the preconditions.

includegraphics[width=0.5textwidth]{inf}

Table 1 : Infering the Inequuality preconditions[1]

newline

subsection{Filtration}

We get the inferred preconditions which may not be all correct and some may be the project-specific as well. Thus, we go through the process of filtration. In the process, we remove all conditions that are checked only once and also remove the conditions that have low confidence in being checked before calling API [1]. A confidence level is calculated by the ratio of the total number of code location checking the condition before calling the API over the total number of locations calling the API [1]. For each API, the threshold of Ïƒ =0.5 is kept [1].We compute confidence level for both client projects(confpr) and client methods(confm)using the following formula [1].

newline

includegraphics[width=0.3textwidth]{filt}

subsection{Ranking}

The final process is ranking that involves the ranking of the preconditions. The preconditions are ranked according to the confidence level that is calculated by using the formula conf (p)= confpr (p) Ã— confm(p). According to the confidence the preconditions are ranked, and only top one percent are listed in the final result [1].

section{Future Works}

For the future works, Our colleague is working to use all the algorithm including filtering and ranking in boa language. That will make the algorithm effective and efficient. The goal is to complete the research using boa language before moving to other works.

section{Threats to Validity}

The control dependence part has the manual work that needs to be done by a human. There is always the human error that might be possible which may lead to some unwanted results. The precision and recall are also dependent on manual checking through the repositories.Similarly, sometime the data sets might not be good enough to generalize the result [1]. This involves only for the Java language JDK and the other programming language are not represented.

section{Conclusion}

The paper published [1] has the well-organized pattern and procedure to find the preconditions of the APIS from the client methods. The process of finding the APIS from client method, control dependence relation, normalization, and inference is carried out using the boa code so far in the research. The same procedure is used to find the preconditions.

we are going to develop the code to address some shortcomings and bugs to make the procedure more effective.We added the control flow graph which explains the control dependence relation.We tried to improve the precision and recall improving the codes and address the shortcomings and exceptions that couldnot be gathered earlier.

section{Acknowledgement}

This handout is an extension of research paper[1]

clearpage

begin{thebibliography}{20}

bibitem{Nguyen}Nguyen, H. A., Dyer, R., Nguyen, T. N., & Rajan, H. (2014). Mining preconditions of APIs in large-scalecode corpus. Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering – FSE 2014. doi:10.1145/2635868.2635924

bibitem{glenn}Glenn Ammons, Rastislav BodÃk, and James R. Larus. 2002. Mining specifications. In Proceedings of the 29th ACM SIGPLAN-SIGACT symposium on Principles of programming languages (POPL '02). ACM, NewYork,NY,USA,4-16.DOI=http://dx.doi.org/10.1145/503272.503275

bibitem{dyer}Dyer, R., Nguyen, H. A., Rajan, H., & Nguyen, T. N. (2013). Boa: A language and infrastructure for analyzing ultra-large-scale software repositories. 2013 35th International Conference on Software Engineering (ICSE). doi:10.1109/icse.2013.6606588

bibitem{semoa}SeMoA – Secure Mobile Agents. url{http://sourceforge.net/projects/semoa/

bibitem{murali} Murali Krishna Ramanathan, Ananth Grama, and Suresh Jagannathan. 2007. Static specification inference using predicate mining. SIGPLAN Not. 42,6(June2007),123134.DOI:https://doi.org/10.1145/1273442.1250 749

bibitem{westley}Westley Weimer and George C. Necula. 2005. Mining temporal specifications for error detection. In Proceedings of the 11th international conference on Tools and Algorithms for the Construction and Analysis of Systems (TACAS'05), Nicolas Halbwachs and Lenore D. Zuck (Eds.).

Springer-Verlag, Berlin, Heidelberg, 461-476. DOI=http://dx.doi.org/10.1007/978-3-540-31980-1_30

bibitem{rajan}Dyer, R., Nguyen, H. A., Rajan, H., & Nguyen, T. N. (2013). Boa: A language and infrastructure for analyzing ultra-large-scale software repositories. 2013 35th International Conferenceof Software Engineering (ICSE).doi:1006588.1109

bibitem{dr}Nguyen, H. A. (n.d.). Analyzing repetitiveness in big code to support software maintenance and evolution. doi:10.31274/etd-180810-4144

bibitem{patrice}Patrice Godefroid, Nils Klarlund, and Koushik Sen. 2005. DART: directed automated random testing. In Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation (PLDI '05). ACM, New York, NY, USA, 213-223. DOI: https://doi.org/10.1145/1065010.1065036

bibitem{}

end{thebibliography}

end{document}

Essay: Mining Preconditions Using Boa Language – Improve Accuracy and Verify Progam Accuracy

Essay details and download:

Text preview of this essay:

About this essay:

Essay details and download:

Text preview of this essay:

About this essay:

Essay Categories: