Representative Sample Selection via FrequentSubgraph Analysis
First, the accurate separation of malicious components andthe legitimate part from the majority of Android malware,which are repackaged popular apps, is nontrivial [9]–[12].Zhou and Jiang [7] found that 86% of Android malwaresamples are repackaged apps produced by injecting maliciouscomponents into legitimate apps. The injected maliciouscomponents are hidden within the functionalities of popularapps and usually constitute only a small portion of therepackaged apps. Differentiating between the legitimate partand malicious components of malware is difficult for existingfeatures, such as system calls [13] and sensitive path [14].Second, polymorphic variants of Android malware thatbelong to the same family perform the same maliciousactivities with different implementations. Therefore, suchmalware can easily evade existing classification solutions [15],[16] that seek an exact match of a given specification. Forexample, Listing 1 illustrates different implementations ofthe same functionality (i.e., obtain device id, phone number,and voice mail number) in two malware samples. The twomalware samples belong to the same family,geinimi. Thesebot-like malware samples steal personal information and sendit to a remote server. Three major differences (highlightedin red) are observed in the two implementations. First, thestructures of class names are different. Second, the argumentsof the two functions are different. One takes a service(Lcom/geinimi/Adservice), one of the four basic componentsof Android apps, as an argument. By contrast, the other usesan object of the classrally/eas an argument. Third, theformer function contains two more statements (including oneinvocation) than the latter Code Shoppy
https://codeshoppy.com/php-projects-titles-topics.html
To address the above challenges, we propose a novelapproach that exploits the following two observations:Observation 1.Android malware usually invokes sensitiveapplication program interface (API) calls that operate onsensitive data to perform malicious activities. For example,the malware samples presented in Listing 1 invokeget-Line1Number()to obtain the phone number of users.Observation 2.Malware and its variants within the samefamily invoke sensitive API calls by following similar patternseven if their codes may be obfuscated. As illustrated in Listing1, three commonly invoked sensitive API calls (i.e.,get-DeviceId(),getLine1Number(), andgetVoiceMailNumber()),which are highlighted in blue, exist in the two methods ofdifferent malware samples. The three sensitive API calls aresequentially invoked in the two methods, thus illustrating asimilar pattern of sensitive API calls in different sampleswithin the same family.By exploiting the above two observations, we first distillprogram semantics into function call graph (FCG) represen-tation and assign different weights to different sensitive APIcalls with a term frequency-inverse document frequency (TF-IDF)-like approach (see Section II-A for details). TF-IDF is anumerical statistic that evaluates the importance of a word toa document in a collection or corpus.Then, we propose two key techniques to solve the chal-lenges (see Section II-B for details), as follows: 1) We proposea clustering-based approach to extract common malicious be-havior in each family and to address the inaccurate separationof malicious components and the legitimate part of repackagedapps. Thus, we can reduce the side-effects of the legitimatepart in the malware. 2) For the different implementationsof the same functionality, we propose a weighted-sensitive-API-call-based graph matching approach to calculate thesimilarity between graphs generated by community detectionalgorithms. Community detection algorithms are used todetermine whether or not a graph has community structureif the nodes of the graph can be easily grouped into setsof nodes, such that each set of nodes is internally denselyconnected. Our approach can detect homogeneous maliciousbehavior while tolerating minor differences in implementation,such as function renaming and junk-code insertion. SensitiveAPI calls constitute only a small portion of the entire AndroidAPI calls, and they cannot be easily obfuscated by existingtypical obfuscation techniques, whereas the names of user-defined functions are usually obfuscated asa,b, orc.To represent common malicious behaviors shared by mal-ware samples within the same family, we construct frequentsubgraphs (fregraphs), which are novel graph-based featuresextracted from generated FCGs, on the basis of two keytechniques. Moreover, we propose and develop FalDroid,an automatic system for classifying Android malware andselecting representative samples of each family in accordancewithfregraphs, in 8,100 lines of Java code and 900 linesof Python code. We apply FalDroid to 8,407 malware in 36different families and find that it exhibits impressive familialclassification performance. Moreover, it can effectively reduceworkload and accelerate malware analysis.In summary, our major contributions include the followingi) We proposefregraph, a novel graph-based feature, torepresent the common behavior of malware within thesame family. We then employfregraphto conduct mal-ware familial classification and representative malwareselection.(ii) We propose a novel weighted-sensitive-API-call-basedgraph matching approach that can detect the homo-geneous malicious behavior of malware within thesame family while tolerating minor differences inimplementation.(iii) We design and implement FalDroid, a novel systemthat can handle the familial classification of large-scaleAndroid malware with high accuracy and effectivelydecrease the number of malware to be analyzed.(iv) We conduct extensive experiments to evaluate FalDroid.Our results show that FalDroid can achieve 94.2%accuracy and only requires approximately 4.6 sec toprocess an app. Moreover, it can also dramaticallydecrease the cost of malware investigation by selectingonly 8.5% to 22% of representative samples that presentthe most malicious behavior among all samples.The remainder of this paper is organized as follows. Themethodology of FalDroid is detailed in Section II, and its twousages are presented in Section III. The experimental resultsare reported in Section IV. After providing a discussion ofthe limitations of FalDroid in Section V, we introduce relatedwork in Section VI. We conclude the paper with a discussionof future work in Section VII.
First, the accurate separation of malicious components andthe legitimate part from the majority of Android malware,which are repackaged popular apps, is nontrivial [9]–[12].Zhou and Jiang [7] found that 86% of Android malwaresamples are repackaged apps produced by injecting maliciouscomponents into legitimate apps. The injected maliciouscomponents are hidden within the functionalities of popularapps and usually constitute only a small portion of therepackaged apps. Differentiating between the legitimate partand malicious components of malware is difficult for existingfeatures, such as system calls [13] and sensitive path [14].Second, polymorphic variants of Android malware thatbelong to the same family perform the same maliciousactivities with different implementations. Therefore, suchmalware can easily evade existing classification solutions [15],[16] that seek an exact match of a given specification. Forexample, Listing 1 illustrates different implementations ofthe same functionality (i.e., obtain device id, phone number,and voice mail number) in two malware samples. The twomalware samples belong to the same family,geinimi. Thesebot-like malware samples steal personal information and sendit to a remote server. Three major differences (highlightedin red) are observed in the two implementations. First, thestructures of class names are different. Second, the argumentsof the two functions are different. One takes a service(Lcom/geinimi/Adservice), one of the four basic componentsof Android apps, as an argument. By contrast, the other usesan object of the classrally/eas an argument. Third, theformer function contains two more statements (including oneinvocation) than the latter Code Shoppy
https://codeshoppy.com/php-projects-titles-topics.html
To address the above challenges, we propose a novelapproach that exploits the following two observations:Observation 1.Android malware usually invokes sensitiveapplication program interface (API) calls that operate onsensitive data to perform malicious activities. For example,the malware samples presented in Listing 1 invokeget-Line1Number()to obtain the phone number of users.Observation 2.Malware and its variants within the samefamily invoke sensitive API calls by following similar patternseven if their codes may be obfuscated. As illustrated in Listing1, three commonly invoked sensitive API calls (i.e.,get-DeviceId(),getLine1Number(), andgetVoiceMailNumber()),which are highlighted in blue, exist in the two methods ofdifferent malware samples. The three sensitive API calls aresequentially invoked in the two methods, thus illustrating asimilar pattern of sensitive API calls in different sampleswithin the same family.By exploiting the above two observations, we first distillprogram semantics into function call graph (FCG) represen-tation and assign different weights to different sensitive APIcalls with a term frequency-inverse document frequency (TF-IDF)-like approach (see Section II-A for details). TF-IDF is anumerical statistic that evaluates the importance of a word toa document in a collection or corpus.Then, we propose two key techniques to solve the chal-lenges (see Section II-B for details), as follows: 1) We proposea clustering-based approach to extract common malicious be-havior in each family and to address the inaccurate separationof malicious components and the legitimate part of repackagedapps. Thus, we can reduce the side-effects of the legitimatepart in the malware. 2) For the different implementationsof the same functionality, we propose a weighted-sensitive-API-call-based graph matching approach to calculate thesimilarity between graphs generated by community detectionalgorithms. Community detection algorithms are used todetermine whether or not a graph has community structureif the nodes of the graph can be easily grouped into setsof nodes, such that each set of nodes is internally denselyconnected. Our approach can detect homogeneous maliciousbehavior while tolerating minor differences in implementation,such as function renaming and junk-code insertion. SensitiveAPI calls constitute only a small portion of the entire AndroidAPI calls, and they cannot be easily obfuscated by existingtypical obfuscation techniques, whereas the names of user-defined functions are usually obfuscated asa,b, orc.To represent common malicious behaviors shared by mal-ware samples within the same family, we construct frequentsubgraphs (fregraphs), which are novel graph-based featuresextracted from generated FCGs, on the basis of two keytechniques. Moreover, we propose and develop FalDroid,an automatic system for classifying Android malware andselecting representative samples of each family in accordancewithfregraphs, in 8,100 lines of Java code and 900 linesof Python code. We apply FalDroid to 8,407 malware in 36different families and find that it exhibits impressive familialclassification performance. Moreover, it can effectively reduceworkload and accelerate malware analysis.In summary, our major contributions include the followingi) We proposefregraph, a novel graph-based feature, torepresent the common behavior of malware within thesame family. We then employfregraphto conduct mal-ware familial classification and representative malwareselection.(ii) We propose a novel weighted-sensitive-API-call-basedgraph matching approach that can detect the homo-geneous malicious behavior of malware within thesame family while tolerating minor differences inimplementation.(iii) We design and implement FalDroid, a novel systemthat can handle the familial classification of large-scaleAndroid malware with high accuracy and effectivelydecrease the number of malware to be analyzed.(iv) We conduct extensive experiments to evaluate FalDroid.Our results show that FalDroid can achieve 94.2%accuracy and only requires approximately 4.6 sec toprocess an app. Moreover, it can also dramaticallydecrease the cost of malware investigation by selectingonly 8.5% to 22% of representative samples that presentthe most malicious behavior among all samples.The remainder of this paper is organized as follows. Themethodology of FalDroid is detailed in Section II, and its twousages are presented in Section III. The experimental resultsare reported in Section IV. After providing a discussion ofthe limitations of FalDroid in Section V, we introduce relatedwork in Section VI. We conclude the paper with a discussionof future work in Section VII.