Login: Password: Remember Me Register Introduction to IDAPython

Friday, June 24 2005 10:07.18 CDT Author: ero # Views: 88609 Printer Friendly ...

Python and IDA

Python is a powerful scripting language which has features greatly appreciated by its followers. Versatility, speed of development and readability are among the top ones.



IDA provides the advanced user with IDC, a C-like scripting language to automate some of the tasks of analysis. Yet, compared to Python, IDC feels clumsy and slow. Many times has the author (and others) wished for something more versatile.



IDAPython (Erdélyi 2005) was first introduced in an earlier joint paper, Carrera and Erdélyi 2004, where a general overview was given together with minimal examples comparing IDC and equivalent Python scripts.



Python goes well beyond the possibilities of IDC by providing networking support, avanced I/O and a host of other features not available in IDC at all.



In this article, a series of examples will be introduced in order to get acquainted with IDAPython and its possibilities.



The examples presented in this paper are known to work with IDA 4.8 and IDAPython 0.7.0. running under Linux.



IDAPython keeps the same global dictionary regardless of the input method. Whether Python code is run from external files or typed in its notepad, the data is persistent. This is extremely convenient as one might want to run a script that will gather and parse certain data but does not yet know, or want to, do anything further with it. Having such data always accessible sets a wonderful environment for poking and tinkering around.



IDAPython provides the full API available to those writing plugins and also the well known IDC functions. It?s possible to access nearly anything within IDA?s database.



Walking the Functions

As an introductory script, the first example will loop through all the functions IDA has found and any others the user has defined, and will print their effective addresses and names. The script is nearly identical to one of the examples in (Carrera and Erdélyi 2004):



### Walk the functions # Get the segment's starting address ea = ScreenEA() # Loop through all the functions for function_ea in Functions(SegStart(ea), SegEnd(ea)): # Print the address and the function name. print hex(function_ea), GetFunctionName(function_ea)

Functions such as ScreenEA and GetFunctionName exist also in IDC and documentation for them can be found at .



The functions Functions(), is provided by IDAPython?s idautils module, which is automatically imported on load.



Walking the Segments

This example will loop though all segments and fetch their data, byte by byte, storing it in a Python string.



### Going through the segments segments = dict() # For each segment for seg_ea in Segments(): data = [] # For each byte in the address range of the segment for ea in range(seg_ea, SegEnd(seg_ea)): # Fetch byte data.append(chr(Byte(ea))) # Put the data together segments[SegName(seg_ea)] = '' .join(data) # Loop through the dictionary and print the segment's names # and their sizes for seg_name, seg_data in segments.items(): print seg_name, len(seg_data)

The function Segments() is again provided by idautils. Byte(), SegEnd() and SegName() exist in IDC and their functionality is quite self-evident.



Function Connectivity

The third example is a bit more elaborate. It will go through all the functions and will find all the calls performed to and from each of them. The references will be stored in two dictionaries and, in the end, a list of functions with their indegree and outdegree will be shown.



### Indegree and outdegree of functions from sets import Set # Get the segment's starting address ea = ScreenEA() callers = dict() callees = dict() # Loop through all the functions for function_ea in Functions(SegStart(ea), SegEnd(ea)): f_name = GetFunctionName(function_ea) # Create a set with all the names of the functions calling (referring to) # the current one. callers[f_name] = Set(map(GetFunctionName, CodeRefsTo(function_ea, 0))) # For each of the incoming references for ref_ea in CodeRefsTo(function_ea, 0): # Get the name of the referring function caller_name = GetFunctionName(ref_ea) # Add the current function to the list of functions # called by the referring function callees[caller_name] = callees.get(caller_name, Set()) callees[caller_name].add(f_name) # Get the list of all functions functions = Set(callees.keys()+callers.keys()) # For each of the functions, print the number of functions calling it and # number of functions being called. In short, indegree and outdegree for f in functions: print '%d:%s:%d' % (len(callers.get(f, [])), f, len(callees.get(f, [])))

Walking the Instructions

The fourth example will take us to the instruction level. For each segment, we will walk through all the defined elements, by means of Heads(start address, end address) and check whether the element defined at each address is an instruction; if so, the mnemonic will be fetched and its occurrence count will be updates in the mnemonics dictionary. Finally, the mnemonics and their number of occurrences are shown.



### Nmemonics histogram mnemonics = dict() # For each of the segments for seg_ea in Segments(): # For each of the defined elements for head in Heads(seg_ea, SegEnd(seg_ea)): # If it's an instruction if isCode(GetFlags(head)): # Get the mnemonic and increment the mnemonic # count mnem = GetMnem(head) mnemonics[mnem] = mnemonics.get(mnem, 0)+1 # Sort the mnemonics by number of occurrences sorted = map( lambda x:(x[1], x[0]), mnemonics.items()) sorted.sort() # Print the sorted list for mnemonic, count in sorted: print mnemonic, count

Cyclomatic Complexity

The next example goes a bit further. It will go through all the functions and for each of them it will compute the Cyclomatic Complexity. The Cyclomatic Complexity measures the complexity of the code by looking at the nodes and edges (basic blocks and branches) of the graph of a function. It is usually defined as:



CC = Edges - Nodes + 2



The function cyclomatic_complexity() will compute its value, given the function?s start address as input.



The example can be run in two different modes. The first one is invoked as usual, through IDAPython, by locating the Python script and running it. A second way is to launch IDA and make it run the script in batch mode; that will be explored in the next section.



In this example function chunks are not considered. IDA added in recent versions, support for function chunks, which are a result of some compiler?s optimization process. It is possible to walk the chunks by using the function API function func_tail_iterator_t(). The following code shows how to iterate through the chunks.



### Collecting function chunks function_chunks = [] #Get the tail iterator func_iter = func_tail_iterator_t(get_func(ea)) # While the iterator?s status is valid status = func_iter.main() while status: # Get the chunk chunk = func_iter.chunk() # Store its start and ending address as a tuple function_chunks.append((chunk.startEA, chunk.endEA)) # Get the last status status = func_iter.next()

### Cyclomatic complexity import os from sets import Set def cyclomatic_complexity (function_ea): """Calculate the cyclomatic complexity measure for a function. Given the starting address of a function, it will find all the basic block's boundaries and edges between them and will return the cyclomatic complexity, defined as: CC = Edges - Nodes + 2 """ f_start = function_ea f_end = FindFuncEnd(function_ea) edges = Set() boundaries = Set((f_start,)) # For each defined element in the function. for head in Heads(f_start, f_end): # If the element is an instruction if isCode(GetFlags(head)): # Get the references made from the current instruction # and keep only the ones local to the function. refs = CodeRefsFrom(head, 0) refs = Set(filter( lambda x: x>=f_start and x<=f_end, refs)) if refs: # If the flow continues also to the next (address-wise) # instruction, we add a reference to it. # For instance, a conditional jump will not branch # if the condition is not met, so we save that # reference as well. next_head = NextHead(head, f_end) if isFlow(GetFlags(next_head)): refs.add(next_head) # Update the boundaries found so far. boundaries.union_update(refs) # For each of the references found, and edge is # created. for r in refs: # If the flow could also come from the address # previous to the destination of the branching # an edge is created. if isFlow(GetFlags(r)): edges.add((PrevHead(r, f_start), r)) edges.add((head, r)) return len(edges) - len(boundaries) + 2 def do_functions (): cc_dict = dict() # For each of the segments for seg_ea in Segments(): # For each of the functions for function_ea in Functions(seg_ea, SegEnd(seg_ea)): cc_dict[GetFunctionName(function_ea)] = cyclomatic_complexity(function_ea) return cc_dict # Wait until IDA has done all the analysis tasks. # If loaded in batch mode, the script will be run before # everything is finished, so the script will explicitly # wait until the autoanalysis is done. autoWait() # Collect data cc_dict = do_functions() # Get the list of functions and sort it. functions = cc_dict.keys() functions.sort() ccs = cc_dict.values() # If the environment variable IDAPYTHON exists and its value is 'auto' # the results will be appended to a data file and the script will quit # IDA. Otherwise it will just output the results. if os.getenv( 'IDAPYTHON' ) == 'auto' : results = file( 'example5.dat' , 'a+' ) results.write( '%3.4f,%03d,%03d %s

' % ( sum(ccs)/float(len(ccs)), max(ccs), min(ccs), GetInputFile())) results.close() Exit(0) else : # Print the cyclomatic complexity for each of the functions. for f in functions: print f, cc_dict[f] # Print the maximum, minimum and average cyclomatic complexity. print 'Max: %d, Min: %d, Avg: %f' % (max(ccs), min(ccs), sum(ccs)/float(len(ccs)))

Automating IDA through IDAPython

As mentioned in the last section, the previous example has a a second way of operating. IDAPython now supports to run Python scripts on start up, from the command line. Such functionality comes handy, to say the least, when analyzing a set of binaries in batch mode.



The switch -OIDAPython:/path/to/python/script.py can be used to tell IDAPython which script to run. Another switch which might come handy is -A which will instruct IDA to run in batch mode, not asking anything, just performing the auto-analysis. With those two options combined it is possible to auto-analyze a binary and run a Python script to perform some mining. A function which will be usually required is autoWait() which will instruct the Python script to wait until IDA is done performing the analysis. It is a good idea to call it in the beginning of any script. To analyze a bunch of files a command like the following could be entered (if working in Bash on Linux).



for virus in virus/*.idb; do IDAPYTHON='auto' idal -A -OIDAPython:example5.py $virus; done

It will go through all the .idb files in the virus/ directory and will invoke idal which each of them, running the script example5.py on load.



The script is the one in the last example. If it finds the environment variable IDAPYTHON, it will just collect the data and append it to a file instead of showing it in IDA?s messages window. Subsequently it will call Exit() to close the database and quit.



It would be equally easy to batch mode analyze a set of executables. If IDB files are given, IDA will just load them and no auto-analysis will be performed; otherwise, if a binary file is provided the analysis will be done and the script run once finished.



All this allows for a good degree of automation in analysis of a set of binaries. For instance, the next table is the output of running the previous script on a bunch of malware IDBs. A nice feature is to see the clear clustering of the families by their cyclomatic complexity features.



Output of running the example in batch mode on a set of malware binaries.



Sample Cyclomatic Complexity Avg. Max Min Filename Klez 7.4197 148 001 klez_a.ex 7.4975 148 001 klez_b.ex 7.5972 148 001 klez_c.ex 7.5972 148 001 klez_d.ex 7.0349 148 001 klez_e.ex 7.0502 148 001 klez_f.ex 7.0502 148 001 klez_g.ex 7.0573 148 001 klez_h.ex 7.0573 148 001 klez_i.ex 7.0502 148 001 klez-j.ex Mimail 3.2190 052 001 mimailA.ex_.1.unp 3.2353 052 001 mimailB.ex_ 3.2313 052 001 mimailC.ex_.1.unp 3.4148 052 001 mimailD.ex_ 2.8110 052 001 mimailE.ex_.1.unp 2.7953 052 001 mimailF.ex_.1.unp 2.7638 052 001 mimailG.ex_.1.unp 2.7874 052 001 mimailH.ex_.1.unp 2.8376 052 001 mimailI.ex_.1.unp 2.8632 052 001 mimailJ.ex_ 2.8984 052 001 mimailL.ex_.1.unp 2.8231 052 001 mimail-m_u.ex 3.4375 052 001 outlook_.dmp 3.1138 052 001 mimail-s_u.ex Sasser 6.5301 039 001 sasser.avpe 6.5422 039 001 sasser-b.avpe 6.6098 039 001 sasser-c.avpe 6.5955 041 001 sasser-d.ex_unp.exe 6.5444 041 001 sasser-e.unp 6.8452 041 001 sasser-f.unp 8.0000 041 001 sasser-g.unp Netsky 7.3505 041 001 netskyaa.unp 7.4947 041 001 netsky_unk.unp 7.1667 041 001 netsky_ac.ex_unp 5.9694 051 001 Netsky.AD.unp 7.3125 041 001 virus.ex_.1.unp 7.2478 041 001 your_details.doc.exe.2.unp 8.0407 123 001 userconfig9x.dl.1.unp 7.9068 041 001 netsky-q-dll.unp 7.9068 041 001 netsky-q-dll.unp 7.5702 041 001 netsky-r-dll_unp_.exe 7.5657 041 001 list0_unp_.pif 7.5743 041 001 private.unp.pi_ 7.5268 041 001 netsky_v_unp_.exe 7.8824 041 001 netsky-w.unp 6.8165 041 001 netsky.pif.2.unp



Visualizing Binaries

This example is based on the one collecting the indegrees and outdegree of all functions. This time, we will use that information to generate a graph of the call-tree and plot it using pydot, (Carrera 2005a); a package to interface Graphviz, (Ellson et al. 2005).



The code follows, the only changes from the example it is based on, are the lines creating the graph, setting some defaults and then adding the edges.



### Visualizing Binaries from sets import Set import pydot # Get the segment's starting address ea = ScreenEA() callers = dict() callees = dict() # Loop through all the functions for function_ea in Functions(SegStart(ea), SegEnd(ea)): f_name = GetFunctionName(function_ea) # For each of the incoming references for ref_ea in CodeRefsTo(function_ea, 0): # Get the name of the referring function caller_name = GetFunctionName(ref_ea) # Add the current function to the list of functions # called by the referring function callees[caller_name] = callees.get(caller_name, Set()) callees[caller_name].add(f_name) # Create graph g = pydot.Dot(type= 'digraph' ) # Set some defaults g.set_rankdir( 'LR' ) g.set_size( '11,11' ) g.add_node(pydot.Node( 'node' , shape= 'ellipse' , color= 'lightblue' , style= 'filled' )) g.add_node(pydot.Node( 'edge' , color= 'lightgrey' )) # Get the list of all functions functions = Set(callees.keys()+callers.keys()) # For each of the functions and each of the called ones, add # the corresponding edges. for f in functions: if callees.has_key(f): for f2 in callees[f]: g.add_edge(pydot.Edge(f, f2)) # Write the output to a Postscript file g.write_ps( 'example6.ps' )

Some examples output is shown next, the different plots are obtained by using the different plotting utilities provided by Graphviz.











Projects Using IDAPython

It might be also useful to check some already existing projects based solely on IDAPython. Some of them are: idb2reml, (Carrera 2005); will export IDB information to a XML format, REML (ReverseEngineering ML)

pyreml, (Carrera 2005a); loads the REML produced by idb2reml and provides a set of functions to perform advanced analysis. This paper is also available in PDF form. The PDF version is 64 pages long, contains in addition to this article a full function reference and is available from Introduction to IDAPython.pdf.

Article Comments Write Comment / View Complete Comments



Username Comment Excerpt Date ResearchAviator Hi Ero, Can this be extended to plot a graph... Friday, November 27 2009 01:02.13 CST Nadya Hi Ero, Interesting introduction, but one qu... Monday, July 7 2008 10:08.03 CDT ero I should get around to write some more docs on ... Tuesday, June 28 2005 20:02.15 CDT JCRoberts Ero, Is there any further known uses or additi... Tuesday, June 28 2005 02:31.39 CDT ero The number of connected components in this case... Monday, June 27 2005 19:01.23 CDT ThorstenSchneider The correct definition of the Cyclomatic Comple... Sunday, June 26 2005 01:49.46 CDT