This article is for our sponsors at CodeProject. These articles are intended to provide you with information on products and services that we consider useful and of value to developers

Intel® Developer Zone offers tools and how-to information for cross-platform app development, platform and technology information, code samples, and peer expertise to help developers innovate and succeed. Join our communities for Android, Internet of Things, Intel® RealSense™ Technology, and Windows to download tools, access dev kits, share ideas with like-minded developers, and participate in hackathon’s, contests, roadshows, and local events.

Introduction

The new buzz in the mobile marketplace is about Android 64-bit systems. In September 2013, Apple released the iPhone* 5 with a 64-bit A7 processor onboard. Thus began the mobile technology race.

It turns out that the Android-based kernel GNU/Linux* has been supporting processors with 64-bit registers for a long time. Ubuntu is "GNU/Linux" while Android is "Dalvik/Linux". Dalvik is the process virtual machine (VM) in Google's Android operating system, which specifically executes applications written for Android. This makes Dalvik an integral part of the Android software stack, which is typically used on mobile devices such as mobile phones and tablet computers, as well as more recently on devices such as smart TVs and wearables. Nevertheless, all developers who use the NDK have to rebuild their programs under the latest architecture, and the ease or difficulty of this process depends on the tools that Google will provide. In addition, Google should provide backward compatibility, i.e., NDK 32-bit applications should run in Android 64-bit.

The first Intel 64-bit processors for mobile devices were created in the 3rd quarter of 2013 and were the new powerful multicore System on a Chip (SoC) for mobile and desktop devices. This new SoC family consists of Intel® AtomTM processors for tablets and 2 in 1 devices, Intel® Celeron® processors, and Intel® Pentium® processors for 2 in 1 devices, laptops, desktop PCs and All in One PCs.

In October 2014, Google released a preview emulator image of the 64-bit Android L for developers. This allowed them to test their programs and rewrite code, if necessary, before the OS is released. In a Google+ blog developers indicated that programs entirely created with Java* do not require porting. They ran them "as is" in the L- version of the emulator, which supports 64-bit architecture. Those using other languages, especially C and C++, will have to perform some steps to build against the new Android NDK. Several older versions of Android-based devices with 64-bit processors are on the market. However, manufacturers may have to update them rather quickly; otherwise, there will be a lack of software apps for users.

Android 64-bit L emulator

In June 2014, Google announced that Android would support 64-bit in the coming release. This is great news for those who want the most performance possible out of their devices and apps. The list of benefits highlighted by Google in this update include a larger number of registers, increased addressable memory space, and new instruction sets.

The Android emulator supports many hardware features likely to be found on mobile devices, including:

An ARM* v5 CPU and the corresponding memory-management unit (MMU)

A 16-bit LCD display

One or more keyboards (a Qwerty-based keyboard and associated Dpad/Phone buttons)

A sound chip with output and input capabilities

Flash memory partitions (emulated through disk image files on the development machine)

A GSM modem, including a simulated SIM Card

A camera, using a webcam connected to your development computer.

Sensors like an accelerometer, using data from a USB-connected Android device

This is a great step forward for building our favorite devices and apps. Unfortunately, we’ll have to wait for Android L to drop before we can enjoy these new performance boosts. A few weeks after Android L releases, Revision 10 of the Native Development Kit (NDK) should be posted with support for the three 64-bit architectures that will be able to run the new version of Android: arm64-v8a, x86_64, and mips64. If you’ve built an app using Java, your code will automatically have better performance on the new x86 64-bit architecture. Google has updated the NDK to revision 10b and added an emulator image developers can use to prepare their apps to run on devices built with Intel's 64-bit chips.

Keep in mind, the NDK is only for native apps, not those built with Java on the regular Android SDK. If you have been looking forward to getting your apps running on 64-bit, or if you need to update to the latest version of the NDK, hit the developer portal to get your download started.

Developing with the x86_64 Android NDK

The Native Development Kit (NDK) is a toolset that allows you to implement parts of your app using native code languages such as C and C++. For certain types of apps, this can be helpful so you can reuse existing code libraries written in these languages, but most apps do not need the Android NDK. You need to balance the benefits of using the NDK against its drawbacks. Notably, using native code on Android generally does not result in a noticeable performance improvement, but it always increases your app complexity. You should only use the NDK if it is essential to your app and not because you simply prefer to program in C/C++.

You can download the latest version of Android NDK from: https://developer.android.com/tools/sdk/ndk/index.html

In this section I'll review how to compile a sample application using the Android NDK.

We will use the sample application, san-angeles, located in the Android NDK samples directory:

$ANDROID_NDK/samples/san-angeles

Native code is located in the jni/ directory:

$ANDROID_NDK/samples/san-angeles/jni

Native code is compiled for specified CPU architecture(s). Android applications may contain libraries for several architectures in one apk file.

To set target architectures you need to create the Application.mk file inside the jni/ directory. The following line will compile the native libraries for all supported architectures:

APP_ABI := all

Sometimes, it’s better to specify a list of target architectures. This line compiles the libraries for x86 and ARM architectures:

APP_ABI := x86 armeabi armeabi-v7a

Because we are building a 64-bit app, we need to compile the libraries for x86_64 architectures:

APP_ABI := x86_64

Run the following command inside the sample directory to build libraries:

cd $ANDROID_NDK/samples/san-angeles

After the successful build, open the sample in Eclipse* as an Android application and click "Run". Select the emulator or a connected Android device where you want to run the application.

To support all available devices you need to compile the application for all architectures. If the apk file size with libraries for all architectures is too big, consider following the instructions in Google Play Multiple APK Support to prepare a separate apk file for each platform.

Checking supported architectures

You can use this command to check what architectures are included in apk file:

aapt dump badging file.apk

The following line lists all architectures:

native-code: 'armeabi', 'armeabi-v7a', 'x86', 'x86_64'

Another method is to open the apk file as a zip file and view subdirectories in the lib/ directory.

Optimization of 64-bit programs

Reducing the amount of memory an app consumes

When a program is compiled in the 64-bit mode, it consumes more memory than its 32-bit version. This increase often goes unnoticed, but memory consumption can sometimes be two times higher than 32-bit apps. The amount of memory consumption is determined by the following factors:

Some objects, like pointers, require larger amounts of memory

Data alignment and data structure padding

Increased stack memory consumption

64-bit systems have a larger amount of memory available to user applications than 32-bit systems. So if a program takes 300 Mbytes on a 32-bit system with 2 Gbytes of memory but needs 400 Mbytes on a 64-bit system with 8 Gbytes of memory, in relative units, the program takes three times less memory on a 64-bit system. The one disadvantage is performance loss. Although 64-bit programs are faster, extracting larger amounts of data from memory might cancel all the advantages and even reduce performance. Transferring data between the memory and microprocessor (cache) is not very cheap.

One way to reduce memory consumption is to optimize data structures. Another way is to use memory-saving data types. For instance, if we need to store a lot of integer numbers and we know that their values will never exceed UINT_MAX, we may use the type "unsigned" instead of "size t", as discussed in the next section.

Using memsize-types in address arithmetic

Using ptrdiff_t and size_t types in address arithmetic might give you an additional performance gain along with making the code safer. For example, using the type int , whose size differs from the pointer's capacity, as an index results in additional data conversion commands in the binary code. We might have 64-bit code and the pointers' size is 64 bits while the size of int type remains the same - 32 bits.

It is not easy to give a brief example to show that size_t is better than unsigned . To be impartial, we have to use the compiler's optimizing capabilities. But two variants of the optimized code often get too different to easily demonstrate their difference. We managed to create something like a simple example after six tries. But the sample is far from ideal because instead of the code containing the unnecessary conversions of data types discussed above, it shows that the compiler can build a more efficient code when using size_t . Consider this code, which arranges array items in the reverse order:

<code1.txt> unsigned arraySize; ... for (unsigned i = 0 ; i < arraySize / 2 ; i++) { float value = array[i]; array[i] = array[arraySize - i - 1 ]; array[arraySize - i - 1 ] = value; }

The variables " arraySize " and " i " in the example have the type unsigned . You can easily replace it with size_t and compare a small fragment of assembler code shown in Table 1.

Table 1 - Comparing the 64-bit assembler code fragments using the types unsigned and size_t

array [arraySize - I - 1] = value; arraySize, i : unsigned arraySize, i : size_t mov eax, DWORD PTR arraySize$[rsp] sub eax, r11d sub r11d, 1 add eax, -1 movss DWORD PTR [rbp + rax*4], xmm0 … mov rax, QWORD PTR arraySize$[rsp] sub rax, r11 add r11, 1 movss DWORD PTR [rdi + rax*4 - 4], xmm0 …

The compiler managed to build a more concise code when using 64-bit registers. We do not want to say that the code created using the type unsigned (column 1) will be slower than the code using the type size_t (column 2). It is difficult to compare the speed of code execution on contemporary processors. But you can see in this example that the compiler built a briefer and faster code when using 64-bit types.

Now let us consider an example showing the advantages of the types ptrdiff_t and size_t from the viewpoint of performance. For the purposes of demonstration, we will take a simple algorithm of calculating the minimum path length.

The function FindMinPath32 is written in classic 32-bit style with unsigned types. The function FindMinPath64 differs from it only in the way that all the unsigned types in it are replaced with size_t types. There are no other differences! Now let us compare the execution speeds of these two functions (Table 2).

Table 2 - The time of executing the functions FindMinPath32 and FindMinPath64

Mode and function Function's execution time 1 32-bit compilation mode. Function FindMinPath32 1 2 32-bit compilation mode. Function FindMinPath64 1.002 3 64-bit compilation mode. Function FindMinPath32 0.93 4 64-bit compilation mode. Function FindMinPath64 0.85

Table 2 shows reduced time relative to the speed of execution of the function FindMinPath32 on a 32-bit system. This table was developed for the purpose of clarity. The operation time of the

FindMinPath32 function in the first line is 1 on a 32-bit system. This represents our baseline as a unit of measurement.

In the second line, we see that the operation time of the FindMinPath64 function is also 1 on a 32-bit system. No wonder, because the type unsigned coincides with the type size_t on a 32-bit system, and there is no difference between the FindMinPath32 and FindMinPath64 functions. A small deviation (1.002) only indicates a small error in measurements.

In the third line, we see a performance gain of 7%. We could well expect this result after recompiling the code for a 64-bit system.

The fourth line is of the most interest for us. The performance gain is 15%. By merely using the type size_t instead of unsigned , the compiler built a more effective code that works even 8% faster!

This simple and obvious example shows how data that are not equal to the size of the machine word slow down algorithm performance. Mere replacement of the types int and unsigned with ptrdiff_t and size_t may result in a significant performance gain. This result applies first of all to those cases where these data types are used in index arrays, in address arithmetic and to arrange loops.

Intrinsic functions

Intrinsic functions are special system-dependent functions that perform actions that cannot be performed at the C/C++ level of code or that perform these functions much more effectively. Actually, they let you get rid of inline assembler code because it is often undesirable or impossible to use.

Programs may use intrinsic functions to create faster code due to the lack of overhead expenses on calling common functions. The code size is a bit larger of course. MSDN gives a list of functions that can be replaced with their intrinsic versions. Examples of these are memcpy , strcmp , etc.

Besides automatic replacement of common functions with their intrinsic versions, you may use intrinsic functions explicitly in your code. This might be helpful due to these factors:

Inline assembler is not supported by the Visual C++ compiler in the 64-bit mode while intrinsic code is.

Intrinsic functions are simpler to use as they do not require knowledge of registers and other similar low-level constructs.

Intrinsic functions are updated in compilers while assembler code must be updated manually.

The built-in optimizer does not work with assembler code.

Intrinsic code is easier to port than assembler code.

Using intrinsic functions in automatic mode (with the help of the compiler switch) will let you get some percentage of performance gain and using the "manual" switch helps even more. That is why using intrinsic functions is a good way to go.

Alignment

Data structure alignment is the way data is arranged and accessed in computer memory. It consists of two separate but related issues: data alignment and data structure padding. When a modern computer reads from or writes to a memory address, it will do this in word-sized chunks (e.g., 4-byte chunks on 32-bit systems) or larger. Data alignment means putting the data at a memory offset equal to some multiple of the word size, which increases the system's performance due to the way the CPU handles memory. To align the data, it may be necessary to insert some meaningless bytes between the end of the last data structure and the start of the next, which is data structure padding.

For example, when the computer's word size is 4 bytes (which is 8 bits on most machines, but could be different on some systems), the data to be read should be at a memory offset that is some multiple of 4. When this is not the case, e.g., the data starts at the 14th byte instead of the 16th byte, then the computer has to read two 4-byte chunks and do some calculation before the requested data has been read, or it may generate an alignment fault. Even though the previous data structure ends at the 13th byte, the next data structure should start at the 16th byte. Two padding bytes are inserted between the two data structures to align the next data structure to the 16th byte.

Although data structure alignment is a fundamental issue for all modern computers, many computer languages and computer language implementations handle data alignment automatically

It is good in some cases to help the compiler by defining the alignment manually to enhance performance. For example, Streaming SIMD Extensions (SSE) data must be aligned on a 16-byte boundary. You may do this in the following way:

__declspec(align( 16 )) double init_val[ 2 ] = { 3 . 14 , 3 . 14 }; _m128d vector_var = __mm_load_pd(init_val);

Android Runtime

Android Runtime (ART) applications were developed by Google as a replacement of Dalvik. This runtime offers a number of new features that improve performance and smoothness of the Android platform and apps. ART was introduced in Android 4.4 KitKat; in Android 5.0 it will completely replace Dalvik. Unlike Dalvik, ART uses a Just-In-Time (JIT) compiler (at runtime), meaning that ART compiles an application during its installation. As a result, the program executes faster and that improves battery life.

For backward compatibility, ART uses the same byte code as Dalvik.

In addition to the potential speed increase, using ART can provide a second important benefit. As ART runs app machine code directly (native execution), it doesn't hit the CPU as hard as just-in-time code compiling on Dalvik. Less CPU usage results in less battery drain, which is a big plus for portable devices in general.

So why wasn't ART implemented earlier? Let's look at the downsides of Ahead-of-time (AOT) compilation. First, the generated machine code requires more space than the existing byte code. Second, the code is pre-compiled at install time, so the installation process takes a bit longer time. Finally, it also corresponds to a larger memory footprint at execution time. This means that fewer apps can be run concurrently. When the first Android devices hit the market, memory and storage capacity were significantly smaller and presented a bottleneck for performance. This is the reason why a JIT approach was the preferred option at that time. Today, memory is much cheaper and thus more abundant, even on low-end devices, so ART is a logical step forward.

In perhaps the most important improvement, ART now compiles your application to native machine code when installed on a user’s device. Known as ahead-of-time compilation, you can expect to see large performance gains as the compilers are set for specific architectures (such as ARM, x86, or MIPS). This eliminates the need for just-in-time compilation each time an application is run. Thus it takes more time to install your application, but it will boot faster when launched as many tasks executed at runtime on the Dalvik VM, such as class and method verification, have already taken place.

Next, the ART team worked to optimize the garbage collector (GC). Instead of two pauses totaling about 10ms for each GC in Dalvik, you’ll see just one, usually under 2ms. They’ve also parallelized portions of the GC runs and optimized collection strategies to be aware of device states. For example, a full GC will run only when the phone is locked and responsiveness to user interaction is no longer important. This is a huge improvement for applications that are sensitive to dropping frames. Additionally, future versions of ART will include a compact collector that will move chunks of allocated memory into contiguous blocks to reduce fragmentation and the need to kill older applications to allocate large memory regions.

Lastly, ART makes use of an entirely new memory allocator called Rosalloc (runs of slots allocator). Most modern systems use allocators based on Doug Lea’s design, which has a single global memory lock. In a multithreaded, object-oriented environment, this interferes with the garbage collector and other memory operations. In Rosalloc, smaller objects common in Java are allocated in a thread-local region without locking and larger objects have their own locks. Thus when your application attempts to allocate memory for a new object, it doesn’t have to wait while the garbage collector frees an unrelated region of memory.

Currently, Dalvik is the default runtime for Android devices and ART is optionally available on a number of Android 4.4 devices, such as Nexus phones, Google Play edition devices, Motorola phones running stock Android, and many other smartphones. ART is currently in development, and seeking developer and user feedback. ART will eventually replace Dalvik runtime once it becomes completely stable. Until then, users with compatible devices can switch from Dalvik to ART if they’re interested in trying out this new functionality and experience its performance.

To switch or enable ART, your device must be running Android 4.4 KitKat and be compatible with ART. You can easily turn on ART runtime from "Settings" -> "Developer options" -> "Runtime option". (Tip: If you can’t see Developer options in Settings, then go to "About phone", scroll down, and tap the Build number 7 times to enable developer options.) The phone will reboot and start optimizing the apps for ART, which can take around 15-20 minutes, depending on the number of apps installed on your phone. You will also notice an increase in the size of installed apps after enabling ART runtime.

Note: After switching to ART, when you reboot your device for the first time, it will optimize all the apps once again; which is kind of annoying.

As Dalvik is the default runtime on Android devices, some apps might not work on ART, though, most existing apps are compatible with ART and should work fine. But in case you experience any bugs or app crashes with ART, then it’s wise to switch back and stay with ART.

Switching to ART on devices requires you to know where to find the switching option on the device. Google has hidden it under Settings. Fortunately, there is a trick to enable ART runtime on device that are based on Android 4.4 KitKat.

Disclaimer: Before trying this, you should make a backup of your data. Intel won’t be responsible if your device gets bricked (won’t turn on regardless of what you try). Try it at your own risk!

Requires Root

Don’t try if you have WSM Tools installed as they don’t support ART.

To enable ART, carefully follow these steps:

Make sure your device is rooted. Install ‘ES File Explorer’ from the Play store. Open ES File Explorer, tap the menu icon from top left corner and select Tools. In tools, enable the ‘Root Explorer’ option and grant full root access to ES explorer when prompted. In ES explorer, open the Device (/) directory from Menu -> Local-> Device. Go to the /data/property folder. Open the persist.sys.dalvik.vm.lib file as Text and then select ES note editor. Edit the file by selecting the edit option from top right corner. Rename the line from libdvm.so to libart.so Go back to the persist.sys.dalvik.vm.lib file and select ‘Yes’ to save the file. Then reboot the phone. The phone will reboot now and start optimizing the apps for ART. It can take time to reboot depending on the number of apps installed on your device.

In case you want to revert back to Dalvik runtime, simply follow the above steps and rename the text in persist.sys.dalvik.vm.lib file to libdvm.so.

Conclusion

Google has released a 64-bit emulator image for the forthcoming Android L - but only for the Intel x86 chip. The new emulator will allow developers to build or optimize older apps for the upcoming Android L OS and its new 64-bit architecture. Moving to 64-bit increases the addressable memory space, and allows a larger number of registers and a new instructions set for developers, but 64-bit apps aren't necessarily faster.

Java apps automatically gain the benefits of 64-bit because their byte code will be interpreted by the new ART VM which is 64-bit.This also implies that no changes to pure Java apps are necessary. Those built on the Android NDK will need some optimization to include the x86_64 build target. Intel has advice on how to go about porting code that targets ARM to x86/x64. Using the new emulator, developers will only be able to create apps for Intel® Atom™ processor-based chips.

Intel has been providing developers with tools and good system support for Android particularly its Intel® Hardware Accelerated Execution Manager (Intel® HAXM) and a range of Intel Atom OS images. Many Android programmers regularly test on emulated Intel architecture even though most of their deployment is to ARM devices. As well as the new emulator there is a 64-bit upgrade to the HAXM accelerator which should make using HAXM even more attractive. To quote Intel:

"This commitment is evident not only in the delivery of the industry’s first 64-bit emulator image for Intel architecture, and 64-bit Intel HAXM within the Android L Developer Preview SDK, but also in many other innovations along the way such as the first 64-bit kernel for Android KitKat earlier this year, the 64-bit Android Native Development Kit (NDK), and other 64-bit advancements over the last decade."

Could it be that a change to Intel architecture might happen as part of the change from 32-bit mobile to 64-bit mobile?

The Android SDK includes a virtual mobile device emulator that runs on your computer. The emulator lets you prototype, develop, and test Android applications without using a physical device. The Android emulator mimics all of the hardware and software features of a typical mobile device, except that it cannot place actual phone calls. It provides a variety of navigation and control keys, which you can "press" using your mouse or keyboard to generate events for your application. It also provides a screen in which your application is displayed, along with any other active Android applications.

To let you model and test your application more easily, the emulator utilizes Android Virtual Device (AVD) configurations. AVDs let you define certain hardware aspects of your emulated phone and allow you to create many configurations to test many Android platforms and hardware permutations. Once your application is running on the emulator, it can use the services of the Android platform to invoke other applications, access the network, play audio and video, store and retrieve data, notify the user, and render graphical transitions and themes.

Related Articles and Resources