When something totally out of your control happens to your app that prevents your users from using it, are you able to do something about it?

We will analyze latest problem with Samsung A70 device that after official update to newest Android 10, some applications start to have a problem connecting to their respective backend services.

For instance, quite big player on Android market Blizzard's Hearthstone also suffered from this issue, which is good news for us, since the bigger the issue is, the more chances it will get fixed, otherwise usually issue will never get fixed as pushing any change on worldwide market is an extreme cost (even small fix require extensive testing so you don't trade one bugfix for new bugs affecting even more users).

Solving almost any problem in IT world requires common steps:

1) Confirm that a problem actually exists

2) Find where the problem is

3) Isolate the problem out of your application logic (i.e. create a 'new app' that reproduces the problem)

4) Diagnose and debug the problem

5A) Create a fix or workaround for the problem

5B) Give up all hope (remember that your time costs, at some point, it might be just cheaper to replace few devices of your customers rather than pursuing to fix it, or the revenue from those customers is lower than the cost of fixing it)

OK we know what to do so let's skip in and do it

Step 1 - Confirm that problem actually exists

We download our app from Google Play Store on our regular development devices and.. it just works. (Great, we just confirmed that its not affected by 100% userbase).

We order specific device online and wait two days for it to arrive. Let's hope the device we get will have this issue - It happened to us few times before that the device we ordered was correct model to reproduce the issue, but the issue did not reproduce - it usually happens when a device is clean and does not contain any crapware installed that daily used device usually has.

Device arrives

And problem reproduced! Sorry for blacking out some irrelevant stuff - I always wanted to feel like an FBI agent.

Step 2 - Find where the problem is.

Is this case, this step is quite easy. We clearly see that it is connectivity problem with our backend services and even this error is mapped into "Connection problem" dialog so even application user can see it.

Step 3 - Isolate the problem out of your application logic

This step is sometimes tricky, because isolating the problem needed to reproduce the issue sometimes hides it (for example there might be race condition somewhere and stripping a bit of code can cause that race condition to never occur).

We copy our http connection code into new application:

public class BackendConnection { private int counter = 0; private SSLSocketFactory m_sslSocketFactory; private HostnameVerifier m_hostnameVerifier; private String url; private String tag; private HttpURLConnection m_currentConnection; public BackendConnection(String tag) { this.tag = tag; } public void InitializeHost(String url) { this.url = url; } public void InitializeSecurity(SSLSocketFactory sslSocketFactory) { m_sslSocketFactory = sslSocketFactory; } public synchronized boolean execute(byte[] payload) { try { counter++; URL url = new URL(this.url); HttpURLConnection connection = (HttpURLConnection) url.openConnection(); if ( null != m_sslSocketFactory ) { HttpsURLConnection huc = (HttpsURLConnection) connection; huc.setSSLSocketFactory(m_sslSocketFactory); } connection.setRequestProperty("Content-Type", "application/json"); connection.setRequestProperty("charset", "utf-8"); // android < 4.4 if ( Build.VERSION.SDK_INT >= Build.VERSION_CODES.JELLY_BEAN_MR2 ) { connection.setRequestProperty("connection", "close"); } else { System.setProperty("http.keepAlive", "false"); } { byte[] inputData = payload; connection.setDoOutput(true); connection.setFixedLengthStreamingMode(inputData.length); } synchronized(this) { m_currentConnection = connection; } long start = System.currentTimeMillis(); connection.connect(); if ( null != payload ) { OutputStream os = connection.getOutputStream(); os.write(payload); } DataInputStream isr = new DataInputStream(connection.getInputStream()); ByteArrayOutputStream streamData = new ByteArrayOutputStream(); do { byte[] buf = new byte[1024]; int len = isr.read(buf); if ( len < 0 ) { break; } streamData.write(buf, 0, len); } while(true); byte[] data = streamData.toByteArray(); Log.d("SOFTAX-SDK-HTTP", String.format("[%s:%d] time: %dms", tag, counter, (System.currentTimeMillis() - start))); connection.disconnect(); return true; } catch (MalformedURLException ex) { Log.d("SOFTAX-SDK-HTTP", String.format("[%s:%d] exception: %s", tag, counter, ex.toString())); return false; } catch (UnknownHostException ex) { Log.d("SOFTAX-SDK-HTTP", String.format("[%s:%d] exception: %s", tag, counter, ex.toString())); return false; } catch (SSLPeerUnverifiedException ex) { Log.d("SOFTAX-SDK-HTTP", String.format("[%s:%d] exception: %s", tag, counter, ex.toString())); return false; } catch ( InterruptedIOException ex ) { Log.d("SOFTAX-SDK-HTTP", String.format("[%s:%d] exception: %s", tag, counter, ex.toString())); return false; } catch (IOException ex) { Log.d("SOFTAX-SDK-HTTP", String.format("[%s:%d] exception: %s", tag, counter, ex.toString())); return false; } catch (Exception ex) { Log.d("SOFTAX-SDK-HTTP", String.format("[%s:%d] exception: %s", tag, counter, ex.toString())); return false; } finally { synchronized ( this ) { m_currentConnection = null; } } } }

Add magic button to execute that code:

@Override public void onClick(View v) { Thread thd = new Thread(new Runnable() { @Override public void run() { BackendConnection connection = new BackendConnection("SAMSUNG"); connection.InitializeHost(String.format("%s/samsunga70test1", host)); connection.InitializeSecurity(sslFactory.get()); connection.execute("{ \"hello\": \"kitty\"}".getBytes()); } }); thd.start(); }

And..

Success! We isolated the code

Step 4 - Diagnose and debug the problem.

This is usually most time consuming part of the job. I usually divide this into two parts.

1) How to force an error to occur.

2) Debug only when you know that an error will occur. Remember, we will be debugging problem that we don't even have source code for, so we want to minimize wasted time debugging when there is nothing to debug (it just works).

Notice if we wait long enough, error will occur. Great!

Lets debug now:

Lets see what throws this exception:

By doing that we get some hints

Quick google for okhttp source code and we find such link https://github.com/square/okhttp

This library has quite few major revisions:

remotes/origin/master

remotes/origin/okhttp_27

remotes/origin/okhttp_3.10.x

remotes/origin/okhttp_3.11.x

remotes/origin/okhttp_3.12.x

remotes/origin/okhttp_3.13.x

remotes/origin/okhttp_3.14.x

remotes/origin/okhttp_3.9.x

remotes/origin/okhttp_4.0.x

remotes/origin/okhttp_4.1.x

remotes/origin/okhttp_4.2.x

remotes/origin/okhttp_4.3.x

remotes/origin/okhttp_4.4.x

So let's try all of them and look for RealConnection class. And it seems like okhttp_27 is the closest fit.

From stack trace we notice that method connectSocket should be somewhere near line 1409, but the file only contains total 407 lines of code so either this code is somewhat post-compile obfuscated, or they just include big comment in their copy of the file.

Let's add empty lines so that 141 line becomes 1409 and we will have easier time reading what is actually happening.

Now let's add a breakpoint. And try to force an error (remembered part 1?)

Got ya!

Using debugger feature of step into/step out of methods we try to walk down the call tree to see which step causes exception to be thrown. On the side we have okHttp lib source code (but source lines are messed up so we need to use our brain to pinpoint where we are. Probably also there are some small changes to the code).

Be sure to remember that we work on live http connections and server side will drop connection if it takes too long to do proper HTTPS communication and this might lead you into a wrong direction. After all, we want to know why this occurs, so maybe we can apply small fix on server side instead?

Ok, so we did set up a method breakpoint and it works.. but bug seems to be gone - true Heisenbug :(

I guess we do have a race condition here and adding method breakpointing hides it as it slows down alot. Android Studio interface does not allow to add a line breakpoint if you don't have a source code so are we doomed?

Naah.. Simply build & run your application, then create a file and paste okHttp sources and add blank lines so exception line matches it:

Now add breakpoint by double-clicking and..

One step closer

Print its stack trace via evaluate expression ( e.printStackTrace(); ):

And another step

Let's look at libcore sources.

Let's do the trick again with adding source code and using breakpoint and we get:

Looking into connectErrno we can see:

Debugger shown value errno = 115 , lets see what is EINPROGRESS worth:

Which looks like a dead end. So let's recover sources from devices.

We use adb in shell mode to search for library containing IoBridge

a70q:/system $ grep -R 'IoBridge' * Binary file apex/com.android.runtime.release/javalib/core-libart.jar matches Binary file apex/com.android.runtime.release/javalib/core-oj.jar matches Binary file app/Bluetooth/Bluetooth.apk matches etc/preloaded-classes:libcore.io.IoBridge Binary file framework/services.jar matches Binary file framework/framework.jar matches Binary file framework/wifi-service.jar matches Binary file priv-app/MediaProvider/MediaProvider.apk matches PS C:\work\tmp> adb pull /system/apex/com.android.runtime.release/javalib/core-libart.jar /system/apex/com.android.runtime.release/javalib/core-libart.jar: 1 file pulled, 0 skipped. 25.9 MB/s (3319326 bytes in 0.122s)

Now let's decompile into smali with apktool

PS C:\work\tmp> java -jar .\apktool_2.4.1.jar d -r .\core-libart.jar I: Using Apktool 2.4.1 on core-libart.jar I: Baksmaling classes.dex... I: Copying assets and libs... I: Copying unknown files... I: Copying original files...

And this is what we get:

Which looks like a line:

So it looks like we went into native mode there but it is under timeoutMs == 0 condition, so maybe we can manipulate that somehow? Let's move to step 5

Step 5 - Create a fix or workaround for the problem

Let's try this:

Add method breakpoint to: libcore.io.IoBridge to method connectErrno

And it looks like a value we set as ConnectTimeout is passed down to IoBridge ... let's do some testing (remember about step 5.1 ?) and we got SUCCESSFULL WORKAROUND!

Now apply the same patch into your production app, do a test there and see it also works!