Mobile apps data extraction on scale

6 min readApr 27, 2017

Extracting data from mobile data sources isn't something new, however it seems that the ways to do so does not scale easily.

So, how do we do it? suppose you want to extract data from mobile application, lets say that we got the APK of the Android application and we want to extract 500,000 data points (UI screens) a day, in what ways you will be able to do so? at what costs?

Well, when I first got introduced to this challenge the first thing that pop up of my head was “reverse engineering” the application.
Find out how the client communicate with the server, which protocol they are using, how they transfer messages with each other.
While this might seems like the best scalable and cheapest solution it can only give solution to one application (one data source), what if we wish to repeat the process again with another application? What if the API changed? also, it very hard to estimate the effort it should take.

So, trying this, I spin up Android Emulator, installed the APK, connected it to a proxy and started to observe the data.
Although all the communication was via HTTPS, using mitmproxy and a few hours of playing with certificates, I was able to observe the traffic in/out from the client to the server and even managed to simulate the calls to the server.

When started to dig deeper, sessions, cookies, how to generate them etc, things began to be much more complicated.

Conclusions: reverse engineering is easy to get start with and seems the most scalabe and cheapest way to do it. Nevertheless it can take long days the development costs are unpredictable and you not always see the end of the road.

Different approach was using tools like Appium or Selendroid, if you familiar with Selenium, then it similar for the mobile world.
You write easily the scenario you want to test (in our use case, scrape) and automatically run the test script over and over.

I decided to try Appium together with the Android Emulator.
Android emulators know for long time as impossible tool to work with for the community of mobile development, however since the releases of the x86 emulators things began to work smoothly and it even feels like running the application inside my laptop run faster then the physical device itself.

So I wrote the data extraction script, it took it 2 min to complete when running on my laptop with the standard Android emulator.
Remember scale? we want to extract 500,000 data points a day, each data point takes 2 min means, 500,000 * 2 = 1,000,000 min a day.

To save you the calculations it means we need ~700 Android emulators running simultaneously 24/7.

Later I created Docker container with Ubuntu 16.04 + Appium + Android x86 emulator and started to test how many of those I could run simultaneously.
Thanks to the x86 image that know how to leverage the machine hardware, each script took 2 minutes, I was managed to run multiple emulators, 1 for each physical core on my laptop (intel i7, 16GB) simultaneously, each emulator consume 1 core and 1 GB of memory.
So Assuming we can use 1 CPU per 1 emulator we will need 700 CPU’s.
Its a lot and required physical machines that in turn are expensive to manage.

Conclusions: physical hardware brings good performance, which in turn hard to manage on large scale.

So what do you do in order to avoid physical hardware management?
You go to the public cloud (AWS).
However when I took this approach into the cloud things work totally different, and here where the fun begins.

Although Docker, Linux and AWS work pretty well together, with Android
emulator inside, they’re not. Remember, AWS EC2 provide you a VM, and Android Emulator is another VM on top of it. In order to benefit from hardware acceleration when using the x86 Android emulator the host machine should expose this capability, however Amazon and any public cloud don’t expose this, instead they use it for themselves to serve us with virtual machines, hence I wasn’t been able to even start the Android x86 emulator there.

More info on it in here: https://www.ravellosystems.com/blog/android-emulator-on-amazon-ec2-and-google-cloud/

So what do we do now?
Enter ravello.
In short, what ravello solution gives us is nested virtualization or KVM support on the host machine while running on the public cloud.
This gave me the ability to run x86 Android emulators on the cloud.
So I decided to give a try, it worked, but in term of performance the scrapping script took 3 times longer comparing to physical machines and only got worse when I tried to launch more emulators on parallel.

Conclusions: ravello solution works, but regarding performance, it wasn’t sufficient.

Another solution on the public cloud is Genymotion, which provides Android Image (AMI) for EC2.
So instead of getting Ubuntu/Windos VM you get Android VM, cool ah?

In short this looks like the best solution I was able to found that runs on
the public cloud. Using the Android image I was managed
to run on t2.small instance (1 core and 2 GB memory) the scrapping script the same it runs on physical hardware.
The downside of this solution is its cost, each instance together with the image cost is 0.148$ per hour which in large scale, 700 Android emulators, become quite expensive.

Conclusions: Genymotion works pretty good on the cloud and gave almost the same performance as running it on physical machine, however they are very expensive when using them on large scale.

Two more solutions were Nox and Bluestacks.
Those products were developed specially for gamers, but it doesn’t mean we can’t use them.

So I spin up t2.medium Windows VM on AWS EC2 and gave it a try.
With Nox It failed to install it because the graphic card driver was outdated, remember, we are running on VM.
Anyway, even after managed to overcome this, more obstacles was waiting along the road, at some point I gave up and tried Bluestacks.

Bluestacks installation process went well and the performance was pretty good.
The downside was that I didn’t managed to come with a solution to run more then one Bluestacks application simultaneously on the cloud inside my VM, at least to the time of this article was written, and from some reason our tested APK wasn’t worked pretty well on it, I’m guessing it because Bluestacks runs in some kind of tablet mode.

Conclusions: Bluestacks worked beyond the expectations when running on virtual machine, it free and even detectable through ADB, meaning we are able to run the Appium tests on it, the downsides are: it runs only on Windows/Mac, you can run only 1 instance simultaneously and it works only on Tablet mode.

One last yet pre-mature solution but that worth to keep follow on is http://anbox.io/ Android in a box, after testing it, I found out it very premature, in pre-alpha stage, could be the perfect solution in the future.

When choosing on one of the emulators solutions there are few optimizations that can help to speed up the time to extract the data, naming a few, if possible, use deep links, landing url’s, when running on strong machines the application speed can be larger then it actually on real device.

So to summarize, when you want to extract data from mobile application in scale, If reverse engineering works and fit your needs, take it, it the cheapest and scalable solution, to my use case it works but wasn't fit my entire needs.
The other solutions using Android emulators aren't much scalable and as result expensive, however if you decide to go for it my recommendation as follow:
if your data center on the cloud try Appium + Genymotion / Bluestacks (Windows), if you are running on physical hardware the standard Android x86 emulator would do the job.

If you know on other solutions to scrape mobile application on scale you are more then welcome to share!

Mobile apps data extraction on scale

Written by Alon Rolnik