ZDNet published this interesting post 2 days ago titled “Two malicious Python libraries caught stealing SSH and GPG keys” which sets stage to what is coming in 2020 and onward.

And if you think your are safe (as you recently procured a well marketed commercial open source dependency scanner) is when you are most in danger as all such tools lack intelligence to track such advanced infiltration patterns.

The phrase “Think like an Attacker” is often abused in cyber security to encourage people and organizations to get inside the head of the groups which are targeting them.

Here’s what’s wrong with think like an attacker: most people have no clue how to do it. They don’t know what matters to an attacker. They don’t know how an attacker spends their day. They don’t know how an attacker approaches a problem. Lately, I’ve been challenging people to think like a professional chef. Most people have no idea how a chef spends their days, or how they approach a problem. They have no idea how to plan a menu, or how to cook a hundred or more dinners in an hour. ~ Adam Shostack

I’d strongly encourage everyone to pause and watch this entire presentation by Haroon Meer titled Learning the wrong lessons from Offense. Haroon’s presentations are often vendor-agnostic, honest, informative and downright fabulous.

Key takeaways : You cannot teach a defender to think like an attacker. As Haroon wisely states (quoting from Richard Feynman’s Caro Cult Science), we as defenders follow everything that we see the attacker do, then model detection in isolation (honeypots, adversarial modeling, situational awareness) and not grasp the point bearing context.

Let’s now revert back to this incident and speculatively understand how an infiltrator organized her/his actions.

First principles thinking — Modeling the Infiltrators mindset

Act 1: Prey Selection

Identify the most popular libraries imported/used in the PyPI — Python Package Index via https://hugovk.github.io/top-pypi-packages/.

Why pick these two?

It’s imperative that python-dateutil (7MM downloads) and jellyfish (387K downloads) are fairly popular and ranked on the fortnight index https://hugovk.github.io/top-pypi-packages/top-pypi-packages-30-days.json

Act 2: Mimicry and Camouflage

Ambush predators are types of carnivores that usually distinguish themselves by an approach to hunting which relies on stealth or strategy as opposed to speed or strength. Hallmarks of this behavior is mimicry and camouflage which are ways to deceive prey into believing predator activity is something else.

Applied to cyber security, this can be rationalized to typosquatting packages and domains.

In this case, the actor created two new packages in the package index PyPI: python3-dateutil and jelLyfish.

If you haven’t noticed, pause for moment and pay attention to the naming convention.

python3-dateutil mimics python-dateutil with a version number prefix and jelLyfish mimics jellyfish with a typo that can often go unnoticed.

Applying the art of smart mimicry, the actor pretends to serve the community by porting dateutil to python version 3+ and thereby calling it python3-dateutil. The actor then camouflages jelLyfish as a transitive dependency of python3-dateutil that contains the seeded trojan pattern (encode-download-decode-exfiltrate-send).

Attack graph

Act 3: Seeding awareness

Merely seeding a trojan package amongst a plethora of packages is insufficient. Majority of the developers make choices and decisions using a package-index service in combination with a search engine query (google, bing, etc) and of course stackoverflow.

[Speculative account] The actor now pivots to stackoverflow and creates an anonymous identity. Concealed behind this identity the actor can pose a simple question.

What is the replacement for dateutil.parser in python3?

Thereafter he/she creates additional identities and answers to this question with a hyperlink to the malicious python3-dateutil. The poser now scores answers with the this hyperlink.

Based on a ranking algorithm, any search engine would begin to index/score this question and any subsequent searches for “dateutil for python 3” would yield this stackoverflow link.

Act 4: Lie in Wait

Now the actor waits insidiously until the appropriate moment where a prey (data scientist/programmer/developer) incorporates the library into his/her software supply chain.

Act 5: Anti Predator Adaptions

This is not a new attack technique.

In 2016, Nikolai Philipp Tschacher documented and proved this technique as a part of his bachelor thesis.

Why haven’t commercial security vendors assessed and incorporated his research to improve detection techniques?

Some statistical numbers from Nikolai’s thesis for the uploaded packages and their installations:

214 total different uploaded typo packages on three different package repositories

total different uploaded typo packages on three different package repositories 92 average installations per package

The standard derivation of installations per package is 433 and thus relatively high

The most installed package (urllib2) received 3929 unique installations in almost 2 weeks (284 average installations per day)

in almost 2 weeks (284 average installations per day) The most installed package per day was bs4 with 366 unique daily installations on average

with on average The least installed package had only one installation (Probably by a mirror or crawler)

Takeaways

For engineers — Be informed and pay more attention prior to depending on any library. Use your own judgement and assess all transitive dependencies as well.

For engineers — Incorporate a formal verification process as a part of your code reviews and CI pipeline to ensure that such libraries are not imported.

For engineers — Identify all trigger functions in your application’s call path to assess attack surface exposure.

For vendors — Stop spending dollars in messaging your product and start paying attention to improving detection techniques. An obvious one is to simply check string edit distance of packages being installed. If you see ‘pip install xyz’, compare ‘xyz’ to the top 500 PyPI packages, and see if it’s close, but not the same, as an existing package — bonus points if the difference is in a lookalike character such as 0/O, l/L/I, etc.

For publishers/vendors — Avoid selling FUD and provide insights / solutions that a customer can incorporate to solve for these emerging threats.

For customers — Ask vendors hard questions to ensure that such detection is provable.

Update:

List of other packages in ecosystem with traits of mimicry and camouflage since 2017