In this article, I’d like to introduce the libbash project. I’ll explain what libbash can do with some examples. At the end of this article, a benchmark result is given for egencache (Portage), instruo (Paludis) and instruo reimplemented with our library.

I’ve been planning to write this article for a long time. This project was started last year as a GSoC project proposed by Petteri Räty. Nathan Eloe did a great work and achieved to build Abstractr Syntax Tree(AST) for a give shell script. This year, again as a GSoC student, I’ll work on the runtime part, or put it simply, making the library capable to run shell scripts. I’ve started contributing to this project since March 2011 and find it really amazing.

Libbash will enable programs to use Abstract Syntax Trees(AST) to parse and interpret shell scripts directly instead of using regular expressions. Most of bash 3.2 syntax will be supported. This will be a great benefit to programs both outside and inside Gentoo, including Portage/Paludis and repoman.

For instance, you have /etc/conf.d/net which is essentially a shell script. Libbash will tell you what variables and functions are there, what values of the variables will be after interpreting it. It also allows you to use compound statements and shell builtins in the script. We plan to support common bash 3.2 syntax except features related to interactive shell and executing external process. Currently it lacks a lot of functionality ( of cause 🙂 ), but it begins to shape up and can do some real work.

Let me show you how we handle /etc/conf.d/net with the library at hand.

$ ./variable_printer /etc/conf.d/net auto_eth0=true auto_eth1=false auto_myxjtu2=false auto_ppp0=false auto_qiaomuf=true config_eth0=202.xxx.xxx.xxx/24 192.168.14.xxx/24 192.168.4.xxx/24 config_eth1=dhcp ......

The variable_printer is a utility program that is linked against our library. All the non-local variables defined in /etc/conf.d/net are printed out by the utility including arrays. Actually we can do much more than that. For example, function definition, variable expansion and command substitution are supported(Although their functionality is not complete yet). If you need to analyze bash script, this library should be helpful.

Our goal of this summer is to support Portage metadata generation, so let me give you an example for it.

You may already know what is Portage metadata. It is used to speed up searches and the building of dependency trees. You can find it under $PORTDIR/metadata/cache and regenerate it by executing ‘egencache –update’. We have a utility that generates metadata for a give ebuild:

$ ECLASSDIR=scripts ./metadata_generator scripts/sunpinyin-2.0.3-r1.ebuild dev-db/sqlite:3 dev-util/pkgconfig foo/bar dev-db/sqlite:3 0 http://sunpinyin.googlecode.com/files/.tar.gz http://open-gram.googlecode.com/files/dict.utf8.tar.bz2 http://open-gram.googlecode.com/files/lm_sc.t3g.arpa.tar.bz2 http://sunpinyin.googlecode.com LGPL-2.1 CDDL SunPinyin is a SLM (Statistical Language Model) based IME ~amd64 ~x86 foo 1 compile install postinst unpack

This ebuild is modified to inherit foo.eclass written for testing purpose. Because some features are missing so the content is not exactly the same as the one under $PORTDIR. But the format should be the same now. (I removed unnecessary blank lines for better readability)

You may wonder why we need it as we already have egencache. The problem with egencache and some other Portage utilities is that the performance is not good. The overhead of forking bash and sourcing eclasses costs a lot of time. With libbash, the overhead can be avoided. Here’s a benchmark test for egencache, instruo and instruo reimplemented with libbash:

Environment: Linux puma 2.6.37-gentoo-r4 #1 SMP Fri May 13 14:44:26 CST 2011 x86_64 Intel(R) Xeon(R) CPU E5405 @ 2.00GHz GenuineIntel GNU/Linux CFLAGS & CXXFLAGS: -march=core2 -g -O2 -pipe -mtune=generic CPU Freq governor: performance time egencache --jobs=1 --update --cache-dir=meta_egencache real 95m49.598s user 52m8.223s sys 16m26.867s time INSTRUO_THREADS=1 instruo -D /usr/portage/ -o meta_paludis real 123m0.811s user 54m6.507s sys 39m7.614s time ./instruo -D /usr/portage/ -o meta_libbash 2>error real 1m24.977s user 1m18.070s sys 0m6.555s time pmaint regen /usr/portage 1 real 30m23.433s user 10m21.820s sys 6m2.990s

Thanks to ferringb for mentioning about pmaint (Based on the result, pmaint is the fastest metadata generation tool for now). Thanks to nirbheek and Ford_Prefect for reminding me of the kernel cache. Now every command is run 4 times and the result is the mean running time without caring about the first run. egencache and instruo were running in a single-threaded environment because our implementation of instruo is single-threaded. Note that /usr/portage/metadata/cache was removed every time before egencache was run.

I thought egencache would be slower since it generates two different metadata formats. But it turns out that writing metadata is not the bottleneck. Kernel cache has little impact on metadata generation as there’s little time difference between the first and second run for all the three commands.

Although our time will grow as we implement more features(We ignore the statements that we can’t handle), the result looks good. Our implementation of instruo doesn’t cheat. We just embedded our code that reads variable values from an ebuild into the original instruo implementation.

The main reason of the performance gain is that we don’t have to fork a huge number of bash process. In the meanwhile, the AST of eclasses are cached while generating metadata. Without the AST cache, we need 30 minutes to generate the ebuild metadata.

git repository: http://git.overlays.gentoo.org/gitweb/?p=proj/libbash.git;a=summary.