Date: Fri, 29 Jun 2012 19:49:21 +0400 From: Solar Designer <solar@...nwall.com> To: announce@...ts.openwall.com, john-users@...ts.openwall.com Subject: John the Ripper 1.7.9-jumbo-6 Hi, We've released John the Ripper 1.7.9-jumbo-6 earlier today. This is a "community-enhanced" version, which includes many contributions from JtR community members - in fact, that's what it primarily consists of. It's been half a year since 1.7.9-jumbo-5, which is a lot of time, and a lot has been added to jumbo since then. Even though it's just a one digit change in the version number, this is in fact the biggest single jumbo update we've made so far. It appears that between -5 and -6 the source code grew by over 1 MB, or by over 40,000 lines of code (and that's not including lines that were changed as opposed to added). The biggest new thing is integrated GPU support, both CUDA and OpenCL - although for a subset of the hash and non-hash types only, not for all that are supported on CPU. (Also, it is efficient only for so-called "slow" hashes now, and for the "non-hashes" that we chose to support on GPU. For "fast" hashes, it is just a development milestone, albeit a desirable one as well.) The other biggest new thing is the addition of support for many more "non-hashes" and hashes (see below). You may download John the Ripper 1.7.9-jumbo-6 at the usual place: http://www.openwall.com/john/ With so many changes, even pushing this release out was difficult. Despite of the statement that "jumbo is buggy by definition", we did try to eliminate as many bugs as we reasonably could - but after a week of mad testing and bug-fixing, I chose to release the tree as-is, only documenting the remaining known bugs (below and in doc/BUGS). Still, we ended up posting over 1200 messages to john-dev in June - even though in prior months we did not even hit 500. Indeed, we did run plenty of tests and fix plenty of bugs, which you won't see in this release. I've included a lengthy description of some of the changes below, and below that I'll add some benchmark results that I find curious (such as for bcrypt on CPU vs. GPU). Direct code contributors to 1.7.9-jumbo-6 (since 1.7.9-jumbo-5), by commit count: magnum Dhiru Kholia Frank Dittrich JimF (Jim Fougeron) myrice (Dongdong Li) Claudio Andre Lukas Odzioba Solar Designer Sayantan Datta Samuele Giovanni Tonon Tavis Ormandy bartavelle (Simon Marechal) Sergey V bizonix Robert Veznaver Andras New non-hashes: * Mac OS X keychains [OpenMP] (Dhiru) - based on research from extractkeychain.py by Matt Johnston * KeePass 1.x files [OpenMP] (Dhiru) - keepass2john is based on ideas from kppy by Karsten-Kai Koenig http://gitorious.org/kppy/kppy * Password Safe [OpenMP, CUDA, OpenCL] (Dhiru, Lukas) * ODF files [OpenMP] (Dhiru) * Office 2007/2010 documents [OpenMP] (Dhiru) - office2john is based on test-dump-msole.c by Jody Goldberg and OoXmlCrypto.cs by Lyquidity Solutions Limited * Mozilla Firefox, Thunderbird, SeaMonkey master passwords [OpenMP] (Dhiru) - based on FireMaster and FireMasterLinux http://code.google.com/p/rainbowsandpwnies/wiki/FiremasterLinux * RAR -p mode encrypted archives (magnum) - RAR -hp mode was supported previously, now both modes are New challenge/responses, MACs: * WPA-PSK [OpenMP, CUDA, OpenCL] (Lukas, Solar) - CPU code is loosely based on Aircrack-ng http://www.aircrack-ng.org http://openwall.info/wiki/john/WPA-PSK * VNC challenge/response authentication [OpenMP] (Dhiru) - based on VNCcrack by Jack Lloyd http://www.randombit.net/code/vnccrack/ * SIP challenge/response authentication [OpenMP] (Dhiru) - based on SIPcrack by Martin J. Muench * HMAC-SHA-1, HMAC-SHA-224, HMAC-SHA-256, HMAC-SHA-384, HMAC-SHA-512 (magnum) New hashes: * IBM RACF [OpenMP] (Dhiru) - thanks to Nigel Pentland (author of CRACF) and Main Framed for providing algorithm details, sample code, sample RACF binary database, test vectors * sha512crypt (SHA-crypt) [OpenMP, CUDA, OpenCL] (magnum, Lukas, Claudio) - previously supported in 1.7.6+ only via "generic crypt(3)" interface * sha256crypt (SHA-crypt) [OpenMP, CUDA] (magnum, Lukas) - previously supported in 1.7.6+ only via "generic crypt(3)" interface * DragonFly BSD SHA-256 and SHA-512 based hashes [OpenMP] (magnum) * Django 1.4 [OpenMP] (Dhiru) * Drupal 7 $S$ phpass-like (based on SHA-512) [OpenMP] (magnum) * WoltLab Burning Board 3 [OpenMP] (Dhiru) * New EPiServer default (based on SHA-256) [OpenMP] (Dhiru) * GOST R 34.11-94 [OpenMP] (Dhiru, Sergey V, JimF) * MD4 support in "dynamic" hashes (user-configurable) (JimF) - previously, only MD5 and SHA-1 were supported in "dynamic" * Raw-SHA1-LinkedIn (raw SHA-1 with first 20 bits zeroed) (JimF) Alternate implementations for previously supported hashes: * Faster raw SHA-1 (raw-sha1-ng, password length up to 15) (Tavis) OpenMP support in new formats: * Mac OS X keychains (Dhiru) * KeePass 1.x files (Dhiru) * Password Safe (Lukas) * ODF files (Dhiru) * Office 2007/2010 documents (Dhiru) * Mozilla Firefox, Thunderbird, SeaMonkey master passwords (Dhiru) * WPA-PSK (Solar) * VNC challenge/response authentication (Dhiru) * SIP challenge/response authentication (Dhiru) * IBM RACF (Dhiru) * DragonFly BSD SHA-256 and SHA-512 based hashes (magnum) * Django 1.4 (Dhiru) * Drupal 7 $S$ phpass-like (based on SHA-512) (magnum) * WoltLab Burning Board 3 (Dhiru) * New EPiServer default (based on SHA-256) (Dhiru) * GOST R 34.11-94 (Dhiru, JimF) OpenMP support for previously supported hashes that lacked it: * Mac OS X 10.4 - 10.6 salted SHA-1 (magnum) * DES-based tripcodes (Solar) * Invision Power Board 2.x salted MD5 (magnum) * HTTP Digest access authentication MD5 (magnum) * MySQL (old) (Solar) CUDA support for: * phpass MD5-based "portable hashes" (Lukas) * md5crypt (FreeBSD-style MD5-based crypt(3) hashes) (Lukas) * sha512crypt (glibc 2.7+ SHA-crypt) (Lukas) * sha256crypt (glibc 2.7+ SHA-crypt) (Lukas) * Password Safe (Lukas) * WPA-PSK (Lukas) * Raw SHA-224, raw SHA-256 [inefficient] (Lukas) * MSCash (DCC) [not working reliably yet] (Lukas) * MSCash2 (DCC2) [not working reliably yet] (Lukas) * Raw SHA-512 [not working reliably yet] (myrice) * Mac OS X 10.7 salted SHA-512 [not working reliably yet] (myrice) - we have already identified the problem with the above two, and a post 1.7.9-jumbo-6 fix should be available shortly - please ask on john-users if interested in trying it out OpenCL support for: * phpass MD5-based "portable hashes" (Lukas) * md5crypt (FreeBSD-style MD5-based crypt(3) hashes) (Lukas) * sha512crypt (glibc 2.7+ SHA-crypt) (Claudio) - suitable for NVIDIA cards, faster than the CUDA implementation above http://openwall.info/wiki/john/OpenCL-SHA-512 * bcrypt (OpenBSD-style Blowfish-based crypt(3) hashes) (Sayantan) - pre-configured for AMD Radeon HD 7970, will likely fail on others unless WORK_GROUP_SIZE is adjusted in opencl_bf_std.h and opencl/bf_kernel.cl; the achieved level of performance is CPU-like (bcrypt is known to be somewhat GPU-unfriendly - a lot more than SHA-512) http://openwall.info/wiki/john/GPU/bcrypt * MSCash2 (DCC2) (Sayantan) - with optional and experimental multi-GPU support as a compile-time hack (even AMD+NVIDIA mix), by editing init() in opencl_mscash2_fmt.c * Password Safe (Lukas) * WPA-PSK (Lukas) * RAR (magnum) * MySQL 4.1 double-SHA-1 [inefficient] (Samuele) * Netscape LDAP salted SHA-1 (SSHA) [inefficient] (Samuele) * NTLM [inefficient] (Samuele) * Raw MD5 [inefficient] (Dhiru, Samuele) * Raw SHA-1 [inefficient] (Samuele) * Raw SHA-512 [not working properly yet] (myrice) * Mac OS X 10.7 salted SHA-512 [not working properly yet] (myrice) - we have already identified the problem with the above two, and a post 1.7.9-jumbo-6 fix should be available shortly - please ask on john-users if interested in trying it out Several of these require byte-addressable store (any NVIDIA card, but only 5000 series or newer if AMD/ATI). Also, OpenCL kernels for "slow" hashes/non-hashes (e.g. RAR) may cause "ASIC hang" on certain AMD/ATI cards with recent driver versions. We'll try to address these issues in a future version. AMD XOP (Bulldozer) support added for: * Many hashes based on MD4, MD5, SHA-1 (Solar) Uses of SIMD (MMX assembly, SSE2/AVX/XOP intrinsics) added for: * Mac OS X 10.4 - 10.6 salted SHA-1 (magnum) * Invision Power Board 2.x salted MD5 (magnum) * HTTP Digest access authentication MD5 (magnum) * SAP CODVN B (BCODE) MD5 (magnum) * SAP CODVN F/G (PASSCODE) SHA-1 (magnum) * Oracle 11 (magnum) Other optimizations: * Reduced memory usage for raw-md4, raw-md5, raw-sha1, and nt2 (magnum) * Prefer CommonCrypto over OpenSSL on Mac OS X 10.7 (Dhiru) * New SSE2 intrinsics code for SHA-1 (JimF, magnum) * Smarter use of SSE2 and SSSE3 intrinsics (the latter only if enabled in the compiler at build time) to implement some bit rotates for MD5, SHA-1 (Solar) * Assorted optimizations for raw SHA-1 and HMAC-MD5 (magnum) * In RAR format, added inline storing of RAR data in JtR input file when the original file is small enough (magnum) * Added use of the bitslice DES implementation for tripcodes (Solar) * Raw-MD5-unicode made "thick" again (that is, not building upon "dynamic"), using much faster code (magnum) * Assorted performance tweaks in "salted-sha1" (SSHA) (magnum) * Added functions for larger hash tables to several formats (magnum, Solar) Other assorted enhancements: * linux-*-gpu (both CUDA and OpenCL at once), linux-*-cuda, linux-*-opencl, macosx-x86-64-opencl make targets (magnum et al.) * linux-*-native make targets (pass -march=native to gcc) (magnum) * New option: --dupe-suppression (for wordlist mode) (magnum) * New option: --loopback[=FILE] (implies --dupe-suppression) (magnum) * New option: --max-run-time=N for graceful exit after N seconds (magnum) * New option: --log-stderr (magnum) * New option: --regenerate-lost-salts=N for cracking hashes where we do not have the salt and essentially need to crack it as well (JimF) * New unlisted option: --list (for bash completion, GUI, etc.) (magnum) * --list=[encodings|opencl-devices] (magnum) * --list=cuda-devices (Lukas) * --list=format-details (Frank) * --list=subformats (magnum) * New unlisted option: --length=N for reducing maximum plaintext length of a format, mostly for testing purposes (magnum) * Enhanced parameter syntax for --markov: may refer to a configuration file section, may specify the start and/or end in percent of total (Frank) * Make incremental mode restore ETA figures (JimF) * In "dynamic", support NUL octets in constants (JimF) * In "salted-sha1" (SSHA), support any salt length (magnum) * Use comment and home directory fields from PWDUMP-style input (magnum) * Sort the format names list in "john" usage output alphabetically (magnum) * New john.conf options subsection "MPI" (magnum) * New john.conf config item CrackStatus under Options:Jumbo (magnum) * \xNN escape sequence to specify arbitrary characters in rules (JimF) * New rule command _N to reject a word unless it is of length N (JimF) * Extra wordlist rule sections: Extra, Single-Extra, Jumbo (magnum) * Enhanced "Double" external mode sample (JimF) * Source $JOHN/john.local.conf by default (magnum) * Many format and algorithm names have been changed for consistency (Solar) * When intrinsics are in use, the reported algorithm name now tells which ones (SSE2, AVX, or XOP) (Solar) * benchmark-unify: a Perl script to unify benchmark output of different versions of JtR for use with relbench (Frank) * Per-benchmark speed ratio output added to relbench (Frank) * bash completion for JtR (to install: "sudo make bash-completion") (Frank) * New program: raw2dyna (helper to convert raw hashes to "dynamic") (JimF) * New program: pass_gen.pl (generates hashes from plaintexts) (JimF, magnum) * Many code changes made, many bugs fixed, many new bugs introduced (all) Now the promised benchmarks. Here's 1.7.9-jumbo-5 to 1.7.9-jumbo-6 overall speed change on one core in FX-8120 (should be 4.0 GHz turbo), after running through benchmark-unify and relbench (yet about 50 of the new version's benchmark results could not be directly compared against results of the previous version, and thus are excluded): Number of benchmarks: 151 Minimum: 0.84668 real, 0.84668 virtual Maximum: 10.92416 real, 10.92416 virtual Median: 1.10800 real, 1.10800 virtual Median absolute deviation: 0.12531 real, 0.12369 virtual Geometric mean: 1.26217 real, 1.26284 virtual Geometric standard deviation: 1.47239 real, 1.47274 virtual Ditto for OpenMP-enabled builds (8 threads, should be 3.1 GHz): Number of benchmarks: 151 Minimum: 0.94616 real, 0.48341 virtual Maximum: 24.19709 real, 4.29610 virtual Median: 1.17609 real, 1.05964 virtual Median absolute deviation: 0.17436 real, 0.11465 virtual Geometric mean: 1.35493 real, 1.17097 virtual Geometric standard deviation: 1.71505 real, 1.36577 virtual These show that overall we do indeed have a speedup, and that's without any GPU stuff. Also curious is speedup due to OpenMP in 1.7.9-jumbo-6 (same version in both cases), on the same CPU (8 threads): Number of benchmarks: 202 Minimum: 0.76235 real, 0.09553 virtual Maximum: 30.51791 real, 3.81904 virtual Median: 1.01479 real, 0.98287 virtual Median absolute deviation: 0.02747 real, 0.03514 virtual Geometric mean: 1.71441 real, 0.77454 virtual Geometric standard deviation: 2.08823 real, 1.50966 virtual The 30x maximum speedup (with only 8 threads) is indeed abnormal, it is for: Ratio: 30.51791 real, 3.81904 virtual SIP MD5:Raw We'll correct the non-OpenMP performance for SIP in the next version. For the rest, the maximum speedup is 6.13x for SSH, which is great (considering that the CPU clock rate reduces with more threads running, and that this is a 4-module CPU rather than a true 8-core). Here are the top 10 OpenMP performers (excluding SIP): Ratio: 6.13093 real, 0.77210 virtual SSH RSA/DSA (one 2048-bit RSA and one 1024-bit DSA key):Raw Ratio: 6.05882 real, 0.75737 virtual NTLMv2 C/R MD4 HMAC-MD5:Many salts Ratio: 6.04342 real, 0.75548 virtual LMv2 C/R MD4 HMAC-MD5:Many salts Ratio: 5.92830 real, 0.74108 virtual GOST R 34.11-94:Raw Ratio: 5.81605 real, 0.73986 virtual sha256crypt (rounds=5000):Raw Ratio: 5.65289 real, 0.70523 virtual sha512crypt (rounds=5000):Raw Ratio: 5.63333 real, 0.72034 virtual Drupal 7 $S$ SHA-512 (x16385):Raw Ratio: 5.56435 real, 0.69937 virtual OpenBSD Blowfish (x32):Raw Ratio: 5.50484 real, 0.69682 virtual Password Safe SHA-256:Raw Ratio: 5.49613 real, 0.68814 virtual Sybase ASE salted SHA-256:Many salts The worst regression is for: Ratio: 0.76235 real, 0.09553 virtual LM DES:Raw It is known that our current LM hash code does not scale well, and is very fast even with one thread (close to the bottleneck of the current interface). It is in fact better not to use OpenMP for LM hashes yet, or to keep the thread count low (e.g., 4 would behave better than 8). The low median and mean speedup are because many hashes still lack OpenMP support - mostly the "fast" ones, where we'd bump into the bottleneck anyway. We might deal with this later. For "slow" hashes, the speedup with OpenMP is close to perfect (5x to 6x for this CPU). Now to the new stuff. The effect of XOP (make linux-x86-64-xop): user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=md5 Benchmarking: FreeBSD MD5 [128/128 XOP intrinsics 8x]... (8xOMP) DONE Raw: 204600 c/s real, 25625 c/s virtual -5 achieved at most: user@...l:~/john-1.7.9-jumbo-5/run$ ./john -te -fo=md5 Benchmarking: FreeBSD MD5 [SSE2i 12x]... (8xOMP) DONE Raw: 158208 c/s real, 19751 c/s virtual with "make linux-x86-64i" (icc precompiled SSE2 intrinsics), and only: user@...l:~/john-1.7.9-jumbo-5/run$ ./john -te -fo=md5 Benchmarking: FreeBSD MD5 [SSE2i 12x]... (8xOMP) DONE Raw: 141312 c/s real, 17664 c/s virtual with "make linux-x86-64-xop" because it did not yet use XOP for MD5 (nor for MD4 and SHA-1), only knowing how to use it for DES (which it did). So we got an over 20% speedup due to XOP here. Similarly, for raw SHA-1 best result with -5: user@...l:~/john-1.7.9-jumbo-5/run$ ./john -te -fo=raw-sha1 Benchmarking: Raw SHA-1 [SSE2i 8x]... DONE Raw: 13067K c/s real, 13067K c/s virtual whereas -6 does, with JimF's and magnum's optimizations and with XOP: user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=raw-sha1 Benchmarking: Raw SHA-1 [128/128 XOP intrinsics 8x]... DONE Raw: 23461K c/s real, 23698K c/s virtual and with Tavis' contribution: user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=raw-sha1-ng Benchmarking: Raw SHA-1 (pwlen <= 15) [128/128 XOP intrinsics 4x]... DONE Raw: 28024K c/s real, 28024K c/s virtual So that's an over 2x speedup if we can accept the length 15 limit, or an almost 80% speedup otherwise. Note: all of the raw SHA-1 benchmarks above are for one CPU core, not for the entire chip (no OpenMP for fast hashes like this yet, but there's MPI and there are always separate process invocations...) To more important stuff, sha512crypt on CPU vs. GPU: For reference, here's what we would get with the previous version, using the glibc implementation of SHA-crypt: user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=crypt -sub=sha512crypt Benchmarking: generic crypt(3) SHA-512 rounds=5000 [?/64]... (8xOMP) DONE Many salts: 1518 c/s real, 189 c/s virtual Only one salt: 1515 c/s real, 189 c/s virtual Now we also have builtin implementation, although it nevertheless uses OpenSSL for the SHA-512 primitive (it doesn't have its own SHA-512 yet - adding that and making use of SIMD would provide much additional speedup, this is a to-do item for us): user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=sha512crypt Benchmarking: sha512crypt (rounds=5000) [64/64]... (8xOMP) DONE Raw: 2045 c/s real, 256 c/s virtual So it is about 35% faster. Let's try GPUs, first GTX 570 1600 MHz (a card that is vendor-overclocked to that frequency): user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=sha512crypt-cuda Benchmarking: sha512crypt (rounds=5000) [CUDA]... DONE Raw: 3833 c/s real, 3833 c/s virtual Another 2x speedup here, but that's still not it. Let's see: user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=sha512crypt-opencl OpenCL platform 0: NVIDIA CUDA, 1 device(s). Using device 0: GeForce GTX 570 Building the kernel, this could take a while Local work size (LWS) 512, global work size (GWS) 7680 Benchmarking: sha512crypt (rounds=5000) [OpenCL]... DONE Raw: 11405 c/s real, 11349 c/s virtual And now this is it - Claudio's OpenCL code is really good on NVIDIA, giving us a 5.5x speedup over CPU. (SHA-512 is not as GPU-friendly as e.g. MD5, but is friendly enough for some decent speedup.) Let's also try AMD Radeon HD 7970 (normally a faster card), at stock clocks: user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=sha512crypt-opencl -pla=1 OpenCL platform 1: AMD Accelerated Parallel Processing, 2 device(s). Using device 0: Tahiti Building the kernel, this could take a while Elapsed time: 17 seconds Local work size (LWS) 32, global work size (GWS) 16384 Benchmarking: sha512crypt (rounds=5000) [OpenCL]... DONE Raw: 5144 c/s real, 3276K c/s virtual Not as much luck here yet. Finally, for comparison and to show how any one of the three OpenCL devices may be accessed from john's command-line with --platform and --device options, the same OpenCL code on the CPU: user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=sha512crypt-opencl -pla=1 -dev=1 OpenCL platform 1: AMD Accelerated Parallel Processing, 2 device(s). Using device 1: AMD FX(tm)-8120 Eight-Core Processor Local work size (LWS) 1, global work size (GWS) 1024 Benchmarking: sha512crypt (rounds=5000) [OpenCL]... DONE Raw: 1850 c/s real, 233 c/s virtual This shows that the code is indeed pretty efficient - almost reaching OpenSSL's specialized code speed. Now to bcrypt. This CPU is pretty good at it: user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=bf Benchmarking: OpenBSD Blowfish (x32) [32/64 X2]... (8xOMP) DONE Raw: 5300 c/s real, 664 c/s virtual (FWIW, with overclocking I was able to get this to about 5650 c/s, but not more - bumping into 125 W TDP. The above is at stock clocks.) This is for "$2a$05" or only 32 iterations, which is used as baseline for benchmarks for historical reasons. Actual systems often use "$2a$08" (8 times slower) to "$2a$10" (32 times slower) these days. Anyway, the reference cracking speed for bcrypt above is higher than the speed for sha512crypt on the same CPU (with the current code at least, which admittedly can be optimized much further). Can we make it even higher on a GPU? Maybe, but not yet, not with the current code: user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=bf-opencl -pla=1 OpenCL platform 1: AMD Accelerated Parallel Processing, 2 device(s). Using device 0: Tahiti ****Please see 'opencl_bf_std.h' for device specific optimizations**** Benchmarking: OpenBSD Blowfish (x32) [OpenCL]... DONE Raw: 4143 c/s real, 238933 c/s virtual user@...l:~/john-1.7.9-jumbo-6/run$ DISPLAY=:0 aticonfig --od-enable --od-setclocks=1225,1375 AMD Overdrive(TM) enabled Default Adapter - AMD Radeon HD 7900 Series New Core Peak : 1225 New Memory Peak : 1375 user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=bf-opencl -pla=1 OpenCL platform 1: AMD Accelerated Parallel Processing, 2 device(s). Using device 0: Tahiti ****Please see 'opencl_bf_std.h' for device specific optimizations**** Benchmarking: OpenBSD Blowfish (x32) [OpenCL]... DONE Raw: 5471 c/s real, 358400 c/s virtual It's only with a 30% overclock that the high-end GPU gets to the same level of performance as the 2-3 times cheaper CPU. BTW, the GPU stays cool with this overclock (73 C with stock cooling when running bf-opencl for a while), precisely because we have to heavily under-utilize it due to it not having enough local memory to accommodate as many parallel bcrypt computations as we'd need for full occupancy and to hide memory access latencies. Maybe more optimal code will achieve better results, though. The NVIDIA card also has no luck competing with the CPU at bcrypt yet: user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=bf-opencl OpenCL platform 0: NVIDIA CUDA, 1 device(s). Using device 0: GeForce GTX 570 ****Please see 'opencl_bf_std.h' for device specific optimizations**** Benchmarking: OpenBSD Blowfish (x32) [OpenCL]... DONE Raw: 1137 c/s real, 1137 c/s virtual Some tuning could provide better numbers, but they stay a lot lower than the CPU's and HD 7970's anyway (for the current code). Some other GPU benchmarks where I think we achieve decent performance (not exactly the best, but on par with competing tools that had GPU support for longer): GTX 570 1600 MHz: user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=phpass-cuda Benchmarking: phpass MD5 ($P$9 lengths 1 to 15) [CUDA]... DONE Raw: 510171 c/s real, 507581 c/s virtual HD 7970 925 MHz (stock): user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=mscash2-opencl -pla=1 OpenCL platform 1: AMD Accelerated Parallel Processing, 2 device(s). Using device 0: Tahiti Optimal Work Group Size:256 Kernel Execution Speed (Higher is better):1.403044 Benchmarking: M$ Cache Hash 2 (DCC2) PBKDF2-HMAC-SHA-1 [OpenCL]... DONE Raw: 92467 c/s real, 92142 c/s virtual 1225 MHz: user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=mscash2-opencl -pla=1 OpenCL platform 1: AMD Accelerated Parallel Processing, 2 device(s). Using device 0: Tahiti Optimal Work Group Size:128 Kernel Execution Speed (Higher is better):1.856949 Benchmarking: M$ Cache Hash 2 (DCC2) PBKDF2-HMAC-SHA-1 [OpenCL]... DONE Raw: 121644 c/s real, 121644 c/s virtual (would overheat if actually used? this is not bcrypt anymore) user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=rar OpenCL platform 0: NVIDIA CUDA, 1 device(s). Using device 0: GeForce GTX 570 Optimal keys per crypt 32768 (to avoid this test on next run, put "rar_GWS = 32768" in john.conf, section [Options:OpenCL]) Local worksize (LWS) 64, Global worksize (GWS) 32768 Benchmarking: RAR3 SHA-1 AES (6 characters) [OpenCL]... (8xOMP) DONE Raw: 4380 c/s real, 4334 c/s virtual The HD 7970 card is back to stock clocks here: user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=rar -pla=1 OpenCL platform 1: AMD Accelerated Parallel Processing, 2 device(s). Using device 0: Tahiti Optimal keys per crypt 65536 (to avoid this test on next run, put "rar_GWS = 65536" in john.conf, section [Options:OpenCL]) Local worksize (LWS) 64, Global worksize (GWS) 65536 Benchmarking: RAR3 SHA-1 AES (6 characters) [OpenCL]... (8xOMP) DONE Raw: 7162 c/s real, 468114 c/s virtual WPA-PSK, on CPU: user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=wpapsk Benchmarking: WPA-PSK PBKDF2-HMAC-SHA-1 [32/64]... (8xOMP) DONE Raw: 1980 c/s real, 247 c/s virtual (no SIMD yet; could do several times faster). CUDA: user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=wpapsk-cuda Benchmarking: WPA-PSK PBKDF2-HMAC-SHA-1 [CUDA]... (8xOMP) DONE Raw: 32385 c/s real, 16695 c/s virtual OpenCL on the faster card (stock clock): user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=wpapsk-opencl -pla=1 OpenCL platform 1: AMD Accelerated Parallel Processing, 2 device(s). Using device 0: Tahiti Max local work size 256 Optimal local work size = 256 Benchmarking: WPA-PSK PBKDF2-HMAC-SHA-1 [OpenCL]... (8xOMP) DONE Raw: 55138 c/s real, 42442 c/s virtual 27x speedup over CPU here, although presumably the CPU code is further from optimal. ...Hey, what are you doing here? That message was way too long, you couldn't possibly read this far. I'll just presume you scrolled to the end. There's good stuff you have missed above, so please scroll up. ;-) As usual, feedback is welcome on the john-users list. I realize that we're currently missing usage instructions for much of the new stuff, so please just ask on john-users - and try to make your questions specific. That way, code contributors will also be prompted/forced to contribute documentation, and we'll get it under doc/ and on the wiki - in fact, you can contribute to that too. Alexander

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.