Diapolo



Offline



Activity: 769

Merit: 500









Hero MemberActivity: 769Merit: 500 further improved phatk_dia kernel for Phoenix + SDK 2.6 - 2012-01-13 July 01, 2011, 05:59:34 PM

Last edit: February 25, 2012, 02:25:39 PM by Diapolo #1 If it works, please post here and consider a small donation @ 1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x :).





Important (2012-01-13) : The FASTLOOP=False parameter is not needed anymore, because FASTLOOP defaults to false in this version. Update : FASTLOOP=True works now, I uploaded a fixed version!



Important : since OpenCL SDK / Runtime version 2.6 AMD updated their OpenCL compiler, so that some older kernels and optimizations in them seem to not work anymore or are not needed anymore. In order to reflect this change I had to edit the kernel performance section of this thread.



Important : since version 2011-08-27 you don't need to supply the BFI_INT switch anymore. If your HW supports it, it's enabled automatically. To disable it use BFI_INT=false.



Important : since version 2011-08-04 (pre-release) you have to use the switch VECTORS2 instead of VECTORS. I made this change to be clear what vectors are used in the kernel (2- or 4-component). To use 4-component vectors use switch VECTORS4.



Important : since version 2011-07-17 a modified version of __init__.py (for the Phoenix miner) is included in this package and has to be used! The kernel won't work for other Miners without modifications to them, see kernel.cl for further infos.





This is the preferred switch for Phoenix with phatk_dia in order to achieve comparable performance:

Code: -k phatk AGGRESSION=12 VECTORS2 WORKSIZE=128

Download version 2012-01-13 : http://www.mediafire.com/?xzk6b1yvb24r4dg

Download version 2011-12-21:

Download version 2011-08-27:

Download version 2011-08-11:

Download version 2011-08-04 (pre-release):

Download version 2011-07-17:

Download version 2011-07-11:

Download version 2011-07-07:

Download version 2011-07-06:

Download version 2011-07-03:

Download version 2011-07-01:



Kernel performance (BFI_INT / VECTORS2 / WORKSIZE=128 / SDK 2.6 / APP KernelAnalyzer 1.11 - Cal 11.12 profile):

HD5870

2011-08-20: 22 GPR / 1427 ALU OPs / 66 CF OPs

2011-08-27: 22 GPR / 1426 ALU OPs / 66 CF OPs

2011-12-21: 20 GPR / 1400 ALU OPs / 66 CF OPs

2012-01-13: 21 GPR / 1394 ALU OPs / 67 CF OPs



HD6970

2011-08-20: 21 GPR / 1687 ALU OPs / 66 CF OPs

2011-08-27: 23 GPR / 1688 ALU OPs / 68 CF OPs

2011-12-21: 21 GPR / 1687 ALU OPs / 66 CF OPs

2012-01-13: 20 GPR / 1687 ALU OPs / 66 CF OPs





Kernel performance (BFI_INT / VECTORS2 / SDK 2.5 / APP KernelAnalyzer 1.9 - Cal 11.7 profile):

HD5870

original phatk 1.X: 1393 ALU OPs

2011-07-01: 1389 ALU OPs

2011-07-03: 1385 ALU OPs

2011-07-06: 1380 ALU OPs

2011-07-07: 1380 ALU OPs

2011-07-11: 1378 ALU OPs

2011-07-17: 1376 ALU OPs

2011-08-04 (pre-release): 1368 ALU OPs

2011-08-11: 1364 ALU OPs

2011-08-27: 1363 ALU OPs (30 less compared to original phatk 1.X)

HD6970

original phatk 1.X: 1707 ALU OPs

2011-07-01: 1710 ALU OPs

2011-07-03: 1706 ALU OPs

2011-07-06: 1702 ALU OPs

2011-07-07: 1702 ALU OPs

2011-07-11: 1701 ALU OPs

2011-07-17: 1699 ALU OPs

2011-08-04 (pre-release): 1689 ALU OPs

2011-08-11: 1687 ALU OPs

2011-08-27: 1687 ALU OPs (20 less compared to original phatk 1.X)





changelog:



2012-01-13

Kernel:

- modified: Disclaimer is now the same as in original Phoenix package

- removed: all (u) typecasts in front of scalars, where vectors and scalars were used together because per OpenCL definition this is not needed

- removed: all () brackets around n in the #define parts of the kernel

- removed: S0(), which is now again merged into s0()

- removed: brackets around the commands in t1W(), t1(), t2() and W() were removed, to allow the compiler to reorder these

- added: W() function missed an ; at it's end

- added: init variable B1addK6 used in round 6 to save an add -> THX to DiabloD3

- added: a (uint) typecast in front of get_local_id() and get_group_id() calls, because return value could be 64 bits long, which is not wanted

- modified: replaced all ma() + s0() or s0() + ma() calls with t2()

- modified: round 6 now uses the new new B1addK6 variable

- modified: reordered W[] calculation for rounds 32, 91 and 92

- modified: rounds 121, 122 and 123 to not compute Vals[4], Vals[5] and Vals[6], because they are not needed for final computation of Vals[7] -> THX to jhajduk

- modified: removed + H[7] from round 124 and use -0xec9fcd13 to check for valid nonces

- added: result_r124 variable to take the result of the last round 124, this saves a few ALU OPs on VLIW5 GPUs

Python Init:

- modified: replaced spaces with tabs in the source code formatting (I really dislike this part in Python ^^)

- modified: a few comments and commands were reformatted for better readability or to be better understandable

- modified: FASTLOOP parameter now defaults to False, which means you don't need to supply FASTLOOP=False anymore

- removed: OUTPUT_SIZE is not used anymore so all references to it were removed

- modified: changed REVISION to 122

- modified: moved the WORKSIZE checks below the part where the check, if and which vectors should be used is performed

this takes into account, that the global worksize passed to the kernel is influenced by vector usage and vectorsize

(currently the use of FASTLOOP can break this, because of the "dynamic" number of iterations)

- added: some debug info about worksize and pyOpenCL is displayed at the start

- added: B1 + K[6] is passed as new kernel parameter

- modified: made enqueue_read_buffer() / enqueue_write_buffer() blocking and removed finish() after the read, as per AMDs recommendations

to minimize API overhead



2011-08-27:

Kernel:

- added: code path for 3-component Vectors, activated via VECTORS3 (currently not usable, because of a bug in the AMD drivers up to Cat 11.8)

- removed: BITALIGN option from the kernel, BFI_INT is now used automatically, if the HW supports it (disabled via BFI_INT=false)

- modified: non BFI_INT Ch() function, which was broken in 2011-08-11 -> THX to Vince

- modified: kernel output buffer is now an ulong array and not an uint array

- removed: OUTPUT_SIZE argument is not passed and used in the kernel anymore

- modified: WORKSIZEx4, WORKSIZEx3 and WORKSIZEx2 arguments were merged into WORKSIZExVECSIZE

- modified: removed, reordered and added some brackets and type-casting stuff in the kernel

- modified: restored command order for round 108 - 123 to free a GPR

- modified: added H[7] into round 124 calculation

- modified: changed the checking for positive nonces again to cover the H[7] change

- modified: writing of nonces to output now uses 1 write for Vec2 and max. 2 writes for Vec4, because 2x uints are now encoded into 1x ulong

Python Init:

- added: code for 3-component Vectors, activated via VECTORS3 (currently not usable, because of a bug in the AMD drivers up to Cat 11.8)

- removed: BITALIGN option from the Python init, BFI_INT is now used automatically, if the HW supports it (disabled via BFI_INT=false)

- added: detection of maximum supported WORKSIZE per Device, which is used if no WORKSIZE is supplied, if supplied WORKSIZE > max. supported WORKSIZE

or if WORKSIZE is not a power of 2

- added: code to decode the ulong from the output buffer into 2x uint and process the results

- modified: comments, code formating and line breaks for better readability

- modified: output buffer size is now the WORKSIZE -> THX to Phaetus



2011-08-11:

- modified: reverted a former change to the Ma() function to save an ALU OP for 69XX cards

- added: S0() and S1() function, which is a compiler help -> THX Phateus

- modified: a few brackets and layout of all helper functions for better readability and compatibility

- added: t2() function, which is (s0(n) + ma(n)) and saves a few GPRs -> THX Phateus and myself (had this in earlier, but removed it sometime ^^)

- modified: changed layout of kernel definition for better readability

- modified: all values which for example had a 10u now have a 10U (uppercase) to be consistent in the whole kernel

- modified: modified round 94 W calculation for better performance

- modified: round 108 - 123 now consists of 2 W() blocks followed by 2 sharoundW() blocks to save a GPR

- modified: changed the checking for positive nonces again to never create an invalid share and lower ALU OP usage



2011-08-04 (pre-release):

- added: user Vince into disclaimer -> THX Vince :)

- added: kernel is now able to work with 4-component vectors (switch VECTORS4) -> THX to Phateus

- modified: to use 2-component vectors I renamed the switch VECTORS to VECTORS2

- added: __attribute__((reqd_work_group_size(WORKSIZE, 1, 1))) -> THX to Phateus

- added: constants PreW31 and PreW32, which store P2() + P4() for round 31 and 32 -> THX to Phateus

- renamed - modified: W17_2 is now PreW19, W2 is now PreW18, PreVal4addT1 is now PreVal4 (= PreVal4 + T1), state0subT1 is now PreVal0 (= Preval4 + state0)

- modified: base is now declared as u to save the addidion of uint2(0, 1) or uint4(0, 1, 2, 3) for W_3 init -> THX to Phateus

- modified: nonce calculation now uses the local Work-Item ID, the group ID and the WORKSIZE instead of only the global Work-Item ID -> THX to Phateus

- added: saved a multiplication by passing WORKSIZEx2 and WORKSIZEx4 constants to the kernel

- modified: calculation for W[18 - O] was optimized so that P2(18) is only calculated for x component (if Vectors are used), because x and y only differ

in the LSB and afterwards Bit 14 and 25 are rotated for W[18 - O].y -> THX to Phateus

- modified: saved an addition for Vals[0] init, because of the change to PreVal0

- modified: reordered code for round 4 - 95 to optimize for less ALU OPs used -> THX Phateus and myself ^^

- modified: ordering of variables in additions for Round 124 was changed to optimize for less ALU OPs used

- modified: rewrote the part where nonces are checked, if they are positive and where they are written into output buffer

(saves 2 global writes per work-item and saves additional ALU OPs)

- modified: changed variables W_3, P2_18_x, P2_18 and nonce into a constant

- modified: changed code formating for rounds 4 - 124 better readability

- removed: some comments to cleanup the code



2011-07-17:

- added: offset for W[] array to reduce it's size -> THX to user Vince

- modified: function t1() renamed to t1W() / function sharound() renamed to sharoundW()

- added: function t1() and sharound() which are used where the W[] addition can be left out, because W[] == 0

I guess the compiler already does this optimization, but doesn't hurt) -> THX to user Vince

- modified: P1() - P4() and W() to male use of the offset

- modified: quite a few kernel parameters have new values or were added (mixed ideas from User Vince with own ones)

C1addK5: C1addK5 = C1 + K[5]: C1addK5 = C1 + 0x59f111f1

D1: D1 = D1 + K[4] + W[4]: D1 = D1 + 0xe9b5dba5 + 0x80000000U

W2: W2 + W16 in P1(): W2 = P1(18) + P4(18)

W17_2: 0x80000000U in P2() = 0x11002000 + W17 in P1(): W17_2 = P1(19) + P2(19)

PreValaddT1: PreValaddT1 = PreVal4 + T1

T1substate0: T1substate0 = T1 - substate0

- added: variable W_3, which stores the first value formely held in W[3]

- added: Temp variable used to speed up calculation for rounds 4 and 5

- modified: changed round 3 so that it's more efficient (uses: Vals[0] and Vals[4])

- modified: W[0] - W[14] are now kind of hard-coded or left out, where they were 0

- modified: optimized P1(18) + P2(18) + P4(18)

- modified: optimized P1(19) + P2(19) + P4(19)

- modified: optimized round 4 + 5

- modified: rounds 6 - 14 and 73 - 78 now use new sharound() without W[] addition

- modified: offset added for all parts, where W[] is used

- modified: W_3 is used as result instead of W[3] (W[3] is still used to generate random possition in output buffer) -> THX to user Vince



2011-07-11:

- modified: constant H[7] has a new value (saves an addition in round 124)

- modified: non BFI_INT Ch() function now uses OpenCL built-in bitselect

- modified: reordered W[] calculations for round 18 - 30, 87 and 94

- modified: reordered calculation for round 5

- modified: W[] calculation for round 80 - 86 is now a block before sharound() is called

- removed: K[60] from round 124 (because of new H[7] value)



2011-07-07:

- removed: some large comments in the source were removed

- modified: Ma() function is now unique in the kernel, no matter if BFI_INT is used or not -> THX to User: 1MLyg5WVFSMifFjkrZiyGW2nw

- added: Ch() function which uses OpenCL bitselect() command (but it's not active, so you are free to try it) -> THX to User: 1MLyg5WVFSMifFjkrZiyGW2nw

- modified: u W[128] is replaced with u W[124] because no more than 124 values are used

- modified: initialisation for Vals[0], Vals[3], Vals[4] and Vals[7] is now processed in other places to save some unneeded writes to these variables

- fixed: some hex values, which were used in vector additions are now properly type-casted, which hopefully restores AMD APP SDK 2.1 compatibility

- modified: rounds 3, 4 and 5 were modified for better performance (guess this can be tuned, if I have a working KernelAnalyzer)



2011-07-06:

- modified: H[] constants were reordered (2 were not used because of earlier mods)

- added: ulong L constant added (it's value doesn't fit into an uint)

- modified: new Ma() for non BFI_INT capable cards, should be faster -> THX to User: 1MLyg5WVFSMifFjkrZiyGW2nw

- removed: t1W()

- modified: t1() reordered function calls for better performance

- modified: W() reordered function calls for better performance

- modified: sharound() removed writing to t1, now t1() is called twice, which makes this function FASTER (OpenCL compiler optimization)

- removed: sharound2() (if needed W() + sharound() is used instead)

- removed: partround() not needed because of another solution for round 3 and 124

- removed: t1 and t1W variabled

- modified: rounds 3, 19, 30, 81, 87, 94 and 124 were modified for better performance



2011-07-03:

- removed: t2(), w(n), r0(x), r1(x), R0(n) and R1(n)

- renamed - modified: R(x) to W(x) plus now uses P1, P2, P3 and P4 directly

- modified: P1(x) and P2(x) to not use R1(x - 2), R0(x - 15) but do that directly

- modified: SHA rounds 31, 32, 47 - 61, 86, 87, 114 - 119 now use sharound2() instead of W() + sharound()

- modified: reordered code for SHA rounds 66 - 94 -> saw no decrease in performance -> better readability

- modified: SHA rounds 18, 19, 20, 80, 93, 94 now use a simpler calculation because of removed zero addions

--> 1x P1(x), 2x P2(x), 4x P3(x) and 2x P4(x) were removed which should give a little MHash/sec boost

- modified: sharound() so that a double execution of t1() is avoided -> THX to User: 1MLyg5WVFSMifFjkrZiyGW2nw

- added: "u t1W" variable, which is used in sharound2() to avoid double execution of t1W()



2011-07-01:

Code: Vals[7] = 0xb0edbdd0 + K[0] + W[64] + 0x08909ae5U; -> Vals[7] = 0xfc08884d + W[64];

Vals[3] = 0xa54ff53a + 0xb0edbdd0 + K[0] + W[64]; -> Vals[3] = 0x198c7e2a2 + W[64]; - removed the Code: Vals[7] += H[7] addition and replaced the final if-statements in the Kernel

- reordered some W[n] = statements to remove some unneeded additions

- replaced all additions like 64 + 5 with the corresponding integer value (guess it was in there for readability reasons, so here it got worse :D)

- removed some unneeded brackets

- re-formatted for better readability



If it works, please post here and consider a small donation @ 1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x :).



Thanks,

Dia : The FASTLOOP=False parameter is not needed anymore, because FASTLOOP defaults to false in this version.: FASTLOOP=True works now, I uploaded a fixed version!: since OpenCL SDK / Runtime versionAMD updated their OpenCL compiler, so that some older kernels and optimizations in them seem to not work anymore or are not needed anymore. In order to reflect this change I had to edit the kernel performance section of this thread.: since versionyou don't need to supply the BFI_INT switch anymore. If your HW supports it, it's enabled automatically. To disable it use BFI_INT=false.: since versionyou have to use the switch VECTORS2 instead of VECTORS. I made this change to be clear what vectors are used in the kernel (2- or 4-component). To use 4-component vectors use switch VECTORS4.: since versiona modified version of __init__.py (for the Phoenix miner) is included in this package and has to be used! The kernel won't work for other Miners without modifications to them, see kernel.cl for further infos.This is the preferred switch for Phoenix with phatk_dia in order to achieve comparable performance:Download version 2011-12-21: http://www.mediafire.com/?r3n2m5s2y2b32d9 Download version 2011-08-27: http://www.mediafire.com/?697r8t2pdk419ji Download version 2011-08-11: http://www.mediafire.com/?s5c7h4r91r4ad4j Download version 2011-08-04 (pre-release): http://www.mediafire.com/?upwwud7kfyx7788 Download version 2011-07-17: http://www.mediafire.com/?4zxdd5557243has Download version 2011-07-11: http://www.mediafire.com/?k404b6lqn8vu6z6 Download version 2011-07-07: http://www.mediafire.com/?o7jfp60s7xefrg4 Download version 2011-07-06: http://www.mediafire.com/?f8b8q3w5u5p0ln0 Download version 2011-07-03: http://www.mediafire.com/?xlkcc08jvp5a43v Download version 2011-07-01: http://www.mediafire.com/?5jmt7t0e83k3eox Kernel performance (BFI_INT / VECTORS2 / WORKSIZE=128 / SDK 2.6 / APP KernelAnalyzer 1.11 - Cal 11.12 profile):2011-08-20:2011-08-27:/ 1426 ALU OPs /2011-12-21:/ 1400 ALU OPs /2012-01-13: 21 GPR /2011-08-20: 21 GPR /2011-08-27:2011-12-21: 21 GPR /2012-01-13:Kernel performance (BFI_INT / VECTORS2 / SDK 2.5 / APP KernelAnalyzer 1.9 - Cal 11.7 profile):HD5870original phatk 1.X: 1393 ALU OPs2011-07-01: 1389 ALU OPs2011-07-03: 1385 ALU OPs2011-07-06: 1380 ALU OPs2011-07-07: 1380 ALU OPs2011-07-11: 1378 ALU OPs2011-07-17: 1376 ALU OPs2011-08-04 (pre-release): 1368 ALU OPs2011-08-11: 1364 ALU OPs2011-08-27: 1363 ALU OPs (30 less compared to original phatk 1.X)HD6970original phatk 1.X: 1707 ALU OPs2011-07-01: 1710 ALU OPs2011-07-03: 1706 ALU OPs2011-07-06: 1702 ALU OPs2011-07-07: 1702 ALU OPs2011-07-11: 1701 ALU OPs2011-07-17: 1699 ALU OPs2011-08-04 (pre-release): 1689 ALU OPs2011-08-11: 1687 ALU OPs2011-08-27: 1687 ALU OPs (20 less compared to original phatk 1.X)Kernel:- modified: Disclaimer is now the same as in original Phoenix package- removed: all (u) typecasts in front of scalars, where vectors and scalars were used together because per OpenCL definition this is not needed- removed: all () brackets around n in the #define parts of the kernel- removed: S0(), which is now again merged into s0()- removed: brackets around the commands in t1W(), t1(), t2() and W() were removed, to allow the compiler to reorder these- added: W() function missed an ; at it's end- added: init variable B1addK6 used in round 6 to save an add -> THX to DiabloD3- added: a (uint) typecast in front of get_local_id() and get_group_id() calls, because return value could be 64 bits long, which is not wanted- modified: replaced all ma() + s0() or s0() + ma() calls with t2()- modified: round 6 now uses the new new B1addK6 variable- modified: reordered W[] calculation for rounds 32, 91 and 92- modified: rounds 121, 122 and 123 to not compute Vals[4], Vals[5] and Vals[6], because they are not needed for final computation of Vals[7] -> THX to jhajduk- modified: removed + H[7] from round 124 and use -0xec9fcd13 to check for valid nonces- added: result_r124 variable to take the result of the last round 124, this saves a few ALU OPs on VLIW5 GPUsPython Init:- modified: replaced spaces with tabs in the source code formatting (I really dislike this part in Python ^^)- modified: a few comments and commands were reformatted for better readability or to be better understandable- modified: FASTLOOP parameter now defaults to False, which means you don't need to supply FASTLOOP=False anymore- removed: OUTPUT_SIZE is not used anymore so all references to it were removed- modified: changed REVISION to 122- modified: moved the WORKSIZE checks below the part where the check, if and which vectors should be used is performedthis takes into account, that the global worksize passed to the kernel is influenced by vector usage and vectorsize(currently the use of FASTLOOP can break this, because of the "dynamic" number of iterations)- added: some debug info about worksize and pyOpenCL is displayed at the start- added: B1 + K[6] is passed as new kernel parameter- modified: made enqueue_read_buffer() / enqueue_write_buffer() blocking and removed finish() after the read, as per AMDs recommendationsto minimize API overheadKernel:- added: code path for 3-component Vectors, activated via VECTORS3 (currently not usable, because of a bug in the AMD drivers up to Cat 11.8)- removed: BITALIGN option from the kernel, BFI_INT is now used automatically, if the HW supports it (disabled via BFI_INT=false)- modified: non BFI_INT Ch() function, which was broken in 2011-08-11 -> THX to Vince- modified: kernel output buffer is now an ulong array and not an uint array- removed: OUTPUT_SIZE argument is not passed and used in the kernel anymore- modified: WORKSIZEx4, WORKSIZEx3 and WORKSIZEx2 arguments were merged into WORKSIZExVECSIZE- modified: removed, reordered and added some brackets and type-casting stuff in the kernel- modified: restored command order for round 108 - 123 to free a GPR- modified: added H[7] into round 124 calculation- modified: changed the checking for positive nonces again to cover the H[7] change- modified: writing of nonces to output now uses 1 write for Vec2 and max. 2 writes for Vec4, because 2x uints are now encoded into 1x ulongPython Init:- added: code for 3-component Vectors, activated via VECTORS3 (currently not usable, because of a bug in the AMD drivers up to Cat 11.8)- removed: BITALIGN option from the Python init, BFI_INT is now used automatically, if the HW supports it (disabled via BFI_INT=false)- added: detection of maximum supported WORKSIZE per Device, which is used if no WORKSIZE is supplied, if supplied WORKSIZE > max. supported WORKSIZEor if WORKSIZE is not a power of 2- added: code to decode the ulong from the output buffer into 2x uint and process the results- modified: comments, code formating and line breaks for better readability- modified: output buffer size is now the WORKSIZE -> THX to Phaetus- modified: reverted a former change to the Ma() function to save an ALU OP for 69XX cards- added: S0() and S1() function, which is a compiler help -> THX Phateus- modified: a few brackets and layout of all helper functions for better readability and compatibility- added: t2() function, which is (s0(n) + ma(n)) and saves a few GPRs -> THX Phateus and myself (had this in earlier, but removed it sometime ^^)- modified: changed layout of kernel definition for better readability- modified: all values which for example had a 10u now have a 10U (uppercase) to be consistent in the whole kernel- modified: modified round 94 W calculation for better performance- modified: round 108 - 123 now consists of 2 W() blocks followed by 2 sharoundW() blocks to save a GPR- modified: changed the checking for positive nonces again to never create an invalid share and lower ALU OP usage- added: user Vince into disclaimer -> THX Vince :)- added: kernel is now able to work with 4-component vectors (switch VECTORS4) -> THX to Phateus- modified: to use 2-component vectors I renamed the switch VECTORS to VECTORS2- added: __attribute__((reqd_work_group_size(WORKSIZE, 1, 1))) -> THX to Phateus- added: constants PreW31 and PreW32, which store P2() + P4() for round 31 and 32 -> THX to Phateus- renamed - modified: W17_2 is now PreW19, W2 is now PreW18, PreVal4addT1 is now PreVal4 (= PreVal4 + T1), state0subT1 is now PreVal0 (= Preval4 + state0)- modified: base is now declared as u to save the addidion of uint2(0, 1) or uint4(0, 1, 2, 3) for W_3 init -> THX to Phateus- modified: nonce calculation now uses the local Work-Item ID, the group ID and the WORKSIZE instead of only the global Work-Item ID -> THX to Phateus- added: saved a multiplication by passing WORKSIZEx2 and WORKSIZEx4 constants to the kernel- modified: calculation for W[18 - O] was optimized so that P2(18) is only calculated for x component (if Vectors are used), because x and y only differin the LSB and afterwards Bit 14 and 25 are rotated for W[18 - O].y -> THX to Phateus- modified: saved an addition for Vals[0] init, because of the change to PreVal0- modified: reordered code for round 4 - 95 to optimize for less ALU OPs used -> THX Phateus and myself ^^- modified: ordering of variables in additions for Round 124 was changed to optimize for less ALU OPs used- modified: rewrote the part where nonces are checked, if they are positive and where they are written into output buffer(saves 2 global writes per work-item and saves additional ALU OPs)- modified: changed variables W_3, P2_18_x, P2_18 and nonce into a constant- modified: changed code formating for rounds 4 - 124 better readability- removed: some comments to cleanup the code- added: offset for W[] array to reduce it's size -> THX to user Vince- modified: function t1() renamed to t1W() / function sharound() renamed to sharoundW()- added: function t1() and sharound() which are used where the W[] addition can be left out, because W[] == 0I guess the compiler already does this optimization, but doesn't hurt) -> THX to user Vince- modified: P1() - P4() and W() to male use of the offset- modified: quite a few kernel parameters have new values or were added (mixed ideas from User Vince with own ones)C1addK5: C1addK5 = C1 + K[5]: C1addK5 = C1 + 0x59f111f1D1: D1 = D1 + K[4] + W[4]: D1 = D1 + 0xe9b5dba5 + 0x80000000UW2: W2 + W16 in P1(): W2 = P1(18) + P4(18)W17_2: 0x80000000U in P2() = 0x11002000 + W17 in P1(): W17_2 = P1(19) + P2(19)PreValaddT1: PreValaddT1 = PreVal4 + T1T1substate0: T1substate0 = T1 - substate0- added: variable W_3, which stores the first value formely held in W[3]- added: Temp variable used to speed up calculation for rounds 4 and 5- modified: changed round 3 so that it's more efficient (uses: Vals[0] and Vals[4])- modified: W[0] - W[14] are now kind of hard-coded or left out, where they were 0- modified: optimized P1(18) + P2(18) + P4(18)- modified: optimized P1(19) + P2(19) + P4(19)- modified: optimized round 4 + 5- modified: rounds 6 - 14 and 73 - 78 now use new sharound() without W[] addition- modified: offset added for all parts, where W[] is used- modified: W_3 is used as result instead of W[3] (W[3] is still used to generate random possition in output buffer) -> THX to user Vince- modified: constant H[7] has a new value (saves an addition in round 124)- modified: non BFI_INT Ch() function now uses OpenCL built-in bitselect- modified: reordered W[] calculations for round 18 - 30, 87 and 94- modified: reordered calculation for round 5- modified: W[] calculation for round 80 - 86 is now a block before sharound() is called- removed: K[60] from round 124 (because of new H[7] value)- removed: some large comments in the source were removed- modified: Ma() function is now unique in the kernel, no matter if BFI_INT is used or not -> THX to User: 1MLyg5WVFSMifFjkrZiyGW2nw- added: Ch() function which uses OpenCL bitselect() command (but it's not active, so you are free to try it) -> THX to User: 1MLyg5WVFSMifFjkrZiyGW2nw- modified: u W[128] is replaced with u W[124] because no more than 124 values are used- modified: initialisation for Vals[0], Vals[3], Vals[4] and Vals[7] is now processed in other places to save some unneeded writes to these variables- fixed: some hex values, which were used in vector additions are now properly type-casted, which hopefully restores AMD APP SDK 2.1 compatibility- modified: rounds 3, 4 and 5 were modified for better performance (guess this can be tuned, if I have a working KernelAnalyzer)- modified: H[] constants were reordered (2 were not used because of earlier mods)- added: ulong L constant added (it's value doesn't fit into an uint)- modified: new Ma() for non BFI_INT capable cards, should be faster -> THX to User: 1MLyg5WVFSMifFjkrZiyGW2nw- removed: t1W()- modified: t1() reordered function calls for better performance- modified: W() reordered function calls for better performance- modified: sharound() removed writing to t1, now t1() is called twice, which makes this function FASTER (OpenCL compiler optimization)- removed: sharound2() (if needed W() + sharound() is used instead)- removed: partround() not needed because of another solution for round 3 and 124- removed: t1 and t1W variabled- modified: rounds 3, 19, 30, 81, 87, 94 and 124 were modified for better performance- removed: t2(), w(n), r0(x), r1(x), R0(n) and R1(n)- renamed - modified: R(x) to W(x) plus now uses P1, P2, P3 and P4 directly- modified: P1(x) and P2(x) to not use R1(x - 2), R0(x - 15) but do that directly- modified: SHA rounds 31, 32, 47 - 61, 86, 87, 114 - 119 now use sharound2() instead of W() + sharound()- modified: reordered code for SHA rounds 66 - 94 -> saw no decrease in performance -> better readability- modified: SHA rounds 18, 19, 20, 80, 93, 94 now use a simpler calculation because of removed zero addions--> 1x P1(x), 2x P2(x), 4x P3(x) and 2x P4(x) were removed which should give a little MHash/sec boost- modified: sharound() so that a double execution of t1() is avoided -> THX to User: 1MLyg5WVFSMifFjkrZiyGW2nw- added: "u t1W" variable, which is used in sharound2() to avoid double execution of t1W()- removed theaddition and replaced the final if-statements in the Kernel- reordered some W[n] = statements to remove some unneeded additions- replaced all additions like 64 + 5 with the corresponding integer value (guess it was in there for readability reasons, so here it got worse :D)- removed some unneeded brackets- re-formatted for better readabilityThanks,Dia

1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x

bitcoin:1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x?label=Diapolo Liked my former work for Bitcoin Core? Drop me a donation via:1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x