Phi2 dirty release

DumaxFr · Jun 10, 2018 · bb94ece · bb94ece
1 parent 5e02fcf
commit bb94ece
Show file tree

Hide file tree

Showing 14 changed files with 81 additions and 76 deletions.
diff --git a/README.md b/README.md
@@ -4,16 +4,22 @@ Based on Christian Buchner's &amp; Christian H.'s CUDA project, no more active o
 
 Check the [README.txt](README.txt) for the additions
 
+If you appreciate the work done for this release, please consider a small donation to one of those addresses :
+
+LUX : LWcYLSx37F37gHjbmvhwQPQ8PhyDYibmpr
+RVN : RQXpsvSaVrGYo4tyGityWDNBQMFcnqANyj
+BTC : 1AtQXFbnzYTsjzy2bzSH6nPGxqZ32NG42T 
+
+You may also retribute the previous "forker" :
+
 BTC donation address: 1AJdfCpLWPNoAMDfHF1wD5y8VgKSSTHxPo (tpruvot)
 
 A part of the recent algos were originally written by [djm34](https://github.com/djm34) and [alexis78](https://github.com/alexis78)
 
-This variant was tested and built on Linux (ubuntu server 14.04, 16.04, Fedora 22 to 25)
+This variant was tested and built on Windows 10 and Visual Studio 2015 (community edition)
 It is also built for Windows 7 to 10 with VStudio 2013, to stay compatible with Windows 7 and Vista.
 
-Note that the x86 releases are generally faster than x64 ones on Windows, but that tend to change with the recent drivers.
-
-The recommended CUDA Toolkit version was the [6.5.19](http://developer.download.nvidia.com/compute/cuda/6_5/rel/installers/cuda_6.5.19_windows_general_64.exe), but some light algos could be faster with the version 7.5 and 8.0 (like lbry, decred and skein).
+Note that x86 releases won't be anymore provided.
 
 About source code dependencies
 ------------------------------
@@ -24,7 +30,7 @@ This project requires some libraries to be built :
 - Curl (prebuilt for win)
 - pthreads (prebuilt for win)
 
-The tree now contains recent prebuilt openssl and curl .lib for both x86 and x64 platforms (windows).
+The tree now contains recent prebuilt openssl and curl .lib for x64 platforms (windows).
 
 To rebuild them, you need to clone this repository and its submodules :
     git clone https://github.com/peters/curl-for-windows.git compat/curl-for-windows

diff --git a/README.txt b/README.txt
@@ -1,11 +1,16 @@
 
-ccminer 2.2.5 (Feb 2018)                            "x16r algo"
+ccminer dumax-0.9.0 (June 10th 2018)                "Phi2 algo"
 ---------------------------------------------------------------
 
 ***************************************************************
 If you find this tool useful and like to support its continuous
           development, then consider a donation.
 
+DumaxFr@github:
+  LUX  : LWcYLSx37F37gHjbmvhwQPQ8PhyDYibmpr
+  RVN  : RQXpsvSaVrGYo4tyGityWDNBQMFcnqANyj
+  BTC  : 1AtQXFbnzYTsjzy2bzSH6nPGxqZ32NG42T
+
 tpruvot@github:
   BTC  : 1AJdfCpLWPNoAMDfHF1wD5y8VgKSSTHxPo
   DCR  : DsUCcACGcyP8McNMRXQwbtpDxaVUYLDQDeU
@@ -102,8 +107,8 @@ its command line interface and options.
                           neoscrypt   use to mine FeatherCoin, Trezarcoin, Orbitcoin, etc
                           nist5       use to mine TalkCoin
                           penta       use to mine Joincoin / Pentablake
-                          phi         use to mine LUXCoin
-                          phi2d       use to mine LUXCoin after fork
+                          phi         use to mine Folm
+                          phi2        use to mine LUXCoin
                           polytimos   use to mine Polytimos
                           quark       use to mine Quarkcoin
                           qubit       use to mine Qubit
@@ -219,13 +224,11 @@ Wildkeccak specific:
 
 >>> Examples <<<
 
+Example for Ravencoin mining on thetechnicalspool.com with all nvidia gpus in your system
+    ccminer -a x16r -o stratum+tcp://thetechnicalspool.com:3636 -u <<username.worker>> -p <<workerpassword>>
 
-Example for Heavycoin Mining on heavycoinpool.com with a single gpu in your system
-    ccminer -t 1 -a heavy -o stratum+tcp://stratum01.heavycoinpool.com:5333 -u <<username.worker>> -p <<workerpassword>> -v 8
-
-
-Example for Heavycoin Mining on hvc.1gh.com with a dual gpu in your system
-    ccminer -t 2 -a heavy -o stratum+tcp://hvcpool.1gh.com:5333/ -u <<WALLET>> -p x -v 8
+Example for Ravencoin mining on thetechnicalspool.com with a single gpu from many in your system (gpu number from 0 to x)
+    ccminer -a x16r -o stratum+tcp://thetechnicalspool.com:3636 -u <<username.worker>> -p <<workerpassword>> -d 1
 
 
 Example for Fuguecoin solo-mining with 4 gpu's in your system and a Fuguecoin-wallet running on localhost
@@ -274,15 +277,25 @@ I plan to add a json format later, if requests are formatted in json too..
 >>> Additional Notes <<<
 
 This code should be running on nVidia GPUs ranging from compute capability
-3.0 up to compute capability 5.2. Support for Compute 2.0 has been dropped
+5.2 up to compute capability 6.1 (7.0). Support for Compute 3.0 has been dropped
 so we can more efficiently implement new algorithms using the latest hardware
 features.
 
+
 >>> RELEASE HISTORY <<<
-  Feb. 2017       v2.2.5
+
+  June 10th 2018  dumax-0.9.0 (initial fork release)
+                  Moved to visual studio 2015 and CUDA 9.1
+				  Improved x16r
+				  Added x16s (same kernels than x16r)
+				  Added Phi2 algo (unoptimized)
+				  Improved Phi, x17
+
+>>> RELEASE HISTORY by tpruvot <<<
+  Feb. 2018       v2.2.5
                   New x16r algo
 
-  Jan. 04th 2017  v2.2.4
+  Jan. 04th 2018  v2.2.4
                   Improve lyra2v2
                   Higher keccak default intensity
                   Drop SM 2.x support by default, for CUDA 9 and more recent
@@ -602,6 +615,4 @@ Source code is included to satisfy GNU GPL V3 requirements.
 
 With kind regards,
 
-   Christian Buchner ( [email protected] )
-   Christian H. ( Chris84 )
-   Tanguy Pruvot ( tpruvot@github )
+   Stephane Duma ( DumaxFr@github )
diff --git a/algos.h b/algos.h
@@ -110,7 +110,7 @@ static const char *algo_names[] = {
 	"nist5",
 	"penta",
 	"phi",
-	"phi2d",
+	"phi2",
 	"polytimos",
 	"quark",
 	"qubit",

diff --git a/bench.cpp b/bench.cpp
@@ -81,7 +81,7 @@ void algo_free_all(int thr_id)
 	free_nist5(thr_id);
 	free_pentablake(thr_id);
 	free_phi(thr_id);
-	free_phi2d(thr_id);
+	free_phi2(thr_id);
 	free_polytimos(thr_id);
 	free_quark(thr_id);
 	free_qubit(thr_id);

diff --git a/ccminer.cpp b/ccminer.cpp
@@ -269,7 +269,7 @@ Options:\n\
 			nist5       NIST5 (TalkCoin)\n\
 			penta       Pentablake hash (5x Blake 512)\n\
 			phi         Lux, Folm ...\n\
-			phi2d       Dirty Lux Phi2\n\
+			phi2       Dirty Lux Phi2\n\
 			polytimos   Politimos\n\
 			quark       Quark\n\
 			qubit       Qubit\n\
@@ -2430,7 +2430,7 @@ static void *miner_thread(void *userdata)
 			rc = scanhash_phi(thr_id, &work, max_nonce, &hashes_done);
 			break;
 		case ALGO_PHI2D:
-			rc = scanhash_phi2d(thr_id, &work, max_nonce, &hashes_done);
+			rc = scanhash_phi2(thr_id, &work, max_nonce, &hashes_done);
 			break;
 		case ALGO_POLYTIMOS:
 			rc = scanhash_polytimos(thr_id, &work, max_nonce, &hashes_done);

diff --git a/ccminer.vcxproj b/ccminer.vcxproj
@@ -395,7 +395,7 @@
     <ClCompile Include="neoscrypt\neoscrypt-cpu.c" />
     <ClInclude Include="neoscrypt\cuda_vectors.h" />
     <ClInclude Include="phi\cuda_phi.h" />
-    <ClInclude Include="phi\cuda_phi2d.h" />
+    <ClInclude Include="phi\cuda_phi2.h" />
     <ClInclude Include="phi\cuda_phi2_skein512.h" />
     <ClInclude Include="x11\cuda_x11_simd512_sm2.cuh" />
     <ClInclude Include="x16x\cuda_x16x.h" />
@@ -553,7 +553,7 @@
     <CudaCompile Include="phi\cuda_phi2_br_streebog_echo512.cu" />
     <CudaCompile Include="phi\cuda_phi2_lyra2.cu" />
     <CudaCompile Include="phi\cuda_phi2_skein512.cu" />
-    <CudaCompile Include="phi\phi2d.cu" />
+    <CudaCompile Include="phi\phi2.cu" />
     <CudaCompile Include="scrypt\blake.cu" />
     <CudaCompile Include="scrypt\keccak.cu" />
     <CudaCompile Include="scrypt\sha256.cu" />

diff --git a/ccminer.vcxproj.filters b/ccminer.vcxproj.filters
@@ -620,8 +620,8 @@
     <ClInclude Include="Algo512\cuda_b_hamsi512.h" />
     <ClInclude Include="Algo512\cuda_b_cubehash512.h" />
     <ClInclude Include="Algo512\cuda_b_jh512.h" />
-    <ClInclude Include="phi\cuda_phi2d.h" />
     <ClInclude Include="phi\cuda_phi2_skein512.h" />
+    <ClInclude Include="phi\cuda_phi2.h" />
   </ItemGroup>
   <ItemGroup>
     <CudaCompile Include="cuda.cpp">
@@ -1010,10 +1010,10 @@
     <CudaCompile Include="Algo512\cuda_b_hamsi512.cu" />
     <CudaCompile Include="Algo512\cuda_b_cubehash512.cu" />
     <CudaCompile Include="Algo512\cuda_b_jh512.cu" />
-    <CudaCompile Include="phi\phi2d.cu" />
     <CudaCompile Include="phi\cuda_phi2_skein512.cu" />
     <CudaCompile Include="phi\cuda_phi2_lyra2.cu" />
     <CudaCompile Include="phi\cuda_phi2_br_streebog_echo512.cu" />
+    <CudaCompile Include="phi\phi2.cu" />
   </ItemGroup>
   <ItemGroup>
     <Image Include="res\ccminer.ico">

diff --git a/miner.h b/miner.h
@@ -292,7 +292,7 @@ extern "C" {
     extern int scanhash_nist5(int thr_id, struct work *work, uint32_t max_nonce, unsigned long *hashes_done);
     extern int scanhash_pentablake(int thr_id, struct work *work, uint32_t max_nonce, unsigned long *hashes_done);
     extern int scanhash_phi(int thr_id, struct work* work, uint32_t max_nonce, unsigned long *hashes_done);
-    extern int scanhash_phi2d(int thr_id, struct work* work, uint32_t max_nonce, unsigned long *hashes_done);
+    extern int scanhash_phi2(int thr_id, struct work* work, uint32_t max_nonce, unsigned long *hashes_done);
     extern int scanhash_polytimos(int thr_id, struct work* work, uint32_t max_nonce, unsigned long *hashes_done);
     extern int scanhash_quark(int thr_id, struct work *work, uint32_t max_nonce, unsigned long *hashes_done);
     extern int scanhash_qubit(int thr_id, struct work* work, uint32_t max_nonce, unsigned long *hashes_done);
@@ -358,7 +358,7 @@ extern "C" {
     extern void free_nist5(int thr_id);
     extern void free_pentablake(int thr_id);
     extern void free_phi(int thr_id);
-    extern void free_phi2d(int thr_id);
+    extern void free_phi2(int thr_id);
     extern void free_polytimos(int thr_id);
     extern void free_quark(int thr_id);
     extern void free_qubit(int thr_id);

diff --git a/phi/cuda_phi2d.h → phi/cuda_phi2.h b/phi/cuda_phi2d.h → phi/cuda_phi2.h
diff --git a/phi/cuda_phi2_lyra2.cu b/phi/cuda_phi2_lyra2.cu
@@ -4,12 +4,15 @@
  * DumaxFr@github 2018 - Dual Lyra2 for Phi2
  */
 
+#include <stdio.h>
 
 #include "cuda_helper.h"
 #include "cuda_vector_uint2x4.h"
 
-
-#define PHI2LYRA2_TPB64 32
+// Limited by shared mem max capacity (TPBx1532 <= 48kb)
+// 48kb prefered to fit 2 times in 96kb max shared on sm52 & sm61)
+#define PHI2LYRA2_TPB64_MAIN 32
+#define PHI2LYRA2_TPB64_LDST 128
 
 #ifdef __INTELLISENSE__
 /* just for vstudio code colors */
@@ -147,7 +150,7 @@ static void round_lyra(uint2x4* s)
 }
 
 __device__ __forceinline__
-static void reduceDuplex(uint2 state[4], uint32_t thread, const uint32_t threads)
+static void reduceDuplex(uint2 state[4], const uint32_t thread, const uint32_t threads)
 {
 	uint2 state1[3];
 
@@ -175,7 +178,7 @@ static void reduceDuplex(uint2 state[4], uint32_t thread, const uint32_t threads
 }
 
 __device__ __forceinline__
-static void reduceDuplexRowSetup(const int rowIn, const int rowInOut, const int rowOut, uint2 state[4], uint32_t thread, const uint32_t threads)
+static void reduceDuplexRowSetup(const int rowIn, const int rowInOut, const int rowOut, uint2 state[4], const uint32_t thread, const uint32_t threads)
 {
 	uint2 state1[3], state2[3];
 
@@ -226,7 +229,7 @@ static void reduceDuplexRowt(const int rowIn, const int rowInOut, const int rowO
 		LD4S(state1, rowIn, i, thread, threads);
 		LD4S(state2, rowInOut, i, thread, threads);
 
-#pragma unroll
+        #pragma unroll
 		for (int j = 0; j < 3; j++)
 			state[j] ^= state1[j] + state2[j];
 
@@ -255,7 +258,7 @@ static void reduceDuplexRowt(const int rowIn, const int rowInOut, const int rowO
 
 		LD4S(state1, rowOut, i, thread, threads);
 
-#pragma unroll
+        #pragma unroll
 		for (int j = 0; j < 3; j++)
 			state1[j] ^= state[j];
 
@@ -318,19 +321,9 @@ static void reduceDuplexRowt_8(const int rowInOut, uint2* state, const uint32_t
 		state[j] ^= last[j];
 }
 
-//__constant__ uint2x4 blake2b_IV[2] = {
-//	0xf3bcc908lu, 0x6a09e667lu,
-//	0x84caa73blu, 0xbb67ae85lu,
-//	0xfe94f82blu, 0x3c6ef372lu,
-//	0x5f1d36f1lu, 0xa54ff53alu,
-//	0xade682d1lu, 0x510e527flu,
-//	0x2b3e6c1flu, 0x9b05688clu,
-//	0xfb41bd6blu, 0x1f83d9ablu,
-//	0x137e2179lu, 0x5be0cd19lu
-//};
 
 __global__
-__launch_bounds__(64, 1)
+__launch_bounds__(PHI2LYRA2_TPB64_LDST, 8)
 void cuda_phi2_lyra2_gpu_hash_32p1_1(const uint32_t threads, const uint2* const __restrict__ g_hash) {
 
 	const uint32_t thread = (blockDim.x * blockIdx.x + threadIdx.x);
@@ -369,7 +362,7 @@ void cuda_phi2_lyra2_gpu_hash_32p1_1(const uint32_t threads, const uint2* const
 }
 
 __global__
-__launch_bounds__(64, 1)
+__launch_bounds__(PHI2LYRA2_TPB64_LDST, 8)
 void cuda_phi2_lyra2_gpu_hash_32p2_1(const uint32_t threads, const uint2* const __restrict__ g_hash) {
 
 	const uint32_t thread = (blockDim.x * blockIdx.x + threadIdx.x);
@@ -408,7 +401,7 @@ void cuda_phi2_lyra2_gpu_hash_32p2_1(const uint32_t threads, const uint2* const
 }
 
 __global__
-__launch_bounds__(PHI2LYRA2_TPB64, 1)
+__launch_bounds__(PHI2LYRA2_TPB64_MAIN, 1)
 void cuda_phi2_lyra2_gpu_hash_32_2(const uint32_t threads) {
 
 	const uint32_t thread = blockDim.y * blockIdx.x + threadIdx.y;
@@ -453,7 +446,7 @@ void cuda_phi2_lyra2_gpu_hash_32_2(const uint32_t threads) {
 }
 
 __global__
-__launch_bounds__(64, 1)
+__launch_bounds__(PHI2LYRA2_TPB64_LDST, 8)
 void cuda_phi2_lyra2_gpu_hash_32p1_3(const uint32_t threads, uint2 *g_hash) {
 
 	const uint32_t thread = blockDim.x * blockIdx.x + threadIdx.x;
@@ -479,7 +472,7 @@ void cuda_phi2_lyra2_gpu_hash_32p1_3(const uint32_t threads, uint2 *g_hash) {
 }
 
 __global__
-__launch_bounds__(64, 1)
+__launch_bounds__(PHI2LYRA2_TPB64_LDST, 8)
 void cuda_phi2_lyra2_gpu_hash_32p2_3(const uint32_t threads, uint2 *g_hash) {
 
 	const uint32_t thread = blockDim.x * blockIdx.x + threadIdx.x;
@@ -513,20 +506,20 @@ void cuda_phi2_lyra2_cpu_init(uint64_t *d_matrix) {
 
 __host__
 void cuda_phi2_lyra2_cpu_hash_32x2(const uint32_t threads, uint32_t *d_hash) {
-	uint32_t tpb = PHI2LYRA2_TPB64;
+	uint32_t tpb = PHI2LYRA2_TPB64_MAIN;
 
 	dim3 grid1((threads * 4 + tpb - 1) / tpb);
 	dim3 block1(4, tpb >> 2);
 
-	dim3 grid2((threads + 64 - 1) / 64);
-	dim3 block2(64);
+	dim3 grid2((threads + PHI2LYRA2_TPB64_LDST - 1) / PHI2LYRA2_TPB64_LDST);
+	dim3 block2(PHI2LYRA2_TPB64_LDST);
 
 	cuda_phi2_lyra2_gpu_hash_32p1_1 <<< grid2, block2 >>> (threads, (uint2*)d_hash);
-	cuda_phi2_lyra2_gpu_hash_32_2 <<< grid1, block1, 24 * (8 - 0) * sizeof(uint2) * tpb >>> (threads);
+	cuda_phi2_lyra2_gpu_hash_32_2 <<< grid1, block1, 192 * sizeof(uint2) * tpb >>> (threads);
 	cuda_phi2_lyra2_gpu_hash_32p1_3 <<< grid2, block2 >>> (threads, (uint2*)d_hash);
 
 	cuda_phi2_lyra2_gpu_hash_32p2_1 <<< grid2, block2 >>> (threads, (uint2*)d_hash);
-	cuda_phi2_lyra2_gpu_hash_32_2 <<< grid1, block1, 24 * (8 - 0) * sizeof(uint2) * tpb >>> (threads);
+	cuda_phi2_lyra2_gpu_hash_32_2 <<< grid1, block1, 192 * sizeof(uint2) * tpb >>> (threads);
 	cuda_phi2_lyra2_gpu_hash_32p2_3 <<< grid2, block2 >>> (threads, (uint2*)d_hash);
 
 }