Welcome to Sanmayce "downloads" page.
Click on image above to download the package Monstrous_Jesters_revision_C.zip 160 MB (168,299,369 bytes).
For reference: http://www.sanmayce.com/Downloads/index.html#Jesters
'Monstrous Jesters' benchmark package short overview:
This is my latest 32bit/64bit (strstr-showdown included) CPU/RAM benchmark package (a NSIS installation).
File: Monstrous_Jesters.exe
Size: 158 MB (165,927,681 bytes)
Size unpacked: 513 MB
Size needed: 2000 MB
After installation 7 shortcuts (tests) are placed on Desktop/Programs.
All tests are written in C (sources included), and compiled with latest Intel 12.1 and Microsoft 16 optimizers.
The MEMMEM (strstr-showdown) test takes some 21minutes to complete on Core2Duo_E7500_2.93Ghz.
Of course in order to obtain decent results stop all the concurrent processes before running the test.
Also enable 100% computing power.
Railgun homepage: http://www.sanmayce.com/Railgun/index.html
Railgun article: http://www.codeproject.com/KB/cpp/Railgun_Quadruplet.aspx
Well, there are some additional tests (Intel 12.1 and Microsoft 16 executables included):
- lzpre a LZ77 32bit/64bit [de]compressor, written by Matt Mahoney;
- Yappy a LZ 32bit/64bit [de]compressor, written by IronPeter;
- Knight tour benchmark, finds first 9,000,000 tours (at rate some 1 billion per minute jumps), in fact tests/stresses only CPU clock;
- Quicksort 32bit/64bit used to sort 200,000,000+ pointers (pointing to 7bytes chunks);
- NEW! (since rev. B) qpress - Fastest LZ Decompressor, uses 2/4/6/8/12/24/32/48 threads, written by Lasse Reinhold;
Note: The qpress sources are not given - they are downloadable from Lasse's site: www.quicklz.com
- NEW! (since rev. C) ZPAQ a 64bit multi-threaded [de]compressor, one of the strongest crunchers, written by Matt Mahoney.
The shot below: Revision B just installed:
The shot below: Revision B results on my laptop T7500 2200MHz:
Results for 'Monstrous Jesters' revision B on my laptop T7500 2200MHz (4MB L2 cache) 4GB dual channel DDR2 667MHz using Windows 7 64bit:
Test #1: MEMMEM
OSHO.TXT:
SHORT-SHOWDOWN_Intel_O3_64bit.exe:
[
Railgun_Quadruplet_7Tridentx64 49 i.e. average performance: 2725KB/clock
Railgun_Quadruplet_7Tridentx64 49 total Skip-Performance/Iterations: 2708288/6416464496
BNDM_64 49 i.e. average performance: 2524KB/clock
BNDM_64 49 total Skip-Performance/Iterations: 2779920/6213485968
Railgun_Quadruplet_7Elsiane 49 i.e. average performance: 2122KB/clock
Railgun_Quadruplet_7Elsiane 49 total Skip-Performance/Iterations: 1880784/8251788448
Railgun_Quadruplet_7Hasherezade 49 i.e. average performance: 2352KB/clock
Railgun_Quadruplet_7Hasherezade 49 total Skip-Performance/Iterations: 2701232/6466619104
]
strstr_SHORT-SHOWDOWN_Microsoft_v16_Ox_64bit.exe:
[
Railgun_Quadruplet_7Tridentx64 49 i.e. average performance: 2689KB/clock
Railgun_Quadruplet_7Tridentx64 49 total Skip-Performance/Iterations: 2708288/6416464496
BNDM_64 49 i.e. average performance: 2414KB/clock
BNDM_64 49 total Skip-Performance/Iterations: 2779920/6213485968
Railgun_Quadruplet_7Elsiane 49 i.e. average performance: 1737KB/clock
Railgun_Quadruplet_7Elsiane 49 total Skip-Performance/Iterations: 1880784/8251788448
Railgun_Quadruplet_7Hasherezade 49 i.e. average performance: 2565KB/clock
Railgun_Quadruplet_7Hasherezade 49 total Skip-Performance/Iterations: 2701232/6466619104
]
strstr_SHORT-SHOWDOWN_Microsoft_v16_Ox_32bit.exe:
[
Railgun_Quadruplet_7Tridentx64 49 i.e. average performance: 2947KB/clock
Railgun_Quadruplet_7Tridentx64 49 total Skip-Performance/Iterations: 2708288/6416464496
BNDM_64 49 i.e. average performance: 2201KB/clock
BNDM_64 49 total Skip-Performance/Iterations: 2779920/6213485968
Railgun_Quadruplet_7Elsiane 49 i.e. average performance: 1593KB/clock
Railgun_Quadruplet_7Elsiane 49 total Skip-Performance/Iterations: 1880784/8251788448
Railgun_Quadruplet_7Hasherezade 49 i.e. average performance: 2958KB/clock
Railgun_Quadruplet_7Hasherezade 49 total Skip-Performance/Iterations: 2701232/6466619104
]
hs_alt_HuRef_chr1.fa:
SHORT-SHOWDOWN_Intel_O3_64bit.exe:
[
Railgun_Quadruplet_7Tridentx64 49 i.e. average performance: 2711KB/clock
Railgun_Quadruplet_7Tridentx64 49 total Skip-Performance/Iterations: 2634368/7091550000
BNDM_64 49 i.e. average performance: 3535KB/clock
BNDM_64 49 total Skip-Performance/Iterations: 2806144/6595760528
Railgun_Quadruplet_7Elsiane 49 i.e. average performance: 2636KB/clock
Railgun_Quadruplet_7Elsiane 49 total Skip-Performance/Iterations: 2540592/9256480624
Railgun_Quadruplet_7Hasherezade 49 i.e. average performance: 2397KB/clock
Railgun_Quadruplet_7Hasherezade 49 total Skip-Performance/Iterations: 2691888/7089590528
]
strstr_SHORT-SHOWDOWN_Microsoft_v16_Ox_64bit.exe:
[
Railgun_Quadruplet_7Tridentx64 49 i.e. average performance: 2868KB/clock
Railgun_Quadruplet_7Tridentx64 49 total Skip-Performance/Iterations: 2634368/7091550000
BNDM_64 49 i.e. average performance: 3397KB/clock
BNDM_64 49 total Skip-Performance/Iterations: 2806144/6595760528
Railgun_Quadruplet_7Elsiane 49 i.e. average performance: 2266KB/clock
Railgun_Quadruplet_7Elsiane 49 total Skip-Performance/Iterations: 2540592/9256480624
Railgun_Quadruplet_7Hasherezade 49 i.e. average performance: 2592KB/clock
Railgun_Quadruplet_7Hasherezade 49 total Skip-Performance/Iterations: 2691888/7089590528
]
strstr_SHORT-SHOWDOWN_Microsoft_v16_Ox_32bit.exe:
[
Railgun_Quadruplet_7Tridentx64 49 i.e. average performance: 2977KB/clock
Railgun_Quadruplet_7Tridentx64 49 total Skip-Performance/Iterations: 2634368/7091550000
BNDM_64 49 i.e. average performance: 3131KB/clock
BNDM_64 49 total Skip-Performance/Iterations: 2806144/6595760528
Railgun_Quadruplet_7Elsiane 49 i.e. average performance: 2052KB/clock
Railgun_Quadruplet_7Elsiane 49 total Skip-Performance/Iterations: 2540592/9256480624
Railgun_Quadruplet_7Hasherezade 49 i.e. average performance: 3035KB/clock
Railgun_Quadruplet_7Hasherezade 49 total Skip-Performance/Iterations: 2691888/7089590528
]
Test #2: LZ Yappy
Yappy_Intel_32bit_O3.exe: comp 29.9 MB/s uncomp 512.5 MB/s
Yappy_Intel_32bit_Ox.exe: comp 33.1 MB/s uncomp 513.0 MB/s
Yappy_Microsoft_32bit_Ox.exe: comp 32.3 MB/s uncomp 527.1 MB/s
Test #3: qpress
Kazuya_PTHREADed: DEFAULT_THREAD_COUNT: 2
Kazuya_PTHREADed: Decompression RAM-to-RAM performance: 505MB/s
Kazuya_PTHREADed: DEFAULT_THREAD_COUNT: 4
Kazuya_PTHREADed: Decompression RAM-to-RAM performance: 505MB/s
Kazuya_PTHREADed: DEFAULT_THREAD_COUNT: 6
Kazuya_PTHREADed: Decompression RAM-to-RAM performance: 505MB/s
Kazuya_PTHREADed: DEFAULT_THREAD_COUNT: 8
Kazuya_PTHREADed: Decompression RAM-to-RAM performance: 486MB/s
Kazuya_PTHREADed: DEFAULT_THREAD_COUNT: 12
Kazuya_PTHREADed: Decompression RAM-to-RAM performance: 467MB/s
Kazuya_PTHREADed: DEFAULT_THREAD_COUNT: 24
Kazuya_PTHREADed: Decompression RAM-to-RAM performance: 450MB/s
Kazuya_PTHREADed: DEFAULT_THREAD_COUNT: 32
Kazuya_PTHREADed: Decompression RAM-to-RAM performance: 467MB/s
Kazuya_PTHREADed: DEFAULT_THREAD_COUNT: 48
Kazuya_PTHREADed: Decompression RAM-to-RAM performance: 332MB/s
Test #4: LZMM
lzpre2_32bit_Microsoft_Ox.exe: 29.25 sec
lzpre2_x64_Intel_O3.exe: 26.74 sec
lzpre2_x64_Microsoft_Ox.exe: 27.10 sec
Test #5: Quicksort
Simplicius_Simplicissimus_Septupleton_Intel_32bit_v12_Ox.exe:
Sort took: 196062 clocks
Decompression to RAM without Dumping to DRIVE performance: 174943 KB/s or 170 MB/s
Benchmarking 'memcpy' by copying 197MB (OSHO.TXT size) ten times ...
Simplicius says for 'memcpy' performance: 1802 MB/s
Simplicius_Simplicissimus_Septupleton_Microsoft_32bit_v16_Ox.exe:
Sort took: 220819 clocks
Decompression to RAM without Dumping to DRIVE performance: 212247 KB/s or 207 MB/s
Benchmarking 'memcpy' by copying 197MB (OSHO.TXT size) ten times ...
Simplicius says for 'memcpy' performance: 1418 MB/s
Test #6: Knight Tours
Knight-tour_Microsoft_V16_32bit_Ox.exe: 218.13 seconds
Knight-tour_Intel_V12_32bit_Ox.exe: 227.73 seconds
I initiated a thread on a cool (COLD yes) overclock maniacs forum at:
http://www.overclockaholics.com/forums/showthread.php?t=5132
Enjoy!
Kaze, 2012-Mar-22
Click on image above to view a PDF booklet (project 'Gamera' revision 15 at a glance).
For reference: http://www.sanmayce.com/Downloads/index.html#Gamera
Wanna try whether your browser is well written? Click on image above to view (or rather download for off-line browsing) a very long HTM page containing 'Married With Children' comedy saga phrase-checked.
Table Note1: Corpus 'Gamera' statistics (given in bold) obtained with Leprechaun_x-leton revision 15.
Table Note2: Corpus 'Gamera' ripped as one 34,273,505,280 bytes TAR file consisted of 562,504 TXT files.
Table Note3: Corpus 'Wikipedia' ripped as one enwiki-20120403-pages-articles.xml 37,430,769,961 bytes file consisted of 607,687,347 lines (longest line: 795,359).
Table Note4: In Leprechaun_x-leton revision 15FIXFIX two bugs were fixed, downloadable at bottom of section 2, the entire table must be remade - the x-grams number is significantly bigger. The remake will take a few weeks maybe - for example (on laptop equipped with HDD and Merom 2166MHz and [1758MB RAM used]) it takes 64passes*1800s=~32hours (the exact stats: Total time: 120382 second(s) or Total performance: 23,365P/s i.e. phrases per second) to rip all the 879,557,846 distinct 4-grams.
x-grams Order
|
x-grams Total Number
|
x-grams Distinct Number
|
x-grams Distinct Size
|
Total memory needed for one pass
|
1-grams
|
4,589,933,215 4,963,095,154
|
9,181,275 12,475,645
|
207,307,606 bytes 271,221,937 bytes
|
720,042KB 950,731KB
|
2-grams
|
3,835,376,293 3,883,397,331
|
124,669,942 187,975,215
|
3,267,897,913 bytes 4,861,307,858 bytes
|
10,746,363KB 15,994,665KB
|
3-grams
|
3,286,567,380 3,160,801,770
|
477,829,381 625,323,984
|
14,630,478,827 bytes 19,354,345,361 bytes
|
40,435,332KB 52,701,578KB
|
4-grams
|
2,812,845,037 2,683,841,486
|
879,557,846 1,019,691,522
|
30,931,053,072 bytes 36,790,039,795 bytes
|
87,861,963KB 101,441,772KB
|
5-grams
|
1,587,143,109
|
726,496,853
|
28,676,806,441 bytes
|
82,888,636KB
|
6-grams
|
1,197,407,768
|
652,099,162
|
28,994,002,731 bytes
|
83,523,982KB
|
7-grams
|
889,495,472
|
524,279,771
|
26,052,174,976 bytes
|
74,669,323KB
|
8-grams
|
1,439,078,375 1,530,957,484
|
812,576,024 1,035,095,633
|
45,963,360,220 bytes 60,845,501,063 bytes
|
128,842,779KB 163,445,441KB
|
9-grams
|
465,990,400
|
296,741,735
|
17,911,340,011 bytes
|
51,404,184KB
|
|
Rip Note1: All these x-grams, what is their purpose? Simply to make phrase-checking (with ranking) possible. The first English 4-gram phrase-checker named Graphein (revision 2, powered by 800,000,000+ Gamera corpus 4-grams) is about to emerge anytime soon. This 14~GB package will allow (in two steps: copying the TXT file(s) into a specific folder and executing a desktop shortcut) auto-loading into NOTEPAD the resultant TXT file containing familiar and unfamiliar (to Gamera corpus) 4-grams. An example follows in form of a PDF booklet (26 A4 pages):
'Crisis' Does NOT Equal 'Danger' Plus 'Opportunity' by Victor H. Mair 2+4-grammed.pdf.
Also a quick look at results for *having_been* pattern being searched among the 812,576,024 8-grams:
Kazuya_Gamera_8-grams_requests_a_small_part_'HAVING_BEEN'.log
Rip Note2:
In next Leprechaun_x-leton revision 16 (respectively Graphein revision 3) dump of all slots (including b-trees) will be added thus enabling fast linear phrase-checking (against a given corpus) which is IOPS ONLY bound i.e. latency of the drive will be the only bottleneck (not CPU and physical RAM dependent).
Rip Note3: The following b-tree heights show (roughly) how many IOPS (seeks i.e. repositionings) are needed (when virtual memory 'Z/z' options is used) in order to find an x-gram:
Order 1:
Highest Tree not counting ROOT Level i.e. CORONA levels(littler THE BETTER): 2
Order 2:
Highest Tree not counting ROOT Level i.e. CORONA levels(littler THE BETTER): 3
Order 3:
Highest Tree not counting ROOT Level i.e. CORONA levels(littler THE BETTER): 4
Order 4:
Highest Tree not counting ROOT Level i.e. CORONA levels(littler THE BETTER): 5
Order 5:
Highest Tree not counting ROOT Level i.e. CORONA levels(littler THE BETTER): 5
Order 6:
Highest Tree not counting ROOT Level i.e. CORONA levels(littler THE BETTER): 4
Order 7:
Highest Tree not counting ROOT Level i.e. CORONA levels(littler THE BETTER): 4
Order 8:
Highest Tree not counting ROOT Level i.e. CORONA levels(littler THE BETTER): 5
Order 9:
Highest Tree not counting ROOT Level i.e. CORONA levels(littler THE BETTER): 4
When using slowest drives (that is HDDs) it takes maximum 5 IOPS (packet 512-bytes) to find any x-gram i.e. some 5x10milliseconds which gives 20+ x-grams per second rate for all orders.
Rip Note4: Most suitable (virtual memory 'Z/z' options) drive for heavy loads (RANDOM accesses mostly): the Fusion-io ioDrive SSD Series SLC/MLC PCIe: ioDrive Duo 320GB allowing 238,000 Mixed (75/25 r/w) IOPS (512 Byte), ioDrive Octal 5.12TB offers 729,000 75/25 Mix IOPS (512 B)!!! Not impressed? Compare these IOPS with the super-poor 100 IOPS which fastest HDD can deliver.
Rip Note5: Currently corpus 'Gamera' holds 1- billion x-grams for each order - in order an x-gram wordlist to get sorted it takes roughly 1,000,000,000*8bytes (one 64bit pointer for each x-gram) physical RAM or in short: 8GB.
Rip Note6: Console log showing how 7-grams were being ripped from corpus 'Gamera': gamera_7-grams_log.txt shows 64 passes executed - for one pass needed memory is 74,669,323KB, pshaw!
Virtual memory allocated on SSD vs Multi-pass mode using Physical memory:
Ripping 'Gamera' corpus on laptop (CPU Intel T7500 plus Samsung 64GB SSD 470 series), one fourth completed:
E:\_Gamera>Leprechaun_x-leton_32bit_quadrupleton_4passes.exe _Gamera.tar.lst _Gamera.tar.4 22020096 Z
Leprechaun_quadrupleton (Fast-In-Future Greedy n-gram-Ripper), rev. 15FIXFIX, written by Svalqyatchx.
Purpose: Rips all distinct 4-grams (4-word phrases) with length 13..51 chars from incoming texts.
Feature1: All words within x-lets/n-grams are in range 1..31 chars inclusive.
Feature2: In this revision 128MB 1-way hash is used which results in 16,777,216 external B-Trees of order 3.
Feature3: In this revision 4 passes are to be made.
Feature4: If the external memory has latency 99+microseconds then !(look no further), IOPS(seek-time) rules.
Pass #1 of 4:
Size of input file with files for Leprechauning: 52
Allocating HASH memory 134,217,793 bytes ... OK
Allocating/ZEROing 22,548,578,318 bytes swap file ... OK
Size of Input TEXTual file: 34,273,505,280
|; 00,002,086P/s; Phrase count: 2,812,845,037 of them 219,872,420 distinct; Done: 64/64
Bytes per second performance: 25,418B/s
Phrases per second performance: 2,086P/s
Time for putting phrases into trees: 1348366 second(s)
Flushing UNsorted phrases: 100%; Shaking trees performance: 00,002,907P/s
Time for shaking phrases from trees: 151247 second(s)
Leprechaun: Current pass done.
Leprechaun report:
Number Of Hash Collisions(Distinct WORDs - Number Of Trees): 215,678,116
Number Of Trees(GREATER THE BETTER): 4,194,304
Number Of LEAFs(littler THE BETTER) not counting ROOT LEAFs: 161,182,602
Highest Tree not counting ROOT Level i.e. CORONA levels(littler THE BETTER): 5
Used value for third parameter in KB: 22,020,096
Use next time as third parameter: 21,964,121
Total Attempts to Find/Put WORDs into B-trees order 3: 2,235,409,809
Hard Disk Sentinel:
Total Data Read ~9,002,571 MB
Total Data Write ~3,025,453 MB
Quick notes:
- The gradual IOPS performance degradation (with swap file (i.e. virtual memory pool) getting bigger) scares me, I have had greater expectations.
- Definitely 2,000 phrases per second (equals several times more IOPS) is inferior, yet compared to HDD's 100 phrases per second: only 20 times faster - I expected 100 times.
- Bad bad bad: 417 hours to complete 25%, the obvious new attempt should be with hash table 27bit (i.e. 134,217,728 slots x 8bytes = 1024MB) instead of 24bit (i.e. 16,777,216 slots x 8bytes = 128MB).
- Having SSD size limitation on top of all other limitations makes the pain uglier, so at least the SSD should be big enough to house the (one-pass) swap file, in this case 87,861,963KB for 4-grams and 128,842,779KB for 8-grams, thus 879,557,846 4-grams map to 134,217,728 slots or 7:1 not bad at all.
Bottom-lines:
- On random 75%R/25%W SSDs are only 20 times faster than HDDs, to avoid further pain buy/use SSD big enough to house the entire Virtual-Memory-Pool - in my case 123GB which needs 256GB drive.
- Single-pass mode using Virtual memory allocated on SSD is still inferior to Multi-pass mode using Physical memory.
- My personal choice would be 512GB SSD, the rest have no future (for serious tasks).
Leprechaun_x-leton:
A free open-source and demonically fast phrase ripper.
The long-awaited (by me, he-he) Super-Leprechaun, pure 64bit addressing both as 32bit and 64bit code with adjustable hash table and MULTI-PASS MODE.
Powerful and fast: unlimited (in practice) vocabulary size, rips at 1+ million phrases per second rate.
Creates x-gram unique phrases, allowing a vocabulary consisted of x (1..10) COLLOCATIVE words to be analyzed.
Slow-Rip (in one pass) for BILLIONS (on SSD) of x-grams. An external B-trees technique used, currently 16,777,216 B-Trees of order 3.
For 5 billion 4-grams it needs less than 633GB (136bytes*5,000,000,000) 30microseconds seek-time external memory, though!
The advent of 640GB physical/internal RAM is commenced, having such a computer (30nanoseconds seek-time internal memory) will unleash the beast.
Yes, we are in limbo (that is, nowadays 640MB main memory) i.e. in-between the oldadays 640KB and the newadays 640GB.
The nifty thing about MPM (Multi-Pass Mode): Allows to rip the whole electronic English on a simple PC (with HDD and 4GB RAM).
...
0,000,003 worried_about_whether_the
0,000,001 worried_about_their_small
0,000,001 worried_about_the_rest
0,000,001 worried_about_their_antipathy
0,000,001 worried_about_thoughts_circulating
0,000,001 worried_about_the_destination
0,000,004 worried_about_the_man
0,000,001 worried_about_god_s
0,000,001 worried_about_others_if
0,000,001 worried_about_rolls_royces
0,000,001 worried_about_anything_the
...
0,000,013 which_has_become_a
0,000,001 which_gives_life_the
0,000,001 which_for_centuries_they
0,000,002 which_formally_he_is
0,000,001 which_extends_both_to
0,000,001 which_everything_would_be
0,000,001 which_everything_false_is
0,000,001 which_everything_was_destroyed
0,000,001 which_exploited_the_whole
0,000,001 which_fell_on_the
0,000,001 which_dogo_is_missing
0,000,001 which_emperor_was_born
0,000,003 which_do_you_want
0,000,001 which_does_contain_the
0,000,002 which_does_not_grow
0,000,003 which_develops_into_awareness
0,000,001 which_dissolves_all_paradoxes
...
0,000,001 continuously_brag_that_you
0,000,001 continuously_cripple_the_research
0,000,001 continuously_ask_for_attention
0,000,001 continuously_available_to_the
0,000,001 continuous_showering_of_blessings
0,000,001 continuous_trouble_during_my
0,000,001 continuously_and_you_will
0,000,001 continuously_being_pulled_downwards
0,000,002 continuously_escaping_from_yourself
0,000,001 continuous_motion_won_t
0,000,001 continuous_oxygen_is_needed
0,000,001 continuous_fear_that_i
0,000,001 continuous_hunger_for_war
0,000,001 continuous_conflict_within_you
...
0,000,001 is_as_strongly_convinced
0,000,001 is_as_mindless_as
0,000,002 is_as_nonexistential_as
0,000,001 is_as_powerless_as
0,000,001 is_as_imperfect_as
0,000,002 is_as_inconceivable_as
0,000,003 is_as_if_for
0,000,001 is_as_distant_as
0,000,002 is_as_enlightened_as
0,000,002 is_as_formless_as
...
0,000,001 indignantly_at_this_suggestion
0,000,001 indira_gandhi_both_were
0,000,001 indira_s_congress_party
0,000,001 indirect_way_of_indicating
0,000,002 indistinguishable_from_the_patriarchs
0,000,004 individual_is_a_reality
0,000,001 individuality_of_an_enlightened
0,000,002 infinite_energy_within_you
0,000,001 indifferent_where_ordinarily_you
0,000,002 indigestion_was_your_real
0,000,001 indicative_of_a_fear
0,000,001 indicative_of_a_healthy
...
All-in-all: IT is capable of ripping the whole electronic English at once while counting the repetitions of each x-gram.
For reference: http://www.sanmayce.com/Downloads/index.html#Leprechaun
Click on image above to download the package Dumbino_r1.7z 442 MB (463,605,386 bytes).
Note1: This bona-fide artist comes from 'FUNHOUSE' video-clip, VIVA.
Note2: Dumbino (former 'Graphein') is an open-source and free English phrase-checker.
Note3: Dumbino features sub-linear phrase-checking performance, regardless of corpus size (currently googlebooks-eng-us-all-4gram-20090715 140,222,335 4-grams). My corpora are richer.
For short overview: Dumbino - a console word (x-gram) checker_at-a-glance.pdf
For reference: http://www.sanmayce.com/Downloads/index.html#Dumbino
Click on image above to download the package _KAZE_OWL-package_huge_mix_of_1-grams.7z 198 MB (207,752,676 bytes).
Note1: OWL is an open-source and free English spell-checker with ranking (not yet, currently a word-explorer only).
Note2: OWL uses 20 wordlists with total 20,761,385 distinct words.
For short overview: OWL-package_Longest-word-in-English.pdf
For reference: http://www.sanmayce.com/Downloads/index.html#OWL
Click on image above to download the package OWL-package_revision_A.exe.zip 92.2 MB (96,709,454 bytes).
Note: This is a NSIS Windows installation, it needs 100MB free disk space.
Note1: A screenshot showing OWL package r.A in action under Windows 7.
Note2: Wanna enrich OWL corpus (currently made of 20 wordlists) by sharing another HQ one? Feel free to write me at: sanmayce@sanmayce.com, perhaps adding some dozens of HQ wordlists will enhance the rank depth.
Latest OWL corpus
This mix/compilation of 23 English wordlists is made with one purpose only: to house all English words under one roof.
- What is the name of this corpus of corpora?
- OWL.
- What are the terms of use?
- NONE. It is 100% free package.
- What is the definition for 'word' there?
- Words (in OWL corpus) are clusters of a-z letters (26 in total) with length from 1 to 31 inclusive. No symbols other than alphabetical are allowed.
- How many words are there?
- The total number of distinct words/lines is 21,082,463 and the longest line is 42.
- What is the size of OWL corpus?
- 469,105,213 bytes _KAZE_huge_mix_of_1-grams.occ-wrd.sorted.txt
- What role play these numbers preceding the words?
- They say how many corpora contain the word, e.g. 0,000,003 means 3. The higher the number the greater the rank i.e. the bigger the rank the better.
- What wordlists are in use?
- The following 23 wordlists:
016,720,903 bytes; 01,607,640 words/lines: keithv_com_wlist_match1.wrd.sorted
004,801,669 bytes; 00,427,397 words/lines: OWW.wrd.sorted
005,502,365 bytes; 00,458,970 words/lines: RIDYHEW_The_RIDiculouslY_Huge_English_Wordlist.wrd.sorted
005,645,902 bytes; 00,514,105 words/lines: WORDLIST_source_18_various_wordlists.wrd.sorted
000,720,733 bytes; 00,075,801 words/lines: Dictionary of American English.pdf.wrd.sorted
000,182,603 bytes; 00,019,859 words/lines: Dictionary of American Idioms and Phrasal Verbs.pdf.wrd.sorted
000,215,579 bytes; 00,023,128 words/lines: Dictionary of Contemporary Slang.pdf.wrd.sorted
000,889,414 bytes; 00,087,466 words/lines: EuroDict XP 3.0 _ MacroMagic41r_r02_DOS.wrd.sorted
001,779,419 bytes; 00,174,978 words/lines: HERITAGE.wrd.sorted
000,355,146 bytes; 00,038,917 words/lines: dictionary of historical slang.pdf.wrd.sorted
000,398,554 bytes; 00,043,749 words/lines: Longman Dictionary of American English, Special Edition.pdf.wrd.sorted
000,695,541 bytes; 00,065,316 words/lines: mthesaur.wrd.sorted
000,233,582 bytes; 00,024,435 words/lines: OXFORD Collocations Dictionary.wrd.sorted
000,268,457 bytes; 00,029,733 words/lines: The Oxford Dictionary of Slang.wrd.sorted
000,388,308 bytes; 00,038,936 words/lines: The Oxford Thesaurus, An A-Z Dictionary of Synonyms.wrd.sorted
002,651,685 bytes; 00,260,733 words/lines: SOED.wrd.sorted
000,411,462 bytes; 00,044,668 words/lines: The Routledge Dictionary of Modern American Slang.pdf.wrd.sorted
000,333,541 bytes; 00,034,773 words/lines: Websters-Dictionary-of-English-Usage.pdf.wrd.sorted
000,384,499 bytes; 00,038,676 words/lines: Webster's New Dictionary of Synonyms (1984).pdf.wrd.sorted
000,740,179 bytes; 00,074,993 words/lines: RHW_mpron.wrd.sorted
146,465,487 bytes; 12,475,645 words/lines: enwiki-20120403-pages-articles.wrd.sorted
115,494,856 bytes; 09,181,275 words/lines: _Gamera_r15.wrd.sorted
046,515,064 bytes; 04,434,936 words/lines: googlebooks-eng-all-1gram-20090715.wrd.sorted
- How different are wordlist from one another?
- They cover very different contexts/areas, thus a lot. All words with rank 3[+]/4[+]/5[+]/6[+]/7[+]/8[+] are 2,097,527/1,058,261/519,140/333,057/229,834/163,372 respectively.
The file format is ASCII text:
D:\_KAZE_OWL-package_huge_mix_of_1-grams>type _KAZE_huge_mix_of_1-grams.occ-wrd.sorted.txt|more
0,000,023 zone
0,000,023 zip
0,000,023 youth
0,000,023 yourself
0,000,023 yours
0,000,023 your
0,000,023 younger
0,000,023 young
0,000,023 you
0,000,023 york
0,000,023 yield
0,000,023 yet
0,000,023 yesterday
0,000,023 yes
0,000,023 yellow
0,000,023 year
0,000,023 yard
0,000,023 yank
0,000,023 wrongdoing
0,000,023 wrong
0,000,023 written
0,000,023 writing
0,000,023 writer
0,000,023 write
0,000,023 wrinkle
0,000,023 wreck
0,000,023 wrapped
0,000,023 wrap
0,000,023 wound
0,000,023 would
0,000,023 worthy
0,000,023 worthless
0,000,023 worth
0,000,023 worst
0,000,023 worse
0,000,023 worry
0,000,023 worried
0,000,023 worn
0,000,023 worm
0,000,023 world
0,000,023 works
0,000,023 working
0,000,023 worker
0,000,023 worked
0,000,023 work
0,000,023 word
0,000,023 wool
0,000,023 wooden
0,000,023 wood
0,000,023 wonderful
...
0,000,012 ingeminate
0,000,012 ingathering
0,000,012 infrequence
0,000,012 infrangible
0,000,012 infract
0,000,012 infotainment
0,000,012 informers
0,000,012 infomercial
0,000,012 infold
0,000,012 influenceable
0,000,012 inflorescence
0,000,012 inflicts
0,000,012 inflectional
0,000,012 inflammability
0,000,012 infinitude
0,000,012 infiltrated
0,000,012 infighting
0,000,012 infelicitously
0,000,012 infecund
0,000,012 infectivity
0,000,012 infectiously
0,000,012 infarction
0,000,012 infantine
0,000,012 infanta
0,000,012 infallibly
0,000,012 inextensible
0,000,012 inexorability
0,000,012 inexhaustibly
0,000,012 inevitableness
0,000,012 inestimably
...
0,000,011 goethe
0,000,011 goeth
0,000,011 godwin
0,000,011 godling
0,000,011 godlikeness
0,000,011 goddaughter
0,000,011 godawful
0,000,011 gobsmacked
0,000,011 goblets
0,000,011 gobelin
0,000,011 gobby
0,000,011 gobbled
0,000,011 goatsucker
0,000,011 goatlike
0,000,011 goatfish
0,000,011 goalpost
0,000,011 goalless
0,000,011 gnosis
0,000,011 gnomonic
0,000,011 gnaws
0,000,011 gnawn
...
- How about new revisions i.e. additional wordlists?
- It is up to you, feel free to email me (at sanmayce@sanmayce.com) your Rich High-Quality wordlist.
Enjoy!
2012 Aug 31, Kaze
The ZIPed ASCII text: _KAZE_huge_mix_of_1-grams.occ-wrd.sorted.txt.zip 79.3 MB (83,192,555 bytes), click here.
For reference: http://www.sanmayce.com/Downloads/index.html#RANKEDENGLISHWORDS
Note: Only 'on' and 'down' are included so far, the PDF includes also 384 prepositional words/phrases which I saw here-and-there.
For reference: http://www.sanmayce.com/Downloads/index.html#IDIOM
Note: KAZE_English_phrase_list_r2-.pdf 4.59 MB (4,815,489 bytes), click here.
For reference: http://www.sanmayce.com/Downloads/index.html#IDIOMATIC
Note: LZ_predator is a console prompt (a command line tool) text context dumper - it/he slices for you citations/excerpts from a given LZMM (a compressed text) file. In short: Contexts/Citations/Excerpts Maker/Dumper/Slicer.
Package: LZ_predator_r3+_Windows_Linux.zip 75.1 MB (78,811,387 bytes), click here.
Log: console_LOG.txt 10.4 KB (10,657 bytes), click here.
Dump1: OSHO_contexts_Lao.txt 2.83 MB (2,968,503 bytes), click here.
Dump2: OSHO_contexts_Tao.txt 1.34 MB (1,409,278 bytes), click here.
Three slices/contexts:
Context #0,000,000,016 (680bytes or less long) holding the 'Tao' pattern found at line #0,000,023,546:
[...ns: you have accumulated much rubbish and when you meditate that rubbish starts
disappearing, falling away.
AND I FEEL MYSELF A STUPID CHILD.
That is the way, the way to the kingdom of God. Lao Tzu says, 'Be like an idiot in this world so
that you can understand the illogical ways of Tao.' Jesus says, 'Be like a child -- because only those
who are like children will be able to enter into the kingdom of God.' Don't be worried about those
things; the non-essential is dropping away. Feel happy and grateful. Once the rubbish has dropped,
the real will arise; non-essential gone, the esse...] /OSHO.TXT (197MB) discourses/
Context #0,000,000,017 (680bytes or less long) holding the 'Tao' pattern found at line #0,000,028,892:
[... world, means the wheel. It moves in
the same groove. You come and go, and you do much -- to no avail. Where do you miss? You miss
in the first step.
The nature of the mind is repetition, and the nature of life is no repetition. Life is always new,
ALWAYS. Newness is the nature of life, Tao; nothing is old, cannot be. Life never repeats, it simply
becomes new every day, new every moment -- and mind is old; hence mind and life never meet.
Mind simply repeats, life never repeats -- how can mind and life meet? That's why philosophy never
understands life.
The whole effort ...] /OSHO.TXT (197MB) discourses/
Context #0,000,000,020 (680bytes or less long) holding the 'Tao' pattern found at line #0,000,030,115:
[...upt in the market -- and that fragile root is broken. Then
you go on wandering and wandering; then there is no coming back, then you never touch reality.
This is the state of the madman, and the normal man is different only in degree.
And what is the state of a buddha, an enlightened man, a man of Tao, of understanding,
awareness? He is deeply rooted in reality, he never wanders from it -- just the opposite of a madman.
You are in the middle. From that middle either you can move towards being a madman or you
can move towards being a buddha. It is up to you. Don't give much energy to thoughts...] /OSHO.TXT (197MB) discourses/
For reference: http://www.sanmayce.com/Downloads/index.html#OSHO
Note1: Galadriel is a console prompt (a command line tool) fuzzy text dumper - it/she suggests strings/lines similar to your string looking into a given text file. In short: Fuzzy line Dumper. Current revision (2) uses 16 threads both for parsing and searching.
Note2: _Kaze_Levenshtein_Galadriel.zip 172 MB (180,495,915 bytes), click here.
Note3: I initiated a thread in order to benchmark it, its name: FAST 'on the fly' fuzzy string matching console tool written in C.
// Test on my 'Bonboniera' laptop T7500 2200MHz, 2/2 cores/threads, 2x2GB dual channel DDR2 667MHz, Windows 7 64bit:
E:\_Kaze_Levenshtein_Galadriel>dir/og/oe
Volume in drive E is SSD_Sanmayce
Volume Serial Number is 9CF6-FEA3
Directory of E:\_Kaze_Levenshtein_Galadriel
01/28/2013 07:51 PM <DIR> .
01/28/2013 07:51 PM <DIR> ..
01/28/2013 05:09 AM <DIR> Galadriel_logo
01/28/2013 07:52 PM 187 Galadriel_compile_Intel.bat
01/28/2013 07:52 PM 2,572 TESTbigrams.bat
01/28/2013 07:52 PM 26 makeEXE.bat
01/28/2013 07:52 PM 79,620,218 4andabove_Gamera.tar.2.sorted.bsc
01/28/2013 07:52 PM 26,593 Galadriel.c
01/28/2013 07:52 PM 87,524 Galadriel_r2-.c
01/28/2013 07:52 PM 744,167 Galadriel_r2-.cod
01/28/2013 07:52 PM 61,305 Galadriel.cod
01/28/2013 07:52 PM 387,072 Galadriel_r2-_HEXADECAD-Threads_IntelV12_32bit.exe
01/28/2013 07:52 PM 459,776 Galadriel_r2-_HEXADECAD-Threads_IntelV12_64bit.exe
01/28/2013 07:52 PM 90,112 Galadriel_r2-_MONAD-Thread_IntelV12_32bit.exe
01/28/2013 07:52 PM 100,352 Galadriel_r2-_MONAD-Thread_IntelV12_64bit.exe
01/28/2013 07:52 PM 58,880 Galadriel.exe
01/28/2013 07:52 PM 598,528 GRAFFITH_r2++_Graphein_2.3.0_Intel_12.1_32bit.exe
01/28/2013 07:52 PM 4,096 Timer.exe
01/28/2013 07:52 PM 1,566 KAZE prompt.lnk
01/28/2013 07:52 PM 889,537,624 4andabove_Gamera.tar.2.sorted
01/28/2013 07:52 PM 722 README.txt
01/28/2013 07:52 PM 3,869,529 MASAKARI_General-Purpose_Grade_English_Wordlist_r3_316423_words.wrd
01/28/2013 07:52 PM 53,460,640 googlebooks-eng-all-1gram-20120701_5038456_words.wrd
20 File(s) 1,029,111,489 bytes
3 Dir(s) 19,616,112,640 bytes free
E:\_Kaze_Levenshtein_Galadriel>Galadriel.exe 9 0,000,001_psychedelized_???? 4andabove_Gamera.tar.2.sorted
Galadriel, an x-gram suggesteress using Wagner-Fischer Levenshtein Distance, revision 1+++, copyleft Sanmayce 2013-Jan-21.
Galadriel: Total/Checked/Dumped xgrams: 35,116,064/31,763,627/33
Galadriel: Performance: 2,065,650 xgrams/s
E:\_Kaze_Levenshtein_Galadriel>Timer Galadriel_r2-_MONAD-Thread_IntelV12_32bit.exe 9 0,000,001_psychedelized_???? 4andabove_Gamera.tar.2.sorted
Timer 9.01 : Igor Pavlov : Public domain : 2009-05-31
Galadriel, an x-gram suggesteress using Wagner-Fischer Levenshtein Distance, revision 2-, copyleft Sanmayce 2013-Jan-25.
Enforcing MONAD i.e. single-thread ...
Allocating memory 8MB ... OK
Galadriel: Total/Checked/Dumped xgrams: 35,116,064/31,763,627/33
Galadriel: Performance: 48 KB/clock
Galadriel: Performance: 1,974 xgrams/clock
Kernel Time = 0.655 = 3%
User Time = 17.222 = 96%
Process Time = 17.877 = 99%
Global Time = 17.915 = 100%
E:\_Kaze_Levenshtein_Galadriel>Timer Galadriel_r2-_HEXADECAD-Threads_IntelV12_32bit.exe 9 0,000,001_psychedelized_???? 4andabove_Gamera.tar.2.sorted
Timer 9.01 : Igor Pavlov : Public domain : 2009-05-31
Galadriel, an x-gram suggesteress using Wagner-Fischer Levenshtein Distance, revision 2-, copyleft Sanmayce 2013-Jan-25.
omp_get_num_procs( ) = 2
omp_get_max_threads( ) = 2
Enforcing HEXADECAD i.e. hexadecuple-threads ...
Allocating memory 8MB ... OK
Galadriel: Total/Checked/Dumped xgrams: 35,116,064/31,763,627/33
Galadriel: Performance: 73 KB/clock
Galadriel: Performance: 2,989 xgrams/clock
Kernel Time = 1.357 = 11%
User Time = 22.058 = 184%
Process Time = 23.415 = 195%
Global Time = 11.948 = 100%
E:\_Kaze_Levenshtein_Galadriel>Timer Galadriel_r2-_MONAD-Thread_IntelV12_64bit.exe 9 0,000,001_psychedelized_???? 4andabove_Gamera.tar.2.sorted
Timer 9.01 : Igor Pavlov : Public domain : 2009-05-31
Galadriel, an x-gram suggesteress using Wagner-Fischer Levenshtein Distance, revision 2-, copyleft Sanmayce 2013-Jan-25.
Enforcing MONAD i.e. single-thread ...
Allocating memory 8MB ... OK
Galadriel: Total/Checked/Dumped xgrams: 35,116,064/31,763,627/33
Galadriel: Performance: 54 KB/clock
Galadriel: Performance: 2,202 xgrams/clock
Kernel Time = 0.468 = 2%
User Time = 15.631 = 96%
Process Time = 16.099 = 99%
Global Time = 16.124 = 100%
E:\_Kaze_Levenshtein_Galadriel>Timer Galadriel_r2-_HEXADECAD-Threads_IntelV12_64bit.exe 9 0,000,001_psychedelized_???? 4andabove_Gamera.tar.2.sorted
Timer 9.01 : Igor Pavlov : Public domain : 2009-05-31
Galadriel, an x-gram suggesteress using Wagner-Fischer Levenshtein Distance, revision 2-, copyleft Sanmayce 2013-Jan-25.
omp_get_num_procs( ) = 2
omp_get_max_threads( ) = 2
Enforcing HEXADECAD i.e. hexadecuple-threads ...
Allocating memory 8MB ... OK
Galadriel: Total/Checked/Dumped xgrams: 35,116,064/31,763,627/33
Galadriel: Performance: 81 KB/clock
Galadriel: Performance: 3,281 xgrams/clock
Kernel Time = 1.294 = 11%
User Time = 20.841 = 178%
Process Time = 22.136 = 189%
Global Time = 11.683 = 100%
E:\_Kaze_Levenshtein_Galadriel>
Okay, I expected 2 threads to offer much more than (2,989-1,974)/1,974*100=51% for 32bit code and (3,281-2,202)/2,202*100=49% for 64bit code, a few lines remained unoptimized...
Send me please your results at sanmayce@sanmayce.com, it is quite a benchmark - who can run it in its 'native' 16 threads mode!
For reference: http://www.sanmayce.com/Downloads/index.html#GALADRIEL
Note1: Kazahana is a console prompt (a command line tool, mix of GRAFFITH and GALADRIEL) search exact&wildcards&fuzzy text dumper - it suggests strings/lines similar to your string looking into a given text file. In short: Fast line Dumper. Current revision (1-+) uses 16 threads both for parsing and searching.
Note2: _Kaze_Kazahana.zip 97 MB (102,184,620 bytes), click here.
Note3: I initiated a thread in order to benchmark it, its name: FAST 'on the fly' fuzzy string matching console tool written in C.
// Test on my 'Bonboniera' laptop T7500 2200MHz, 2/2 cores/threads, 2x2GB dual channel DDR2 667MHz, Windows 7 64bit:
E:\_Kaze_Kazahana>dir 4andabove_Gamera.tar.2.sorted
Volume in drive E is SSD_Sanmayce
Volume Serial Number is 9CF6-FEA3
Directory of E:\_Kaze_Kazahana
02/07/2013 12:14 AM 889,537,624 4andabove_Gamera.tar.2.sorted
2 File(s) 43,043,184,331 bytes
0 Dir(s) 14,405,668,864 bytes free
E:\_Kaze_Kazahana>timer "Kazahana_r1-+_HEXADECAD-Threads_IntelV12.exe" ramjet 4andabove_Gamera.tar.2.sorted
Timer 9.01 : Igor Pavlov : Public domain : 2009-05-31
Kazahana, a superfast exact & wildcards & Levenshtein Distance (Wagner-Fischer) searcher, revision 1-+, copyleft Kaze 2013-Feb-06.
omp_get_num_procs( ) = 2
omp_get_max_threads( ) = 2
Enforcing HEXADECAD i.e. hexadecuple-threads ...
Allocating Master-Buffer 7MB ... OK
|; 00,000,195,583 bytes/clock
Kazahana: Total/Checked/Dumped xgrams: 35,116,064/35,116,064/49
Kazahana: Performance: 189 KB/clock
Kazahana: Performance: 7,653 xgrams/clock
Kazahana: Done.
Kernel Time = 0.967 = 19%
User Time = 8.049 = 166%
Process Time = 9.016 = 186%
Global Time = 4.844 = 100%
E:\_Kaze_Kazahana>timer grep\grep ramjet 4andabove_Gamera.tar.2.sorted
Timer 9.01 : Igor Pavlov : Public domain : 2009-05-31
0,000,083 bussard_ramjet
0,000,051 the_ramjet
0,000,048 the_ramjets
0,000,046 a_ramjet
0,000,031 a_scramjet
0,000,027 the_scramjet
0,000,026 bussard_ramjets
0,000,018 interstellar_ramjet
0,000,014 ramjet_engine
0,000,012 scramjet_powered
0,000,012 ramjet_is
0,000,011 scramjet_engines
0,000,011 scramjet_engine
0,000,011 ramjet_engines
0,000,010 ramjets_were
0,000,010 combustion_ramjet
0,000,009 ramjet_and
0,000,008 ramjet_controls
0,000,007 combustion_ramjets
0,000,006 water_ramjet
0,000,006 scramjet_technology
0,000,006 ramjets_on
0,000,006 ramjet_will
0,000,006 ramjet_speeds
0,000,006 ramjet_ship
0,000,006 ramjet_rocket
0,000,006 ramjet_in
0,000,006 mode_scramjet
0,000,005 scramjets_can
0,000,005 ramjet_to
0,000,005 ramjet_scramjet
0,000,005 ramjet_operation
0,000,005 of_scramjets
0,000,005 of_scramjet
0,000,005 of_ramjets
0,000,005 by_ramjets
0,000,005 and_ramjets
0,000,005 and_ramjet
0,000,004 scramjet_to
0,000,004 scramjet_s
0,000,004 scramjet_is
0,000,004 scramjet_intake
0,000,004 ramjet_was
0,000,004 ramjet_a
0,000,004 raking_ramjets
0,000,004 or_scramjet
0,000,004 expander_ramjets
0,000,004 ejector_ramjet
0,000,004 a_turboramjet
Kernel Time = 0.483 = 9%
User Time = 4.368 = 86%
Process Time = 4.851 = 95%
Global Time = 5.062 = 100%
E:\_Kaze_Kazahana>grep\grep.exe -V
GNU grep 2.5.4
Copyright (C) 2009 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
E:\_Kaze_Kazahana>Kazahana_r1-+_HEXADECAD-Threads_IntelV12
Kazahana, a superfast exact & wildcards & Levenshtein Distance (Wagner-Fischer) searcher, revision 1-+, copyleft Kaze 2013-Feb-06.
Usage: Kazahana [AtMostLevenshteinDistance] string textualfile
Note1: There are three regimes: exact, wildcards and fuzzy searches. First two kick in when 2 parameters are given, fuzzy when 3.
Note2: What decides whether exact or wildcards? Of course presence of at least one wildcard. To see exact search see Example #4.
Note3: Exact search hits with 'Railgun_Quadruplet_7'.
Note4: Incoming string is automatically lowercased for exact and wildcards searches i.e. they both are case insensitive.
Note5: Incoming string could be up to 21168/126 chars for exact&wildcards/Levenshtein respectively.
Note6: Incoming textualfile could be bigger than 4GB.
Note7: Each line should end with [CR]LF, that is Windows or/and UNIX style.
Note8: The dump goes to Kazahana.txt file.
Note9: Seven wildcards are available:
wildcard '*' any character(s) or empty,
wildcard '@'/'#' any character {or empty}/{and not empty},
wildcard '^'/'$' any ALPHA character {or empty}/{and not empty},
wildcard '|'/'~' any NON-ALPHA character {or empty}/{and not empty}.
Example1: E:\>Kazahana 0 ramjet MASAKARI_General-Purpose_Grade_English_Wordlist_r3_316423_words.wrd
Example2: E:\>Kazahana 3 psychedlicize MASAKARI_General-Purpose_Grade_English_Wordlist_r3_316423_words.wrd
Example3: E:\>Kazahana "psyched^^^^^^ize^" MASAKARI_General-Purpose_Grade_English_Wordlist_r3_316423_words.wrd
Example4: E:\>Kazahana "metal fatigue" enwiki-20121201-pages-articles.xml
Example5: E:\>Kazahana "out^^^^^^^^^^^^^ize*" MASAKARI_General-Purpose_Grade_English_Wordlist_r3_316423_words.wrd
E:\>type Kazahana.txt
[out^^^^^^^^^^^^^ize*] outhyperbolize /MASAKARI_General-Purpose_Grade_English_Wordlist_r3_316423_words.wrd/
[out^^^^^^^^^^^^^ize*] outsize /MASAKARI_General-Purpose_Grade_English_Wordlist_r3_316423_words.wrd/
[out^^^^^^^^^^^^^ize*] outsized /MASAKARI_General-Purpose_Grade_English_Wordlist_r3_316423_words.wrd/
[out^^^^^^^^^^^^^ize*] outstrategize /MASAKARI_General-Purpose_Grade_English_Wordlist_r3_316423_words.wrd/
[out^^^^^^^^^^^^^ize*] outtyrannize /MASAKARI_General-Purpose_Grade_English_Wordlist_r3_316423_words.wrd/
E:\_Kaze_Kazahana>
As you can see from the dump above, only two threads are enough to outspeed 'grep', the data is cached though, for non-cached data it takes more threads.
Send me please your results at sanmayce@sanmayce.com, it is quite a benchmark - who can run it in its 'native' 16 threads mode!
For reference: http://www.sanmayce.com/Downloads/index.html#KAZAHANA