Monday, December 24, 2012

sftype (super fast type) - type speed training/test game

Quickly put this together:

This program I quickly put together in Game Maker 8.1 helps you measure and improve your typing speed. My best so far is like 73 WPM...
The source code is included.

For improving your typing speed also check out:

Friday, August 10, 2012

How to monitor a directory for changes with ReadDirectoryChangesW

Here's how to use ReadDirectoryChangesW without knowing all of this and without using this overkill C++ class library (remove the &o parameter in the call to make it blocking instead of async. Note: This only returns one change, you have to step trough a buffer of FILE_NOTIFY_INFORMATIONs if you need to get more):

HANDLE hDirectory;
union {
} fni;

int checkAutoCompile() {
    DWORD b;

    if (!hDirectory) {
        hDirectory = CreateFile(".", 
        o.hEvent = CreateEvent(0,0,0,0);
    if (fni.i.Action != 0) {
        wprintf(L"action %d, b: %d, %s\n", fni.i.Action, b, fni.i.FileName);
        fni.i.Action = 0;

    return 0;

Friday, July 20, 2012

Extended Terrain XYZ now free and open source

This is a 3D Terrain Editor I created back in 2007 with Dark BASIC Pro (free now, anybody should be able to compile this). Along with standard heightmap terrain editor features, it also allows shifting the vertices to create simple overhangs and caves.


Download (7.8 MB) - compiled executable and instructions for compiling included (ReadMe.txt) (26.5 MB) (2.1 MB) - you need these if you want to build from source

Old Information: link

Saturday, June 23, 2012

Simple Multitouch on Windows

Quickliy coded a dll (packaged as a Game Maker extension here, but can be used with any HWND) that gives a simple interface to multi touch handling on Windows.

All source files included.

Friday, June 15, 2012

NVIDIA GPU-accelerated Path / SVG Vector Graphics Rendering

This definitely needs attention:

(video with link without slides:
Try viewing for example this svg image in your browser - it'll definately kill it. This particular image renders at 35 FPS on my GPU (GTX 260) in NVIDIAs svg viewer which uses their quite new (June 2011) GL_NV_path_rendering OpenGL extension. Remarkable.
The GPU implementation doesn't implement the filters - blurring in this example - (neither does IE). I'll try to add these, I don't think they'll slow it down too much.


Note: Many browsers claim to have GPU accelerated rendering, but this mostly only applies to the final composition of different elements (though some do have GPU text rendering).

Edit: After looking at the specification and some samples, I see that this is actually pretty complex: You can define various intermediate targets and in and outputs for filters. The easiest way to do this would of course be to render every step to a texture, then apply a shader that represents the predefined effect (one of just a handful) and repeat. However this can be optimized - e.g. a blur and offset pass can be done in one step. Applying such optimizations will require smart runtime shader compilation... let's see what we can do. It would also be interesting to implement the pathrendering with OpenGL and shaders in WebGL to provide an  alternative for the current browser's software svg rendering.

Edit 2: Not even browser/software svg renderer makers bothered to implement all of the specification, I couldn't find a browser or program that supports enable-background and renders this correctly (maybe OpenVG would):

Should look like this:

Another edit: Opera does! Looks like it's 100% compilant.

Tuesday, June 05, 2012

Some Java Bit Functions and easy syntax hightlighting of various languages in html

Use and for easy syntax highlighting in your blogger posts. (I tried this with that but it doesn't seem to work).

Now the Java stuff (easily portable to C, C++ and many other languages):

    // Returs bit a to b (> a). lsb is bit 0, msb is 63, is returned as String in normal order (msb first)
    public static String getBits(long n, int a, int b) {
        String bs="";
        for (int i = a; i <= b; i++)
            bs = (1 & (n >> i)) + bs;
        return bs;

    // Calculates floored log2 by finding position of highest set bit. 
    // Ignores very highest bit of long (63rd).
    public static int log2(long n) {
        long m = 1L << 62; int i = 62;
        while ((n & m) == 0 && i > 0) {m = m >> 1; i--;}
        return i;

     * Extracts the n-th b-bit number. E.g. extractBitDigit(0xff00, 8, 2) =0xff. Useful for e.g. Radixsort.
    public static long extractBitDigit(long value, int b, int n) {
        long m = (
                ((1 << b) - 1) // Mask
                << (n*b) // move to n-th position
        return  (m &
                value) // apply
                >> (n*b) // shift down


Thursday, May 31, 2012

Fastest way to zero out memory - stream past cache with movntdq/_mm_stream_si128

As a rule of thumb, this technique is only beneficial if the buffer is larger than half the largest level cache.
#include <time.h>
#include <stdio.h>
#include <stdlib.h>
#include <stdlib.h>
#include <emmintrin.h>
#include <intrin.h>

typedef unsigned long long ull;

ull tsc;
int clk;

// Stream 64 Bytes to DRAM, bypass caches. _p must be 16-byte aligned.
template<typename T>
inline void memstream(T *_p, const __m128i& i)
    char* p = (char*)_p;
    _mm_stream_si128((__m128i *)&p[0], i);
    _mm_stream_si128((__m128i *)&p[16], i);
    _mm_stream_si128((__m128i *)&p[32], i);
    _mm_stream_si128((__m128i *)&p[48], i);

inline void serialize() {int a[4]; __cpuid(a,0);}

inline void starttimer() {clk = clock(); serialize(); tsc = __rdtsc();}

inline void stoptimer(char* n) {
    serialize(); ull tsc2 = __rdtsc(); tsc = tsc2 - tsc;  clk = clock() - clk;
    printf("%-10s %6i msec, %10I64i clocks, \n", n, clk, tsc); 

int main()
    ull tsc = __rdtsc();

    ull tsc2 = __rdtsc();

    printf("rdtsc overhead: %i ticks\n", tsc2 - tsc);

    // Allocate 1.2 GB of RAM
    const int cnt = 1024*1024*400; 
    size_t sz = cnt * sizeof(int);
    int *ary = (int*)malloc(sz+64);
    int p = (int)ary; p += 64; p = p & ~63; ary = (int*)p; // align on 64 bytes

    // Can set it to anything, limitation is memory bandwidth anyways.
    __m128i zero = {0};/*{1,2,3,4,

    // Run a few times to get accurate results
    do {
        for (int i = 0; i < cnt; i++)
            ary[i] = 0;
        stoptimer("ary[i] = 0"); // 900 with msvc, 358 with intel

        for (int i = 0; i < cnt; i += 16)
            memstream(&ary[i], zero);
        stoptimer("memstream"); // 358 with both

        memset(ary, 0, sz);
        stoptimer("memset"); // 940

    } while(getchar());

    // Ensure array not optimized away
    printf("%i\n", ary[rand()]);

Visual C++ (2010):
ary[i] = 0    962 msec, 2309504493 clocks,
memstream     362 msec,  869966541 clocks,
memset        949 msec, 2276475597 clocks,

ary[i] = 0    958 msec, 2298528630 clocks,
memstream     365 msec,  876925611 clocks,
memset        960 msec, 2303561970 clocks,

ary[i] = 0    943 msec, 2263462995 clocks,
memstream     363 msec,  870981457 clocks,
memset        968 msec, 2321940605 clocks,

Generated assembly:
00C71076  xor         eax,eax  
00C71078  mov         ecx,19000000h  
00C7107D  mov         edi,esi  
00C7107F  rep stos    dword ptr es:[edi]  

00C71081  lea         eax,[esi+20h]  
00C71084  mov         ecx,1900000h  
00C71089  lea         esp,[esp]  

00C71090  movntdq     xmmword ptr [eax-20h],xmm0  
00C71095  movntdq     xmmword ptr [eax-10h],xmm0  
00C7109A  movntdq     xmmword ptr [eax],xmm0  
00C7109E  movntdq     xmmword ptr [eax+10h],xmm0  
00C710A3  add         eax,40h  
00C710A6  dec         ecx  
00C710A7  jne         main+90h (0C71090h)  

00C710A9  push        64000000h  
00C710AE  push        ecx  
00C710AF  push        esi  
00C710B0  call        memset (0C75910h) 
Note that: MSVC converts the ary = 0 loop to Intel's implementation for memset, while calling memset.

Intel C++ (for vs 2010)
ary[i] = 0    435 msec, 1044279793 clocks,
memstream     357 msec,  855606474 clocks,
memset        928 msec, 2226664378 clocks,

ary[i] = 0    361 msec,  866360328 clocks,
memstream     371 msec,  890675064 clocks,
memset        933 msec, 2238792639 clocks,

ary[i] = 0    368 msec,  882105732 clocks,
memstream     358 msec,  858516585 clocks,
memset        959 msec, 2301478323 clocks,

Generated assembly:
00071088  movntdq     xmmword ptr [esi+eax*4],xmm0  
0007108D  add         eax,4  
00071090  cmp         eax,19000000h  
00071095  jb          main+88h (71088h)  

00071097  movdqa      xmm0,xmmword ptr [___xt_z+24h (79160h)]  
0007109F  mov         eax,ebx 
000710A1  mov         edx,eax  
000710A3  inc         eax  
000710A4  shl         edx,6  
000710A7  cmp         eax,1900000h  
000710AC  movntdq     xmmword ptr [edx+esi],xmm0  
000710B1  movntdq     xmmword ptr [edx+esi+10h],xmm0  
000710B7  movntdq     xmmword ptr [edx+esi+20h],xmm0  
000710BD  movntdq     xmmword ptr [edx+esi+30h],xmm0  
000710C3  jb          main+0A1h (710A1h)  

000710C5  mov         edi,esi  
000710C7  xor         eax,eax  
000710C9  mov         ecx,19000000h  // = 1024*1024*400
000710CE  rep stos    dword ptr es:[edi]  

Note that: There's no call to memset (which however also uses rep stos making no big difference). The Intel compiler is smart enought to use movntdq (= _mm_stream_si128) by itself.

 rep stos: Fill (E)CX doublewords at ES:[(E)DI--] with EAX, while ECX-- != 0

Tuesday, May 29, 2012

No installers

When publishing software, make it installerless and portable (meaning it shouldn't write it's settings to the registry but to a user specifiable folder (a subfolder of itself by default)).

Also, when zipping things up don't put the files directly in the root folder, forcing people to use the "extract to folder..." option instead of "extract here". This stops people from downloading your software to a temporary folder and opening it in a zip viewing program, then dragging and dropping it somewhere else, because the'y first have to create a folder whereever they want to put it.

Instead, create a single folder in the zip and put the files there. Download some of my software to see what I mean.

GnuCalc - A good commandline calculator

Download: Get
(place readline5.dll in GnuCalc bin folder).

For integer only, "set /a 2+2" is good enough.


Monday, May 14, 2012

highgui.h, highgui.dll download

fatal error C1083: Cannot open include file: 'highgui.h': No such file or directory

Just came across some OpenGL Tutorials which depend on OpenCV Lib's Image loading. OpenCV is a huge library with tons of other stuff you won't need for compiling these samples, so I went ahead and repacked just everything you need for compiling and running these samples (only tested simpleGLUT-Texturing). (only vs10x86 binaries and required headers included)

When compiling these samples, make sure to set Linker > General > Output File back to default and remove any input libraries, then add the ones included here (e.g. by just drag and dropping them into the source file area of vs10) and copy the dlls in bin to your project dir.

Thursday, April 05, 2012

Leg'oh - Free Lego Jump and Run Game

Posted another old game I made back in 2005, Leg'oh 1:
Posted ImagePosted ImagePosted ImagePosted Image

Predecessor to Leg'oh 2!:

Btw. when publishing something: Make sure to include an installerless version and put the files not directly into a zip but a single folder, such that "extract here" and drag and drop can be used to extract your program.