Threads, Vertex Dispatch Performance and the Vertex Array Range Extension On SGI Linux Visual Workstations 230,330,550. By Jaya Kanajan, SGI H1. Introduction: In a previous article, I mentioned multi threaded renderers and dispatch performance. "Dispatch performance limits the rate at which geometry can be sent to the graphics subsystem. This can be affected by the driver implementation, processor speed, sustained bus transfer speed, or the graphics hardware. The current implementation is highly optimized for single threaded renderers. This means that in order to achieve the maximum performance in both immediate mode and display list mode, one must only call glXMakeCurrent from a single thread. Multi-processed renderers are fine and do not incur any performance penalty. Multi threaded renderers will drop to a slower dispatch path and can be up to 6 times slower in immediate mode dispatch. The reason for this is due to the nature of thread specific storage on Linux." In this article, I'll suggest simple methods to improve multi threaded rendering performance. The overhead for thread identification is the reason for slower dispatch for multi threaded renderers. This can be resolved by reducing the number of dispatches. For example, rather than calling glVertex3f 10 million times, it can be preferable to gather all vertices together in a vertex array and just call glDrawElements once. This is much more efficient since the thread identification will only happen once per glDrawArrays call. Using this technique a multi threaded renderer can avoid the penalty of thread identification. H1. Details On the 230, 330, and 550 Linux Visual Workstations, the optimal scheme for fast dispatch of lots of vertices is through the vertex array range extension. This allows the OpenGL driver to provide the application with an optimal allocator for vertices. The application can also provide hints to the driver on the priority the allocator should assign to vertex buffers compared to textures and the rate at which this buffer will be read or written to. The vertex array range extension introduces constraints on vertex buffer size and index range, while relaxing sequentiality requirements. This allows the OpenGL driver to optimize transfer of vertex data to the graphics hardware. H1. Conclusions Developers seeking high performance for their OpenGL applications should consider multi processed renderers as a safe choice. If necessary, multi threaded renderers are also a viable alternative by taking advantage of OpenGL driver features such as the vertex array range extension. There is also interest in hardware evaluation of higher order surfaces such as NURBS. I am interested in ISV and developer feedback on this issue. Would you prefer to represent your objects using higher order surfaces and have OpenGL take on the task of vertex evaluation/ generation? H1. Example Application The following example is of a multi threaded renderer. One context/thread is used to render to the left half of a window using immediate mode rendering while the other context/thread renders to the right half using the vertex array range extension. On a quiescent 933 MHz 550 with VR7 graphics, the left half renders at a rate of 7 million triangles per second compared to the right half that renders at a rate of 18 million triangles per second. multithreadedrenderer.c =============================================================================== /* * jaya@sgi.com * vertex dispatch odometer for multiple threads */ #include #include #include #include #include #include #include #if 1 /* until this gets into glxext.h */ void *glXAllocateMemoryNV(GLsizei size, GLfloat readfreq, GLfloat writefreq, GLfloat priority); #endif static int fincount=0; GLXContext ctxt1, ctxt2; Display *dpy; GLuint w=300,h=300,x=0,y=0; Window win; static pthread_cond_t cv = PTHREAD_COND_INITIALIZER; static pthread_mutex_t mtx = PTHREAD_MUTEX_INITIALIZER; float *vbuffer; GLint maxvert,numiter; long long now() { struct timeval tv; struct timezone tz; gettimeofday(&tv, &tz); return ((long long)tv.tv_sec) * 1e9 + ((long long)tv.tv_usec)*1e3; } void horizstrip(int a, int b, int c, int d, int e, int f, int g, int h, int number, float* vbuf) { float *vbuftmp = vbuf; double topdivlen,botdivlen,topdroprate,botdroprate,i; topdivlen = (double) ( c-a ) / number; botdivlen = (double) ( g-e ) / number; topdroprate = (double) ( d-b ) / number; botdroprate = (double) ( h-f ) / number; for (i=0; i < number; i++) { *(vbuf++) = (float) ( (double) a + (double) i*topdivlen ); *(vbuf++) = (float) ( (double) b + (double) i*topdroprate ); *(vbuf++) = (float) ( (double) e + (double) i*botdivlen ); *(vbuf++) = (float) ( (double) f + (double) i*botdroprate ); } vbuf = vbuftmp; } XVisualInfo* getVisual(Display* dpy) { int attribs[] = { GLX_RGBA, None }; return glXChooseVisual( dpy, DefaultScreen( dpy ), attribs ); } Window makewindow(Display* dpy, int w, int h, int x, int y, XVisualInfo *vis) { XSetWindowAttributes winattribs; Colormap cmap; Window win; cmap = XCreateColormap( dpy, RootWindow(dpy, vis->screen), vis->visual, AllocNone ); winattribs.colormap = cmap; winattribs.border_pixel = 0; winattribs.background_pixel = 0; win = XCreateWindow( dpy, RootWindow(dpy, vis->screen), x, y, w, h, 0, vis->depth, InputOutput, vis->visual, CWColormap|CWBorderPixel|CWBackPixel, &winattribs ); XStoreName(dpy, win, "Multi Threaded Renderer test"); XSelectInput(dpy, win, ExposureMask | StructureNotifyMask | KeyReleaseMask ); XMapWindow( dpy, win ); XFlush( dpy ); return win; } void immediatemode (void) { float *imvbuffer,*tmpbuf; double seconds; long long before, after; int numvert = 64*1024,i=0,j=0; fprintf(stderr,"starting immediatemode thread\n"); glXMakeCurrent(dpy, win, ctxt1); glViewport(0,0,w/2,h); glMatrixMode( GL_PROJECTION ); glLoadIdentity(); glOrtho(-1, 1, -1, 1, -1, +1); glScissor(0,0,w/2,h); glEnable(GL_SCISSOR_TEST); glClearColor(0,0,0,0); glClear(GL_COLOR_BUFFER_BIT); glDisable(GL_SCISSOR_TEST); imvbuffer = (float *) malloc(numvert*2*sizeof(float)); tmpbuf = imvbuffer; horizstrip(-1,1,1,1,-1,0,1,0,numvert/2,imvbuffer); glColor3f(0, 0, 1); before = now(); for (i=0; i < numiter; i++) { glBegin(GL_TRIANGLE_STRIP); for (j=0; j < numvert; j++) { glVertex2fv(imvbuffer); imvbuffer+=2; } imvbuffer = tmpbuf; glEnd(); } after = now(); seconds = (double)(after - before)/1000000000.0; fprintf(stderr, "%d vertices in %4.2lf seconds %4.2lf vps (imm mode)\n",numvert*numiter,seconds,(double) (numvert*numiter)/(double)seconds); fprintf(stderr,"done 1\n"); pthread_mutex_lock(&mtx); fincount++; pthread_cond_broadcast(&cv); pthread_mutex_unlock(&mtx); pthread_exit(NULL); } void vertexarraymode (void) { long long before,after; double seconds; int i; fprintf(stderr,"starting vertexarray mode\n"); glXMakeCurrent(dpy, win, ctxt2); glViewport(w/2,0,w/2,h); glMatrixMode( GL_PROJECTION ); glLoadIdentity(); glOrtho(-1, 1, -1, 1, -1, +1); glScissor(w/2,0,w/2,h); glEnable(GL_SCISSOR_TEST); glClearColor(0,0,0,0); glClear(GL_COLOR_BUFFER_BIT); glDisable(GL_SCISSOR_TEST); glGetIntegerv(GL_MAX_VERTEX_ARRAY_RANGE_ELEMENT_NV, &maxvert); fprintf(stderr,"maximum number of vertices %d\n",maxvert); vbuffer = (float *) glXAllocateMemoryNV(maxvert*2*sizeof(float),0,0,1); glVertexArrayRangeNV(maxvert*2*sizeof(float),vbuffer); glEnableClientState(GL_VERTEX_ARRAY_RANGE_NV); glEnableClientState(GL_VERTEX_ARRAY); horizstrip(-1,1,1,1,-1,0,1,0,maxvert/2,vbuffer); glColor3f(1,0,0); glVertexPointer(2, GL_FLOAT, 0, vbuffer); before = now(); for (i=0; i < numiter; i++) { glDrawArrays(GL_TRIANGLE_STRIP, 0, maxvert); } glFinish(); after = now(); seconds = (double)(after - before)/1000000000.0; fprintf(stderr, "%d vertices in %4.2lf seconds %4.2lf vps (var mode)\n",maxvert*numiter,seconds,(double) (maxvert*numiter)/(double)seconds); fprintf(stderr,"done 2\n"); pthread_mutex_lock(&mtx); fincount++; pthread_cond_broadcast(&cv); pthread_mutex_unlock(&mtx); pthread_exit(NULL); } int main(int argc, char **argv) { pthread_t thid1, thid2; XVisualInfo *vis; numiter = (argv[1] == NULL) ? 1000 : atoi(argv[1]); fprintf(stderr,"X thread support (1 is good) = %d\n",(int) XInitThreads()); dpy = XOpenDisplay(NULL); vis = getVisual(dpy); win = makewindow(dpy,w,h,x,y,vis); ctxt1 = glXCreateContext( dpy, vis, 0, GL_TRUE ); ctxt2 = glXCreateContext( dpy, vis, 0, GL_TRUE ); pthread_create(&thid1,NULL, (void*) immediatemode,NULL); pthread_create(&thid2,NULL, (void*) vertexarraymode,NULL); while(fincount < 2) { pthread_cond_wait(&cv,&mtx); } sleep(2); /* leave window hanging for a little while */ return 0; }