Bjoern Knafla

Parallelization + AI + Gamedev Consulting 

Parallelization of Game AI - Presentation from the Paris Game AI Conference '09

In June 2009 I gave an overview and motivation for Parallelization of Game AI at the Paris Game AI Conference '09. The conference was a blast - great discussions with many interesting people and inspiring presentations - look out for the next installment, I will do so for sure!

Alex and Petra Champandard from AiGameDev.com, the main organizers and master-minds behind the Paris Game AI Conferences, just published the video and audio recordings of my presentation and put it online for AiGameDev.com premium members. Petra did a great job editing the recordings and put in a lot of hard work to synchronize my talk with the downloadable slides.

Video and slides at AiGameDev.com:

Slides with speaker notes can also be found at bjoernknafla.com:

Slides-movie and brief presentation recap

I find it extremely awful to watch and listen to myself / my presentation. I worked very hard on the talk and created slides that were full of animations to explain the presented parallelization concepts - making the slides a kind of script to follow during speaking. A script that needs accurate transitions and explanations under time pressure to stay inside the presentation time slot. This rigid structure to follow in combination with my nervousnes lead to lots of "ahs" and "errs" and let me struggle with my English speaking skills. To amend for th is, I created a quicktime movie from the slides showing the actual animations and added a coarse talk outline below. This in combination with the speaker note extended slides and the video recording at AiGameDev.com should give you much of the useful information of my talk.

(download)

Introduction

  • Why we need parallel programming: parallel hardware, performance.
  • Parallel programming isn't easy - but no rocket-science either:
    • parallel programming errors: race conditions, deadlocks,
    • performance problems: contention,
    • be aware that order of parallel instruction execution isn't deterministic.
  • Parallel hardware dictates how to program for performance:
    • maximize throughput - to fight memory latency,
    • maximize memory locality - to save bandwidth,
    • minimize (false) sharing - to prevent contention,
    • combined: balance maximization of locality vs. minimization of sharing.
  • Artificial intelligence (AI) model referenced in the talk: agent-based AI.

References:

Parallelization approaches

Asynchronous calls for pathfinding and other service calculations

  • Run computations asynchronously.
  • Better memory locality as async computations work on their own data.
  • Control iterations and when to merge result if determinism is important.
  • Uncontrollable scaling with raw threads (over-, under-subscription of hardware and performance penalties for numerous thread creation and destruction per main loop cycle, and context switching).
  • Take away: don't use raw threads but task pools or other constructs to express concurrency but not manage it by hand.

References:

  • James Reinders, Intel Threading Building Blocks, O'Reilly, 2007 (cross-platform C++ task pool by Intel)

Parallel agents

  • Unorganized parallel execution of agents needs synchronization of shared data.
  • Unstructured memory accesses and unorganized syncing:
    • saturates memory bandwidth,
    • is bad for data locality,
    • is not deterministic, and
    • potentially leads to contention and performance degradation.
  • Game entity management is ownership management (think netword and distributed programming).
  • Defer all non-read accesses: collect write requests and deterministically resolve conflicts (e.g. based on entity ids).
  • Identify data:
    • dependencies,
    • reading and writing access,
    • public data readable by all entities,
    • private data only accessible by entity owning it.
  • Execute entity updates in two stages:
    1. collection stage: all entities read public data (from last game world state) in parallel (without side effects),
    2. barrier synchronization waits on all agents to finish their collection stage, followed by the
    3. instance update stage: all entities update their private entity data in parallel.
  • Think about double buffering for public game/entity states.
  • Hint: for determinism and if needed use a random number generator per entity instance.
  • Stage internal: unstructured (read-)access to data can result in bad locality and eventually saturates the memory bus (bandwidth and latency problems).

Reference:

Data parallel systems

  • Each collection stage of parallel agents is different because of different agent state and surrounding game world.
  • Each agent works on different aspects of its behavior in individual order and duration: gameplay logic, sensing, pathfinding, terrain analysis, etc.
  • As already shown - this is bad for data locality and memory bandwidth usage.
  • Slice agents into aspects and group all aspects of the same type into systems.
  • Result: no single update function per agent but an update function per aspect/system.
  • Independent systems can run in parallel to each other.
  • Integration of agents aspects computations is defered based on the order of system execution per main loop cycle.
  • Systems can internally optimize their data: enables efficient parallel processing.
  • Systems run through stages:
    1. Collect data and update requests.
    2. Prepare parallel computation on data and update requests.
    3. Compute and transform aspect data in parallel (data-parallel computation).
    4. Publish results (pull, or push, or message passing, double buffering).
  • Systems enable lean data structures with good locality.
  • Data-parallel computations typically scale well with increasing processor core counts.
  • Order-irrelevant data processing enables determinism.
  • Systems might need internal low-level parallelism and synchronization.

References:

Conclusion

Know your parallel hardware, program for throughput, organize memory access, balance maximization of locality vs. minimization of sharing. Use the parallelization approaches as guides, and always remember: it is easy to introduce hard to find parallel errors - therefore: KISS - keep it short and simple.

References and further reading:

Outlook

I am planning a number of blog articles to delve more deeply into the parallelization of games and game AI, and to take a look at parallelization approaches I had to drop from the talk because of time constraints, like:

If you have got questions, feedback, criticism, then feel free to leave a comment or contact me via email or twitter. I am looking forward to hearing from you!

 

Bjoern

Loading mentions Retweet
Filed under  //   game ai   gamedev   parallelization  

Comments [4]

Parallelization in Games: Codemasters Colin McRae Dirt 2

I am always on the hunt for gamedev industry parallelization approaches deployed in computer and video games. Typically I scribble down some bullet points about game parallelization infos I find to use as a reference later on - so: why not share it on the web? Here we start:

Todays issue of the Intel Visual Adrenaline Magazine (issue 4, 2009, web, pdf) contains the article The Muddy Beauty of DiRT* 2*: Where rally realism meets the raw road by Jon Jordan about Codemasters Colin McRae DiRT 2 ralley/driving game with some bits of technical details provided by Codemasters Gareth Thomas (senior graphics programmer).

DIRT2_morocco_04aDIRT2_morocco_04a picture by Codemasters, on Flickr

Following is a short overview of the parallelization info revealed in the article:

  • Physics engine runs at minimum of 60 Hz - critical parts of it are simulated at 1,000 Hz
  • PlayStation 3 and Xbox 360 as lead platforms, single- and multi-core PCs are also supported
  • A tweaked version of Codemasters EGO game engine provides the main multi-threading and parallelization framework
  • Task parallelism: rendering happens on one core, game logic and physics on another/other core(s)
  • Tasks can be split off to other cores if available (It is unclear to me if the rendering, logic and physics tasks are meant here, or if smaller tasks from within these systems are described to allow scaling of performance or eye-candy with the number of cores)
  • Processes are parallelized depending on the number of cores and threads of the host platform
  • Worker maps specify how many threads the different subsystems use and the mapping between threads and actual cores
  • XML files describe the worker maps for different platforms and parallel hardware configurations based on profiling
  • In combination with different graphics resolutions for different systems, and enabling and disabling certain features like real-time environment maps on cars or dynamic shadows, worker maps enable hand-tuned scaling from lower-end to top-of-the-line machines
  • Occlusion culling is done via offline pre-computed potentially-visible-set tables and real-time lookup into these tables
  • At the beginning of a rendering frame visibility queries for main scene, shadows, and reflections are invoked on multiple threads. Results determine terrain and graphical object resultion/level-of detail and which hidden objects to cull

Summary and comments

Intel Visual Adrenaline Magazine articles typically don't delve deeply into technical details - and this article is no exception.

From the information that is revealed it looks like Codemasters EGO engine is a classical task parallel solution which models certain functional components of games (e.g. graphics, physics, game logic, etc.) as tasks. These tasks are statically mapped to specific threads/cores while satisfying inter-task (data-) dependencies. The mapping is controlled via XML files called worker maps.

I would have liked to get more information how the different tasks are distributed to cores and threads to learn if the parallelization approach described could scale to increasing core counts or where the scaling limits lie. The worker maps and the description of the task parallel system indicate a strong thinking in threads and a coarse-grained approach to parallelism (whole functional components/tasks are subject of threading). While this gains strong control of hardware-usage (which task to map to which core/thread) on todays multi-core machines it often also serverly limits scalability with increasing core counts.

In addition to the platform-static threading, other operations (e.g. visibility queries) can be invoked on available service threads. Depending on the am ount of operations/queries this can easily scale with core/thread count. The article doesn't explain if the queries are handled as jobs and entered into a job pool which then schedules them to available worker threads created at game start-time and destroyed at game-end-time, or if threads are created on the fly per query (which isn't advisable as thread creation and destruction is quite computationally expensive when done too often per main loop cycle).

I would be delighted if anyone with more legally shareable insight into the parallelization of Codemasters EGO engine could comment on the open questions and perhaps provide even more detailed insights into the technology behind the physics sim heavy and visually spectacular DiRT 2 game.

 

Bjoern

Loading mentions Retweet
Filed under  //   game   gamedev   parallelization  

Comments [0]

Integration of Posterous with web presence

I am beginning to blog and don't want to administrate my blogging software/server myself (greetings to the currently running attack on non-updated WordPress blogs). Because of its ease of use for short blog posts and its (constantly growing) integration with many other web services (like Twitter) I decided to test Posterous.
 
My plan: integrate Posterous as transparently as possible into my tiny web presence. Posterous allows to run your blog at your own domain instead of just under <yoursitename>.posterous.com - they call it custom domains.
 
And here I hit a stumbling block: I am also using my website and blog as an outlet for my freelancing and consulting in the area of computer and video game parallelization and my countries law requires to show an imprint with contact infos. Where to show this imprint? And where to put my short about info?

If I point the main A record of my registrar to Posterous I can't think of a direct way to access the imprint and about pages I am currently hosting at my registrar and web hoster anymore.
 
Currently I am using a sub-domain blog.bjoernknafla.com and let its A record point to Posterous and store my about and imprint pages under the www. "sub"-domain.
 
But I would like the blog to be the focal point of my web presence. A blog is the most dynamic part of most websites and the part that will therefore bait google the most.

  • I could point the main A record to my Posterous site (bjoernknafla.posterous.com) and add two new Posterous sites which only show the about and imprint info, e.g.: bjoernknafla_about.posterous.com and bjoernknafla_imprint .posterous.com. However I would like it better if I could directly add (static) sub-pages to my main Posterous site without adding new site-names to .posterous.com (without polluting the .posterous.com "namespace" and without the minimal risk that these names are already taken).
  • Or should I just write a blog entry for the about and imprint stuff and link to these posts from my Posterous profile visible from the main blog site?


How do other Posterous users handle this? Do you use Posterous mainly as a distribution list to streaming posts to different other web-services, but it isn't your central site?

Will be interesting to see what Posterous themes will add to the integration picture.
 
Bjoern

Loading mentions Retweet
Filed under  //   web services  

Comments [0]

Context, context, context - and locality - how not to lose your mind in parallel times

Alex Champandard from AiGameDev.com ranted about the negative sides of Singletons on Twitter:

Been thinking about the best argument against Singletons. It's down to assumptions, they make an ass out of your code now and bite me later.

There have been lots of discussions and battles regarding the pros and cons of Singletons and I don't want to incite them here again. Singleton's are a tool to use and as every tool they have their positive and negative sides.

It started with a Singleton

From a "parallelization of legacy code" perspective, Singleton's are problematic - like most globals.

When looking at code - data structures, classes, (member) functions - you can't spot the usage of globals and Singletons just from the interface. Their use is hidden inside the implementation, buried deep in the bowels of code.

Why is that bad? Just by looking at functions and methods I can't determine if they have side effects or not. I can't easily see and understand the data-relationships - if and when data is read and written. I have no idea if data is shared. But sharing needs synchronization, synchronization hinders scalability and can lead to contention, and contention kills performance (read Joe Duffy's Some performance implications of CAS operations and Herb Sutter's Sharing is the Root of All Contention).

C and C++, the programming languages most prevalent in game programming, don't offer direct support for contracts or aspect-oriented programming like other languages might do - therefore contracts or guarantees in which way data is shared, or for pureness (side-effect freeness), or explanations how to use the functions and data concurrently, need to be expressed by hand and coding style.

Ok, this is my reason for disliking Singletons: they hide manipulation of (shared) global state which can easily introduce race conditions under concurrent execution.

Embrace context and locality

And now to the core point - and question - of this post: How to live without Singletons or globals?

Hidden sharing and side effects

In my experience, easy to parallelize code (the combination of data and instructions) typically shows the context under which it runs very explicitly. All data manipulated by functions or methods is passed into it as arguments - the user controls and introduces sharing explicitly and there are no hidden side effects on global or even instance state.

"Easy" parallelization in this context (pun intended, oh well...) means parallelizing applications without producing hard to find errors or errors that wait a long time till they strike in the worst moment possible (parallel or threading errors are like sleepers...).

By the way: calling C's rand, or IO functions, or even malloc, free, or C++'s new and delete, all fall into the category of global application state side effects. This category also includes object-oriented code with non-abstract super classes that carry state - because from a fast look at the interface of an inherited class it isn't clear if a method call depends on state stored somewhere deep down in the inheritance hierarchy.

Context parameters versus enormous parameter lists

But..., but code that receives all the data and state via arguments has unwieldy numbers of arguments... and everything needs to be passed into it - data, allocators, IO functionality, logging and tracing facilities - such code isn't easy to read or write (even with code completion in todays editors)...

Currently I can think of three - no - four ways to mitigate this parameter-count-increase problem:

  1. Group data, allocators, functions, etc. in context structures - Allan Kelly described this pattern and coined it Encapsulated Context . The se context groups reduce the number of function signature parameters.

    I used this technique when refactoring OpenSteer for parallelization with OpenMP (ParCo2007 - paper and slides) to gain more control over the data the functions and methods were reading and writing. Even more I learned from reading about its use in Insomniac Game's SPU shaders, see Joe Valenzuela's slides about SPU gameplay from GDC 2009 and look out for struct global_funcs_t.

  2. Do less per function call and minimize the need for context. For example, ask yourself:
    • Does a function really need to allocate memory or could the function caller allocate the memory and hand it to the function?
    • Do I need to actually store values in a concurrent container data structure or could I write a lookaside data structure that controls concurrency in combination with a non-concurrent container that holds the actual values?
    • Do I really want to call rand which alters the global sequence of random numbers and leads to non-determinism under concurrency as the order of calls isn't controllable anymore?
    Minimization of context means to solve only very clear-cut problems per function and to model the data like a glove that fits the real problem to solve.

    The less context a function or a member function needs, the less aspects (memory management, file IO, concurrency-control, access to global or singleton service facilities, access and manipulation of instance data) it straddles, the easier it is to understand and the less likely it hides side effects.

    The data feed into such a function or method can be controlled locally - the way it works can be understood locally - I don't need to understand a whole cluster of systems and its dependencies and internal data flow. The result: the user can control the context of execution as locally or as globally as needed - local, like in: local to a block inside a function, or local to a thread calling the function directly or indirectly.

  3. Keep the function callstack shallow - the less functions or methods are called directly or indirectly from inside another function, the less parameters needed by the nested function calls need to be passed to them, the less parameters the encapsulating function needs.

    One way to keep callstacks shallow is by deferring computations. Don't run every calculation directly but collect commands and (batch) execute them at a later point in time - typically at a point in time where the callstack is less deep than inside the function that created the deferred computation commands.

    But this is a topic for its own article(s).

  4. Ok, the fourth solution: ask yourself, do I really understand the problem and how to solve it, or am I creating overly complex abstractions which need a lot of facilities, services, data for its context because I really don't grasp the problem at hand?

    Reads and feels like I am dodging the question, doesn't it. Well, this is taken nearly one-to-one from presentations of Mike Acton and from discussions with him - and I am still struggling and fighting to truly grasp its impact and fully deploy its invaluable insight.

The return of instance data

Even with all of these techniques often many context parameters are needed. Calling one function directly or indirectly from inside another function results in the need to add parameters to the encapsulating function to hand the context down (at least if the context is truly global).

I carve in - an objects instance fields define state and context, too. Instance field use isn't directly visible from the signature of class member functions, therefore only introduce object state with thought and ponder its pros and cons:

  • The number of member function parameters can be minimized, versus
  • manipulation of instance state isn't save under concurrency and needs to be synchronized - or mindfully controlled.

Qualify member functions that don't have side effects on their instance state via const. And, honestly, don't use mutable - mutable data can hide behind const-proclaiming interfaces, but truly needs synchronization - which isn't obvious from the const-qualified method signature you call.

Concurrency needs control of context

Ok, one last point - I promise - under sequential execution, a context (be it controlled via function arguments or via object state) is used from the point it is set until it is changed again. Therefore contexts are used in serial intervals inside sequential apps. Contexts can be stacked, can build hierarchies, e.g. a certain context (memory allocators, resource managers) is set for a whole computer game. It is refined for the currently loaded level (level assets, game logic), and the context for a specific entity inside this level is added on top (entity state).

In concurrent or parallel applications there are many streams of instructions at once - so to say: there are multi-dimensional intervals of context-use. Here we need to differentiate:

  • context per local block/scope,
  • context of (thread-)local function calls,
  • context per instance lifetime (globally per app, or inside a local thread scope),
  • global context used by all instruction streams/threads,
  • context per thread.

All these different global and local context can form hierarchies - even graphs - too.

Matteo Frigo from CilkArts (just bought by Intel) describes a related phenomenon in The thorny problem of the cactus stack.

I am still searching and experimenting with ways to explicitly express and control the scope and kind of context of (member) functions, to keep sharing and side effects under control while minimizing the need for frivolously long function parameter lists.

Perhaps Haskell or other functional programming languages with explicit constructs to introduce non-determinism and side effects, or contract- and aspect-oriented programming systems offer techniques that can be mapped to C and C++.

The journey goes on - with context and locality as my guides.


Bjoern

 

PS: When I started working with C - and departing from C++ with its templates, classes, operator overloading, etc. - I always wondered how to write reusable containers without resorting to void* all the time. Then I stumbled over Mike Acton's Multi-threaded Optimization by Example slides which describe a single thread reader, single thread writer look-aside helper structure for bounded ring buffer's - and I saw a glimpse of the solution: store your data exactly like you want/need it and coordinate concurrent access and manipulation of your data with the look-aside control structure. Decoupling at its best.

Loading mentions Retweet
Filed under  //   parallelization   programming  

Comments [0]