Dienstag, 6. September 2011

Know your middleware … source code please!

 

It sounds so nice and easy: instead of spending time and money on developing features that basically every game needs anyway we license the fitting middleware. Just plug it in and you get awesome physics, astounding AI and the best graphics … and we do not need nearly as many programmers! Well …

What does middleware mean for us?

Middleware is an “easy” way for us to add features to our game we do not want and/or have the resources or time for to develop on our own. It allows us to concentrate our resources on the things that make our games great and special. And how many new rigid body physics simulation implementation does this world need anyway?

So I guess pretty much every game production of a certain scale uses middleware to some extend.

What does middleware mean for a development team?

When I started at YAGER we still had our own engine. The engine was used to build our first game (YAGER) and for some of the demos we did afterwards. While working on new pitches we soon realized that we needed to add middleware to keep up with the development in the industry. Animation and physics were among the first things to add. At some point we moved to a complete engine (Unreal Engine) because we did not want to spend so many resources in developing standard tools like a level editor or a build system. That decision was not taken lightly as we loved to work with our own technology.

After that the programming team grew mainly in 3 areas: gameplay, AI, consoles/performance. The team was a lot smaller then it would have been if we would have continued with our own tech. It was a team that was focused on getting our demos and then the production of our game project done. Some publishers even seem to see it as a big plus if your team uses proven middleware. In short I think middleware is the right way to go if you want to focus your team on doing games instead of tech. And there is always enough tech to develop in addition to the middleware anyway.

But enough introduction – I would like to share the most important lesson we learned while using very different types of middleware.

Know your middleware!

Currently we use the Unreal Engine 3, an AI middleware for pathbuilding, a sound middleware and a few additional middleware packages that are part of the Unreal Engine 3.

Like my last this article I will mainly focus on the high level issues. If you have more specific questions just comment on this article or contact me via email. If I get enough interest on a specific topic I could very well make this my next article.

You have to know your middleware!

One of the main mistakes I have seen when using middleware is treating it like a black box that does magic things for you that are described in some documentation. This almost always leads to a lot of bugs caused by using it wrong or a bit different then intended by the developers of the middleware. For whole engines this is especially true. The Unreal Engine is a very powerful beast but if you use it wrong it can become very slow, need a lot of memory or have all sorts of “under the hood” problems. The reason for this is pretty simply the fact that the engine is developed for the games Epic is doing. For that Epic is implementing certain features or content in a certain way and the engine works very well and is pretty easy to handle when you do your stuff in a similar fashion. Of course you can do things different but the amount of work it takes to get that running smoothly is significantly higher and you loose the fact that “Epic’s way” is proven and bugfixed by multiple game productions.

So the key here is to actually know how your middleware works and how it is supposed to be used in as much detail as you possibly can. Whenever you implement something in your game based on or using a middleware you should take the time to find out how exactly the middleware works in that area. If you have the source code, read it! Do easy experiments! Try to find out where the limits are by testing extreme cases.

Imagine you are planning on having ragdolls in your game. Who would not want that? Find out how complex it is to set them up? How many of them you could have in a level? How fast do they go to sleep? …

We found the answers to questions like that way too often by having the feature in the game and then seeing the problems. This may sound obvious but way too often middleware is just explored on the surface and by reading the documentation alone. This is even worse when you use a whole engine. In that case it is even more important to really understand how the things work you base your game on.

The worst of it all is, that you are usually hit by problems with your middleware close to an important milestone (vertical slices, demos …). Why? Because it is exactly at these times when your team tries to get the most out of the middleware and tries to push it to its limits. So if you run into an issue at that point you face the following situation: You try to fix a problem in or with a system you do not know well enough and if you are really in a bad spot you do not even have the source code for it. You rely on trial and error and on the support of your middleware provider and that might be hard to get on a Saturday night 20 hours before your delivery date. If you would have studied the middleware in depth before it is very likely that at least some of the extreme cases your team is trying now were already covered or ruled out by your early experiments and even if not, you should be a lot better prepared for fixing issues because you know the system.

An example: As explained above we are using a middleware to generate our pathdata for the AI. As long as the environments we were doing this for were nice, flat and blocky like in our early test levels we did not encounter any issues. Problem is: our game is not like that. The levels have a lot of uneven ground and have totally not blocky geometry. Even then in 90% of the cases the pathdata generation still worked fine. But again and again we encountered situations where the pathbuilding algorithm would produce very strange results and the documentation was not able to explain why. The best step now would have been to check the source code to find out how exactly the parameters we could tweak would influence the result but we did not have that. So we needed to use trial and error and the support to find out how to handle these situations. Of course we were hit by these issues closely to delivering an important press demo.

So having licensed a middleware that safes you time and development risks is not the end of the story. It still means you need to know it deeply. Getting to know it will cost a lot of time, but it is time you have to spend or you will very likely regret it at some point. The example above brings me to my next point:

Source Code Please!

Ask any programmer on the planet for the best source of documentation of a software system and you will have a high chance of getting this answer: The source code. I am not saying that we do not need well written documentation. On the contrary! The better the documentation the less dependent we are on the source code and non-programmers can usually only use the documentation and/or examples anyway. But when things get complicated and you need to know exactly how a system behaves there is no better place to look then the source code. We have both situations in our project (middleware with and without source code). Unreal Engine comes with full source code and you even compile it for your game. This is really the best situation and I guess for full engines the only way usable. I cannot count the number of times we needed that source access to understand and debug and sometimes even fix issues we had.

Another example: We had issues with the texture streaming on the PS3. The textures seemed to stream in a totally random fashion. It sometimes even looked like they would stream in the background before they would stream for foreground objects. As we have the source code for the game engine we could debug the issue and found a bug in the engine code that we could easily fix and report to Epic. If I remember correctly they had also found that issue shortly before that and had fixed it already. When reading the Epic mailing list I can see that this happens again and again, that licensees find bugs, fix them and report that fix.

And another example: Our game would crash very infrequently somewhere in the physics middleware code and we had no idea why. For this we had no source code access. In the end we found out that under certain very rare conditions our gameplay code would send NaNs into the middleware and that made it crash somewhere pretty deep. It took some time to track down that problem and once we knew it was pretty easy to fix. It was not even a bug in the middleware but our own fault. With source code access we would have been a lot faster with identifying that issue.

So this to all middleware providers:

Give us the source code!

We license middleware because we need somebody else to write and maintain the code for the features we do not want to write and maintain on our own. As you can see above having the source makes your middleware a lot more valuable to us. We can understand your middleware a lot better and we can debug issues that are caused by misusing it a lot easier, we can identify problems in your code and by that improve your product and if really necessary we can fix a bug ourselves and then simply give the fix to you. I can perfectly understand the fear that somebody might steal your technology but I guess everybody that works in the software development industry knows: Writing code (or as you might fear – stealing it) is one thing – maintaining it usually is much more work and that is what we license.

When we consider middleware these days not getting full source code access is a pretty big minus and for some middleware like AI we would not accept that again. The risk is simply too high.

Montag, 22. August 2011

The TechCabal

 

I am a member of the TechCabal. Usually this does not involve any arkane rituals or strange outfits. We are a group of higher ranking technical guys that guide the project on the technical side. How and why did we create this group?

Once a team reaches a certain size some decisions cannot be made directly by the people in the trenches doing the work anymore. This lesson usually is learned the hard way when a team grows from a small developer to a full AAA project team like we did. The easy direct lines of communication one is so used to, simply do not work anymore. People need to step up and start leading and leading means making decisions. So either you hire some experienced leads or you promote people from the midst of the team to lead them now - Problem solved!

Problem solved? The team grows further, the decisions become bigger. You start to create more specialized teams: a gameplay development team, an AI development team, a performance/console team, a technical art team, a build/tools team… Now you have decisions and problems that are bigger then any of these specialized teams and you also need to coordinate these teams somehow.

What now? Another lead, a lead for the leads! Now that person makes all the really big decisions and handles the coordination, an enormous amount of responsibility for a single person, a lot of pressure. That person by pure necessity and time limitation would be pretty much removed from working the real metal, from the trenches of the actual development. We still would like to have him have all the knowledge that you get by working in the trenches to have his decisions be properly influenced by it. Maybe we could find some kind of Uber-genius for that.

Me personally, I do not believe in the Uber-genius that can fill such a role or at least I have not met such a person and even if it would exist you would need at least 2 for your team because even an Uber-genius needs to take vacation or will be sick. To me this means we have to find a different solution.

Our solution to the problem is the TechCabal.

So what is this TechCabal? The TechCabal is a group of tech guys that meet at least once a week to discuss the current technical issues and questions of the project. These questions range from questions of coordination of the different technical groups of the team to making far reaching technical decisions.

I have to give the credits for this idea to where they belong. Hendrik our Technical Project Lead came up with that idea and together with him the members of the group shaped it.

Currently the TechCabal includes the Technical Director, the Technical Project Lead, the Lead Programmer, the Lead Technical Artist and the Heads of the gameplay and performance/console programming teams.

Instead of relying on a single very great person we share the responsibility for the big and not so big technical decisions among this group. When we discuss in this group we try to keep title and rank outside. We try to discuss absolutely honest and that sometimes means getting loud and shouting at each other. But how was the saying: What happens in the TechCabal stays in the TechCabal!

We found that decisions that are achieved in honest discussions are usually a lot better than if somebody has to make them alone. Do not get me wrong. This is not democracy. Usually we do not end up with a compromise. We end up with clear decisions. Most of the time once you listened to all viewpoints of the involved parties you end up with a good decision that is not the smallest part of the whole that we could barely agree on. I guess a reason for that might be that technical problems seem to have good and bad solutions. The whole process only works as long as the members of the group trust each other and political bullshit stays out of it. So yes that means that they need to get along well with each other which is a weak spot to some extend for this way of making decisions.

Once the decisions have been made in the group, the group makes sure that they are followed. Individual members are tasked with carrying out individual decisions of the TechCabal.

A group that usually only meets once a week cannot make all the decisions that come up in the development process and it should not. As much as a Lead should not try to make decisions for the guys in his team they can do without him (micromanagement) so should the TechCabal not try to take over the work, the decisions the Leads should be doing. Remember, it was setup to fulfill the need for the Uber-genius not to replace the Leads.

How do we find out what to discuss in the TechCabal? This is not easy and we have no strict rules. Basically every member of the group decides what to bring up in the TechCabal. For me some of my decisions or problems feel big enough to inform the TechCabal about the decisions I have made and some decisions are big enough that I do not want to make them alone. It is a matter of experience and trust that brings the right topics into the TechCabal. Of course this means that we sometimes discuss things that are actually too small and sometimes things that should have been decided by the group are not. It is a learning process and we get better over time and there is also a positive side effect to discussing big and small problems alike. It gives even somebody as high up in the hierarchy as the Technical Director a pretty good idea of what happens in the trenches of the development of the project and this is something that is hard to achieve by writing reports.

Who should be in the TechCabal? There is two factors that influence who should be in the group. First is technical and organizational knowledge. You need to have all guys in the group that have the knowledge you need to make fast decisions. For us that means that we have 3 programmers (Lead Programmer, the Head of gameplay programming and the Head of console/performance programming) and 2 Technical Artists (the Technical Project Lead and the Lead Technical Artist). With these guys we make sure to have all the needed technical knowledge and the needed information about the current state of the project available in our discussions. The second important factor is the place in the hierarchy of a person. We simply need to make sure that all needed persons are available when it comes to making decisions. In our case the Technical Director is that person that brings the authority and his knowledge about company wide technical questions. If he would not be there we sometimes would have to get certain decisions signed off externally which would slow the whole process down. If we find out that we constantly miss certain pieces of information we add somebody that has that knowledge to the group. One thing you need to take into account though when setting up a group like the TechCabal is size. Make it too big and decisions will be very hard to make and meetings will take far too long. Make it too small and you will not get enough opinions, ideas and knowledge for it to work efficiently. We found our current setup working very well. Six guys with very different background bring six different views and are still fast enough to make decisions.

How does the TechCabal interact with the usual hierarchy in the team? Very well! The TechCabal includes decision makers from very different levels of the hierarchy but includes all top level guys. So the decisions made do not need to be signed off by some higher level in the hierarchy. This also means that if the TechCabal for some reason would fail to make a decision that can still be done in the usual way using the hierarchy in the team. I cannot remember a single instance where that was necessary so far. The classic hierarchy even helps the TechCabal in the communication with the team. Carrying out project wide big scale decisions is usually something the Technical Project Lead or even the Technical Director is tasked with. If issues become more specific to smaller parts of the team it is the specific lead or head that is taking the responsibility.

I think the TechCabal is a good example for how making decisions in a group of experts will get you better decisions and will provide a more failsafe way for decision making. This can also be used by other groups. I could perfectly imagine a design cabal, an art cabal. Just look out for situations where the decision making of a single person spans a huge area, has a big impact on the whole project and requires a multitude of opinions and facts to be taken into account. This might be a good place to have a group instead of that one person.

Mittwoch, 27. Juli 2011

How to improve build stability

 

Build stability is always an important topic for us but once a game production has entered the production phase in earnest the stability of the game and the tools becomes one of the more important aspects for the tech team. The simple reason for this is that the number of people relying on this is highest at that point and any time these people have to wait for a bugfix or missing tools potentially means a lot of money wasted. So keeping your build as stable as possible is important.

And now for the bad news: I do not have the “This Solves All Our Problems” recipe. I want to share some of measures we have applied in our projects. If you have other measures you have taken to ensure build stability please tell me. I am always interested in doing more.

Iteration time rules

Having a stable build is very important – yes, but you cannot ruin the iteration time for your team. There will always be that level designer that requests a small feature, a small change or simply needs a critical bug fix really fast (usually yesterday) to finish the mission for the next milestone. You do not want him to wait for a week for that change. With 10, 20 or even more engineers working on your code base at the same time the chance is high that there is always at least one that has added a bug that makes it impossible to release the next engine version to the team at least if you do not take some measures that help to keep the build stable. The problem of course is that the measures you take cannot add so much overhead that they become a reason for slow iteration times. So everything you do needs to strike a balance between overhead and improved build stability.

Automated build systems


CIS – continuous integration server: you need this! It is bad enough if “real” bugs trouble your build – it is far worse if simple bugs destroy it. Ever come into the office in the morning to find out that you cannot compile the game? A typo, a file that had not been checked in, a bad merge? How many people lost how much time during this one morning? This is totally avoidable. The main function of our CIS is to continuously build the engine whenever somebody checks in a change. This makes sure that the engine and tools at least compile. Of course we also run a few easy and fast smoke tests that also make sure that you can at least start the engine.

But you can do even more. During the day the focus is on getting the engine build as fast as possible and run smoke tests. During the night we can do a lot more. We run automated tests to get statistics for memory usage and performance in test levels and game levels. These statistics are made available as graphs on an internal website. These graphs are an enormous help to recognize and track down sudden jumps in either performance or memory as well as gradual development. Together with good check in comments (see below) you can prevent this from breaking the game before it actually becomes a problem or you at least recognize the problem very fast and efficient (without TAs or programmers spending time to find out why MissionXY is not running any more).

When I talk about automated tests I guess I have to talk about unit tests as well. I have some experience with it though I have to admit that most of it is about how not to do it. We integrated a unit testing framework into the Unreal Engine on the Unreal Script and Kismet level pretty early in the production process. We started to use it for the AI code mostly as this was mainly written by us and not relying too much on middleware code (except pathfinding). The main mistake we made was that we ended up with actually doing integration tests and maintaining those takes a lot of time. For some time we even made it part of the process to have “unit tests” for every feature we did. At some point we started spending more and more time on fixing the tests which were failing because of changes in other systems and not because of bugs in the tested code – we stopped doing it. For next projects I want to do actual unit tests to test critical parts of our code. Integration tests is something that should be used for finished features that are not very likely to change a lot and I guess that means you have to keep that for a later time in the production. If you have experience with successfully applying either I would like to hear about your experiences.

Peer reviews

This is one of the best tools in our belt to improve build stability. It does not only give you a substantial improvement in build stability it also fosters communication within the team and distributes knowledge (win – win – win).
The idea is pretty simple: When ever someone wants to check in a change he needs to get this change reviewed by one of his colleagues. Of course this will only work if it is taken serious. The goal of a review should be that the reviewer has a good understanding of what the change is, how and why it was done. There are no dumb questions during a review. If you do not understand something while you do a review, ask. This goes especially towards seniors or leads that sometimes might feel they should not ask dumb questions. If you think you even need another ones opinion get it. You may and should criticize style and details. Ask for additional or improved comments if you think they might help. This is not only about making sure the change works it is about sharing ideas and knowledge as well.

So what do you get in the end? Reviews will easily spot obvious issues or problems with the idea of how to solve the issue at hand. They will rarely spot really intricate bugs or side effects. By that it will remove quite a number of bugs that would have been found later by the automated systems, by the QA or even worse by somebody trying to use a broken tool. What you also will get is people learning from each other, people looking into parts of the system they would not see usually. At least 2 people know the change that has been made in detail, so people getting sick or leaving the company becomes less of an issue. You get a culture of talking about your work and making sure work is actually done before the checkin (it is pretty embarrassing if obvious flaws are discovered by your reviewer in a piece of code that you actually considered worth checking in). People in your team talk, they develop a common language, they understand weaknesses and strengths of the team members.

A few things to keep in mind to make peer reviews work:
- it costs time – make sure everybody knows that this is time well spent and factor it into your estimates
- every checkin is reviewed – a lot of mistakes are made with “easy” or “small” checkins
- people should be available for a review – nothing is as annoying as not being able to checkin just because nobody has time therefore you should have a damn good reason to refuse a review
- add the information about who did the review to the checkin comment – reviews will be taken a lot more serious that way and if you hunt a bug caused by a checkin you know the two guys you should talk to to help you

Checkin comments


It might not be very clear initially how checkin comments can improve build stability because once the bug is checked in it is in. Good checkin comments make it a lot easier to track down an issue. Applying a structured format makes it even easier. Just imagine you sitting in front of the screen scanning through a list of 100 checkin comments to find out which change could cause your AI getting stuck while trying to vault over a cover. The easier it is to read the information and the better the information is in it the faster you will be. We fixed quite a lot of our “hard” bugs that way.

But actually checkin comments (if they are well done) have even more uses. You can subscribe to your source control system (we use perforce) to get an automated mail for every checkin in areas you are interested in and stay up to date with what is checked in by whom. This is not only a useful tool for a Lead it is also interesting for other programmers, QA or producers to know what actually is checked in.

Testbuilds

This is something that is not easy to do and it requires substantial inhouse QA resources and some additional tool support. The basic idea is again pretty simple: Before you check in in a change that you are not so sure about – test it. I guess everybody knows this bad feeling when he is changing something in a very old part of the code and this code also touches a lot of other code (maybe the guy who originally wrote it is not even there any more or you have to change code in your middleware). You are just not sure about the side effects and yes there is no automated testing around that part of the code. Basically the only way to find out what your change does besides what you intend it to do is testing it. The best people you have for testing are QA people (some of our QA guys find the strangest bugs and more importantly reliable repros for really hard ones - amazing). So the idea is to create a local build of the game (or representative part of it) and send that to the QA team to test your change. While you are waiting you can shelf your change and continue with something else. To make this a viable option you need really great tools to make the whole process as easy as possible. We are using the Unreal Engine with their build tools. It is easy to create a local build of the game for any platform using the Unreal Frontend. This tool is used to cook the game for the platform you need it for. Out of this tool we can push a build on to a central server (the prop server). The QA can get this build by using a simple web frontend and have it copied to their PC or XBox. Yes you could cook a build copy it into some network folder and write a mail to the QA where to find it. But the easier the whole process is the more likely it is that people are actually using it and do not find excuses to not do it or get frustrated because they have to. We also established a bit of a strict workflow around it to make the whole process even smoother.

Even applying all of the things I explain above perfectly will not give you zero bugs but it will allow you to spend your time on the important and interesting bugs and what is even more important – adding cool shit to your game.

To not lengthen this lengthy blog entry further I kept the individual parts pretty short. If you are interested in the details of how we exactly do certain things – let me know. I could make one of my next blog entries cover this in more detail.

Things to look at next

Static code analysis - I have seen a pretty interesting talk about this on the GameFest 2011 in London and after that I hoped to try this pretty soon on our project. After John Carmacks Keynote during the Quakecon this has just become a lot more interesting. If you are interested watch the second half of his 90min talk.

Sonntag, 17. Juli 2011

Starting to blog about gamedev

I am working as a Lead Programmer for YAGER Development in Berlin (www.yager.de). After about 8 years with the company and almost 5 years in the current project I thought I might have a few things to share about developing games. This is going to be my first blog so I am not totally sure how this is going to work out. I am as excited as I hope those are that read this blog.
Some facts about my professional background:
I joined the company while they were finishing up the work on the first game: Yager on the first XBox. I was then part of the team that developed the PC Version of the game. This time was followed by a number of demos and prototypes we did to get our next contract. After some time we got a deal with 2K Games to work on “Spec Ops: The Line” (www.specopstheline.com). Working on a Triple-A project like “SpecOps: The Line” for about 5 years teaches you a lot of things. I want to share some of the ideas and insights that I got in this time and that I still get this way.