We have a transaction table like this:
table txn ( date_created timestamp not null, date_confirmed timestamp null, ... )
Records are inserted with date_confirmed=null, using null to mean it has not been confirmed.
The majority of transactions are confirmed within a few minutes and date_confirmed is updated.
If disk space is not allocated for null values on insertion, will this lead to fragmentation? Is it better to insert a 'date-that-means-unconfirmed' instead?
It depends (doesn't it always :-), but I would say that if most rows will eventually have the nullable column date_confirmed column (and any other nullable columns) set to a non-null value then it is likely best to insert the row into the table with a non-null value.
Inserting rows with null values and the changing the null values to non-null will cause the row to grow and if there is insufficient free space on the page where the row is stored then the row will be split (or moved and a stub left behind in its original spot) and thus fragmenting the row. Exactly how much fragmentation occurs will depend on:
I.e. it really depends on how may splits are going to occur per page.
For example, if it is always the case that when you insert a row with a null value and you then set to a non-null value before any other row is inserted then it will likely be the case that doing this sequence will lead to at most one split row per page (i.e. the last row inserted on the page) and therefore this may not be that big of a deal ... but how "big" of a deal it is will depend on the number of rows per page (which depends on the size of the page and the size of a typical row).
However if you are inserting rows at a rapid rate (relative to the setting of the columns to non-null values) then it could be the case that each page is filled to its limit with rows containing nulls ... so when you set the nullable columns to non-null values there may not be sufficient space on the page to hold the now-larger-row and hence this could lead to every row on the page being split. This would be bad.
So, net result is that if you know that the column is going to have a non-null value then save yourself the grief and make the column non-nullable and initialize it to a out-of-band "date-that-means-unconfirmed" value. Declaring the column as "not null" has the bonus that the optimizer may be able to use this information to possibly choose more efficient plans when the column is used as a predicate in your queries.