Quota design documentation (2.6.36)

{Back to main page}
{Back to quota page}

Table of Contents

Intro

Current Linux quota implementation has been written about 10 year ago. It satisfy just basic smp correctness semantics(deadlock avoidance). Code contains several global locks which protects data. As result it has awful smp scalability. It is almost useless to use quota on systems with more than 8 CPU. This project is aimed to rewrite quota code to make it more scalable. This paper contains guide for accomplish this task. The paper divided in to three parts:

Current locks and locking rules
It is almost impossible to understand real locking rules from code comments. So I've collected updated list of locks, and table of locks for each quota function.
Definition of most contented locks
Definition of most contented locks
Proposed locking changes
This part was moved to patchseries page.

Current locks and locking rules (2.6.36)

Locks

dq_list_lock
protects all lists with quotas and quota formats dqstats structure containing statistics about the lists, dqstats structure(madness)
dq_data_lock
protects data from dq_dqb and also mem_dqinfo structures and also guards consistency of dquot->dq_dqb with inode->i_blocks, i_bytes.
dq_state_lock protects
modifications of quota state (on quotaon and quotaoff)
dqonoff_sem
protect from on/off race
dqptr_sem (per sb rw sem)
Any operation working on dquots via inode pointers must hold it.
dq_lock (per dquot mutex)
dquot is locked only when it is being read to memory
dqio_mutex (per dquot mutex)
per sb io mutex.
i_mutex (prer quota file)
each io operation required this mutex, in case of journalled quota all operations result in quota write.

Functions and lock table

 *table legend* 
  - dat :: dq_data_lock
  - drt :: mark_quota_dirty() is called form this function.
  - number :: lock order
  - g :: guarded by this lock, caller must hold it.
  - INLK :: inode_lock
  - in dqptr ::  WR = down_write, rd = down_read
  - List manipulation in lst column
       *First char == an action, 
         - a :: add in to a list              
         - r :: remove from a list
         - t :: traverse a list
       *Second char == a list           
         - i :: inuse_list
         - d :: dirty_list
         - f :: free_list
         - F :: tofree_head (dq_free) in remove_inode_dquot_ref()

|---------------------------+-----+----+-----+----+----+----+---+---|
|                           |     |    |     |    |    | On | d | d |
|                           |     |    |     | qd | dq | Of | a | r |
| func name                 | lst | st | ptr | lk | io |    | t | t |
|---------------------------+-----+----+-----+----+----+----+---+---|
| wait_on_dquot             |     |    |     |  1 |    |    |   |   |
| dquot_mark_dquot_dirty    | ad  |    |     |    |    |    |   |   |
| clear_dquot_dirty         | grd |    |     |    |    |    |   |   |
| dquot_acquire             |     |    |     |  1 |  2 |    |   |   |
| dquot_commit  cl drt      | 2rd |    |     |    |  1 |    |   |   |
| dquot_release             |     |    |     |  1 |  2 |    |   |   |
| invalidate_dquots INLK    |     |    |     |    |    |    |   |   |
| dquot_scan_active (ocfs2) | 2ti |    |     |    |    |  1 |   |   |
| vfs_quota_sync ->wr_dq    | 2td |    |     |    |    |  1 |   |   |
| shrink_dqcache_memory     | dif |    |     |    |    |    |   |   |
| dqput lts_l: dq_count     | 1   |    |     |    |    |    |   |   |
| dqget lst_l: dqhash       | 1   |  2 |     |    |    |    |   |   |
| dquot_initialize ->dqget  |     |    | ?WR |    |    |    |   |   |
| add_dquot_ref / INLK      |     |    |     |    |    |  g |   |   |
| remove_inode_dquot_ref    | 1aF |    | gWR |    |    |    |   |   |
| remove_dquot_ref /  INLK  |     |    |     |    |    |    |   |   |
| drop_dquot_ref /          |     |    | WR  |    |    |    |   |   |
| dquot_drop   ->dqput      |     |    | WR  |    |    |    |   |   |
| dquot_transfer ->dqget    |     |    | WR2 |    |    |    | 1 | 1 |
| dquot_commit_info         |     |    |     |    |  1 |    |   |   |
| vfs_quota_disable         |     |  2 |     |    |    |  1 |   |   |
| vfs_load_quota_inode      |     |  3 | WR2 |    |  3 |  1 |   |   |
|                           |     |    |     |    |    |    |   |   |
| __dquot_alloc_space [as]  |     |    |     |    |    |    | 1 |   |
| dquot_alloc_space  ->as   |     |    | rd  |    |    |    |   | 1 |
| dquot_reserve_space ->as  |     |    | rd  |    |    |    |   |   |
| dquot_alloc_inode         |     |    | rd  |    |    |    | 1 | 1 |
| dquot_claim_space         |     |    | rd  |    |    |    | 1 | 1 |
| dquot_release_rsrv_spc    |     |    | rd  |    |    |    | 1 |   |
| dquot_free_space          |     |    | rd  |    |    |    | 1 | 1 |
| dquot_free_inode          |     |    | rd  |    |    |    | 1 | 1 |
|---------------------------+-----+----+-----+----+----+----+---+---|

Most contented locks chart

i_mutex
This is most annoying lock in case of journalled quota. Just think about it. Then several concurrent users perform write. they have to getting in to sleep on i_mutex. CRAP
vfs_dq_alloc_space()->mark_quota_dirty()->write_quot()
dqio_mutex
Same as previous, per sb io mutex.
dqptr_sem
Currently this acquired on write during vfs_dq_init() In fact many vfs functions call this callback, for example (open(for write), truncate, unlink, link, etc).
dq_data_lock
Problem with this lock is what it is global. It must protect just given dquot and inode's bytes struct.
dq_list_lock
Again the lock is global, proper code reorganization will help.
dq_state_lock
Why do we need it? we already have dqonoff_sem, This is overwhelming.