Skip to content
  • Anton Blanchard's avatar
    ae01f84b
    powerpc: Optimise per cpu accesses on 64bit · ae01f84b
    Anton Blanchard authored
    
    
    Now we dynamically allocate the paca array, it takes an extra load
    whenever we want to access another cpu's paca. One place we do that a lot
    is per cpu variables. A simple example:
    
    DEFINE_PER_CPU(unsigned long, vara);
    unsigned long test4(int cpu)
    {
    	return per_cpu(vara, cpu);
    }
    
    This takes 4 loads, 5 if you include the actual load of the per cpu variable:
    
        ld r11,-32760(r30)  # load address of paca pointer
        ld r9,-32768(r30)   # load link address of percpu variable
        sldi r3,r29,9       # get offset into paca (each entry is 512 bytes)
        ld r0,0(r11)        # load paca pointer
        add r3,r0,r3        # paca + offset
        ld r11,64(r3)       # load paca[cpu].data_offset
    
        ldx r3,r9,r11       # load per cpu variable
    
    If we remove the ppc64 specific per_cpu_offset(), we get the generic one
    which indexes into a statically allocated array. This removes one load and
    one add:
    
        ld r11,-32760(r30)  # load address of __per_cpu_offset
        ld r9,-32768(r30)   # load link address of percpu variable
        sldi r3,r29,3       # get offset into __per_cpu_offset (each entry 8 bytes)
        ldx r11,r11,r3      # load __per_cpu_offset[cpu]
    
        ldx r3,r9,r11       # load per cpu variable
    
    Having all the offsets in one array also helps when iterating over a per cpu
    variable across a number of cpus, such as in the scheduler. Before we would
    need to load one paca cacheline when calculating each per cpu offset. Now we
    have 16 (128 / sizeof(long)) per cpu offsets in each cacheline.
    
    Signed-off-by: default avatarAnton Blanchard <anton@samba.org>
    Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
    ae01f84b
    powerpc: Optimise per cpu accesses on 64bit
    Anton Blanchard authored
    
    
    Now we dynamically allocate the paca array, it takes an extra load
    whenever we want to access another cpu's paca. One place we do that a lot
    is per cpu variables. A simple example:
    
    DEFINE_PER_CPU(unsigned long, vara);
    unsigned long test4(int cpu)
    {
    	return per_cpu(vara, cpu);
    }
    
    This takes 4 loads, 5 if you include the actual load of the per cpu variable:
    
        ld r11,-32760(r30)  # load address of paca pointer
        ld r9,-32768(r30)   # load link address of percpu variable
        sldi r3,r29,9       # get offset into paca (each entry is 512 bytes)
        ld r0,0(r11)        # load paca pointer
        add r3,r0,r3        # paca + offset
        ld r11,64(r3)       # load paca[cpu].data_offset
    
        ldx r3,r9,r11       # load per cpu variable
    
    If we remove the ppc64 specific per_cpu_offset(), we get the generic one
    which indexes into a statically allocated array. This removes one load and
    one add:
    
        ld r11,-32760(r30)  # load address of __per_cpu_offset
        ld r9,-32768(r30)   # load link address of percpu variable
        sldi r3,r29,3       # get offset into __per_cpu_offset (each entry 8 bytes)
        ldx r11,r11,r3      # load __per_cpu_offset[cpu]
    
        ldx r3,r9,r11       # load per cpu variable
    
    Having all the offsets in one array also helps when iterating over a per cpu
    variable across a number of cpus, such as in the scheduler. Before we would
    need to load one paca cacheline when calculating each per cpu offset. Now we
    have 16 (128 / sizeof(long)) per cpu offsets in each cacheline.
    
    Signed-off-by: default avatarAnton Blanchard <anton@samba.org>
    Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
Loading