Date Mon, 16 May 2016 19:08:12 +0200 From Ingo Molnar <> Subject [GIT PULL] scheduler changes for v4.7 Linus,



Please pull the latest sched-core-for-linus git tree from:



git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git sched-core-for-linus



# HEAD: ef0491ea17f8019821c7e9c8e801184ecf17f85a ARM: Hide finish_arch_post_lock_switch() from modules



- massive CPU hotplug rework (Thomas Gleixner)



- improve migration fairness (Peter Zijlstra)



- CPU load calculation updates/cleanups (Yuyang Du)



- cpufreq updates (Steve Muckle)



- nohz optimizations (Frederic Weisbecker)



- switch_mm() micro-optimization on x86 (Andy Lutomirski)



- ... lots of other enhancements, fixes and cleanups.



Thanks,



Ingo



------------------>

Alexander Shishkin (1):

perf/core, sched: Don't use clock function pointer to determine clock



Andy Lutomirski (5):

sched/core, ARM: Include linux/preempt.h from asm/mmu_context.h

sched/core: Add switch_mm_irqs_off() and use it in the scheduler

x86/mm: Build arch/x86/mm/tlb.c even on !SMP

x86/mm, sched/core: Uninline switch_mm()

x86/mm, sched/core: Turn off IRQs in switch_mm()



Anton Blanchard (1):

sched/cpuacct: Check for NULL when using task_pt_regs()



Daniel Lezcano (2):

sched/clock: Remove pointless test in cpu_clock/local_clock

sched/clock: Make local_clock()/cpu_clock() inline



Davidlohr Bueso (1):

sched/core: Fix comment typo in wake_q_add()



Dietmar Eggemann (2):

sched/fair: Remove stale power aware scheduling comments

sched/fair: Fix comment in calculate_imbalance()



Dongsheng Yang (1):

sched/cpuacct: Split usage accounting into user_usage and sys_usage



Frederic Weisbecker (3):

sched/fair: Gather CPU load functions under a more conventional namespace

sched/fair: Correctly handle nohz ticks CPU load accounting

sched/fair: Optimize !CONFIG_NO_HZ_COMMON CPU load updates



Ingo Molnar (1):

mm/mmu_context, sched/core: Fix mmu_context.h assumption



Matt Fleming (1):

sched/fair: Update rq clock before updating nohz CPU load



Morten Rasmussen (1):

sched/fair: Correct unit of load_above_capacity



Muhammad Falak R Wani (1):

sched/core: Remove unused variable



Peter Zijlstra (10):

sched/core: Move task_rq_lock() out of line

sched/core: Introduce 'struct rq_flags'

locking/lockdep, sched/core: Implement a better lock pinning scheme

sched/core: Enable increased load resolution on 64-bit kernels

sched/hotplug: Move sync_rcu to be with set_cpu_active(false)

sched/fair: Move record_wakee()

sched/fair: Prepare to fix fairness problems on migration

sched/core: Kill sched_class::task_waking to clean up the migration logic

sched/fair: Fix fairness issue on migration

sched/fair: Clean up scale confusion



Peter Zijlstra (Intel) (1):

sched: Allow per-cpu kernel threads to run on online && !active



Rabin Vincent (1):

sched/debug: Don't dump sched debug info in SysRq-W



Srikar Dronamraju (2):

sched/fair: Reset nr_balance_failed after active balancing

sched/fair: Fix asym packing to select correct CPU



Steve Muckle (3):

sched/fair: Move cpufreq hook to update_cfs_rq_load_avg()

sched/fair: Do not call cpufreq hook unless util changed

sched/fair: Call cpufreq hook in additional paths



Steven Rostedt (2):

sched/core: Add preempt checks in preempt_schedule() code

ARM: Hide finish_arch_post_lock_switch() from modules



Thomas Gleixner (14):

sched: Make set_cpu_rq_start_time() a built in hotplug state

sched: Allow hotplug notifiers to be setup early

sched: Consolidate the notifier maze

sched: Move sched_domains_numa_masks_clear() to DOWN_PREPARE

sched/hotplug: Convert cpu_[in]active notifiers to state machine

sched/migration: Move prepare transition to SCHED_STARTING state

sched/migration: Move calc_load_migrate() into CPU_DYING

sched/migration: Move CPU_ONLINE into scheduler state

sched/hotplug: Move migration CPU_DYING to sched_cpu_dying()

sched/hotplug: Make activate() the last hotplug step

sched/fair: Make ilb_notifier an explicit call

sched: Make hrtick_notifier an explicit call

sched/core: Use tsk_cpus_allowed() instead of accessing ->cpus_allowed

sched/core: Provide a tsk_nr_cpus_allowed() helper



Tim Chen (1):

sched/numa: Remove unnecessary NUMA dequeue update from non-SMP kernels



Vik Heyndrickx (1):

sched/loadavg: Fix loadavg artifacts on fully idle and on fully loaded systems



Wanpeng Li (3):

sched/cpufreq: Optimize cpufreq update kicker to avoid update multiple times

sched/debug: Print out idle balance values even on !CONFIG_SCHEDSTATS kernels

sched/nohz: Fix affine unpinned timers mess



Xunlei Pang (1):

sched/deadline: Fix a bug in dl_overflow()



Yuyang Du (6):

sched/fair: Update comments after a variable rename

sched/fair: Initiate a new task's util avg to a bounded value

sched/fair: Generalize the load/util averages resolution definition

sched/fair: Rename SCHED_LOAD_SHIFT to NICE_0_LOAD_SHIFT and remove SCHED_LOAD_SCALE

sched/fair: Add detailed description to the sched load avg metrics

sched/fair: Optimize sum computation with a lookup table



Zhao Lei (1):

sched/cpuacct: Show all possible CPUs in cpuacct output





Documentation/trace/ftrace.txt | 10 +-

arch/arm/include/asm/mmu_context.h | 3 +

arch/powerpc/kernel/smp.c | 2 +-

arch/s390/kernel/smp.c | 2 +-

arch/x86/events/core.c | 2 +-

arch/x86/include/asm/mmu_context.h | 101 +----

arch/x86/mm/Makefile | 3 +-

arch/x86/mm/tlb.c | 116 ++++++

include/linux/cpu.h | 18 -

include/linux/cpuhotplug.h | 2 +

include/linux/cpumask.h | 6 +-

include/linux/lockdep.h | 23 +-

include/linux/mmu_context.h | 7 +

include/linux/sched.h | 124 +++++-

kernel/cpu.c | 32 +-

kernel/locking/lockdep.c | 71 +++-

kernel/sched/clock.c | 48 +--

kernel/sched/core.c | 749 +++++++++++++++++++++----------------

kernel/sched/cpuacct.c | 147 ++++++--

kernel/sched/cpudeadline.c | 4 +-

kernel/sched/cpupri.c | 4 +-

kernel/sched/deadline.c | 55 +--

kernel/sched/debug.c | 10 +-

kernel/sched/fair.c | 513 ++++++++++++++++---------

kernel/sched/idle_task.c | 2 +-

kernel/sched/loadavg.c | 11 +-

kernel/sched/rt.c | 38 +-

kernel/sched/sched.h | 140 +++----

kernel/sched/stop_task.c | 2 +-

kernel/time/tick-sched.c | 9 +-

mm/mmu_context.c | 2 +-

31 files changed, 1329 insertions(+), 927 deletions(-)



diff --git a/Documentation/trace/ftrace.txt b/Documentation/trace/ftrace.txt

index f52f297cb406..9857606dd7b7 100644

--- a/Documentation/trace/ftrace.txt

+++ b/Documentation/trace/ftrace.txt

@@ -1562,12 +1562,12 @@ Doing the same with chrt -r 5 and function-trace set.

<idle>-0 3dN.1 12us : menu_hrtimer_cancel <-tick_nohz_idle_exit

<idle>-0 3dN.1 12us : ktime_get <-tick_nohz_idle_exit

<idle>-0 3dN.1 12us : tick_do_update_jiffies64 <-tick_nohz_idle_exit

- <idle>-0 3dN.1 13us : update_cpu_load_nohz <-tick_nohz_idle_exit

- <idle>-0 3dN.1 13us : _raw_spin_lock <-update_cpu_load_nohz

+ <idle>-0 3dN.1 13us : cpu_load_update_nohz <-tick_nohz_idle_exit

+ <idle>-0 3dN.1 13us : _raw_spin_lock <-cpu_load_update_nohz

<idle>-0 3dN.1 13us : add_preempt_count <-_raw_spin_lock

- <idle>-0 3dN.2 13us : __update_cpu_load <-update_cpu_load_nohz

- <idle>-0 3dN.2 14us : sched_avg_update <-__update_cpu_load

- <idle>-0 3dN.2 14us : _raw_spin_unlock <-update_cpu_load_nohz

+ <idle>-0 3dN.2 13us : __cpu_load_update <-cpu_load_update_nohz

+ <idle>-0 3dN.2 14us : sched_avg_update <-__cpu_load_update

+ <idle>-0 3dN.2 14us : _raw_spin_unlock <-cpu_load_update_nohz

<idle>-0 3dN.2 14us : sub_preempt_count <-_raw_spin_unlock

<idle>-0 3dN.1 15us : calc_load_exit_idle <-tick_nohz_idle_exit

<idle>-0 3dN.1 15us : touch_softlockup_watchdog <-tick_nohz_idle_exit

diff --git a/arch/arm/include/asm/mmu_context.h b/arch/arm/include/asm/mmu_context.h

index fa5b42d44985..3cc14dd8587c 100644

--- a/arch/arm/include/asm/mmu_context.h

+++ b/arch/arm/include/asm/mmu_context.h

@@ -15,6 +15,7 @@



#include <linux/compiler.h>

#include <linux/sched.h>

+#include <linux/preempt.h>

#include <asm/cacheflush.h>

#include <asm/cachetype.h>

#include <asm/proc-fns.h>

@@ -66,6 +67,7 @@ static inline void check_and_switch_context(struct mm_struct *mm,

cpu_switch_mm(mm->pgd, mm);

}



+#ifndef MODULE

#define finish_arch_post_lock_switch \

finish_arch_post_lock_switch

static inline void finish_arch_post_lock_switch(void)

@@ -87,6 +89,7 @@ static inline void finish_arch_post_lock_switch(void)

preempt_enable_no_resched();

}

}

+#endif /* !MODULE */



#endif /* CONFIG_MMU */



diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c

index 8cac1eb41466..55c924b65f71 100644

--- a/arch/powerpc/kernel/smp.c

+++ b/arch/powerpc/kernel/smp.c

@@ -565,7 +565,7 @@ int __cpu_up(unsigned int cpu, struct task_struct *tidle)

smp_ops->give_timebase();



/* Wait until cpu puts itself in the online & active maps */

- while (!cpu_online(cpu) || !cpu_active(cpu))

+ while (!cpu_online(cpu))

cpu_relax();



return 0;

diff --git a/arch/s390/kernel/smp.c b/arch/s390/kernel/smp.c

index 40a6b4f9c36c..7b89a7572100 100644

--- a/arch/s390/kernel/smp.c

+++ b/arch/s390/kernel/smp.c

@@ -832,7 +832,7 @@ int __cpu_up(unsigned int cpu, struct task_struct *tidle)

pcpu_attach_task(pcpu, tidle);

pcpu_start_fn(pcpu, smp_start_secondary, NULL);

/* Wait until cpu puts itself in the online & active maps */

- while (!cpu_online(cpu) || !cpu_active(cpu))

+ while (!cpu_online(cpu))

cpu_relax();

return 0;

}

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c

index 041e442a3e28..dd39fde66b54 100644

--- a/arch/x86/events/core.c

+++ b/arch/x86/events/core.c

@@ -2177,7 +2177,7 @@ void arch_perf_update_userpage(struct perf_event *event,

* cap_user_time_zero doesn't make sense when we're using a different

* time base for the records.

*/

- if (event->clock == &local_clock) {

+ if (!event->attr.use_clockid) {

userpg->cap_user_time_zero = 1;

userpg->time_zero = data->cyc2ns_offset;

}

diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h

index 84280029cafd..396348196aa7 100644

--- a/arch/x86/include/asm/mmu_context.h

+++ b/arch/x86/include/asm/mmu_context.h

@@ -115,103 +115,12 @@ static inline void destroy_context(struct mm_struct *mm)

destroy_context_ldt(mm);

}



-static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,

- struct task_struct *tsk)

-{

- unsigned cpu = smp_processor_id();

+extern void switch_mm(struct mm_struct *prev, struct mm_struct *next,

+ struct task_struct *tsk);



- if (likely(prev != next)) {

-#ifdef CONFIG_SMP

- this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK);

- this_cpu_write(cpu_tlbstate.active_mm, next);

-#endif

- cpumask_set_cpu(cpu, mm_cpumask(next));

-

- /*

- * Re-load page tables.

- *

- * This logic has an ordering constraint:

- *

- * CPU 0: Write to a PTE for 'next'

- * CPU 0: load bit 1 in mm_cpumask. if nonzero, send IPI.

- * CPU 1: set bit 1 in next's mm_cpumask

- * CPU 1: load from the PTE that CPU 0 writes (implicit)

- *

- * We need to prevent an outcome in which CPU 1 observes

- * the new PTE value and CPU 0 observes bit 1 clear in

- * mm_cpumask. (If that occurs, then the IPI will never

- * be sent, and CPU 0's TLB will contain a stale entry.)

- *

- * The bad outcome can occur if either CPU's load is

- * reordered before that CPU's store, so both CPUs must

- * execute full barriers to prevent this from happening.

- *

- * Thus, switch_mm needs a full barrier between the

- * store to mm_cpumask and any operation that could load

- * from next->pgd. TLB fills are special and can happen

- * due to instruction fetches or for no reason at all,

- * and neither LOCK nor MFENCE orders them.

- * Fortunately, load_cr3() is serializing and gives the

- * ordering guarantee we need.

- *

- */

- load_cr3(next->pgd);

-

- trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);

-

- /* Stop flush ipis for the previous mm */

- cpumask_clear_cpu(cpu, mm_cpumask(prev));

-

- /* Load per-mm CR4 state */

- load_mm_cr4(next);

-

-#ifdef CONFIG_MODIFY_LDT_SYSCALL

- /*

- * Load the LDT, if the LDT is different.

- *

- * It's possible that prev->context.ldt doesn't match

- * the LDT register. This can happen if leave_mm(prev)

- * was called and then modify_ldt changed

- * prev->context.ldt but suppressed an IPI to this CPU.

- * In this case, prev->context.ldt != NULL, because we

- * never set context.ldt to NULL while the mm still

- * exists. That means that next->context.ldt !=

- * prev->context.ldt, because mms never share an LDT.

- */

- if (unlikely(prev->context.ldt != next->context.ldt))

- load_mm_ldt(next);

-#endif

- }

-#ifdef CONFIG_SMP

- else {

- this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK);

- BUG_ON(this_cpu_read(cpu_tlbstate.active_mm) != next);

-

- if (!cpumask_test_cpu(cpu, mm_cpumask(next))) {

- /*

- * On established mms, the mm_cpumask is only changed

- * from irq context, from ptep_clear_flush() while in

- * lazy tlb mode, and here. Irqs are blocked during

- * schedule, protecting us from simultaneous changes.

- */

- cpumask_set_cpu(cpu, mm_cpumask(next));

-

- /*

- * We were in lazy tlb mode and leave_mm disabled

- * tlb flush IPI delivery. We must reload CR3

- * to make sure to use no freed page tables.

- *

- * As above, load_cr3() is serializing and orders TLB

- * fills with respect to the mm_cpumask write.

- */

- load_cr3(next->pgd);

- trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);

- load_mm_cr4(next);

- load_mm_ldt(next);

- }

- }

-#endif

-}

+extern void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,

+ struct task_struct *tsk);

+#define switch_mm_irqs_off switch_mm_irqs_off



#define activate_mm(prev, next) \

do { \

diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile

index f98913258c63..62c0043a5fd5 100644

--- a/arch/x86/mm/Makefile

+++ b/arch/x86/mm/Makefile

@@ -2,7 +2,7 @@

KCOV_INSTRUMENT_tlb.o := n



obj-y := init.o init_$(BITS).o fault.o ioremap.o extable.o pageattr.o mmap.o \

- pat.o pgtable.o physaddr.o gup.o setup_nx.o

+ pat.o pgtable.o physaddr.o gup.o setup_nx.o tlb.o



# Make sure __phys_addr has no stackprotector

nostackp := $(call cc-option, -fno-stack-protector)

@@ -12,7 +12,6 @@ CFLAGS_setup_nx.o := $(nostackp)

CFLAGS_fault.o := -I$(src)/../include/asm/trace



obj-$(CONFIG_X86_PAT) += pat_rbtree.o

-obj-$(CONFIG_SMP) += tlb.o



obj-$(CONFIG_X86_32) += pgtable_32.o iomap_32.o



diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c

index fe9b9f776361..5643fd0b1a7d 100644

--- a/arch/x86/mm/tlb.c

+++ b/arch/x86/mm/tlb.c

@@ -28,6 +28,8 @@

* Implement flush IPI by CALL_FUNCTION_VECTOR, Alex Shi

*/



+#ifdef CONFIG_SMP

+

struct flush_tlb_info {

struct mm_struct *flush_mm;

unsigned long flush_start;

@@ -57,6 +59,118 @@ void leave_mm(int cpu)

}

EXPORT_SYMBOL_GPL(leave_mm);



+#endif /* CONFIG_SMP */

+

+void switch_mm(struct mm_struct *prev, struct mm_struct *next,

+ struct task_struct *tsk)

+{

+ unsigned long flags;

+

+ local_irq_save(flags);

+ switch_mm_irqs_off(prev, next, tsk);

+ local_irq_restore(flags);

+}

+

+void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,

+ struct task_struct *tsk)

+{

+ unsigned cpu = smp_processor_id();

+

+ if (likely(prev != next)) {

+#ifdef CONFIG_SMP

+ this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK);

+ this_cpu_write(cpu_tlbstate.active_mm, next);

+#endif

+ cpumask_set_cpu(cpu, mm_cpumask(next));

+

+ /*

+ * Re-load page tables.

+ *

+ * This logic has an ordering constraint:

+ *

+ * CPU 0: Write to a PTE for 'next'

+ * CPU 0: load bit 1 in mm_cpumask. if nonzero, send IPI.

+ * CPU 1: set bit 1 in next's mm_cpumask

+ * CPU 1: load from the PTE that CPU 0 writes (implicit)

+ *

+ * We need to prevent an outcome in which CPU 1 observes

+ * the new PTE value and CPU 0 observes bit 1 clear in

+ * mm_cpumask. (If that occurs, then the IPI will never

+ * be sent, and CPU 0's TLB will contain a stale entry.)

+ *

+ * The bad outcome can occur if either CPU's load is

+ * reordered before that CPU's store, so both CPUs must

+ * execute full barriers to prevent this from happening.

+ *

+ * Thus, switch_mm needs a full barrier between the

+ * store to mm_cpumask and any operation that could load

+ * from next->pgd. TLB fills are special and can happen

+ * due to instruction fetches or for no reason at all,

+ * and neither LOCK nor MFENCE orders them.

+ * Fortunately, load_cr3() is serializing and gives the

+ * ordering guarantee we need.

+ *

+ */

+ load_cr3(next->pgd);

+

+ trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);

+

+ /* Stop flush ipis for the previous mm */

+ cpumask_clear_cpu(cpu, mm_cpumask(prev));

+

+ /* Load per-mm CR4 state */

+ load_mm_cr4(next);

+

+#ifdef CONFIG_MODIFY_LDT_SYSCALL

+ /*

+ * Load the LDT, if the LDT is different.

+ *

+ * It's possible that prev->context.ldt doesn't match

+ * the LDT register. This can happen if leave_mm(prev)

+ * was called and then modify_ldt changed

+ * prev->context.ldt but suppressed an IPI to this CPU.

+ * In this case, prev->context.ldt != NULL, because we

+ * never set context.ldt to NULL while the mm still

+ * exists. That means that next->context.ldt !=

+ * prev->context.ldt, because mms never share an LDT.

+ */

+ if (unlikely(prev->context.ldt != next->context.ldt))

+ load_mm_ldt(next);

+#endif

+ }

+#ifdef CONFIG_SMP

+ else {

+ this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK);

+ BUG_ON(this_cpu_read(cpu_tlbstate.active_mm) != next);

+

+ if (!cpumask_test_cpu(cpu, mm_cpumask(next))) {

+ /*

+ * On established mms, the mm_cpumask is only changed

+ * from irq context, from ptep_clear_flush() while in

+ * lazy tlb mode, and here. Irqs are blocked during

+ * schedule, protecting us from simultaneous changes.

+ */

+ cpumask_set_cpu(cpu, mm_cpumask(next));

+

+ /*

+ * We were in lazy tlb mode and leave_mm disabled

+ * tlb flush IPI delivery. We must reload CR3

+ * to make sure to use no freed page tables.

+ *

+ * As above, load_cr3() is serializing and orders TLB

+ * fills with respect to the mm_cpumask write.

+ */

+ load_cr3(next->pgd);

+ trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);

+ load_mm_cr4(next);

+ load_mm_ldt(next);

+ }

+ }

+#endif

+}

+

+#ifdef CONFIG_SMP

+

/*

* The flush IPI assumes that a thread switch happens in this order:

* [cpu0: the cpu that switches]

@@ -353,3 +467,5 @@ static int __init create_tlb_single_page_flush_ceiling(void)

return 0;

}

late_initcall(create_tlb_single_page_flush_ceiling);

+

+#endif /* CONFIG_SMP */

diff --git a/include/linux/cpu.h b/include/linux/cpu.h

index f9b1fab4388a..21597dcac0e2 100644

--- a/include/linux/cpu.h

+++ b/include/linux/cpu.h

@@ -59,25 +59,7 @@ struct notifier_block;

* CPU notifier priorities.

*/

enum {

- /*

- * SCHED_ACTIVE marks a cpu which is coming up active during

- * CPU_ONLINE and CPU_DOWN_FAILED and must be the first

- * notifier. CPUSET_ACTIVE adjusts cpuset according to

- * cpu_active mask right after SCHED_ACTIVE. During

- * CPU_DOWN_PREPARE, SCHED_INACTIVE and CPUSET_INACTIVE are

- * ordered in the similar way.

- *

- * This ordering guarantees consistent cpu_active mask and

- * migration behavior to all cpu notifiers.

- */

- CPU_PRI_SCHED_ACTIVE = INT_MAX,

- CPU_PRI_CPUSET_ACTIVE = INT_MAX - 1,

- CPU_PRI_SCHED_INACTIVE = INT_MIN + 1,

- CPU_PRI_CPUSET_INACTIVE = INT_MIN,

-

- /* migration should happen before other stuff but after perf */

CPU_PRI_PERF = 20,

- CPU_PRI_MIGRATION = 10,



/* bring up workqueues before normal notifiers and down after */

CPU_PRI_WORKQUEUE_UP = 5,

diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h

index 5d68e15e46b7..386374d19987 100644

--- a/include/linux/cpuhotplug.h

+++ b/include/linux/cpuhotplug.h

@@ -8,6 +8,7 @@ enum cpuhp_state {

CPUHP_BRINGUP_CPU,

CPUHP_AP_IDLE_DEAD,

CPUHP_AP_OFFLINE,

+ CPUHP_AP_SCHED_STARTING,

CPUHP_AP_NOTIFY_STARTING,

CPUHP_AP_ONLINE,

CPUHP_TEARDOWN_CPU,

@@ -16,6 +17,7 @@ enum cpuhp_state {

CPUHP_AP_NOTIFY_ONLINE,

CPUHP_AP_ONLINE_DYN,

CPUHP_AP_ONLINE_DYN_END = CPUHP_AP_ONLINE_DYN + 30,

+ CPUHP_AP_ACTIVE,

CPUHP_ONLINE,

};



diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h

index 40cee6b77a93..e828cf65d7df 100644

--- a/include/linux/cpumask.h

+++ b/include/linux/cpumask.h

@@ -743,12 +743,10 @@ set_cpu_present(unsigned int cpu, bool present)

static inline void

set_cpu_online(unsigned int cpu, bool online)

{

- if (online) {

+ if (online)

cpumask_set_cpu(cpu, &__cpu_online_mask);

- cpumask_set_cpu(cpu, &__cpu_active_mask);

- } else {

+ else

cpumask_clear_cpu(cpu, &__cpu_online_mask);

- }

}



static inline void

diff --git a/include/linux/lockdep.h b/include/linux/lockdep.h

index d10ef06971b5..fb7d87e45fbe 100644

--- a/include/linux/lockdep.h

+++ b/include/linux/lockdep.h

@@ -356,8 +356,13 @@ extern void lockdep_set_current_reclaim_state(gfp_t gfp_mask);

extern void lockdep_clear_current_reclaim_state(void);

extern void lockdep_trace_alloc(gfp_t mask);



-extern void lock_pin_lock(struct lockdep_map *lock);

-extern void lock_unpin_lock(struct lockdep_map *lock);

+struct pin_cookie { unsigned int val; };

+

+#define NIL_COOKIE (struct pin_cookie){ .val = 0U, }

+

+extern struct pin_cookie lock_pin_lock(struct lockdep_map *lock);

+extern void lock_repin_lock(struct lockdep_map *lock, struct pin_cookie);

+extern void lock_unpin_lock(struct lockdep_map *lock, struct pin_cookie);



# define INIT_LOCKDEP .lockdep_recursion = 0, .lockdep_reclaim_gfp = 0,



@@ -373,8 +378,9 @@ extern void lock_unpin_lock(struct lockdep_map *lock);



#define lockdep_recursing(tsk) ((tsk)->lockdep_recursion)



-#define lockdep_pin_lock(l) lock_pin_lock(&(l)->dep_map)

-#define lockdep_unpin_lock(l) lock_unpin_lock(&(l)->dep_map)

+#define lockdep_pin_lock(l) lock_pin_lock(&(l)->dep_map)

+#define lockdep_repin_lock(l,c) lock_repin_lock(&(l)->dep_map, (c))

+#define lockdep_unpin_lock(l,c) lock_unpin_lock(&(l)->dep_map, (c))



#else /* !CONFIG_LOCKDEP */



@@ -427,8 +433,13 @@ struct lock_class_key { };



#define lockdep_recursing(tsk) (0)



-#define lockdep_pin_lock(l) do { (void)(l); } while (0)

-#define lockdep_unpin_lock(l) do { (void)(l); } while (0)

+struct pin_cookie { };

+

+#define NIL_COOKIE (struct pin_cookie){ }

+

+#define lockdep_pin_lock(l) ({ struct pin_cookie cookie; cookie; })

+#define lockdep_repin_lock(l, c) do { (void)(l); (void)(c); } while (0)

+#define lockdep_unpin_lock(l, c) do { (void)(l); (void)(c); } while (0)



#endif /* !LOCKDEP */



diff --git a/include/linux/mmu_context.h b/include/linux/mmu_context.h

index 70fffeba7495..a4441784503b 100644

--- a/include/linux/mmu_context.h

+++ b/include/linux/mmu_context.h

@@ -1,9 +1,16 @@

#ifndef _LINUX_MMU_CONTEXT_H

#define _LINUX_MMU_CONTEXT_H



+#include <asm/mmu_context.h>

+

struct mm_struct;



void use_mm(struct mm_struct *mm);

void unuse_mm(struct mm_struct *mm);



+/* Architectures that care about IRQ state in switch_mm can override this. */

+#ifndef switch_mm_irqs_off

+# define switch_mm_irqs_off switch_mm

+#endif

+

#endif

diff --git a/include/linux/sched.h b/include/linux/sched.h

index 52c4847b05e2..38526b67e787 100644

--- a/include/linux/sched.h

+++ b/include/linux/sched.h

@@ -178,9 +178,11 @@ extern void get_iowait_load(unsigned long *nr_waiters, unsigned long *load);

extern void calc_global_load(unsigned long ticks);



#if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ_COMMON)

-extern void update_cpu_load_nohz(int active);

+extern void cpu_load_update_nohz_start(void);

+extern void cpu_load_update_nohz_stop(void);

#else

-static inline void update_cpu_load_nohz(int active) { }

+static inline void cpu_load_update_nohz_start(void) { }

+static inline void cpu_load_update_nohz_stop(void) { }

#endif



extern void dump_cpu_task(int cpu);

@@ -372,6 +374,15 @@ extern void cpu_init (void);

extern void trap_init(void);

extern void update_process_times(int user);

extern void scheduler_tick(void);

+extern int sched_cpu_starting(unsigned int cpu);

+extern int sched_cpu_activate(unsigned int cpu);

+extern int sched_cpu_deactivate(unsigned int cpu);

+

+#ifdef CONFIG_HOTPLUG_CPU

+extern int sched_cpu_dying(unsigned int cpu);

+#else

+# define sched_cpu_dying NULL

+#endif



extern void sched_show_task(struct task_struct *p);



@@ -935,9 +946,19 @@ enum cpu_idle_type {

};



/*

+ * Integer metrics need fixed point arithmetic, e.g., sched/fair

+ * has a few: load, load_avg, util_avg, freq, and capacity.

+ *

+ * We define a basic fixed point arithmetic range, and then formalize

+ * all these metrics based on that basic range.

+ */

+# define SCHED_FIXEDPOINT_SHIFT 10

+# define SCHED_FIXEDPOINT_SCALE (1L << SCHED_FIXEDPOINT_SHIFT)

+

+/*

* Increase resolution of cpu_capacity calculations

*/

-#define SCHED_CAPACITY_SHIFT 10

+#define SCHED_CAPACITY_SHIFT SCHED_FIXEDPOINT_SHIFT

#define SCHED_CAPACITY_SCALE (1L << SCHED_CAPACITY_SHIFT)



/*

@@ -1199,18 +1220,56 @@ struct load_weight {

};



/*

- * The load_avg/util_avg accumulates an infinite geometric series.

- * 1) load_avg factors frequency scaling into the amount of time that a

- * sched_entity is runnable on a rq into its weight. For cfs_rq, it is the

- * aggregated such weights of all runnable and blocked sched_entities.

- * 2) util_avg factors frequency and cpu scaling into the amount of time

- * that a sched_entity is running on a CPU, in the range [0..SCHED_LOAD_SCALE].

- * For cfs_rq, it is the aggregated such times of all runnable and

+ * The load_avg/util_avg accumulates an infinite geometric series

+ * (see __update_load_avg() in kernel/sched/fair.c).

+ *

+ * [load_avg definition]

+ *

+ * load_avg = runnable% * scale_load_down(load)

+ *

+ * where runnable% is the time ratio that a sched_entity is runnable.

+ * For cfs_rq, it is the aggregated load_avg of all runnable and

* blocked sched_entities.

- * The 64 bit load_sum can:

- * 1) for cfs_rq, afford 4353082796 (=2^64/47742/88761) entities with

- * the highest weight (=88761) always runnable, we should not overflow

- * 2) for entity, support any load.weight always runnable

+ *

+ * load_avg may also take frequency scaling into account:

+ *

+ * load_avg = runnable% * scale_load_down(load) * freq%

+ *

+ * where freq% is the CPU frequency normalized to the highest frequency.

+ *

+ * [util_avg definition]

+ *

+ * util_avg = running% * SCHED_CAPACITY_SCALE

+ *

+ * where running% is the time ratio that a sched_entity is running on

+ * a CPU. For cfs_rq, it is the aggregated util_avg of all runnable

+ * and blocked sched_entities.

+ *

+ * util_avg may also factor frequency scaling and CPU capacity scaling:

+ *

+ * util_avg = running% * SCHED_CAPACITY_SCALE * freq% * capacity%

+ *

+ * where freq% is the same as above, and capacity% is the CPU capacity

+ * normalized to the greatest capacity (due to uarch differences, etc).

+ *

+ * N.B., the above ratios (runnable%, running%, freq%, and capacity%)

+ * themselves are in the range of [0, 1]. To do fixed point arithmetics,

+ * we therefore scale them to as large a range as necessary. This is for

+ * example reflected by util_avg's SCHED_CAPACITY_SCALE.

+ *

+ * [Overflow issue]

+ *

+ * The 64-bit load_sum can have 4353082796 (=2^64/47742/88761) entities

+ * with the highest load (=88761), always runnable on a single cfs_rq,

+ * and should not overflow as the number already hits PID_MAX_LIMIT.

+ *

+ * For all other cases (including 32-bit kernels), struct load_weight's

+ * weight will overflow first before we do, because:

+ *

+ * Max(load_avg) <= Max(load.weight)

+ *

+ * Then it is the load_weight's responsibility to consider overflow

+ * issues.

*/

struct sched_avg {

u64 last_update_time, load_sum;

@@ -1871,6 +1930,11 @@ extern int arch_task_struct_size __read_mostly;

/* Future-safe accessor for struct task_struct's cpus_allowed. */

#define tsk_cpus_allowed(tsk) (&(tsk)->cpus_allowed)



+static inline int tsk_nr_cpus_allowed(struct task_struct *p)

+{

+ return p->nr_cpus_allowed;

+}

+

#define TNF_MIGRATED 0x01

#define TNF_NO_GROUP 0x02

#define TNF_SHARED 0x04

@@ -2303,8 +2367,6 @@ extern unsigned long long notrace sched_clock(void);

/*

* See the comment in kernel/sched/clock.c

*/

-extern u64 cpu_clock(int cpu);

-extern u64 local_clock(void);

extern u64 running_clock(void);

extern u64 sched_clock_cpu(int cpu);



@@ -2323,6 +2385,16 @@ static inline void sched_clock_idle_sleep_event(void)

static inline void sched_clock_idle_wakeup_event(u64 delta_ns)

{

}

+

+static inline u64 cpu_clock(int cpu)

+{

+ return sched_clock();

+}

+

+static inline u64 local_clock(void)

+{

+ return sched_clock();

+}

#else

/*

* Architectures can set this to 1 if they have specified

@@ -2337,6 +2409,26 @@ extern void clear_sched_clock_stable(void);

extern void sched_clock_tick(void);

extern void sched_clock_idle_sleep_event(void);

extern void sched_clock_idle_wakeup_event(u64 delta_ns);

+

+/*

+ * As outlined in clock.c, provides a fast, high resolution, nanosecond

+ * time source that is monotonic per cpu argument and has bounded drift

+ * between cpus.

+ *

+ * ######################### BIG FAT WARNING ##########################

+ * # when comparing cpu_clock(i) to cpu_clock(j) for i != j, time can #

+ * # go backwards !! #

+ * ####################################################################

+ */

+static inline u64 cpu_clock(int cpu)

+{

+ return sched_clock_cpu(cpu);

+}

+

+static inline u64 local_clock(void)

+{

+ return sched_clock_cpu(raw_smp_processor_id());

+}

#endif



#ifdef CONFIG_IRQ_TIME_ACCOUNTING

diff --git a/kernel/cpu.c b/kernel/cpu.c

index 3e3f6e49eabb..d948e44c471e 100644

--- a/kernel/cpu.c

+++ b/kernel/cpu.c

@@ -703,21 +703,6 @@ static int takedown_cpu(unsigned int cpu)

struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu);

int err;



- /*

- * By now we've cleared cpu_active_mask, wait for all preempt-disabled

- * and RCU users of this state to go away such that all new such users

- * will observe it.

- *

- * For CONFIG_PREEMPT we have preemptible RCU and its sync_rcu() might

- * not imply sync_sched(), so wait for both.

- *

- * Do sync before park smpboot threads to take care the rcu boost case.

- */

- if (IS_ENABLED(CONFIG_PREEMPT))

- synchronize_rcu_mult(call_rcu, call_rcu_sched);

- else

- synchronize_rcu();

-

/* Park the smpboot threads */

kthread_park(per_cpu_ptr(&cpuhp_state, cpu)->thread);

smpboot_park_threads(cpu);

@@ -923,8 +908,6 @@ void cpuhp_online_idle(enum cpuhp_state state)



st->state = CPUHP_AP_ONLINE_IDLE;



- /* The cpu is marked online, set it active now */

- set_cpu_active(cpu, true);

/* Unpark the stopper thread and the hotplug thread of this cpu */

stop_machine_unpark(cpu);

kthread_unpark(st->thread);

@@ -1236,6 +1219,12 @@ static struct cpuhp_step cpuhp_ap_states[] = {

.name = "ap:offline",

.cant_stop = true,

},

+ /* First state is scheduler control. Interrupts are disabled */

+ [CPUHP_AP_SCHED_STARTING] = {

+ .name = "sched:starting",

+ .startup = sched_cpu_starting,

+ .teardown = sched_cpu_dying,

+ },

/*

* Low level startup/teardown notifiers. Run with interrupts

* disabled. Will be removed once the notifiers are converted to

@@ -1274,6 +1263,15 @@ static struct cpuhp_step cpuhp_ap_states[] = {

* The dynamically registered state space is here

*/



+#ifdef CONFIG_SMP

+ /* Last state is scheduler control setting the cpu active */

+ [CPUHP_AP_ACTIVE] = {

+ .name = "sched:active",

+ .startup = sched_cpu_activate,

+ .teardown = sched_cpu_deactivate,

+ },

+#endif

+

/* CPU is fully up and running. */

[CPUHP_ONLINE] = {

.name = "online",

diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c

index 78c1c0ee6dc1..68bc6a654ca3 100644

--- a/kernel/locking/lockdep.c

+++ b/kernel/locking/lockdep.c

@@ -45,6 +45,7 @@

#include <linux/bitops.h>

#include <linux/gfp.h>

#include <linux/kmemcheck.h>

+#include <linux/random.h>



#include <asm/sections.h>



@@ -3585,7 +3586,35 @@ static int __lock_is_held(struct lockdep_map *lock)

return 0;

}



-static void __lock_pin_lock(struct lockdep_map *lock)

+static struct pin_cookie __lock_pin_lock(struct lockdep_map *lock)

+{

+ struct pin_cookie cookie = NIL_COOKIE;

+ struct task_struct *curr = current;

+ int i;

+

+ if (unlikely(!debug_locks))

+ return cookie;

+

+ for (i = 0; i < curr->lockdep_depth; i++) {

+ struct held_lock *hlock = curr->held_locks + i;

+

+ if (match_held_lock(hlock, lock)) {

+ /*

+ * Grab 16bits of randomness; this is sufficient to not

+ * be guessable and still allows some pin nesting in

+ * our u32 pin_count.

+ */

+ cookie.val = 1 + (prandom_u32() >> 16);

+ hlock->pin_count += cookie.val;

+ return cookie;

+ }

+ }

+

+ WARN(1, "pinning an unheld lock

");

+ return cookie;

+}

+

+static void __lock_repin_lock(struct lockdep_map *lock, struct pin_cookie cookie)

{

struct task_struct *curr = current;

int i;

@@ -3597,7 +3626,7 @@ static void __lock_pin_lock(struct lockdep_map *lock)

struct held_lock *hlock = curr->held_locks + i;



if (match_held_lock(hlock, lock)) {

- hlock->pin_count++;

+ hlock->pin_count += cookie.val;

return;

}

}

@@ -3605,7 +3634,7 @@ static void __lock_pin_lock(struct lockdep_map *lock)

WARN(1, "pinning an unheld lock

");

}



-static void __lock_unpin_lock(struct lockdep_map *lock)

+static void __lock_unpin_lock(struct lockdep_map *lock, struct pin_cookie cookie)

{

struct task_struct *curr = current;

int i;

@@ -3620,7 +3649,11 @@ static void __lock_unpin_lock(struct lockdep_map *lock)

if (WARN(!hlock->pin_count, "unpinning an unpinned lock

"))

return;



- hlock->pin_count--;

+ hlock->pin_count -= cookie.val;

+

+ if (WARN((int)hlock->pin_count < 0, "pin count corrupted

"))

+ hlock->pin_count = 0;

+

return;

}

}

@@ -3751,24 +3784,44 @@ int lock_is_held(struct lockdep_map *lock)

}

EXPORT_SYMBOL_GPL(lock_is_held);



-void lock_pin_lock(struct lockdep_map *lock)

+struct pin_cookie lock_pin_lock(struct lockdep_map *lock)

{

+ struct pin_cookie cookie = NIL_COOKIE;

unsigned long flags;



if (unlikely(current->lockdep_recursion))

- return;

+ return cookie;



raw_local_irq_save(flags);

check_flags(flags);



current->lockdep_recursion = 1;

- __lock_pin_lock(lock);

+ cookie = __lock_pin_lock(lock);

current->lockdep_recursion = 0;

raw_local_irq_restore(flags);

+

+ return cookie;

}

EXPORT_SYMBOL_GPL(lock_pin_lock);



-void lock_unpin_lock(struct lockdep_map *lock)

+void lock_repin_lock(struct lockdep_map *lock, struct pin_cookie cookie)

+{

+ unsigned long flags;

+

+ if (unlikely(current->lockdep_recursion))

+ return;

+

+ raw_local_irq_save(flags);

+ check_flags(flags);

+

+ current->lockdep_recursion = 1;

+ __lock_repin_lock(lock, cookie);

+ current->lockdep_recursion = 0;

+ raw_local_irq_restore(flags);

+}

+EXPORT_SYMBOL_GPL(lock_repin_lock);

+

+void lock_unpin_lock(struct lockdep_map *lock, struct pin_cookie cookie)

{

unsigned long flags;



@@ -3779,7 +3832,7 @@ void lock_unpin_lock(struct lockdep_map *lock)

check_flags(flags);



current->lockdep_recursion = 1;

- __lock_unpin_lock(lock);

+ __lock_unpin_lock(lock, cookie);

current->lockdep_recursion = 0;

raw_local_irq_restore(flags);

}

diff --git a/kernel/sched/clock.c b/kernel/sched/clock.c

index fedb967a9841..e85a725e5c34 100644

--- a/kernel/sched/clock.c

+++ b/kernel/sched/clock.c

@@ -318,6 +318,7 @@ u64 sched_clock_cpu(int cpu)



return clock;

}

+EXPORT_SYMBOL_GPL(sched_clock_cpu);



void sched_clock_tick(void)

{

@@ -363,39 +364,6 @@ void sched_clock_idle_wakeup_event(u64 delta_ns)

}

EXPORT_SYMBOL_GPL(sched_clock_idle_wakeup_event);



-/*

- * As outlined at the top, provides a fast, high resolution, nanosecond

- * time source that is monotonic per cpu argument and has bounded drift

- * between cpus.

- *

- * ######################### BIG FAT WARNING ##########################

- * # when comparing cpu_clock(i) to cpu_clock(j) for i != j, time can #

- * # go backwards !! #

- * ####################################################################

- */

-u64 cpu_clock(int cpu)

-{

- if (!sched_clock_stable())

- return sched_clock_cpu(cpu);

-

- return sched_clock();

-}

-

-/*

- * Similar to cpu_clock() for the current cpu. Time will only be observed

- * to be monotonic if care is taken to only compare timestampt taken on the

- * same CPU.

- *

- * See cpu_clock().

- */

-u64 local_clock(void)

-{

- if (!sched_clock_stable())

- return sched_clock_cpu(raw_smp_processor_id());

-

- return sched_clock();

-}

-

#else /* CONFIG_HAVE_UNSTABLE_SCHED_CLOCK */



void sched_clock_init(void)

@@ -410,22 +378,8 @@ u64 sched_clock_cpu(int cpu)



return sched_clock();

}

-

-u64 cpu_clock(int cpu)

-{

- return sched_clock();

-}

-

-u64 local_clock(void)

-{

- return sched_clock();

-}

-

#endif /* CONFIG_HAVE_UNSTABLE_SCHED_CLOCK */



-EXPORT_SYMBOL_GPL(cpu_clock);

-EXPORT_SYMBOL_GPL(local_clock);

-

/*

* Running clock - returns the time that has elapsed while a guest has been

* running.

diff --git a/kernel/sched/core.c b/kernel/sched/core.c

index d1f7149f8704..404c0784b1fc 100644

--- a/kernel/sched/core.c

+++ b/kernel/sched/core.c

@@ -33,7 +33,7 @@

#include <linux/init.h>

#include <linux/uaccess.h>

#include <linux/highmem.h>

-#include <asm/mmu_context.h>

+#include <linux/mmu_context.h>

#include <linux/interrupt.h>

#include <linux/capability.h>

#include <linux/completion.h>

@@ -170,6 +170,71 @@ static struct rq *this_rq_lock(void)

return rq;

}



+/*

+ * __task_rq_lock - lock the rq @p resides on.

+ */

+struct rq *__task_rq_lock(struct task_struct *p, struct rq_flags *rf)

+ __acquires(rq->lock)

+{

+ struct rq *rq;

+

+ lockdep_assert_held(&p->pi_lock);

+

+ for (;;) {

+ rq = task_rq(p);

+ raw_spin_lock(&rq->lock);

+ if (likely(rq == task_rq(p) && !task_on_rq_migrating(p))) {

+ rf->cookie = lockdep_pin_lock(&rq->lock);

+ return rq;

+ }

+ raw_spin_unlock(&rq->lock);

+

+ while (unlikely(task_on_rq_migrating(p)))

+ cpu_relax();

+ }

+}

+

+/*

+ * task_rq_lock - lock p->pi_lock and lock the rq @p resides on.

+ */

+struct rq *task_rq_lock(struct task_struct *p, struct rq_flags *rf)

+ __acquires(p->pi_lock)

+ __acquires(rq->lock)

+{

+ struct rq *rq;

+

+ for (;;) {

+ raw_spin_lock_irqsave(&p->pi_lock, rf->flags);

+ rq = task_rq(p);

+ raw_spin_lock(&rq->lock);

+ /*

+ * move_queued_task() task_rq_lock()

+ *

+ * ACQUIRE (rq->lock)

+ * [S] ->on_rq = MIGRATING [L] rq = task_rq()

+ * WMB (__set_task_cpu()) ACQUIRE (rq->lock);

+ * [S] ->cpu = new_cpu [L] task_rq()

+ * [L] ->on_rq

+ * RELEASE (rq->lock)

+ *

+ * If we observe the old cpu in task_rq_lock, the acquire of

+ * the old rq->lock will fully serialize against the stores.

+ *

+ * If we observe the new cpu in task_rq_lock, the acquire will

+ * pair with the WMB to ensure we must then also see migrating.

+ */

+ if (likely(rq == task_rq(p) && !task_on_rq_migrating(p))) {

+ rf->cookie = lockdep_pin_lock(&rq->lock);

+ return rq;

+ }

+ raw_spin_unlock(&rq->lock);

+ raw_spin_unlock_irqrestore(&p->pi_lock, rf->flags);

+

+ while (unlikely(task_on_rq_migrating(p)))

+ cpu_relax();

+ }

+}

+

#ifdef CONFIG_SCHED_HRTICK

/*

* Use HR-timers to deliver accurate preemption points.

@@ -249,29 +314,6 @@ void hrtick_start(struct rq *rq, u64 delay)

}

}



-static int

-hotplug_hrtick(struct notifier_block *nfb, unsigned long action, void *hcpu)

-{

- int cpu = (int)(long)hcpu;

-

- switch (action) {

- case CPU_UP_CANCELED:

- case CPU_UP_CANCELED_FROZEN:

- case CPU_DOWN_PREPARE:

- case CPU_DOWN_PREPARE_FROZEN:

- case CPU_DEAD:

- case CPU_DEAD_FROZEN:

- hrtick_clear(cpu_rq(cpu));

- return NOTIFY_OK;

- }

-

- return NOTIFY_DONE;

-}

-

-static __init void init_hrtick(void)

-{

- hotcpu_notifier(hotplug_hrtick, 0);

-}

#else

/*

* Called to set the hrtick timer state.

@@ -288,10 +330,6 @@ void hrtick_start(struct rq *rq, u64 delay)

hrtimer_start(&rq->hrtick_timer, ns_to_ktime(delay),

HRTIMER_MODE_REL_PINNED);

}

-

-static inline void init_hrtick(void)

-{

-}

#endif /* CONFIG_SMP */



static void init_rq_hrtick(struct rq *rq)

@@ -315,10 +353,6 @@ static inline void hrtick_clear(struct rq *rq)

static inline void init_rq_hrtick(struct rq *rq)

{

}

-

-static inline void init_hrtick(void)

-{

-}

#endif /* CONFIG_SCHED_HRTICK */



/*

@@ -400,7 +434,7 @@ void wake_q_add(struct wake_q_head *head, struct task_struct *task)

* wakeup due to that.

*

* This cmpxchg() implies a full barrier, which pairs with the write

- * barrier implied by the wakeup in wake_up_list().

+ * barrier implied by the wakeup in wake_up_q().

*/

if (cmpxchg(&node->next, NULL, WAKE_Q_TAIL))

return;

@@ -499,7 +533,10 @@ int get_nohz_timer_target(void)

rcu_read_lock();

for_each_domain(cpu, sd) {

for_each_cpu(i, sched_domain_span(sd)) {

- if (!idle_cpu(i) && is_housekeeping_cpu(cpu)) {

+ if (cpu == i)

+ continue;

+

+ if (!idle_cpu(i) && is_housekeeping_cpu(i)) {

cpu = i;

goto unlock;

}

@@ -1085,12 +1122,20 @@ void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask)

static int __set_cpus_allowed_ptr(struct task_struct *p,

const struct cpumask *new_mask, bool check)

{

- unsigned long flags;

- struct rq *rq;

+ const struct cpumask *cpu_valid_mask = cpu_active_mask;

unsigned int dest_cpu;

+ struct rq_flags rf;

+ struct rq *rq;

int ret = 0;



- rq = task_rq_lock(p, &flags);

+ rq = task_rq_lock(p, &rf);

+

+ if (p->flags & PF_KTHREAD) {

+ /*

+ * Kernel threads are allowed on online && !active CPUs

+ */

+ cpu_valid_mask = cpu_online_mask;

+ }



/*

* Must re-check here, to close a race against __kthread_bind(),

@@ -1104,22 +1149,32 @@ static int __set_cpus_allowed_ptr(struct task_struct *p,

if (cpumask_equal(&p->cpus_allowed, new_mask))

goto out;



- if (!cpumask_intersects(new_mask, cpu_active_mask)) {

+ if (!cpumask_intersects(new_mask, cpu_valid_mask)) {

ret = -EINVAL;

goto out;

}



do_set_cpus_allowed(p, new_mask);



+ if (p->flags & PF_KTHREAD) {

+ /*

+ * For kernel threads that do indeed end up on online &&

+ * !active we want to ensure they are strict per-cpu threads.

+ */

+ WARN_ON(cpumask_intersects(new_mask, cpu_online_mask) &&

+ !cpumask_intersects(new_mask, cpu_active_mask) &&

+ p->nr_cpus_allowed != 1);

+ }

+

/* Can the task run on the task's current CPU? If so, we're done */

if (cpumask_test_cpu(task_cpu(p), new_mask))

goto out;



- dest_cpu = cpumask_any_and(cpu_active_mask, new_mask);

+ dest_cpu = cpumask_any_and(cpu_valid_mask, new_mask);

if (task_running(rq, p) || p->state == TASK_WAKING) {

struct migration_arg arg = { p, dest_cpu };

/* Need help from migration thread: drop lock and wait. */

- task_rq_unlock(rq, p, &flags);

+ task_rq_unlock(rq, p, &rf);

stop_one_cpu(cpu_of(rq), migration_cpu_stop, &arg);

tlb_migrate_finish(p->mm);

return 0;

@@ -1128,12 +1183,12 @@ static int __set_cpus_allowed_ptr(struct task_struct *p,

* OK, since we're going to drop the lock immediately

* afterwards anyway.

*/

- lockdep_unpin_lock(&rq->lock);

+ lockdep_unpin_lock(&rq->lock, rf.cookie);

rq = move_queued_task(rq, p, dest_cpu);

- lockdep_pin_lock(&rq->lock);

+ lockdep_repin_lock(&rq->lock, rf.cookie);

}

out:

- task_rq_unlock(rq, p, &flags);

+ task_rq_unlock(rq, p, &rf);



return ret;

}

@@ -1317,8 +1372,8 @@ int migrate_swap(struct task_struct *cur, struct task_struct *p)

*/

unsigned long wait_task_inactive(struct task_struct *p, long match_state)

{

- unsigned long flags;

int running, queued;

+ struct rq_flags rf;

unsigned long ncsw;

struct rq *rq;



@@ -1353,14 +1408,14 @@ unsigned long wait_task_inactive(struct task_struct *p, long match_state)

* lock now, to be *sure*. If we're wrong, we'll

* just go back and repeat.

*/

- rq = task_rq_lock(p, &flags);

+ rq = task_rq_lock(p, &rf);

trace_sched_wait_task(p);

running = task_running(rq, p);

queued = task_on_rq_queued(p);

ncsw = 0;

if (!match_state || p->state == match_state)

ncsw = p->nvcsw | LONG_MIN; /* sets MSB */

- task_rq_unlock(rq, p, &flags);

+ task_rq_unlock(rq, p, &rf);



/*

* If it changed from the expected state, bail out now.

@@ -1434,6 +1489,25 @@ EXPORT_SYMBOL_GPL(kick_process);



/*

* ->cpus_allowed is protected by both rq->lock and p->pi_lock

+ *

+ * A few notes on cpu_active vs cpu_online:

+ *

+ * - cpu_active must be a subset of cpu_online

+ *

+ * - on cpu-up we allow per-cpu kthreads on the online && !active cpu,

+ * see __set_cpus_allowed_ptr(). At this point the newly online

+ * cpu isn't yet part of the sched domains, and balancing will not

+ * see it.

+ *

+ * - on cpu-down we clear cpu_active() to mask the sched domains and

+ * avoid the load balancer to place new tasks on the to be removed

+ * cpu. Existing tasks will remain running there and will be taken

+ * off.

+ *

+ * This means that fallback selection must not select !active CPUs.

+ * And can assume that any active CPU must be online. Conversely

+ * select_task_rq() below may allow selection of !active CPUs in order

+ * to satisfy the above rules.

*/

static int select_fallback_rq(int cpu, struct task_struct *p)

{

@@ -1452,8 +1526,6 @@ static int select_fallback_rq(int cpu, struct task_struct *p)



/* Look for allowed, online CPU in same node. */

for_each_cpu(dest_cpu, nodemask) {

- if (!cpu_online(dest_cpu))

- continue;

if (!cpu_active(dest_cpu))

continue;

if (cpumask_test_cpu(dest_cpu, tsk_cpus_allowed(p)))

@@ -1464,8 +1536,6 @@ static int select_fallback_rq(int cpu, struct task_struct *p)

for (;;) {

/* Any allowed, online CPU? */

for_each_cpu(dest_cpu, tsk_cpus_allowed(p)) {

- if (!cpu_online(dest_cpu))

- continue;

if (!cpu_active(dest_cpu))

continue;

goto out;

@@ -1515,8 +1585,10 @@ int select_task_rq(struct task_struct *p, int cpu, int sd_flags, int wake_flags)

{

lockdep_assert_held(&p->pi_lock);



- if (p->nr_cpus_allowed > 1)

+ if (tsk_nr_cpus_allowed(p) > 1)

cpu = p->sched_class->select_task_rq(p, cpu, sd_flags, wake_flags);

+ else

+ cpu = cpumask_any(tsk_cpus_allowed(p));



/*

* In order not to call set_task_cpu() on a blocking task we need

@@ -1604,8 +1676,8 @@ static inline void ttwu_activate(struct rq *rq, struct task_struct *p, int en_fl

/*

* Mark the task runnable and perform wakeup-preemption.

*/

-static void

-ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)

+static void ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags,

+ struct pin_cookie cookie)

{

check_preempt_curr(rq, p, wake_flags);

p->state = TASK_RUNNING;

@@ -1617,9 +1689,9 @@ ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)

* Our task @p is fully woken up and running; so its safe to

* drop the rq->lock, hereafter rq is only used for statistics.

*/

- lockdep_unpin_lock(&rq->lock);

+ lockdep_unpin_lock(&rq->lock, cookie);

p->sched_class->task_woken(rq, p);

- lockdep_pin_lock(&rq->lock);

+ lockdep_repin_lock(&rq->lock, cookie);

}



if (rq->idle_stamp) {

@@ -1637,17 +1709,23 @@ ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)

}



static void

-ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags)

+ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,

+ struct pin_cookie cookie)

{

+ int en_flags = ENQUEUE_WAKEUP;

+

lockdep_assert_held(&rq->lock);



#ifdef CONFIG_SMP

if (p->sched_contributes_to_load)

rq->nr_uninterruptible--;

+

+ if (wake_flags & WF_MIGRATED)

+ en_flags |= ENQUEUE_MIGRATED;

#endif



- ttwu_activate(rq, p, ENQUEUE_WAKEUP | ENQUEUE_WAKING);

- ttwu_do_wakeup(rq, p, wake_flags);

+ ttwu_activate(rq, p, en_flags);

+ ttwu_do_wakeup(rq, p, wake_flags, cookie);

}



/*

@@ -1658,17 +1736,18 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags)

*/

static int ttwu_remote(struct task_struct *p, int wake_flags)

{

+ struct rq_flags rf;

struct rq *rq;

int ret = 0;



- rq = __task_rq_lock(p);

+ rq = __task_rq_lock(p, &rf);

if (task_on_rq_queued(p)) {

/* check_preempt_curr() may use rq clock */

update_rq_clock(rq);

- ttwu_do_wakeup(rq, p, wake_flags);

+ ttwu_do_wakeup(rq, p, wake_flags, rf.cookie);

ret = 1;

}

- __task_rq_unlock(rq);

+ __task_rq_unlock(rq, &rf);



return ret;

}

@@ -1678,6 +1757,7 @@ void sched_ttwu_pending(void)

{

struct rq *rq = this_rq();

struct llist_node *llist = llist_del_all(&rq->wake_list);

+ struct pin_cookie cookie;

struct task_struct *p;

unsigned long flags;



@@ -1685,15 +1765,19 @@ void sched_ttwu_pending(void)

return;



raw_spin_lock_irqsave(&rq->lock, flags);

- lockdep_pin_lock(&rq->lock);

+ cookie = lockdep_pin_lock(&rq->lock);



while (llist) {

p = llist_entry(llist, struct task_struct, wake_entry);

llist = llist_next(llist);

- ttwu_do_activate(rq, p, 0);

+ /*

+ * See ttwu_queue(); we only call ttwu_queue_remote() when

+ * its a x-cpu wakeup.

+ */

+ ttwu_do_activate(rq, p, WF_MIGRATED, cookie);

}



- lockdep_unpin_lock(&rq->lock);

+ lockdep_unpin_lock(&rq->lock, cookie);

raw_spin_unlock_irqrestore(&rq->lock, flags);

}



@@ -1777,9 +1861,10 @@ bool cpus_share_cache(int this_cpu, int that_cpu)

}

#endif /* CONFIG_SMP */



-static void ttwu_queue(struct task_struct *p, int cpu)

+static void ttwu_queue(struct task_struct *p, int cpu, int wake_flags)

{

struct rq *rq = cpu_rq(cpu);

+ struct pin_cookie cookie;



#if defined(CONFIG_SMP)

if (sched_feat(TTWU_QUEUE) && !cpus_share_cache(smp_processor_id(), cpu)) {

@@ -1790,9 +1875,9 @@ static void ttwu_queue(struct task_struct *p, int cpu)

#endif



raw_spin_lock(&rq->lock);

- lockdep_pin_lock(&rq->lock);

- ttwu_do_activate(rq, p, 0);

- lockdep_unpin_lock(&rq->lock);

+ cookie = lockdep_pin_lock(&rq->lock);

+ ttwu_do_activate(rq, p, wake_flags, cookie);

+ lockdep_unpin_lock(&rq->lock, cookie);

raw_spin_unlock(&rq->lock);

}



@@ -1961,9 +2046,6 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)

p->sched_contributes_to_load = !!task_contributes_to_load(p);

p->state = TASK_WAKING;



- if (p->sched_class->task_waking)

- p->sched_class->task_waking(p);

-

cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags);

if (task_cpu(p) != cpu) {

wake_flags |= WF_MIGRATED;

@@ -1971,7 +2053,7 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)

}

#endif /* CONFIG_SMP */



- ttwu_queue(p, cpu);

+ ttwu_queue(p, cpu, wake_flags);

stat:

if (schedstat_enabled())

ttwu_stat(p, cpu, wake_flags);

@@ -1989,7 +2071,7 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)

* ensure that this_rq() is locked, @p is bound to this_rq() and not

* the current task.

*/

-static void try_to_wake_up_local(struct task_struct *p)

+static void try_to_wake_up_local(struct task_struct *p, struct pin_cookie cookie)

{

struct rq *rq = task_rq(p);



@@ -2006,11 +2088,11 @@ static void try_to_wake_up_local(struct task_struct *p)

* disabled avoiding further scheduler activity on it and we've

* not yet picked a replacement task.

*/

- lockdep_unpin_lock(&rq->lock);

+ lockdep_unpin_lock(&rq->lock, cookie);

raw_spin_unlock(&rq->lock);

raw_spin_lock(&p->pi_lock);

raw_spin_lock(&rq->lock);

- lockdep_pin_lock(&rq->lock);

+ lockdep_repin_lock(&rq->lock, cookie);

}



if (!(p->state & TASK_NORMAL))

@@ -2021,7 +2103,7 @@ static void try_to_wake_up_local(struct task_struct *p)

if (!task_on_rq_queued(p))

ttwu_activate(rq, p, ENQUEUE_WAKEUP);



- ttwu_do_wakeup(rq, p, 0);

+ ttwu_do_wakeup(rq, p, 0, cookie);

if (schedstat_enabled())

ttwu_stat(p, smp_processor_id(), 0);

out:

@@ -2381,7 +2463,8 @@ static int dl_overflow(struct task_struct *p, int policy,

u64 new_bw = dl_policy(policy) ? to_ratio(period, runtime) : 0;

int cpus, err = -1;



- if (new_bw == p->dl.dl_bw)

+ /* !deadline task may carry old deadline bandwidth */

+ if (new_bw == p->dl.dl_bw && task_has_dl_policy(p))

return 0;



/*

@@ -2420,12 +2503,12 @@ extern void init_dl_bw(struct dl_bw *dl_b);

*/

void wake_up_new_task(struct task_struct *p)

{

- unsigned long flags;

+ struct rq_flags rf;

struct rq *rq;



- raw_spin_lock_irqsave(&p->pi_lock, flags);

/* Initialize new task's runnable average */

init_entity_runnable_average(&p->se);

+ raw_spin_lock_irqsave(&p->pi_lock, rf.flags);

#ifdef CONFIG_SMP

/*

* Fork balancing, do it here and not earlier because:

@@ -2434,8 +2517,10 @@ void wake_up_new_task(struct task_struct *p)

*/

set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0));

#endif

+ /* Post initialize new task's util average when its cfs_rq is set */

+ post_init_entity_util_avg(&p->se);



- rq = __task_rq_lock(p);

+ rq = __task_rq_lock(p, &rf);

activate_task(rq, p, 0);

p->on_rq = TASK_ON_RQ_QUEUED;

trace_sched_wakeup_new(p);

@@ -2446,12 +2531,12 @@ void wake_up_new_task(struct task_struct *p)

* Nothing relies on rq->lock after this, so its fine to

* drop it.

*/

- lockdep_unpin_lock(&rq->lock);

+ lockdep_unpin_lock(&rq->lock, rf.cookie);

p->sched_class->task_woken(rq, p);

- lockdep_pin_lock(&rq->lock);

+ lockdep_repin_lock(&rq->lock, rf.cookie);

}

#endif

- task_rq_unlock(rq, p, &flags);

+ task_rq_unlock(rq, p, &rf);

}



#ifdef CONFIG_PREEMPT_NOTIFIERS

@@ -2713,7 +2798,7 @@ asmlinkage __visible void schedule_tail(struct task_struct *prev)

*/

static __always_inline struct rq *

context_switch(struct rq *rq, struct task_struct *prev,

- struct task_struct *next)

+ struct task_struct *next, struct pin_cookie cookie)

{

struct mm_struct *mm, *oldmm;



@@ -2733,7 +2818,7 @@ context_switch(struct rq *rq, struct task_struct *prev,

atomic_inc(&oldmm->mm_count);

enter_lazy_tlb(oldmm, next);

} else

- switch_mm(oldmm, mm, next);

+ switch_mm_irqs_off(oldmm, mm, next);



if (!prev->mm) {

prev->active_mm = NULL;

@@ -2745,7 +2830,7 @@ context_switch(struct rq *rq, struct task_struct *prev,

* of the scheduler it's an obvious special-case), so we

* do an early lockdep release here:

*/

- lockdep_unpin_lock(&rq->lock);

+ lockdep_unpin_lock(&rq->lock, cookie);

spin_release(&rq->lock.dep_map, 1, _THIS_IP_);



/* Here we just switch the register state and the stack. */

@@ -2867,7 +2952,7 @@ EXPORT_PER_CPU_SYMBOL(kernel_cpustat);

*/

unsigned long long task_sched_runtime(struct task_struct *p)

{

- unsigned long flags;

+ struct rq_flags rf;

struct rq *rq;

u64 ns;



@@ -2887,7 +2972,7 @@ unsigned long long task_sched_runtime(struct task_struct *p)

return p->se.sum_exec_runtime;

#endif



- rq = task_rq_lock(p, &flags);

+ rq = task_rq_lock(p, &rf);

/*

* Must be ->curr _and_ ->on_rq. If dequeued, we would

* project cycles that may never be accounted to this

@@ -2898,7 +2983,7 @@ unsigned long long task_sched_runtime(struct task_struct *p)

p->sched_class->update_curr(rq);

}

ns = p->se.sum_exec_runtime;

- task_rq_unlock(rq, p, &flags);

+ task_rq_unlock(rq, p, &rf);



return ns;

}

@@ -2918,7 +3003,7 @@ void scheduler_tick(void)

raw_spin_lock(&rq->lock);

update_rq_clock(rq);

curr->sched_class->task_tick(rq, curr, 0);

- update_cpu_load_active(rq);

+ cpu_load_update_active(rq);

calc_global_load_tick(rq);

raw_spin_unlock(&rq->lock);



@@ -2961,6 +3046,20 @@ u64 scheduler_tick_max_deferment(void)



#if defined(CONFIG_PREEMPT) && (defined(CONFIG_DEBUG_PREEMPT) || \

defined(CONFIG_PREEMPT_TRACER))

+/*

+ * If the value passed in is equal to the current preempt count

+ * then we just disabled preemption. Start timing the latency.

+ */

+static inline void preempt_latency_start(int val)

+{

+ if (preempt_count() == val) {

+ unsigned long ip = get_lock_parent_ip();

+#ifdef CONFIG_DEBUG_PREEMPT

+ current->preempt_disable_ip = ip;

+#endif

+ trace_preempt_off(CALLER_ADDR0, ip);

+ }

+}



void preempt_count_add(int val)

{

@@ -2979,17 +3078,21 @@ void preempt_count_add(int val)

DEBUG_LOCKS_WARN_ON((preempt_count() & PREEMPT_MASK) >=

PREEMPT_MASK - 10);

#endif

- if (preempt_count() == val) {

- unsigned long ip = get_lock_parent_ip();

-#ifdef CONFIG_DEBUG_PREEMPT

- current->preempt_disable_ip = ip;

-#endif

- trace_preempt_off(CALLER_ADDR0, ip);

- }

+ preempt_latency_start(val);

}

EXPORT_SYMBOL(preempt_count_add);

NOKPROBE_SYMBOL(preempt_count_add);



+/*

+ * If the value passed in equals to the current preempt count

+ * then we just enabled preemption. Stop timing the latency.

+ */

+static inline void preempt_latency_stop(int val)

+{

+ if (preempt_count() == val)

+ trace_preempt_on(CALLER_ADDR0, get_lock_parent_ip());

+}

+

void preempt_count_sub(int val)

{

#ifdef CONFIG_DEBUG_PREEMPT

@@ -3006,13 +3109,15 @@ void preempt_count_sub(int val)

return;

#endif



- if (preempt_count() == val)

- trace_preempt_on(CALLER_ADDR0, get_lock_parent_ip());

+ preempt_latency_stop(val);

__preempt_count_sub(val);

}

EXPORT_SYMBOL(preempt_count_sub);

NOKPROBE_SYMBOL(preempt_count_sub);



+#else

+static inline void preempt_latency_start(int val) { }

+static inline void preempt_latency_stop(int val) { }

#endif



/*

@@ -3065,7 +3170,7 @@ static inline void schedule_debug(struct task_struct *prev)

* Pick up the highest-prio task:

*/

static inline struct task_struct *

-pick_next_task(struct rq *rq, struct task_struct *prev)

+pick_next_task(struct rq *rq, struct task_struct *prev, struct pin_cookie cookie)

{

const struct sched_class *class = &fair_sched_class;

struct task_struct *p;

@@ -3076,20 +3181,20 @@ pick_next_task(struct rq *rq, struct task_struct *prev)

*/

if (likely(prev->sched_class == class &&

rq->nr_running == rq->cfs.h_nr_running)) {

- p = fair_sched_class.pick_next_task(rq, prev);

+ p = fair_sched_class.pick_next_task(rq, prev, cookie);

if (unlikely(p == RETRY_TASK))

goto again;



/* assumes fair_sched_class->next == idle_sched_class */

if (unlikely(!p))

- p = idle_sched_class.pick_next_task(rq, prev);

+ p = idle_sched_class.pick_next_task(rq, prev, cookie);



return p;

}



again:

for_each_class(class) {

- p = class->pick_next_task(rq, prev);

+ p = class->pick_next_task(rq, prev, cookie);

if (p) {

if (unlikely(p == RETRY_TASK))

goto again;

@@ -3143,6 +3248,7 @@ static void __sched notrace __schedule(bool preempt)

{

struct task_struct *prev, *next;

unsigned long *switch_count;

+ struct pin_cookie cookie;

struct rq *rq;

int cpu;



@@ -3176,7 +3282,7 @@ static void __sched notrace __schedule(bool preempt)

*/

smp_mb__before_spinlock();

raw_spin_lock(&rq->lock);

- lockdep_pin_lock(&rq->lock);

+ cookie = lockdep_pin_lock(&rq->lock);



rq->clock_skip_update <<= 1; /* promote REQ to ACT */



@@ -3198,7 +3304,7 @@ static void __sched notrace __schedule(bool preempt)



to_wakeup = wq_worker_sleeping(prev);

if (to_wakeup)

- try_to_wake_up_local(to_wakeup);

+ try_to_wake_up_local(to_wakeup, cookie);

}

}

switch_count = &prev->nvcsw;

@@ -3207,7 +3313,7 @@ static void __sched notrace __schedule(bool preempt)

if (task_on_rq_queued(prev))

update_rq_clock(rq);



- next = pick_next_task(rq, prev);

+ next = pick_next_task(rq, prev, cookie);

clear_tsk_need_resched(prev);

clear_preempt_need_resched();

rq->clock_skip_update = 0;

@@ -3218,9 +3324,9 @@ static void __sched notrace __schedule(bool preempt)

++*switch_count;



trace_sched_switch(preempt, prev, next);

- rq = context_switch(rq, prev, next); /* unlocks the rq */

+ rq = context_switch(rq, prev, next, cookie); /* unlocks the rq */

} else {

- lockdep_unpin_lock(&rq->lock);

+ lockdep_unpin_lock(&rq->lock, cookie);

raw_spin_unlock_irq(&rq->lock);

}



@@ -3287,8 +3393,23 @@ void __sched schedule_preempt_disabled(void)

static void __sched notrace preempt_schedule_common(void)

{

do {

+ /*

+ * Because the function tracer can trace preempt_count_sub()

+ * and it also uses preempt_enable/disable_notrace(), if

+ * NEED_RESCHED is set, the preempt_enable_notrace() called

+ * by the function tracer will call this function again and

+ * cause infinite recursion.

+ *

+ * Preemption must be disabled here before the function

+ * tracer can trace. Break up preempt_disable() into two

+ * calls. One to disable preemption without fear of being

+ * traced. The other to still record the preemption latency,

+ * which can also be traced by the function tracer.

+ */

preempt_disable_notrace();

+ preempt_latency_start(1);

__schedule(true);

+ preempt_latency_stop(1);

preempt_enable_no_resched_notrace();



/*

@@ -3340,7 +3461,21 @@ asmlinkage __visible void __sched notrace preempt_schedule_notrace(void)

return;



do {

+ /*

+ * Because the function tracer can trace preempt_count_sub()

+ * and it also uses preempt_enable/disable_notrace(), if

+ * NEED_RESCHED is set, the preempt_enable_notrace() called

+ * by the function tracer will call this function again and

+ * cause infinite recursion.

+ *

+ * Preemption must be disabled here before the function

+ * tracer can trace. Break up preempt_disable() into two

+ * calls. One to disable preemption without fear of being

+ * traced. The other to still record the preemption latency,

+ * which can also be traced by the function tracer.

+ */

preempt_disable_notrace();

+ preempt_latency_start(1);

/*

* Needs preempt disabled in case user_exit() is traced

* and the tracer calls preempt_enable_notrace() causing

@@ -3350,6 +3485,7 @@ asmlinkage __visible void __sched notrace preempt_schedule_notrace(void)

__schedule(true);

exception_exit(prev_ctx);



+ preempt_latency_stop(1);

preempt_enable_no_resched_notrace();

} while (need_resched());

}

@@ -3406,12 +3542,13 @@ EXPORT_SYMBOL(default_wake_function);

void rt_mutex_setprio(struct task_struct *p, int prio)

{

int oldprio, queued, running, queue_flag = DEQUEUE_SAVE | DEQUEUE_MOVE;

- struct rq *rq;

const struct sched_class *prev_class;

+ struct rq_flags rf;

+ struct rq *rq;



BUG_ON(prio > MAX_PRIO);



- rq = __task_rq_lock(p);

+ rq = __task_rq_lock(p, &rf);



/*

* Idle task boosting is a nono in general. There is one

@@ -3487,7 +3624,7 @@ void rt_mutex_setprio(struct task_struct *p, int prio)

check_class_changed(rq, p, prev_class, oldprio);

out_unlock:

preempt_disable(); /* avoid rq from going away on us */

- __task_rq_unlock(rq);

+ __task_rq_unlock(rq, &rf);



balance_callback(rq);

preempt_enable();

@@ -3497,7 +3634,7 @@ void rt_mutex_setprio(struct task_struct *p, int prio)

void set_user_nice(struct task_struct *p, long nice)

{

int old_prio, delta, queued;

- unsigned long flags;

+ struct rq_flags rf;

struct rq *rq;



if (task_nice(p) == nice || nice < MIN_NICE || nice > MAX_NICE)

@@ -3506,7 +3643,7 @@ void set_user_nice(struct task_struct *p, long nice)

* We have to be careful, if called from sys_setpriority(),

* the task might be in the middle of scheduling on another CPU.

*/

- rq = task_rq_lock(p, &flags);

+ rq = task_rq_lock(p, &rf);

/*

* The RT priorities are set via sched_setscheduler(), but we still

* allow the 'normal' nice value to be set - but as expected

@@ -3537,7 +3674,7 @@ void set_user_nice(struct task_struct *p, long nice)

resched_curr(rq);

}

out_unlock:

- task_rq_unlock(rq, p, &flags);

+ task_rq_unlock(rq, p, &rf);

}

EXPORT_SYMBOL(set_user_nice);



@@ -3834,11 +3971,11 @@ static int __sched_setscheduler(struct task_struct *p,

MAX_RT_PRIO - 1 - attr->sched_priority;

int retval, oldprio, oldpolicy = -1, queued, running;

int new_effective_prio, policy = attr->sched_policy;

- unsigned long flags;

const struct sched_class *prev_class;

- struct rq *rq;

+ struct rq_flags rf;

int reset_on_fork;

int queue_flags = DEQUEUE_SAVE | DEQUEUE_MOVE;

+ struct rq *rq;



/* may grab non-irq protected spin_locks */

BUG_ON(in_interrupt());

@@ -3933,13 +4070,13 @@ static int __sched_setscheduler(struct task_struct *p,

* To be able to change p->policy safely, the appropriate

* runqueue lock must be held.

*/

- rq = task_rq_lock(p, &flags);

+ rq = task_rq_lock(p, &rf);



/*

* Changing the policy of the stop threads its a very bad idea

*/

if (p == rq->stop) {

- task_rq_unlock(rq, p, &flags);

+ task_rq_unlock(rq, p, &rf);

return -EINVAL;

}



@@ -3956,7 +4093,7 @@ static int __sched_setscheduler(struct task_struct *p,

goto change;



p->sched_reset_on_fork = reset_on_fork;

- task_rq_unlock(rq, p, &flags);

+ task_rq_unlock(rq, p, &rf);

return 0;

}

change:

@@ -3970,7 +4107,7 @@ static int __sched_setscheduler(struct task_struct *p,

if (rt_bandwidth_enabled() && rt_policy(policy) &&

task_group(p)->rt_bandwidth.rt_runtime == 0 &&

!task_group_is_autogroup(task_group(p))) {

- task_rq_unlock(rq, p, &flags);

+ task_rq_unlock(rq, p, &rf);

return -EPERM;

}

#endif

@@ -3985,7 +4122,7 @@ static int __sched_setscheduler(struct task_struct *p,

*/

if (!cpumask_subset(span, &p->cpus_allowed) ||

rq->rd->dl_bw.bw == 0) {

- task_rq_unlock(rq, p, &flags);

+ task_rq_unlock(rq, p, &rf);

return -EPERM;

}

}

@@ -3995,7 +4132,7 @@ static int __sched_setscheduler(struct task_struct *p,

/* recheck policy now with rq lock held */

if (unlikely(oldpolicy != -1 && oldpolicy != p->policy)) {

policy = oldpolicy = -1;

- task_rq_unlock(rq, p, &flags);

+ task_rq_unlock(rq, p, &rf);

goto recheck;

}



@@ -4005,7 +4142,7 @@ static int __sched_setscheduler(struct task_struct *p,

* is available.

*/

if ((dl_policy(policy) || dl_task(p)) && dl_overflow(p, policy, attr)) {

- task_rq_unlock(rq, p, &flags);

+ task_rq_unlock(rq, p, &rf);

return -EBUSY;

}



@@ -4050,7 +4187,7 @@ static int __sched_setscheduler(struct task_struct *p,



check_class_changed(rq, p, prev_class, oldprio);

preempt_disable(); /* avoid rq from going away on us */

- task_rq_unlock(rq, p, &flags);

+ task_rq_unlock(rq, p, &rf);



if (pi)

rt_mutex_adjust_pi(p);

@@ -4903,10 +5040,10 @@ SYSCALL_DEFINE2(sched_rr_get_interval, pid_t, pid,

{

struct task_struct *p;

unsigned int time_slice;

- unsigned long flags;

+ struct rq_flags rf;

+ struct timespec t;

struct rq *rq;

int retval;

- struct timespec t;



if (pid < 0)

return -EINVAL;

@@ -4921,11 +5058,11 @@ SYSCALL_DEFINE2(sched_rr_get_interval, pid_t, pid,

if (retval)

goto out_unlock;



- rq = task_rq_lock(p, &flags);

+ rq = task_rq_lock(p, &rf);

time_slice = 0;

if (p->sched_class->get_rr_interval)

time_slice = p->sched_class->get_rr_interval(rq, p);

- task_rq_unlock(rq, p, &flags);

+ task_rq_unlock(rq, p, &rf);



rcu_read_unlock();

jiffies_to_timespec(time_slice, &t);

@@ -5001,7 +5138,8 @@ void show_state_filter(unsigned long state_filter)

touch_all_softlockup_watchdogs();



#ifdef CONFIG_SCHED_DEBUG

- sysrq_sched_debug_show();

+ if (!state_filter)

+ sysrq_sched_debug_show();

#endif

rcu_read_unlock();

/*

@@ -5163,6 +5301,8 @@ int task_can_attach(struct task_struct *p,



#ifdef CONFIG_SMP



+static bool sched_smp_initialized __read_mostly;

+

#ifdef CONFIG_NUMA_BALANCING

/* Migrate current task p to target_cpu */

int migrate_task_to(struct task_struct *p, int target_cpu)

@@ -5188,11 +5328,11 @@ int migrate_task_to(struct task_struct *p, int target_cpu)

*/

void sched_setnuma(struct task_struct *p, int nid)

{

- struct rq *rq;

- unsigned long flags;

bool queued, running;

+ struct rq_flags rf;

+ struct rq *rq;



- rq = task_rq_lock(p, &flags);

+ rq = task_rq_lock(p, &rf);

queued = task_on_rq_queued(p);

running = task_current(rq, p);



@@ -5207,7 +5347,7 @@ void sched_setnuma(struct task_struct *p, int nid)

p->sched_class->set_curr_task(rq);

if (queued)

enqueue_task(rq, p, ENQUEUE_RESTORE);

- task_rq_unlock(rq, p, &flags);

+ task_rq_unlock(rq, p, &rf);

}

#endif /* CONFIG_NUMA_BALANCING */



@@ -5223,7 +5363,7 @@ void idle_task_exit(void)

BUG_ON(cpu_online(smp_processor_id()));



if (mm != &init_mm) {

- switch_mm(mm, &init_mm, current);

+ switch_mm_irqs_off(mm, &init_mm, current);

finish_arch_post_lock_switch();

}

mmdrop(mm);

@@ -5271,6 +5411,7 @@ static void migrate_tasks(struct rq *dead_rq)

{

struct rq *rq = dead_rq;

struct task_struct *next, *stop = rq->stop;

+ struct pin_cookie cookie;

int dest_cpu;



/*

@@ -5302,8 +5443,8 @@ static void migrate_tasks(struct rq *dead_rq)

/*

* pick_next_task assumes pinned rq->lock.

*/

- lockdep_pin_lock(&rq->lock);

- next = pick_next_task(rq, &fake_task);

+ cookie = lockdep_pin_lock(&rq->lock);

+ next = pick_next_task(rq, &fake_task, cookie);

BUG_ON(!next);

next->sched_class->put_prev_task(rq, next);



@@ -5316,7 +5457,7 @@ static void migrate_tasks(struct rq *dead_rq)

* because !cpu_active at this point, which means load-balance

* will not interfere. Also, stop-machine.

*/

- lockdep_unpin_lock(&rq->lock);

+ lockdep_unpin_lock(&rq->lock, cookie);

raw_spin_unlock(&rq->lock);

raw_spin_lock(&next->pi_lock);

raw_spin_lock(&rq->lock);

@@ -5377,127 +5518,13 @@ static void set_rq_offline(struct rq *rq)

}

}



-/*

- * migration_call - callback that gets triggered when a CPU is added.

- * Here we can start up the necessary migration thread for the new CPU.

- */

-static int

-migration_call(struct notifier_block *nfb, unsigned long action, void *hcpu)

+static void set_cpu_rq_start_time(unsigned int cpu)

{

- int cpu = (long)hcpu;

- unsigned long flags;

struct rq *rq = cpu_rq(cpu);



- switch (action & ~CPU_TASKS_FROZEN) {

-

- case CPU_UP_PREPARE:

- rq->calc_load_update = calc_load_update;

- account_reset_rq(rq);

- break;

-

- case CPU_ONLINE:

- /* Update our root-domain */

- raw_spin_lock_irqsave(&rq->lock, flags);

- if (rq->rd) {

- BUG_ON(!cpumask_test_cpu(cpu, rq->rd->span));

-

- set_rq_online(rq);

- }

- raw_spin_unlock_irqrestore(&rq->lock, flags);

- break;

-

-#ifdef CONFIG_HOTPLUG_CPU

- case CPU_DYING:

- sched_ttwu_pending();

- /* Update our root-domain */

- raw_spin_lock_irqsave(&rq->lock, flags);

- if (rq->rd) {

- BUG_ON(!cpumask_test_cpu(cpu, rq->rd->span));

- set_rq_offline(rq);

- }

- migrate_tasks(rq);

- BUG_ON(rq->nr_running != 1); /* the migration thread */

- raw_spin_unlock_irqrestore(&rq->lock, flags);

- break;

-

- case CPU_DEAD:

- calc_load_migrate(rq);

- break;

-#endif

- }

-

- update_max_interval();

-

- return NOTIFY_OK;

-}

-

-/*

- * Register at high priority so that task migration (migrate_all_tasks)

- * happens before everything else. This has to be lower priority than

- * the notifier in the perf_event subsystem, though.

- */

-static struct notifier_block migration_notifier = {

- .notifier_call = migration_call,

- .priority = CPU_PRI_MIGRATION,

-};

-

-static void set_cpu_rq_start_time(void)

-{

- int cpu = smp_processor_id();

- struct rq *rq = cpu_rq(cpu);

rq->age_stamp = sched_clock_cpu(cpu);

}



-static int sched_cpu_active(struct notifier_block *nfb,

- unsigned long action, void *hcpu)

-{

- int cpu = (long)hcpu;

-

- switch (action & ~CPU_TASKS_FROZEN) {

- case CPU_STARTING:

- set_cpu_rq_start_time();

- return NOTIFY_OK;

-

- case CPU_DOWN_FAILED:

- set_cpu_active(cpu, true);

- return NOTIFY_OK;

-

- default:

- return NOTIFY_DONE;

- }

-}

-

-static int sched_cpu_inactive(struct notifier_block *nfb,

- unsigned long action, void *hcpu)

-{

- switch (action & ~CPU_TASKS_FROZEN) {

- case CPU_DOWN_PREPARE:

- set_cpu_active((long)hcpu, false);

- return NOTIFY_OK;

- default:

- return NOTIFY_DONE;

- }

-}

-

-static int __init migration_init(void)

-{

- void *cpu = (void *)(long)smp_processor_id();

- int err;

-

- /* Initialize migration for the boot CPU */

- err = migration_call(&migration_notifier, CPU_UP_PREPARE, cpu);

- BUG_ON(err == NOTIFY_BAD);

- migration_call(&migration_notifier, CPU_ONLINE, cpu);

- register_cpu_notifier(&migration_notifier);

-

- /* Register cpu active notifiers */

- cpu_notifier(sched_cpu_active, CPU_PRI_SCHED_ACTIVE);

- cpu_notifier(sched_cpu_inactive, CPU_PRI_SCHED_INACTIVE);

-

- return 0;

-}

-early_initcall(migration_init);

-

static cpumask_var_t sched_domains_tmpmask; /* sched_domains_mutex */



#ifdef CONFIG_SCHED_DEBUG

@@ -6645,10 +6672,10 @@ static void sched_init_numa(void)

init_numa_topology_type();

}



-static void sched_domains_numa_masks_set(int cpu)

+static void sched_domains_numa_masks_set(unsigned int cpu)

{

- int i, j;

int node = cpu_to_node(cpu);

+ int i, j;



for (i = 0; i < sched_domains_numa_levels; i++) {

for (j = 0; j < nr_node_ids; j++) {

@@ -6658,51 +6685,20 @@ static void sched_domains_numa_masks_set(int cpu)

}

}



-static void sched_domains_numa_masks_clear(int cpu)

+static void sched_domains_numa_masks_clear(unsigned int cpu)

{

int i, j;

+

for (i = 0; i < sched_domains_numa_levels; i++) {

for (j = 0; j < nr_node_ids; j++)

cpumask_clear_cpu(cpu, sched_domains_numa_masks[i][j]);

}

}



-/*

- * Update sched_domains_numa_masks[level][node] array when new cpus

- * are onlined.

- */

-static int sched_domains_numa_masks_update(struct notifier_block *nfb,

- unsigned long action,

- void *hcpu)

-{

- int cpu = (long)hcpu;

-

- switch (action & ~CPU_TASKS_FROZEN) {

- case CPU_ONLINE:

- sched_domains_numa_masks_set(cpu);

- break;

-

- case CPU_DEAD:

- sched_domains_numa_masks_clear(cpu);

- break;

-

- default:

- return NOTIFY_DONE;

- }

-

- return NOTIFY_OK;

-}

#else

-static inline void sched_init_numa(void)

-{

-}

-

-static int sched_domains_numa_masks_update(struct notifier_block *nfb,

- unsigned long action,

- void *hcpu)

-{

- return 0;

-}

+static inline void sched_init_numa(void) { }

+static void sched_domains_numa_masks_set(unsigned int cpu) { }

+static void sched_domains_numa_masks_clear(unsigned int cpu) { }

#endif /* CONFIG_NUMA */



static int __sdt_alloc(const struct cpumask *cpu_map)

@@ -7092,13 +7088,9 @@ static int num_cpus_frozen; /* used to mark begin/end of suspend/resume */

* If we come here as part of a suspend/resume, don't touch cpusets because we

* want to restore it back to its original state upon resume anyway.

*/

-static int cpuset_cpu_active(struct notifier_block *nfb, unsigned long action,

- void *hcpu)

+static void cpuset_cpu_active(void)

{

- switch (action) {

- case CPU_ONLINE_FROZEN:

- case CPU_DOWN_FAILED_FROZEN:

-

+ if (cpuhp_tasks_frozen) {

/*

* num_cpus_frozen tracks how many CPUs are involved in suspend

* resume sequence. As long as this is not the last online

@@ -7108,35 +7100,25 @@ static int cpuset_cpu_active(struct notifier_block *nfb, unsigned long action,

num_cpus_frozen--;

if (likely(num_cpus_frozen)) {

partition_sched_domains(1, NULL, NULL);

- break;

+ return;

}

-

/*

* This is the last CPU online operation. So fall through and

* restore the original sched domains by considering the

* cpuset configurations.

*/

-

- case CPU_ONLINE:

- cpuset_update_active_cpus(true);

- break;

- default:

- return NOTIFY_DONE;

}

- return NOTIFY_OK;

+ cpuset_update_active_cpus(true);

}



-static int cpuset_cpu_inactive(struct notifier_block *nfb, unsigned long action,

- void *hcpu)

+static int cpuset_cpu_inactive(unsigned int cpu)

{

unsigned long flags;

- long cpu = (long)hcpu;

struct dl_bw *dl_b;

bool overflow;

int cpus;



- switch (action) {

- case CPU_DOWN_PREPARE:

+ if (!cpuhp_tasks_frozen) {

rcu_read_lock_sched();

dl_b = dl_bw_of(cpu);



@@ -7148,19 +7130,120 @@ static int cpuset_cpu_inactive(struct notifier_block *nfb, unsigned long action,

rcu_read_unlock_sched();



if (overflow)

- return notifier_from_errno(-EBUSY);

+ return -EBUSY;

cpuset_update_active_cpus(false);

- break;

- case CPU_DOWN_PREPARE_FROZEN:

+ } else {

num_cpus_frozen++;

partition_sched_domains(1, NULL, NULL);

- break;

- default:

- return NOTIFY_DONE;

}

- return NOTIFY_OK;

+ return 0;

}



+int sched_cpu_activate(unsigned int cpu)

+{

+ struct rq *rq = cpu_rq(cpu);

+ unsigned long flags;

+

+ set_cpu_active(cpu, true);

+

+ if (sched_smp_initialized) {

+ sched_domains_numa_masks_set(cpu);

+ cpuset_cpu_active();

+ }

+

+ /*

+ * Put the rq online, if not already. This happens:

+ *

+ * 1) In the early boot process, because we build the real domains

+ * after all cpus have been brought up.

+ *

+ * 2) At runtime, if cpuset_cpu_active() fails to rebuild the

+ * domains.

+ */

+ raw_spin_lock_irqsave(&rq->lock, flags);

+ if (rq->rd) {

+ BUG_ON(!cpumask_test_cpu(cpu, rq->rd->span));

+ set_rq_online(rq);

+ }

+ raw_spin_unlock_irqrestore(&rq->lock, flags);

+

+ update_max_interval();

+

+ return 0;

+}

+

+int sched_cpu_deactivate(unsigned int cpu)

+{

+ int ret;

+

+ set_cpu_active(cpu, false);

+ /*

+ * We've cleared cpu_active_mask, wait for all preempt-disabled and RCU

+ * users of this state to go away such that all new such users will

+ * observe it.

+ *

+ * For CONFIG_PREEMPT we have preemptible RCU and its sync_rcu() might

+ * not imply sync_sched(), so wait for both.

+ *

+ * Do sync before park smpboot threads to take care the rcu boost case.

+ */

+ if (IS_ENABLED(CONFIG_PREEMPT))

+ synchronize_rcu_mult(call_rcu, call_rcu_sched);

+ else

+ synchronize_rcu();

+

+ if (!sched_smp_initialized)

+ return 0;

+

+ ret = cpuset_cpu_inactive(cpu);

+ if (ret) {

+ set_cpu_active(cpu, true);

+ return ret;

+ }

+ sched_domains_numa_masks_clear(cpu);

+ return 0;

+}

+

+static void sched_rq_cpu_starting(unsigned int cpu)

+{

+ struct rq *rq = cpu_rq(cpu);

+

+ rq->calc_load_update = calc_load_update;

+ account_reset_rq(rq);

+ update_max_interval();

+}

+

+int sched_cpu_starting(unsigned int cpu)

+{

+ set_cpu_rq_start_time(cpu);

+ sched_rq_cpu_starting(cpu);

+ return 0;

+}

+

+#ifdef CONFIG_HOTPLUG_CPU

+int sched_cpu_dying(unsigned int cpu)

+{

+ struct rq *rq = cpu_rq(cpu);

+ unsigned long flags;

+

+ /* Handle pending wakeups and then migrate everything off */

+ sched_ttwu_pending();

+ raw_spin_lock_irqsave(&rq->lock, flags);

+ if (rq->rd) {

+ BUG_ON(!cpumask_test_cpu(cpu, rq->rd->span));

+ set_rq_offline(rq);

+ }

+ migrate_tasks(rq);

+ BUG_ON(rq->nr_running != 1);

+ raw_spin_unlock_irqrestore(&rq->lock, flags);

+ calc_load_migrate(rq);

+ update_max_interval();

+ nohz_balance_exit_idle(cpu);

+ hrtick_clear(rq);

+ return 0;

+}

+#endif

+

void __init sched_init_smp(void)

{

cpumask_var_t non_isolated_cpus;

@@ -7182,12 +7265,6 @@ void __init sched_init_smp(void)

cpumask_set_cpu(smp_processor_id(), non_isolated_cpus);

mutex_unlock(&sched_domains_mutex);



- hotcpu_notifier(sched_domains_numa_masks_update, CPU_PRI_SCHED_ACTIVE);

- hotcpu_notifier(cpuset_cpu_active, CPU_PRI_CPUSET_ACTIVE);

- hotcpu_notifier(cpuset_cpu_inactive, CPU_PRI_CPUSET_INACTIVE);

-

- init_hrtick();

-

/* Move init over to a non-isolated CPU */

if (set_cpus_allowed_ptr(current, non_isolated_cpus) < 0)

BUG();

@@ -7196,7 +7273,16 @@ void __init sched_init_smp(void)



init_sched_rt_class();

init_sched_dl_class();

+ sched_smp_initialized = true;

+}

+

+static int __init migration_init(void)

+{

+ sched_rq_cpu_starting(smp_processor_id());

+ return 0;

}

+early_initcall(migration_init);

+

#else

void __init sched_init_smp(void)

{

@@ -7331,8 +7417,6 @@ void __init sched_init(void)

for (j = 0; j < CPU_LOAD_IDX_MAX; j++)

rq->cpu_load[j] = 0;



- rq->last_load_update_tick = jiffies;

-

#ifdef CONFIG_SMP

rq->sd = NULL;

rq->rd = NULL;

@@ -7351,12 +7435,13 @@ void __init sched_init(void)



rq_attach_root(rq, &def_root_domain);

#ifdef CONFIG_NO_HZ_COMMON

+ rq->last_load_update_tick = jiffies;

rq->nohz_flags = 0;

#endif

#ifdef CONFIG_NO_HZ_FULL

rq->last_sched_tick = 0;

#endif

-#endif

+#endif /* CONFIG_SMP */

init_rq_hrtick(rq);

atomic_set(&rq->nr_iowait, 0);

}

@@ -7394,7 +7479,7 @@ void __init sched_init(void)

if (cpu_isolated_map == NULL)

zalloc_cpumask_var(&cpu_isolated_map, GFP_NOWAIT);

idle_thread_set_boot_cpu();

- set_cpu_rq_start_time();

+ set_cpu_rq_start_time(smp_processor_id());

#endif

init_sched_fair_class();



@@ -7639,10 +7724,10 @@ void sched_move_task(struct task_struct *tsk)

{

struct task_group *tg;

int queued, running;

- unsigned long flags;

+ struct rq_flags rf;

struct rq *rq;



- rq = task_rq_lock(tsk, &flags);

+ rq = task_rq_lock(tsk, &rf);



running = task_current(rq, tsk);

queued = task_on_rq_queued(tsk);

@@ -7674,7 +7759,7 @@ void sched_move_task(struct task_struct *tsk)

if (queued)

enqueue_task(rq, tsk, ENQUEUE_RESTORE | ENQUEUE_MOVE);



- task_rq_unlock(rq, tsk, &flags);

+ task_rq_unlock(rq, tsk, &rf);

}

#endif /* CONFIG_CGROUP_SCHED */



@@ -7894,7 +7979,7 @@ static int sched_rt_can_attach(struct task_group *tg, struct task_struct *tsk)

static int sched_rt_global_constraints(void)

{

unsigned long flags;

- int i, ret = 0;

+ int i;



raw_spin_lock_irqsave(&def_rt_bandwidth.rt_runtime_lock, flags);

for_each_possible_cpu(i) {

@@ -7906,7 +7991,7 @@ static int sched_rt_global_constraints(void)

}

raw_spin_unlock_irqrestore(&def_rt_bandwidth.rt_runtime_lock, flags);



- return ret;

+ return 0;

}

#endif /* CONFIG_RT_GROUP_SCHED */



diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c

index 4a811203c04a..41f85c4d0938 100644

--- a/kernel/sched/cpuacct.c

+++ b/kernel/sched/cpuacct.c

@@ -25,11 +25,22 @@ enum cpuacct_stat_index {

CPUACCT_STAT_NSTATS,

};



+enum cpuacct_usage_index {

+ CPUACCT_USAGE_USER, /* ... user mode */

+ CPUACCT_USAGE_SYSTEM, /* ... kernel mode */

+

+ CPUACCT_USAGE_NRUSAGE,

+};

+

+struct cpuacct_usage {

+ u64 usages[CPUACCT_USAGE_NRUSAGE];

+};

+

/* track cpu usage of a group of tasks and its child groups */

struct cpuacct {

struct cgroup_subsys_state css;

/* cpuusage holds pointer to a u64-type object on every cpu */

- u64 __percpu *cpuusage;

+ struct cpuacct_usage __percpu *cpuusage;

struct kernel_cpustat __percpu *cpustat;

};



@@ -49,7 +60,7 @@ static inline struct cpuacct *parent_ca(struct cpuacct *ca)

return css_ca(ca->css.parent);

}



-static DEFINE_PER_CPU(u64, root_cpuacct_cpuusage);

+static DEFINE_PER_CPU(struct cpuacct_usage, root_cpuacct_cpuusage);

static struct cpuacct root_cpuacct = {

.cpustat = &kernel_cpustat,

.cpuusage = &root_cpuacct_cpuusage,

@@ -68,7 +79,7 @@ cpuacct_css_alloc(struct cgroup_subsys_state *parent_css)

if (!ca)

goto out;



- ca->cpuusage = alloc_percpu(u64);

+ ca->cpuusage = alloc_percpu(struct cpuacct_usage);

if (!ca->cpuusage)

goto out_free_ca;



@@ -96,20 +107,37 @@ static void cpuacct_css_free(struct cgroup_subsys_state *css)

kfree(ca);

}



-static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int cpu)

+static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int cpu,

+ enum cpuacct_usage_index index)

{

- u64 *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);

+ struct cpuacct_usage *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);

u64 data;



+ /*

+ * We allow index == CPUACCT_USAGE_NRUSAGE here to read

+ * the sum of suages.

+ */

+ BUG_ON(index > CPUACCT_USAGE_NRUSAGE);

+

#ifndef CONFIG_64BIT

/*

* Take rq->lock to make 64-bit read safe on 32-bit platforms.

*/

raw_spin_lock_irq(&cpu_rq(cpu)->lock);

- data = *cpuusage;

+#endif

+

+ if (index == CPUACCT_USAGE_NRUSAGE) {

+ int i = 0;

+

+ data = 0;

+ for (i = 0; i < CPUACCT_USAGE_NRUSAGE; i++)

+ data += cpuusage->usages[i];

+ } else {

+ data = cpuusage->usages[index];

+ }

+

+#ifndef CONFIG_64BIT

raw_spin_unlock_irq(&cpu_rq(cpu)->lock);

-#else

- data = *cpuusage;

#endif



return data;

@@ -117,69 +145,103 @@ static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int cpu)



static void cpuacct_cpuusage_write(struct cpuacct *ca, int cpu, u64 val)

{

- u64 *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);

+ struct cpuacct_usage *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);

+ int i;



#ifndef CONFIG_64BIT

/*

* Take rq->lock to make 64-bit write safe on 32-bit platforms.

*/

raw_spin_lock_irq(&cpu_rq(cpu)->lock);

- *cpuusage = val;

+#endif

+

+ for (i = 0; i < CPUACCT_USAGE_NRUSAGE; i++)

+ cpuusage->usages[i] = val;

+

+#ifndef CONFIG_64BIT

raw_spin_unlock_irq(&cpu_rq(cpu)->lock);

-#else

- *cpuusage = val;

#endif

}



/* return total cpu usage (in nanoseconds) of a group */

-static u64 cpuusage_read(struct cgroup_subsys_state *css, struct cftype *cft)

+static u64 __cpuusage_read(struct cgroup_subsys_state *css,

+ enum cpuacct_usage_index index)

{

struct cpuacct *ca = css_ca(css);

u64 totalcpuusage = 0;

int i;



- for_each_present_cpu(i)

- totalcpuusage += cpuacct_cpuusage_read(ca, i);

+ for_each_possible_cpu(i)

+ totalcpuusage += cpuacct_cpuusage_read(ca, i, index);



return totalcpuusage;

}



+static u64 cpuusage_user_read(struct cgroup_subsys_state *css,

+ struct cftype *cft)

+{

+ return __cpuusage_read(css, CPUACCT_USAGE_USER);

+}

+

+static u64 cpuusage_sys_read(struct cgroup_subsys_state *css,

+ struct cftype *cft)

+{

+ return __cpuusage_read(css, CPUACCT_USAGE_SYSTEM);

+}

+

+static u64 cpuusage_read(struct cgroup_subsys_state *css, struct cftype *cft)

+{

+ return __cpuusage_read(css, CPUACCT_USAGE_NRUSAGE);

+}

+

static int cpuusage_write(struct cgroup_subsys_state *css, struct cftype *cft,

u64 val)

{

struct cpuacct *ca = css_ca(css);

- int err = 0;

- int i;

+ int cpu;



/*

* Only allow '0' here to do a reset.

*/

- if (val) {

- err = -EINVAL;

- goto out;

- }

+ if (val)

+ return -EINVAL;



- for_each_present_cpu(i)

- cpuacct_cpuusage_write(ca, i, 0);

+ for_each_possible_cpu(cpu)

+ cpuacct_cpuusage_write(ca, cpu, 0);



-out:

- return err;

+ return 0;

}



-static int cpuacct_percpu_seq_show(struct seq_file *m, void *V)

+static int __cpuacct_percpu_seq_show(struct seq_file *m,

+ enum cpuacct_usage_index index)

{

struct cpuacct *ca = css_ca(seq_css(m));

u64 percpu;

int i;



- for_each_present_cpu(i) {

- percpu = cpuacct_cpuusage_read(ca, i);

+ for_each_possible_cpu(i) {

+ percpu = cpuacct_cpuusage_read(ca, i, index);

seq_printf(m, "%llu ", (unsigned long long) percpu);

}

seq_printf(m, "

");

return 0;

}



+static int cpuacct_percpu_user_seq_show(struct seq_file *m, void *V)

+{

+ return __cpuacct_percpu_seq_show(m, CPUACCT_USAGE_USER);

+}

+

+static int cpuacct_percpu_sys_seq_show(struct seq_file *m, void *V)

+{

+ return __cpuacct_percpu_seq_show(m, CPUACCT_USAGE_SYSTEM);

+}

+

+static int cpuacct_percpu_seq_show(struct seq_file *m, void *V)

+{

+ return __cpuacct_percpu_seq_show(m, CPUACCT_USAGE_NRUSAGE);

+}

+

static const char * const cpuacct_stat_desc[] = {

[CPUACCT_STAT_USER] = "user",

[CPUACCT_STAT_SYSTEM] = "system",

@@ -191,7 +253,7 @@ static int cpuacct_stats_show(struct seq_file *sf, void *v)

int cpu;

s64 val = 0;



- for_each_online_cpu(cpu) {

+ for_each_possible_cpu(cpu) {

struct kernel_cpustat *kcpustat = per_cpu_ptr(ca->cpustat, cpu);

val += kcpustat->cpustat[CPUTIME_USER];

val += kcpustat->cpustat[CPUTIME_NICE];

@@ -200,7 +262,7 @@ static int cpuacct_stats_show(struct seq_file *sf, void *v)

seq_printf(sf, "%s %lld

", cpuacct_stat_desc[CPUACCT_STAT_USER], val);



val = 0;

- for_each_online_cpu(cpu) {

+ for_each_possible_cpu(cpu) {

struct kernel_cpustat *kcpustat = per_cpu_ptr(ca->cpustat, cpu);

val += kcpustat->cpustat[CPUTIME_SYSTEM];

val += kcpustat->cpustat[CPUTIME_IRQ];

@@ -220,10 +282,26 @@ static struct cftype files[] = {

.write_u64 = cpuusage_write,

},

{

+ .name = "usage_user",

+ .read_u64 = cpuusage_user_read,

+ },

+ {

+ .name = "usage_sys",

+ .read_u64 = cpuusage_sys_read,

+ },

+ {

.name = "usage_percpu",

.seq_show = cpuacct_percpu_seq_show,

},

{

+ .name = "usage_percpu_user",

+ .seq_show = cpuacct_percpu_user_seq_show,

+ },

+ {

+ .name = "usage_percpu_sys",

+ .seq_show = cpuacct_percpu_sys_seq_show,

+ },

+ {

.name = "stat",

.seq_show = cpuacct_stats_show,

},

@@ -238,10 +316,17 @@ static struct cftype files[] = {

void cpuacct_charge(struct task_struct *tsk, u64 cputime)

{

struct cpuacct *ca;

+ int index = CPUACCT_USAGE_SYSTEM;

+ struct pt_regs *regs = task_pt_regs(tsk);

+

+ if (regs && user_mode(regs))

+ index = CPUACCT_USAGE_USER;



rcu_read_lock();

+

for (ca = task_ca(tsk); ca; ca = parent_ca(ca))

- *this_cpu_ptr(ca->cpuusage) += cputime;

+ this_cpu_ptr(ca->cpuusage)->usages[index] += cputime;

+

rcu_read_unlock();

}



diff --git a/kernel/sched/cpudeadline.c b/kernel/sched/cpudeadline.c

index 5a75b08cfd85..5be58820465c 100644

--- a/kernel/sched/cpudeadline.c

+++ b/kernel/sched/cpudeadline.c

@@ -103,10 +103,10 @@ int cpudl_find(struct cpudl *cp, struct task_struct *p,

const struct sched_dl_entity *dl_se = &p->dl;



if (later_mask &&

- cpumask_and(later_mask, cp->free_cpus, &p->cpus_allowed)) {

+ cpumask_and(later_mask, cp->free_cpus, tsk_cpus_allowed(p))) {

best_cpu = cpumask_any(later_mask);

goto out;

- } else if (cpumask_test_cpu(cpudl_maximum(cp), &p->cpus_allowed) &&

+ } else if (cpumask_test_cpu(cpudl_maximum(cp), tsk_cpus_allowed(p)) &&

dl_time_before(dl_se->deadline, cp->elements[0].dl)) {

best_cpu = cpudl_maximum(cp);

if (later_mask)

diff --git a/kernel/sched/cpupri.c b/kernel/sched/cpupri.c

index 981fcd7dc394..11e9705bf937 100644

--- a/kernel/sched/cpupri.c

+++ b/kernel/sched/cpupri.c

@@ -103,11 +103,11 @@ int cpupri_find(struct cpupri *cp, struct task_struct *p,

if (skip)

continue;



- if (cpumask_any_and(&p->cpus_allowed, vec->mask) >= nr_cpu_ids)

+ if (cpumask_any_and(tsk_cpus_allowed(p), vec->mask) >= nr_cpu_ids)

continue;



if (lowest_mask) {

- cpumask_and(lowest_mask, &p->cpus_allowed, vec->mask);

+ cpumask_and(lowest_mask, tsk_cpus_allowed(p), vec->mask);



/*

* We have to ensure that we have at least one bit

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c

index 686ec8adf952..fcb7f0217ff4 100644

--- a/kernel/sched/deadline.c

+++ b/kernel/sched/deadline.c

@@ -134,7 +134,7 @@ static void inc_dl_migration(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)

{

struct task_struct *p = dl_task_of(dl_se);



- if (p->nr_cpus_allowed > 1)

+ if (tsk_nr_cpus_allowed(p) > 1)

dl_rq->dl_nr_migratory++;



update_dl_migration(dl_rq);

@@ -144,7 +144,7 @@ static void dec_dl_migration(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)

{

struct task_struct *p = dl_task_of(dl_se);



- if (p->nr_cpus_allowed > 1)

+ if (tsk_nr_cpus_allowed(p) > 1)

dl_rq->dl_nr_migratory--;



update_dl_migration(dl_rq);

@@ -591,10 +591,10 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)

struct sched_dl_entity,

dl_timer);

struct task_struct *p = dl_task_of(dl_se);

- unsigned long flags;

+ struct rq_flags rf;

struct rq *rq;



- rq = task_rq_lock(p, &flags);

+ rq = task_rq_lock(p, &rf);



/*

* The task might have changed its scheduling policy to something

@@ -670,14 +670,14 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)

* Nothing relies on rq->lock after this, so its safe to drop

* rq->lock.

*/

- lockdep_unpin_lock(&rq->lock);

+ lockdep_unpin_lock(&rq->lock, rf.cookie);

push_dl_task(rq);

- lockdep_pin_lock(&rq->lock);

+ lockdep_repin_lock(&rq->lock, rf.cookie);

}

#endif



unlock:

- task_rq_unlock(rq, p, &flags);

+ task_rq_unlock(rq, p, &rf);



/*

* This can free the task_struct, including this hrtimer, do not touch

@@ -717,10 +717,6 @@ static void update_curr_dl(struct rq *rq)

if (!dl_task(curr) || !on_dl_rq(dl_se))

return;



- /* Kick cpufreq (see the comment in linux/cpufreq.h). */

- if (cpu_of(rq) == smp_processor_id())

- cpufreq_trigger_update(rq_clock(rq));

-

/*

* Consumed budget is computed considering the time as

* observed by schedulable tasks (excluding time spent

@@ -736,6 +732,10 @@ static void update_curr_dl(struct rq *rq)

return;

}



+ /* kick cpufreq (see the comment in linux/cpufreq.h). */

+ if (cpu_of(rq) == smp_processor_id())

+ cpufreq_trigger_update(rq_clock(rq));

+

schedstat_set(curr->se.statistics.exec_max,

max(curr->se.statistics.exec_max, delta_exec));



@@ -966,7 +966,7 @@ static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)



enqueue_dl_entity(&p->dl, pi_se, flags);



- if (!task_current(rq, p) && p->nr_cpus_allowed > 1)

+ if (!task_current(rq, p) && tsk_nr_cpus_allowed(p) > 1)

enqueue_pushable_dl_task(rq, p);

}



@@ -1040,9 +1040,9 @@ select_task_rq_dl(struct task_struct *p, int cpu, int sd_flag, int flags)

* try to make it stay here, it might be important.

*/

if (unlikely(dl_task(curr)) &&

- (curr->nr_cpus_allowed < 2 ||

+ (tsk_nr_cpus_allowed(curr) < 2 ||

!dl_entity_preempt(&p->dl, &curr->dl)) &&

- (p->nr_cpus_allowed > 1)) {

+ (tsk_nr_cpus_allowed(p) > 1)) {

int target = find_later_rq(p);



if (target != -1 &&

@@ -1063,7 +1063,7 @@ static void check_preempt_equal_dl(struct rq *rq, struct task_struct *p)

* Current can't be migrated, useless to reschedule,

* let's hope p can move out.

*/

- if (rq->curr->nr_cpus_allowed == 1 ||

+ if (tsk_nr_cpus_allowed(rq->curr) == 1 ||

cpudl_find(&rq->rd->cpudl, rq->curr, NULL) == -1)

return;



@@ -1071,7 +1071,7 @@ static void check_preempt_equal_dl(struct rq *rq, struct task_struct *p)

* p is migratable, so let's not schedule it and

* see if it is pushed or pulled somewhere else.

*/

- if (p->nr_cpus_allowed != 1 &&

+ if (tsk_nr_cpus_allowed(p) != 1 &&

cpudl_find(&rq->rd->cpudl, p, NULL) != -1)

return;



@@ -1125,7 +1125,8 @@ static struct sched_dl_entity *pick_next_dl_entity(struct rq *rq,

return rb_entry(left, struct sched_dl_entity, rb_node);

}



-struct task_struct *pick_next_task_dl(struct rq *rq, struct task_struct *prev)

+struct task_struct *

+pick_next_task_dl(struct rq *rq, struct task_struct *prev, struct pin_cookie cookie)

{

struct sched_dl_entity *dl_se;

struct task_struct *p;

@@ -1140,9 +1141,9 @@ struct task_struct *pick_next_task_dl(struct rq *rq, struct task_struct *prev)

* disabled avoiding further scheduler activity on it and we're

* being very careful to re-start the picking loop.

*/

- lockdep_unpin_lock(&rq->lock);

+ lockdep_unpin_lock(&rq->lock, cookie);

pull_dl_task(rq);

- lockdep_pin_lock(&rq->lock);

+ lockdep_repin_lock(&rq->lock, cookie);

/*

* pull_rt_task() can drop (and re-acquire) rq->lock; this

* means a stop task can slip in, in which case we need to

@@ -1185,7 +1186,7 @@ static void put_prev_task_dl(struct rq *rq, struct task_struct *p)

{

update_curr_dl(rq);



- if (on_dl_rq(&p->dl) && p->nr_cpus_allowed > 1)

+ if (on_dl_rq(&p->dl) && tsk_nr_cpus_allowed(p) > 1)

enqueue_pushable_dl_task(rq, p);

}



@@ -1286,7 +1287,7 @@ static int find_later_rq(struct task_struct *task)

if (unlikely(!later_mask))

return -1;



- if (task->nr_cpus_allowed == 1)

+ if (tsk_nr_cpus_allowed(task) == 1)

return -1;



/*

@@ -1392,7 +1393,7 @@ static struct rq *find_lock_later_rq(struct task_struct *task, struct rq *rq)

if (double_lock_balance(rq, later_rq)) {

if (unlikely(task_rq(task) != rq ||

!cpumask_test_cpu(later_rq->cpu,

- &task->cpus_allowed) ||

+ tsk_cpus_allowed(task)) ||

task_running(rq, task) ||

!dl_task(task) ||

!task_on_rq_queued(task))) {

@@ -1432,7 +1433,7 @@ static struct task_struct *pick_next_pushable_dl_task(struct rq *rq)



BUG_ON(rq->cpu != task_cpu(p));

BUG_ON(task_current(rq, p));

- BUG_ON(p->nr_cpus_allowed <= 1);

+ BUG_ON(tsk_nr_cpus_allowed(p) <= 1);



BUG_ON(!task_on_rq_queued(p));

BUG_ON(!dl_task(p));

@@ -1471,7 +1472,7 @@ static int push_dl_task(struct rq *rq)

*/

if (dl_task(rq->curr) &&

dl_time_before(next_task->dl.deadline, rq->curr->dl.deadline) &&

- rq->curr->nr_cpus_allowed > 1) {

+ tsk_nr_cpus_allowed(rq->curr) > 1) {

resched_curr(rq);

return 0;

}

@@ -1618,9 +1619,9 @@ static void task_woken_dl(struct rq *rq, struct task_struct *p)

{

if (!task_running(rq, p) &&

!test_tsk_need_resched(rq->curr) &&

- p->nr_cpus_allowed > 1 &&

+ tsk_nr_cpus_allowed(p) > 1 &&

dl_task(rq->curr) &&

- (rq->curr->nr_cpus_allowed < 2 ||

+ (tsk_nr_cpus_allowed(rq->curr) < 2 ||

!dl_entity_preempt(&p->dl, &rq->curr->dl))) {

push_dl_tasks(rq);

}

@@ -1724,7 +1725,7 @@ static void switched_to_dl(struct rq *rq, struct task_struct *p)



if (task_on_rq_queued(p) && rq->curr != p) {

#ifdef CONFIG_SMP

- if (p->nr_cpus_allowed > 1 && rq->dl.overloaded)

+ if (tsk_nr_cpus_allowed(p) > 1 && rq->dl.overloaded)

queue_push_tasks(rq);

#else

if (dl_task(rq->curr))

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c

index 4fbc3bd5ff60..cf905f655ba1 100644

--- a/kernel/sched/debug.c

+++ b/kernel/sched/debug.c

@@ -626,15 +626,16 @@ do { \

#undef P

#undef PN



-#ifdef CONFIG_SCHEDSTATS

-#define P(n) SEQ_printf(m, " .%-30s: %d

", #n, rq->n);

-#define P64(n) SEQ_printf(m, " .%-30s: %Ld

", #n, rq->n);

-

#ifdef CONFIG_SMP

+#define P64(n) SEQ_printf(m, " .%-30s: %Ld

", #n, rq->n);

P64(avg_idle);

P64(max_idle_balance_cost);

+#undef P64

#endif



+#ifdef CONFIG_SCHEDSTATS

+#define P(n) SEQ_printf(m, " .%-30s: %d

", #n, rq->n);

+

if (schedstat_enabled()) {

P(yld_count);

P(sched_count);

@@ -644,7 +645,6 @@ do { \

}



#undef P

-#undef P64

#endif

spin_lock_irqsave(&sched_debug_lock, flags);

print_cfs_stats(m, cpu);

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

index e7dd0ec169be..218f8e83db73 100644

--- a/kernel/sched/fair.c

+++ b/kernel/sched/fair.c

@@ -204,7 +204,7 @@ static void __update_inv_weight(struct load_weight *lw)

* OR

* (delta_exec * (weight * lw->inv_weight)) >> WMULT_SHIFT

*

- * Either weight := NICE_0_LOAD and lw \e prio_to_wmult[], in which case

+ * Either weight := NICE_0_LOAD and lw \e sched_prio_to_wmult[], in which case

* we're guaranteed shift stays positive because inv_weight is guaranteed to

* fit 32 bits, and NICE_0_LOAD gives another 10 bits; therefore shift >= 22.

*

@@ -682,17 +682,68 @@ void init_entity_runnable_average(struct sched_entity *se)

sa->period_contrib = 1023;

sa->load_avg = scale_load_down(se->load.weight);

sa->load_sum = sa->load_avg * LOAD_AVG_MAX;

- sa->util_avg = scale_load_down(SCHED_LOAD_SCALE);

- sa->util_sum = sa->util_avg * LOAD_AVG_MAX;

+ /*

+ * At this point, util_avg won't be used in select_task_rq_fair anyway

+ */

+ sa->util_avg = 0;

+ sa->util_sum = 0;

/* when this task enqueue'ed, it will contribute to its cfs_rq's load_avg */

}



+/*

+ * With new tasks being created, their initial util_avgs are extrapolated

+ * based on the cfs_rq's current util_avg:

+ *

+ * util_avg = cfs_rq->util_avg / (cfs_rq->load_avg + 1) * se.load.weight

+ *

+ * However, in many cases, the above util_avg does not give a desired

+ * value. Moreover, the sum of the util_avgs may be divergent, such

+ * as when the series is a harmonic series.

+ *

+ * To solve this problem, we also cap the util_avg of successive tasks to

+ * only 1/2 of the left utilization budget:

+ *

+ * util_avg_cap = (1024 - cfs_rq->avg.util_avg) / 2^n

+ *

+ * where n denotes the nth task.

+ *

+ * For example, a simplest series from the beginning would be like:

+ *

+ * task util_avg: 512, 256, 128, 64, 32, 16, 8, ...

+ * cfs_rq util_avg: 512, 768, 896, 960, 992, 1008, 1016, ...

+ *

+ * Finally, that extrapolated util_avg is clamped to the cap (util_avg_cap)

+ * if util_avg > util_avg_cap.

+ */

+void post_init_entity_util_avg(struct sched_entity *se)

+{

+ struct cfs_rq *cfs_rq = cfs_rq_of(se);

+ struct sched_avg *sa = &se->avg;

+ long cap = (long)(SCHED_CAPACITY_SCALE - cfs_rq->avg.util_avg) / 2;

+

+ if (cap > 0) {

+ if (cfs_rq->avg.util_avg != 0) {

+ sa->util_avg = cfs_rq->avg.util_avg * se->load.weight;

+ sa->util_avg /= (cfs_rq->avg.load_avg + 1);

+

+ if (sa->util_avg > cap)

+ sa->util_avg = cap;

+ } else {

+ sa->util_avg = cap;

+ }

+ sa->util_sum = sa->util_avg * LOAD_AVG_MAX;

+ }

+}

+

static inline unsigned long cfs_rq_runnable_load_avg(struct cfs_rq *cfs_rq);

static inline unsigned long cfs_rq_load_avg(struct cfs_rq *cfs_rq);

#else

void init_entity_runnable_average(struct sched_entity *se)

{

}

+void post_init_entity_util_avg(struct sched_entity *se)

+{

+}

#endif



/*

@@ -2437,10 +2488,12 @@ account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)

updat