12101111's blog - lld为什么在多核CPU上这么慢

最近在一台48核96线程的服务器上编译chromium, 使用LLVM12全家桶(Clang, lld), 开启了lto, 发现在链接最终的二进制文件chromium时, lld虽然占用了所有的CPU资源, 但是大量CPU占用是发生在内核态的(使用htop可以看到CPU占用条一半以上是红色的), 因此对lld的性能进行了一些分析, 发现了lld为何在在内核态占用大量的CPU资源.

perf top

perf top 命令可以显示一个进程各个函数占用CPU的百分比, -p或--pid=可以指定要分析的进程, 不加该参数则表示分析系统所有线程(包括内核线程).

不指定进程pid直接运行perf top,发现最占用CPU的是内核的native_queued_spin_lock_slowpath函数, 位于内核源码的kernel/locking/qspinlock.c文件中.

其余的函数有内核的futex_wake, _raw_spin_lock, futex_wait_setup, libc的__pthread_mutex_lock, __pthread_mutex_timedlock, __pthread_mutex_unlock, malloc memcpy, memcmp, free 等.

通过对内核的挖掘以及perf top提供的信息, 可以得到内核(Linux 5.13)部分的调用栈

native_queued_spin_lock_slowpath
_raw_spin_lock
futex_wake/futex_wait
do_futex
__x64_sys_futex
do_syscall_64
entry_SYSCALL_64_after_hwframe

libc(musl libc 1.2.2)的调用栈(注:部分是宏或内联函数)

__syscall(SYS_futex,...)
__futexwait/__wake
__lock/__unlock
malloc/free

__syscall(SYS_futex,...)
pthread_mutex_lock/pthread_mutex_timedlock/pthread_mutex_unlock

使用jemalloc替代musl的mallocng

musl 1.2.0引入的mallocng虽然性能较之前的老malloc实现性能高了不少, 但是查看其源码可知, 在多线程的情况下, 其malloc/free操作仍需要无条件上锁/解锁, 导致多线程场景下的内存分配性能降低. jemalloc是一个著名的高性能malloc库,使用jemalloc替代musl自带的实现可以解决lld因内存大量分配导致的性能下降.

使用该命令使得lld自动使用jemalloc:

sudo patchelf --add-needed libjemalloc.so.2 /usr/bin/lld

perf record

替换malloc库后, 仍然有大量的内核自旋锁的调用, 因此需要进一步调查.

perf top的动态界面得到的数据并不精确和直观, 可以使用FlameGraph绘制更加直观的图形

perf record -F 200 -p $PID --call-graph lbr
perf script -i perf.data > out.perf
./FlameGraph/stackcollapse-perf.pl out.perf > out.folded
./FlameGraph/flamegraph.pl out.folded > out.svg

最终得到如下的图形

flamegraph

flamegraph生成的图片是可交互的，点击此在新标签页打开该svg图片

最终我们可以确定llvm内的调用栈为

所以为什么lld这里这么慢

在llvm::FPPassManager::runOnFunction中, 只有这一行调用了getPassName()

llvm::TimeTraceScope PassScope("RunPass", FP->getPassName());

根据llvm::TimeTraceScope的文档, The TimeTraceScope is a helper class to call the begin and end functions of the time trace profiler. When the object is constructed, it begins the section; and when it is destroyed, it stops it. If the time profiler is not initialized, the overhead is a single branch.

虽然这个class的overhead很小, 但是为了计算它的构造函数的参数, overhead很大, 而且这个参数的内容在profiler没有启用时直接就扔掉了. 显然这里应该使用一个闭包或者别的进行lazy evaluation.

继续挖掘为什么getPassName()会用到锁.

llvm::Pass这个类的子类大多重写了getPassName()方法, 例如:

  StringRef getPassName() const override { return "Loop Pass Manager"; }

但是没有重写的子类就会使用默认的实现:

StringRef Pass::getPassName() const {
  AnalysisID AID =  getPassID();
  const PassInfo *PI = PassRegistry::getPassRegistry()->getPassInfo(AID);
  if (PI)
    return PI->getPassName();
  return "Unnamed pass: implement Pass::getPassName()";
}

注意到这里的PassRegistry::getPassRegistry()调用, 其代码为:

static ManagedStatic<PassRegistry> PassRegistryObj;
PassRegistry *PassRegistry::getPassRegistry() {
  return &*PassRegistryObj;
}

PassRegistry是一个加了锁的全局变量, 其定义为:

/// PassRegistry - This class manages the registration and intitialization of
/// the pass subsystem as application startup, and assists the PassManager
/// in resolving pass dependencies.
/// NOTE: PassRegistry is NOT thread-safe.  If you want to use LLVM on multiple
/// threads simultaneously, you will need to use a separate PassRegistry on
/// each thread.
class PassRegistry {
  mutable sys::SmartRWMutex<true> Lock;

  /// PassInfoMap - Keep track of the PassInfo object for each registered pass.
  using MapType = DenseMap<const void *, const PassInfo *>;
  MapType PassInfoMap;

  using StringMapType = StringMap<const PassInfo *>;
  StringMapType PassInfoStringMap;

  std::vector<std::unique_ptr<const PassInfo>> ToFree;
  std::vector<PassRegistrationListener *> Listeners;

public:
  PassRegistry() = default;
  ~PassRegistry();

  /// getPassRegistry - Access the global registry object, which is
  /// automatically initialized at application launch and destroyed by
  /// llvm_shutdown.
  static PassRegistry *getPassRegistry();

  /// getPassInfo - Look up a pass' corresponding PassInfo, indexed by the pass'
  /// type identifier (&MyPass::ID).
  const PassInfo *getPassInfo(const void *TI) const;

  /// getPassInfo - Look up a pass' corresponding PassInfo, indexed by the pass'
  /// argument string.
  const PassInfo *getPassInfo(StringRef Arg) const;

PassRegistry使用的Map并非线程安全的, 因此使用sys::SmartRWMutex进行保护. 而后面的getPassInfo()则需要加锁解锁的操作:

const PassInfo *PassRegistry::getPassInfo(const void *TI) const {
  sys::SmartScopedReader<true> Guard(Lock);
  return PassInfoMap.lookup(TI);
}

而SmartRWMutex最终则是对std::shared_timed_mutex以及其他的一些锁的封装.

Fix

查看llvm::TimeTraceScope的代码, 其构造函数有:

/// The TimeTraceScope is a helper class to call the begin and end functions
/// of the time trace profiler.  When the object is constructed, it begins
/// the section; and when it is destroyed, it stops it. If the time profiler
/// is not initialized, the overhead is a single branch.
struct TimeTraceScope {

  TimeTraceScope() = delete;
  TimeTraceScope(const TimeTraceScope &) = delete;
  TimeTraceScope &operator=(const TimeTraceScope &) = delete;
  TimeTraceScope(TimeTraceScope &&) = delete;
  TimeTraceScope &operator=(TimeTraceScope &&) = delete;

  TimeTraceScope(StringRef Name) {
    if (getTimeTraceProfilerInstance() != nullptr)
      timeTraceProfilerBegin(Name, StringRef(""));
  }
  TimeTraceScope(StringRef Name, StringRef Detail) {
    if (getTimeTraceProfilerInstance() != nullptr)
      timeTraceProfilerBegin(Name, Detail);
  }
  TimeTraceScope(StringRef Name, llvm::function_ref<std::string()> Detail) {
    if (getTimeTraceProfilerInstance() != nullptr)
      timeTraceProfilerBegin(Name, Detail);
  }
  ~TimeTraceScope() {
    if (getTimeTraceProfilerInstance() != nullptr)
      timeTraceProfilerEnd();
  }
};

显然我们可以在第二个参数Detail的地方传入闭包而不是StringRef

进一步查看timeTraceProfilerBegin函数的源码

struct llvm::TimeTraceProfiler {
  void begin(std::string Name, llvm::function_ref<std::string()> Detail) {
    Stack.emplace_back(steady_clock::now(), TimePointType(), std::move(Name),
                       Detail());
  }
}

void llvm::timeTraceProfilerBegin(StringRef Name, StringRef Detail) {
  if (TimeTraceProfilerInstance != nullptr)
    TimeTraceProfilerInstance->begin(std::string(Name),
                                     [&]() { return std::string(Detail); });
}

void llvm::timeTraceProfilerBegin(StringRef Name,
                                  llvm::function_ref<std::string()> Detail) {
  if (TimeTraceProfilerInstance != nullptr)
    TimeTraceProfilerInstance->begin(std::string(Name), Detail);
}

void llvm::timeTraceProfilerEnd() {
  if (TimeTraceProfilerInstance != nullptr)
    TimeTraceProfilerInstance->end();
}

显然传入StringRef也会立刻复制一遍然后放进闭包里, 所以修复方案就是直接把所有TimeTraceScope的构造换成传入闭包的形式, 这样不使用llvm自身的profiler时就不需要计算这个string了.

在替换了所有的函数调用为闭包之后(点此查看补丁文件), lld的性能提升明显, 链接时间从数十个CPU小时降低为1个CPU小时,修复后的火焰图如下:

flamegraph

点击此在新标签页打开该svg图片

原有占用大量CPU的锁不见了.

Contents

perf top

使用jemalloc替代musl的mallocng

perf record

所以为什么lld这里这么慢

Fix