Overview

Computing infrastructure for problem solving with deep learning

Preferred Networks (PFN)’s core technologies, especially deep learning, require enormous computing power. To perform a vast number of computations in an efficient way, we currently operate our own computer clusters, more commonly known as supercomputers. Our computer clusters are named MN followed by a series number: MN-1, MN-2 and MN-3.

Infrastructure

(Updated May 2020)

MN-3

MN-3 is PFN’s first computer cluster that uses MN-Core, a processor developed in-house specifically for deep learning. MN-3 started operating in May 2020. PFN plans to expand MN-Core-powered computer clusters in multiple phases. The first of these, MN-3a, is already completed with the following configuration.

MN-3a is made up of 1.5 “zones,” each of which consists of 32 compute nodes (MN-Core Servers) tightly coupled by two MN-Core Direct Connect Switches.

Configuration of the MN-3a cluster:

  • 48 compute nodes (MN-Core Servers)
  • 4 Interconnect nodes (MN-Core Direct Connect Switches)
  • 5 100GbE Ethernet switches

Configuration of each MN-3a node:

MN-Core Server

MN-Core MN-Core Board x 4
CPU Intel Xeon 8260M two-way (48 physical cores)
Memory 384GB DDR4
Storage Class Memory 3TB Intel Optane DC Persistent Memory
Network MN-Core DirectConnect(112Gbps) x 2
Mellanox ConnectX-6(100GbE) x 2
On board(10GbE) x 2

MN-2

MN-2 is the first GPU cluster built and managed solely by PFN. Operating since July 2019, MN-2 adopts RoCEv2 (RDMA over Converged Ethernet) to interconnect and integrate GPU servers using standard Ethernet.

This made it possible to efficiently process high-speed communication between storage servers and collective communications for deep learning over the same network with low latency.

Specifications of the MN-2 cluster:

  • 128 GPU servers (compute nodes)
  • 32 CPU servers (compute nodes)
  • 24 storage servers
  • 18 100GBE Ethernet Switches

Specifications of each compute node on MN-2:

GPU server

GPU NVIDIA V100 SXM x 8
CPU Intel Xeon 6254 two-way (36 physical cores)
Memory 384GB DDR4
Network Mellanox ConnectX-4(100GbE) x 4
On board(10GbE) x 2

CPU server

CPU Intel Xeon 6254 two-way (36 physical cores)
Memory 384GB DDR4
Network Mellanox ConnectX-4(100GbE) x 2
On board(10GbE) x 2

MN-1, MN-1b

MN-1 is a GPU computer cluster that NTT Communications operates exclusively for PFN. The MN-1 cluster has two generations: MN-1 operating since September 2017 and MN-1b operating since July 2018.

The configuration of each MN-1 cluster is as follows:

  • MN-1
  • 128 GPU servers (NVIDIA P100 x 8, FDR 56Gbps InfiniBand × 2)
  • MN-1b
  • 64 GPU servers (NVIDIA V100 x 8, EDR 100Gbps InfiniBand × 2)

MN-1 milestones:

MN-1b milestones:

Computing Sites

  • NTT Communications Datacenter in Tokyo
    • MN-1, MN-1b
    • Operating under service contract
  • Simulator Building at Yokohama Institute for Earth Sciences, Japan Agency for Marine-Earth Science and Technology (JAMSTEC)
    • MN-2, MN-3
    • Independently operated by PFN on tenanted premises

Middleware

PFN’s computer clusters use Kubernetes, an OSS for managing containerised applications, as core technology. This, in combination with PFN’s own schedulers and front-end tools, provides a computing platform for efficient machine learning and deep learning research.