Genomics at a Glance – Part 2 of 2

May 9, 2016, 8:37 am

≫ Next: Variant Calling Benchmark -- Not Only Human

≪ Previous: Genomics at a Glance – Part 1 of 2

Second par of a two part series, looking at the role of HPC in genomics and DNA sequencing.(read more)

↧

Variant Calling Benchmark -- Not Only Human

May 27, 2016, 8:24 am

≫ Next: Introducing 100GBps with Intel® Omni-Path Fabric in HPC

≪ Previous: Genomics at a Glance – Part 2 of 2

Variant call process refers to the identification of a nucleotide difference from reference sequences at a given position in an individual genome or transcriptome. It includes single nucleotide polymorphism (SNPs), insertion/deletions (indels) and structural variants. One of the most popular variant calling applications is GenomeAnalysisTK (GATK) from Broad Institute. Often this GATK is used with BWA to compose a variant calling workflow focusing on SNPs and indels. After we published Dell HPC System for Genomics White Paper last year, there were significant changes in GATK. The key process, variant call step UnifiedGenotyper is no longer recommended in their best practice. Hence, here we recreate BWA-GATK pipeline according to the recommended practice to test whole genome sequencing data from mammals and plants in addition to human’s whole genome sequencing data. This is a part of Dell’s effort to help customers estimating their infrastructure needs for their various genomics data loads by providing a comprehensive benchmark.

Variant Analysis for Whole Genome Sequencing data

System

The detailed configuration is in Dell HPC System for Genomics White Paper, and the summary of system configuration and software is in Table 2.

Table 1 Server configuration and software

Component	Detail
Server	40x PowerEdge FC430 in FX2 chassis
Processor	Total of 1120 cores: Intel® Xeon® Dual E5-2695 v3 - 14 cores
Memory	128GB - 8x 16GB RDIMM, 2133 MT/s, Dual Rank, x4 Data Width
Storage	480TB IEEL (Lustre)
Interconnect	InfiniBand FDR
OS	Red Hat Enterprise 6.6
Cluster Management tool	Bright Cluster Manager 7.1
Short Sequence Aligner	BWA 0.7.2-r1039
Variant Analysis	GATK 3.5
Utilities	sambamba 0.6.0, samtools 1.2.1

BWA-GATK pipeline

The current version of GATK is 3.5, and the actual workflow tested obtained from the workshop, ‘GATK Best Practices and Beyond’. In this workshop, they introduce a new workflow with three phases.

Best Practices Phase 1: Pre-processing
Best Practices Phase 2A: Calling germline variants
Best Practices Phase 2B: Calling somatic variants
Best Practices Phase 3: Preliminary analyses

Here we tested out phase 1, phase 2A and phase3 for germline variant call pipeline. The details of commands used in benchmark are listed below.

Phase 1. Pre-processing

Step 1. Aligning and Sorting

bwa mem -c 250 -M -t [number of threads] -R ‘@RG\tID:noID\tPL:illumine\tLB:noLB\tSM:bar’ [reference chromosome] [read fastq 1] [read fastq 2] | samtools view -bu - | sambamba sort -t [number of threads] -m 30G --tmpdir [path/to/temp] -o [sorted bam output] /dev/stdin

Step 2. Mark and Remove Duplicates

sambamba markdup -t [number of threads] --remove-duplicates --tmpdir=[path/to/temp] [input: sorted bam output] [output: bam without duplicates]

Step 3. Generate Realigning Targets

java -d64 -Xms4g -Xmx30g -jar GenomeAnalysisTK.jar -T RealignerTargetCreator -nt [number of threads] -R [reference chromosome] -o [target list file] -I [bam without duplicates] -known [reference vcf file]

Step 4. Realigning around InDel

java -d64 -Xms4g -Xmx30g -jar GenomeAnalysisTK.jar -T IndelRealigner -R [reference chromosome] -I [bam without duplicates] -targetIntervals [target list file] -known [reference vcf file] -o [realigned bam]

Step 5. Base Recalibration

java -d64 -Xms4g -Xmx30g -jar GenomeAnalysisTK.jar -T BaseRecalibrator -nct [number of threads] -l INFO -R [reference chromosome] -I [realigned bam] -known [reference vcf file] -o [recalibrated data table]

Step 6. Print Recalibrated Reads - Optional

java -d64 -Xms8g -Xmx30g -jar GenomeAnalysisTK.jar -T PrintReads -nct [number of threads] -R [reference chromosome] -I [realigned bam] -BQSR [recalibrated data table] -o [recalibrated bam]

Step 7. After Base Recalibration - Optional

java -d64 -Xms4g -Xmx30g -jar GenomeAnalysisTK.jar -T BaseRecalibrator -nct [number of threads] -l INFO -R [reference chromosome] -I [recalibrated bam] -known [reference vcf file] -o [post recalibrated data table]

Step 8. Analyze Covariates - Optional

java -d64 -Xms8g -Xmx30g -jar GenomeAnalysisTK.jar -T AnalyzeCovariates -R [reference chromosome] -before [recalibrated data table] -after [post recalibrated data table] -plots [recalibration report pdf] -csv [recalibration report csv]

Phase 2. Variant Discovery – Calling germline variants

Step 1. Haplotype Caller

java -d64 -Xms8g -Xmx30g -jar GenomeAnalysisTK.jar -T HaplotypeCaller -nct [number of threads] -R [reference chromosome] -ERC GVCF -BQSR [recalibrated data table] -L [reference vcf file] -I [recalibrated bam] -o [gvcf output]

Step 2. GenotypeGVCFs

java -d64 -Xms8g -Xmx30g -jar GenomeAnalysisTK.jar -T GenotypeGVCFs -nt [number of threads] -R [reference chromosome] -V [gvcf output] -o [raw vcf]

Phase 3. Preliminary Analyses

Step 1. Variant Recalibration

java -d64 -Xms512m -Xmx2g -jar GenomeAnalysisTK.jar -T VariantRecalibrator -R [reference chromosome] --input [raw vcf] -an QD -an DP -an FS -an ReadPosRankSum -U LENIENT_VCF_PROCESSING --mode SNP --recal_file [raw vcf recalibration] --tranches_file [raw vcf tranches]

Step 2. Apply Recalibration

java -d64 -Xms512m -Xmx2g -jar GenomeAnalysisTK.jar -T ApplyRecalibration -R [reference chromosome] -input [raw vcf] -o [recalibrated filtered vcf] --ts_filter_level 99.97 --tranches_file [raw vcf tranches] --recal_file [raw vcf recalibration] --mode SNP -U LENIENT_VCF_PROCESSING

Job Scheduling

Torque/Maui is used to manage a large number of jobs to process sequencing samples simultaneously. Optional steps, 6, 7 and 8 in phase 1 were not included in the benchmark since Step 6 PrintRead took 12.5 hours with 9 threads for Bos Taurus sample (18 hours with single thread). These optional steps are not required, but these steps are useful for the reporting purpose. If it is necessary, it can be added as a side workflow to the main procedure. For each job, 9 cores were assigned when 120 concurrent jobs were processed concurrently and 13 cores were used for the test of 80 concurrent jobs.

Data

In addition to the benchmark for human whole genome sequencing data published in the whitepaper, we gathered cow, pig, two sub-species of rice (japonica and indica) and corn reference genomes from Illumina’s iGenome site and Ensembl database. Fortunately, reference variant call data exist as a standard VCF file format for human, cow and pig. A variant data for japonica rice were obtained from 3000 Rice Genome on AWS and was modified according to the standard VCF file format. However, the chromosome coordinates in this VCF file do not match to the actual reference chromosome sequences, and we were not able to find matching version of reference variant information from public databases. For indica rice and corn, we gathered the variant information from Ensembl and converted them into a compatible VCF format. Whole genome sequencing data were obtained from European Nucleotide Archive (ENA). ENA Sample IDs in Table 1 are the identifiers allow to retrieve sequence data from the site. Although it is not ideal to test an identical input for large number of processes, it is not feasible to obtain large number of similar sample data from public databases.

Table 2 WGS test data for the different species: * x2 indicates the data is paired end reads. ^ƚTest ID column represent identifiers for the sequence data used throughout the test.

Species		Test IDƚ	ENA Sample ID	Sample Base Count	Single file Size x2*	Reference Genome size (bp)	Depth of Coverage	Number of variants in Ref
*Homo sapiens* (human)		Hs1	ERR091571	42,710,459,638	17 GB x2	3,326,743,047	13x	3,152,430
*Homo sapiens* (human)		Hs2	ERR194161	171,588,070,386	54 GB x2	3,326,743,047	52x	3,152,430
*Bos Taurus* (cow)		Bt1	SRR1706031	82,272,305,762	35 GB x2	2,649,685,036	31x	93,347,258
*Bos Taurus* (cow)		Bt2	SRR1805809	32,681,063,800	12 GB x2	2,649,685,036	12x	93,347,258
*Sus scrofa* (pig)		Ss1	SRR1178925	41,802,035,944	19 GB x2	3,024,658,544	14x	52,573,286
*Sus scrofa* (pig)		Ss2	SRR1056427	24,901,150,040	10 GB x2	3,024,658,544	8x	52,573,286
*Oryza sativa* (rice)	japonica	Osj	SRR1450198	49,676,959,200	22 GB x2	374,424,240	132x	19,409,227
*Oryza sativa* (rice)	indica	Osi	SRR3098100	12,191,702,544	4 GB x2	411,710,190	30x	4,538,869
Zea mays (corn)		Zm	SRR1575496	36,192,217,200	14 GB x2	3,233,616,351	11x	51,151,183

Benchmark Results

Data Quality

After mapping and sorting of the sequence input files, quality statistics were obtained from the output files of Phase 1, Step 1. SRR17060031 sample is from bovine gut metagenomics study and was not well mapped onto Bos taurus UMD3.1 reference genome from Ensembl as expected. The majority of DNAs from bovine gut is foreign and has different sequence composition.

Table 3 Mapping qualities of sequence reads data; obtained by using ‘samtools flagstat’. ‘Total QC-passed reads’ is the number of reads passed the criteria of sequencing quality. Among all QC-passed reads, the number of reads actually mapped on a reference genome and its percentage is on ‘Mapped reads (%)’ column. ‘Paired in sequencing’ column is the number of paired reads properly paired by a sequencer. Among the reads properly paired in sequencing, the number of those paired reads mapped on a reference genome as paired reads is listed in ‘Properly paired (%) in mapping.

Species	Sequencing Reads	Test ID	Total QC-passed reads	Mapped reads (%)	Paired in sequencing	Properly paired (%) in mapping
Human	ERR091571	Hs1	424,118,221	421,339,198 (99.34%)	422,875,838	412,370,120 (97.52%)
Human	ERR194161	Hs2	1,691,135,957	1,666,486,126 (98.54%)	1,686,908,514	1,621,073,394 (96.10%)
Cow	SRR1706031	Bt1	813,545,863	29,291,792 ( 3.60%)	813,520,998	28,813,072 ( 3.54%)
Cow	SRR1805809	Bt2	327,304,866	316,654,265 (96.75%)	326,810,638	308,600,196 (94.43%)
Pig	SRR1178925	Ss1	416,854,287	379,784,341 (91.11%)	413,881,544	344,614,170 (83.26%)
Pig	SRR1056427	Ss2	249,096,674	228,015,545 (91.54%)	246,546,040	212,404,874 (86.15%)
Rice	SRR1450198	Osj	499,853,597	486,527,154 (97.33%)	496,769,592	459,665,726 (92.53%)
Rice	SRR3098100	Osi	97,611,519	95,332,114 (97.66%)	96,759,544	86,156,978 (89.04%)
Corn	SRR1575496	Zm	364,636,704	358,393,982 (98.29%)	361,922,172	315,560,140 (87.19%)

The rest of samples were properly aligned on the reference genome with high quality; more than 80% of reads paired in sequencing data properly mapped as pairs on reference genomes.

It is also important to check what level of mismatches exists in the aligning results. The estimated variance in human genome is one in every 1,200 to 1,500 bases. This makes 3 million base differences between any two people randomly picked. However, as shown in Table 4, the results are not quite matched to the 3 million base estimation. Ideally, 36 million mismatches should be shown in Hs1 data set since it covers the human reference genome 13 times. However, the rate of mismatches is quite higher than the estimation, and at least one out of two variants reported by the sequencing might be an error.

Table 4 The number of reads mapped perfectly on a reference genome and the number of reads mapped partially

Test ID	Depth	Mapped reads	Number of reads mapped with mismatches (mm)
Test ID	Depth	Mapped reads	Perfect match (%)	One mm (%)	Two mm (%)	Three mm (%)	Four mm (%)	Five mm (%)
Hs1	13x	421,339,198	328,815,216 (78.0)	53,425,338 (12.7)	13,284,425 (3.2)	6,842,191 (1.6)	5,140,438 (1.2)	4,082,446 (1.0)
Hs2	52x	1,666,486,126	1,319,421,905 (79.2)	201,568,633 (12.1)	47,831,915 (2.9)	24,862,727 (1.5)	19,052,800 (1.1)	15,568,114 (0.9)
Bt1	31x	29,291,792	25,835,536 (88.2)	2,684,650 (9.2)	338,781 (1.2)	147,841 (0.5)	89,706 (0.3)	70,789 (0.24)
Bt2	12x	316,654,265	158,463,463 (50.0)	68,754,190 (21.7)	29,544,252 (9.3)	17,337,205 (5.5)	12,639,289 (4.0)	10,015,029 (3.2)
Ss1	14x	379,784,341	228,627,231 (60.2)	69,912,403 (18.4)	29,142,572 (7.7)	16,701,248 (4.4)	11,036,852 (2.9)	7,652,513 (2.0)
Ss2	8x	228,015,545	112,216,441 (49.2)	53,739,562 (23.6)	25,132,226 (11.0)	13,874,636 (6.1)	8,431,144 (3.7)	5,375,833 (2.4)
Osj	132x	486,527,154	208,387,077 (42.8)	113,948,043 (23.4)	61,697,586 (12.7)	37,520,642 (7.7)	23,761,302 (4.9)	15,370,422 (3.2)
Osi	30x	95,332,114	54,462,837 (57.1)	17,325,526 (18.2)	8,190,929 (8.6)	5,146,096 (5.4)	3,394,245 (3.6)	2,322,355 (2.4)
Zm	11x	358,393,982	150,686,819 (42.1)	82,912,817 (23.1)	44,823,583 (12.5)	28,375,226 (7.9)	19,093,235 (5.3)	12,503,856 (3.5)

Time Measurement

Total run time is the elapsed wall time from the earliest start of Phase 1, Step 1 to the latest completion of Phase 3, Step 2. Time measurement for each step is from the latest completion time of the previous step to the latest completion time of the current step as described in Figure 1.

The running time for each data set is summarized in Table 4. Clearly the input size, size of sequence read files and reference genomes are the major factors affecting to the running time. The reference genome size is a major player for ‘Aligning & Sorting’ step while the size of variant reference affects most on ‘HaplotypeCaller’ step.

Table 5 running time for BWA-GATK pipeline

Species		*Homo sapiens* (human)		*Bos taurus* (cow)		*Sus scrofa* (pig)		*Oryza sativa* (rice)		*Zea mays* (corn)
Species		*Homo sapiens* (human)		*Bos taurus* (cow)		*Sus scrofa* (pig)		*japonica*	*Indica*	*Zea mays* (corn)
Depth of Coverage		13x	52x	31x	12x	14x	8x	132x	30x	11x
Test ID		Hs1	Hs2	Bt1	Bt2	Ss1	Ss2	Osj	Osi	Zm
Total read size, gzip compressed (GB)		34	108	70	22	38	20	44	8	28
Number of samples ran concurrently		80	80	120	80	120	80	120	80	80
Run Time (hours)	Aligning & Sorting	3.93	15.79	7.25	5.77	7.53	3.04	9.50	1.18	11.16
	Mark/Remove Duplicates	0.66	2.62	3.45	0.73	1.07	0.27	1.27	0.12	0.72
	Generate Realigning Targets	0.29	1.08	3.12	1.57	0.47	0.27	0.22	0.05	0.26
	Realign around InDel	2.50	8.90	4.00	3.15	2.87	1.83	7.37	1.25	3.18
	Base Recalibration	1.82	6.80	1.39	1.96	2.37	1.01	3.16	0.36	1.91
	HaplotypeCaller	4.52	10.28	2.75	9.33	26.21	14.65	8.95	1.77	16.72
	GenotypeGVCFs	0.03	0.03	0.20	0.05	0.34	0.06	1.12	0.01	0.04
	Variant Recalibration	0.67	0.37	0.32	0.86	0.58	0.56	0.92	0.04	0.46
	Apply Recalibration	0.04	0.04	0.03	0.06	0.03	0.08	0.03	0.01	0.05
	Total Run Time	14.5	45.9	22.5	23.5	41.5	21.8	32.5	4.78	34.5
Number of Genomes per day		133	42	128	82	69	88	89	402	56

Discussion

The running time of the current version, GATK 3.5 is overly slower than the version of 2.8-1 we tested in our white paper. Particularly, HaplotypeCaller in the new workflow took 4.52 hours while UnifiedGenotyper in the older version took about 1 hour. Despite of the significant slow-down, GATK team believes HaplotypeCaller brings a better result, and that is worthy for the five times longer run.

There are data issues in non-human species. As shown in Table 4, for the similar size of inputs, Hs1 and Ss1 show large difference in the running time. The longer running time in non-human species can be explained by the quality of reference data. Aligning and sorting process takes more than twice times in other mammals, and it became worse in plants. It is known that plants genomes contain large number of repeat sequences which make mapping process difficult. It is important to note that the shorter running time for HaplotypeCaller in rice does not reflect a real time since the size of the reference variant file was reduced significantly due to the chromosome length/position mismatches in the data. All the variant records outside of chromosome range were removed, but position mismatches were used without corrections. The smaller size of the reference variant information and wrong position information the running time of HaplotypeCaller shorter. Corn’s reference data is not any better in terms of the accuracy of these benchmark. These data errors are the major causes of longer processing time.

Nonetheless, the results shown here could serve as good reference points for the worst case running time. Once reference data are cleaned up by researchers, the overall running time for other mammals should be similar to the one from Hs1 in Table 4 with a proper scaling of input size. However, it is hard to estimate an accurate running times for non-human species at this moment.

↧

Introducing 100GBps with Intel® Omni-Path Fabric in HPC

June 20, 2016, 10:57 am

≫ Next: Scaling behavior of short read sequence aligners

≪ Previous: Variant Calling Benchmark -- Not Only Human

By Munira Hussain, Deepthi Cherlopalle

This blog introduces the Omni-Path Fabric from Intel® as a cluster network fabric used for intra-node communication for application, management and storage communication in High Performance Computing (HPC). It is part of the new technology referring to Intel® Scalable System framework based on IP generated from the coalition of Qlogic, Truescale and Cray Aries. The goal of Omni-Path is to eventually be able to meet the demands of the exascale data centers in performance and scalability.

Dell provides complete validated and supported solution offering which includes the Networking H-series Fabric switches and Host Fabric Interface (HFI) adapters. The Omni-Path HFI is a PCI-E Gen3 x16 adapter capable of 100 Gbps unidirectional per port. The card supports 4 lanes supporting 25Gbps per lane.

HPC Program Overview with Omni-Path:

The current solution program is based on Red Hat Linux 7.2 (kernel version 3.10.0-327.el7.x86_64). The Intel Fabric Suite (IFS) drivers are integrated in the current software solution stack Bright Cluster Manager 7.2 which helps to deploy, provision, install and configure an Omni-Path cluster seamlessly.

The following Dell servers support Intel^® Omni-Path Host Fabric Interface (HFI) cards

PowerEdge R430,PowerEdge R630, PowerEdge R730, PowerEdge R730XD, PowerEdge R930, PowerEdge C4130, PowerEdge C6320

The management and monitoring of the Fabric is done using the Fabric Manager (FM) GUI available from Intel®. The FMGUI provides in-depth analysis and graphical overview of the fabric health including detailed breakdown of status of the ports, mapping as well as investigative report on the errors.

Figure 1: Fabric Manager GUI

The IFS tools include various debugging and management tools such as opareports, opainfo, opaconfig, opacaptureall, opafabricinfoall, opapingall, opafastfabric, etc. These help to capture a snapshot of the Fabric and to troubleshoot. The Host based subnet manager service known as opafm is also available with IFS and is able to scale up to 1000’s of nodes.

The Fabric relies on the PSM2 libraries to provide optimal performance. The IFS package provides precompiled versions of the open source OpenMPI and MVAPICH2 MPI along with some of the micro-benchmarks such as OSU and IMB used to test Bandwidth and Latency measurements of the cluster.

Basic Performance Benchmarking Results:

The performance numbers below were taken on Dell PowerEdge Server R630. The server configuration consisted of the dual socket Intel® Xeon® CPU E5-2697 v4 @ 2.3GHz, 18 cores with 8*16 GB @ 2400MHz. The BIOS version was 2.0.2, and the system profile was set to Performance.

OSU Micro-benchmarks were used to determine latency. These latency tests were done in Ping-Pong fashion. HPC applications need low latency and high throughput. As shown in Figure 2, the back to back latency is 0.77µs, and switch latency is 0.9µs which is on par with industry standards.

Figure 2: OSU Latency - E5-2697 v4

Figure 3 below shows the OSU Uni-directional and bi-directional bandwidth results with OpenMPI-1.10-hfi version. At 4MB Uni-directional bandwidth is around 12.3 GB/s, and bi-directional bandwidth is around 24.3GB/s which is on par with the theoretical peak values.

Figure 3: OSU Bandwidth – E5-2697 v4

Conclusion:

Omni-Path Fabric provides a value add to the HPC solution. It is a technology that integrates well as a high speed fabric needed for designing flexible reference architectures with the growing need for computation. Users can benefit from the open source fabric tools like FMGUI, Chassis Viewer and also FastFabric that is packaged with the IFS. The solution is automated and validated with Bright cluster Manager 7.2 on Dell Servers.

More details on how Omni-Path perform in the other domains is available here. This document provides Intel^® Omni-Path Fabric technology key features and provides a reference to performance data conducted on various commercial and open source applications.

↧

Scaling behavior of short read sequence aligners

July 26, 2016, 1:34 pm

≫ Next: Application Performance Study on Intel Broadwell EX processors

≪ Previous: Introducing 100GBps with Intel® Omni-Path Fabric in HPC

This blog explores the scaling behaviors of the most popular three short read sequence aligners; BWA-mem, Bowtie and Bowtie2. Aligning process is the beginning of all Next Generation Sequencing (NGS) data analysis and typically the most time-consuming step in any NGS data analysis pipeline.

It is clear that using more cores for the aligning process will help to speed up the process, but as you already know parallelization comes with cost. Higher complexity, perhaps order of magnitude, due to multiple instruction streams and data flowing between the instruction streams can easily overshadow the speed gain by parallelizing the process. Hence, it is not wise to use the entire cores in a compute node to speed up blindly especially when the overall throughput is more important. Identifying a sweet spot for the optimal number of cores for each aligning process and maximizing the overall throughput at the same time is not an easy task in a complex workflow.

Table 1 Server configuration and software

Component	Detail
Server	PowerEdge FC430 in FX2 chassis - 2 Sockets
Processor	Intel® Xeon® Dual E5-2695 v3 - 14 cores, total 28 physical cores
Memory	128GB - 8x 16GB RDIMM, 2133 MT/s, Dual Rank, x4 Data Width
Storage	480TB IEEL (Lustre)
Interconnect	InfiniBand FDR
OS	Red Hat Enterprise 6.6
Cluster Management tool	Bright Cluster Manager 7.1
Short Sequence Aligner	BWA 0.7.2-r1039, Bowtie 1.1.2, Bowtie 2.2.6

Table 1 shows the summary of the system for the test here. Hyperthreading was not enabled for the test although it helps to improve the overall performance.

In Figure 1, BWA running times were measured for each different sizes of paired end read sequencing data. The size of sequence read data is represented as million fragments (MF) here. For example, 2MF means that the sequence data consist of two fastq files containing two million sequence reads in each file. One read is in a fastq file, and the corresponding paired read is in the other fastq file. A sweet spot for BWA-mem is from 4 to 16 cores roughly depending on the sequence read file size. Typically, the sequence read size is larger than 10 million and hence, 2MF and 10MF results are not realistic. However, larger sequence read data follow the behavior of the smaller input sizes as well. In the speedup chart, blue solid line labeled as 'ideal' represents the theoretical speedup we could obtain by increasing the number of cores.

Bowtie results in Figure 2 shows that the similar sweet spot comparing to BWA-mem results. However, the running time is slightly faster than BWA-mem’s. It is notable that Bowtie and Bowtie2 are sensitive to the read lengths while BWA-mem shows more consistent scaling behavior regardless of the sequence read lengths.

Although Bowtie2 is even faster than Bowtie, it is more sensitive to the length of the sequence reads as shown in Figure 3. Actually, the total number of nucleotides could be a better matrix to estimate the running time for a given sequence read data.

In practice, there are more factors to consider to maximize the overall throughput. One of critical factors is the bandwidth of existing storages in order to utilize the sweet spot instead of using the entire cores in a compute node. For example, if we decided to use 14 cores for each aligning process instead of using 28 cores, this will double up the number of samples processed simultaneously. However, the twice number of processes will fill up the limited storage bandwidth, and overly used storages will slow down significantly the entire processes running simultaneously.

Also, these aligning processes are not used alone typically in a pipeline. It is frequently tied up with a file conversion and a sorting process since subsequent analysis tools requiring these alignment results sorted in either chromosome coordinates or the name of the reads. The most popular approach is to use ‘pipe and redirection’ to save time to write multiple output files. However, this practice makes the optimization harder since it generally requires more computational resources. More detailed optimization for NGS pipelines in this aspect will be discussed in the next blogs.

↧

Application Performance Study on Intel Broadwell EX processors

September 12, 2016, 3:09 am

≫ Next: Deep Learning Performance with P100 GPUs

≪ Previous: Scaling behavior of short read sequence aligners

Author: Yogendra Sharma, Ashish Singh, September 2016 (HPC Innovation Lab)

This blog describes the performance analysis on a PowerEdge R930 server powered by four Intel Xeon E7-8890 v4 @2.2GHz processors (code named as Broadwell-EX). Primary objective of this blog is to compare the performance of HPL, STREAM and few scientific applications ANSYS Fluent and WRF with the previous generation of Intel processor Intel Xeon E7-8890 v3 @2.5GHz codenamed Haswell-EX. Below are the configurations used for this study.

Platform	PowerEdge R930	PowerEdge R930
Processor	4 x Intel Xeon E7-8890 v3@2.5GHz (18 cores) 45MB L3 cache 165W	4 x Intel Xeon E7-8890 v4@2.2GHz (24 cores) 60MB L3 cache 165W
Memory	1024 GB = 64 x 16GB DDR4 @2400MHz RDIMMS	1024 GB = 32 x 32GB DDR4 @2400MHz RDIMMS
BIOS Settings
BIOS	Version 1.0.9	Version 2.0.1
Processor Settings > Logical Processors	Disabled	Disabled
Processor Settings > QPI Speed	Maximum Data Rate	Maximum Data Rate
Processor Settings > System Profile	Performance	Performance
Software and Firmware
Operating System	RHEL 6.6 x86_64	RHEL 7.2 x86_64
Intel Compiler	Version 15.0.2	Version 16.0.3
Intel MKL	Version 11.2	Version 11.3
Intel MPI	Version 5.0	Version 5.1.3
Benchmark and Applications
LINPACK	V2.1 from MKL 11.2	V2.1 from MKL 11.3
STREAM	v5.10, Array Size 1800000000, Iterations 100	v5.10, Array Size 1800000000, Iterations 100
WRF	v3.5.1, Input Data Conus12KM, Netcdf-4.3.1.1	V3.8 Input Data Conus12KM, Netcdf-4.4.0
ANSYS Fluent	v15, Input Data: truck_poly_14m, sedan_4m, aircraft_2m	v16, Input Data: truck_poly_14m, sedan_4m, aircraft_2m

Table 1: Details of Server and HPC Applications used with Broadwell-EX processors

____________________________________________________________________________________________________________________________________

In this section of the blog, we have compared benchmark numbers with two generations of processors on the same server platform i.e. PowerEdge R930 as well as performance of Broadwell-EX processors with different CPU profiles and memory snoop modes namely Home Snoop (HS) and Cluster On Die(COD).

The High Performance Linpack Benchmark is a measure of a system's floating point computing power. It measures how fast a computer solves a dense n by n system of linear equations Ax = b, which is a common task in engineering. HPL benchmark was run on both PowerEdge R930 servers (With Broadwell-EX and Haswell-EX ) with block size of NB=192 and problem size of N=340992.

Figure 1: Comparing HPL Performance across BIOS profiles Figure 2: Comparing HPL Performance over two generations of processors

Figure 1 depicts the performance of PowerEdge R930 server with Broadwell-EX processors on different BIOS options. HS (Home snoop mode) performs better than the COD (Cluster-on-die) on both of the system profiles Performance and DAPC. Figure 2 compares the performance between four socket Intel Xeon E7-8890 v3 and Intel Xeon E7-8890 v4 processor servers. HPL showed 47% performance improvement with four Intel Xeon E7-8890 v4 processors on R930 server in comparison to four Intel Xeon E7-8890 v3 processors. This was due to ~33% increase in the number of cores and 13% increase due to new improved version of both Intel compiler and Intel MKL.

Stream benchmark is a synthetic benchmark program that measures sustainable memory bandwidth and the corresponding computation rate for simple vector kernels.

Figure 3: Comparing STREAM Performance across BIOS profiles Figure 4: Comparing STREAM Performance over two generations of processors

As per Figure 3, the memory bandwidth of PowerEdge R930 server with Intel Broadwell-EX processors are same on different bios profiles. Figure4 shows the memory bandwidth of both Intel Xeon Broadwell-EX and Intel Xeon Haswell-EX processors with PowerEdge R930 server. Both Haswell-EX and Broadwell-EX support DDR3 and DDR4 memories respectively, while the platform with this configuration supports 1600MT/s of memory frequency for both generation of processors. Due to the same memory frequency supported by the PowerEdge R930 platform for both generation of processors, both Intel Xeon processors have same memory bandwidth of 260GB/s with the PowerEdge R930 server.

The Weather Research and Forecasting (WRF) Model is a mesoscale numerical weather prediction system designed for both atmospheric research and operational forecasting needs. It features two dynamical cores, a data assimilation system, and a software architecture facilitating parallel computation and system extensibility. The model serves a wide range of meteorological applications across scales from tens of meters to thousands of kilometers. WRF can generate atmospheric simulations using real data or idealized conditions. We used the CONUS12km and CONUS2.5km benchmark datasets for this study. CONUS12km is a single domain and small size (48hours, 12km resolution case over the Continental U.S. (CONUS) domain from October 24, 2001) benchmark with 72 seconds of time step. CONUS2.5km is a single domain and large size (Latter 3hours of a 9hours, 2.5km resolution case over the Continental U.S. (CONUS) domain from June 4, 2005) benchmark with 15 seconds of time step. WRF decomposes the domain into tasks or patches. Each patch can be further decomposed into tiles that are processed separately, but by default there is only one tile for every run. If the single tile is too large to fit into the cache of the CPU and/or core, it slows down computation due to WRF’s memory bandwidth sensitivity. In order to reduce the size of the tile, it is possible to increase the number of tiles by defining “numtile = x” in input file or defining environment variable “WRF_NUM_TILES = x”. For both CONUS 12km and CONUS 2.5km the number of tiles are chosen based on best performance which is equal to 56.

Figure 5: Comparing WRF Performance across BIOS profiles

Figure 5 demonstrates the comparison of WRF datasets on different BIOS profiles .With Conus 12KM data ,all the bios profiles performs equally well because of the smaller data size while for CONUS 2.5KM Perf.COD (Performance System Profile with Cluster-On-Die snoop mode) gives best performance. As per the figure 5, the Cluster-on-Die snoop mode is performing 2% higher than Home snoop mode, while the Performance system profile gives 1% better performance than DAPC.

Figure 6: Comparing WRF Performance over two generations of processors

Figure 6 shows the performance comparison between Intel Xeon Haswell-EX and Intel Xeon Broadwell-EX processors with PowerEdge R930 server. As shown in the graph, Broadwell-EX performs 24% better than Haswell-EX for CONUS 12KM data set and 6% better for CONUS 2.5KM.

ANSYS Fluent is a computational fluid dynamics (CFD) software tool. Fluent includes well-validated physical modeling capabilities to deliver fast and accurate results across the widest range of CFD and multi physics applications.

Figure 7: Comparing Fluent Performance across BIOS profiles

We used three different datasets for Fluent with ‘Solver Rating’ (Higher is better) as the performance metric. The above graph Figure 7 shows that all three datasets performed 4% better with Perf.COD (Performance System Profile with Cluster-On-Die snoop mode) bios profile than others. While, the DAPC.HS (DAPC system profile with Home snoop mode) bios profile shows lowest performance. For all three datasets ,the COD snoop mode performs 2% to 3% better than Home snoop mode and Performance system profile performs 2% to 4% better than DAPC. For all these three datasets the behaviour of Fluent is consistent.

Figure 8: Comparing Fluent Performance over two generations of processors

As shown above in Figure 8, for all the test cases on PowerEdge R930 with Broadwell-EX ,Fluent showed 13% to 27% performance improvement in-comparision to PowerEdge R930 with Haswell-EX.

________________________________________________________________________________________________

Conclusion:

Overall, Broadwell-EX processor makes the PowerEdge R930 server more powerful and more efficient. With Broadwell-EX, the HPL performance increses in the smae manner as increase in the number of cores in comparison to Haswell-EX. There is also increase in the performance for real time applications depending on their nature of computation. So, it can be a good choice to upgrade for those who are using compute hungry applications.

↧

Deep Learning Performance with P100 GPUs

November 11, 2016, 1:00 am

≫ Next: Understanding the Role of Dell EMC Isilon SmartConnect in Genomics Workloads

≪ Previous: Application Performance Study on Intel Broadwell EX processors

Authors: Rengan Xu and Nishanth Dandapanthu. Dell EMC HPC Innovation Lab. October 2016

Introduction to Deep Learning and P100 GPU

Deep Learning (DL), an area of Machine Learning, has achieved significant progress in recent years. Its application area includes pattern recognition, image classification, Natural Language Processing (NLP), autonomous driving and so on. Deep learning attempts to learn multiple levels of features of the input large data sets with multi-layer neural networks and make predictive decision for the new data. This indicates two phases in deep learning: first, the neural network is trained with large number of input data; second, the trained neural network is used to test/inference/predict the new data. Due to the large number of parameters (the weight matrix connecting neurons in different layers and the bias in each layer, etc.) and training set size, the training phase requires tremendous amounts of computation power.

To approach this problem, we utilize accelerators which include GPU, FPGA and DSP and so on. This blog focuses on GPU accelerator. GPU is a massively parallel architecture that employs thousands of small but efficient cores to accelerate the computational intensive tasks. Especially, NVIDIA® Tesla® P100™ GPU uses the new Pascal™ architecture to deliver very high performance for HPC and hyperscale workloads. In PCIe-based servers, P100 delivers around 4.7 and 9.3 TeraFLOPS of double and single precision performance, respectively. And in NVLink™-optimized servers, P100 delivers around 5.3 and 10.6 TeraFLOPS of double and single precision performance, respectively. This blog focuses on P100 for PCIe-based servers. P100 is also equipped with High Bandwidth Memory 2 (HBM2) which offers higher bandwidth than the traditional GDDR5 memory. Therefore, the high compute capability and high memory bandwidth make GPU an ideal candidate to accelerate deep learning applications.

Deep Learning Frameworks and Dataset

In this blog, we will present the performance and scalability of P100 GPUs with different deep learning frameworks on a cluster. Three deep learning frameworks were chosen: NVIDIA’s fork of Caffe (NV-Caffe), MXNet and TensorFlow. Caffe is a well-known and widely used deep learning framework which is developed by the Berkeley Vision and Learning Center (BVLC) and by community contributors. It focuses more on the image classification and it supports multiple GPUs within a node but not across nodes. MXNet, jointly developed by collaborators from multiple universities and companies, is a lightweight, portable and flexible deep learning framework designed for both efficiency and flexibility. This framework scales to multiple GPUs within a node and across nodes. TensorFlow, developed by Google’s Brain team, is a library for numerical computation using data flow graphs. TensorFlow also supports multiples GPUs and can scale to multiple nodes.

All of the three deep learning frameworks we chose are able to perform the image classification task. With this in mind, we chose the well-known ImageNet Large Scale Visual Recognition Competition (ILSVRC) 2012 dataset. This training dataset contains 1281167 training images and 50000 validation images. All images are grouped into 1000 categories or classes. Another reason we chose ILSVRC 2012 dataset is that its workload is large enough to perform long time training and it is a benchmark dataset used by many deep learning researchers.

Testing Methodology

This blog quantifies the performance of deep learning frameworks using NVIDIA’s P100-PCIe GPU and Dell’s PowerEdge C4130 server architecture. Figure 1 shows the testing cluster. The cluster includes one head node which is Dell’s PowerEdge R630 and four compute nodes which are Dell’s PowerEdge C4130. All nodes are connected by an InfiniBand network and they share disk storage through NFS. Each compute node has 2 CPUs and 4 P100-PCIe GPUs. All of the four compute nodes have the same configurations. Table 1 shows the detailed information about the hardware configuration and software used in every compute node.

Figure 1: Testing Cluster for Deep Learning

Table 1: Hardware Configuration and Software Details

Platform	PowerEdge C4130 (configuration G)
Processor	2 x Intel Xeon CPU E5-2690 v4 @2.6GHz (Broadwell)
Memory	256GB DDR4 @ 2400MHz
Disk	9TB HDD
GPU	P100-PCIe with 16GB GPU memory
Nodes Interconnects	Mellanox ConnectX-4 VPI (EDR 100Gb/s Infiniband)
Infiniband Switch	Mellanox SB7890
Software and Firmware
Operating System	RHEL 7.2 x86_64
Linux Kernel Version	3.10.0-327.el7
BIOS	Version 2.1.6
CUDA version and driver	CUDA 8.0 (361.77)
NCCL version	Version 1.2.3
cuDNN library	Version 5.1.3
Intel Compiler	Version 2017.0.098
Python	2.7.5
Deep Learning Frameworks
NV-Caffe	Version 0.15.13
Intel-Caffe	Version 1.0.0-rc3
MXNet	Version 0.7.0
TensorFlow	Version 0.11.0-rc2

We measured the metrics of both images/sec and training time.

The images/sec is the measurement for training speed while the training time is the wall clock time for training, I/O operation and other overhead. The images/sec number was obtained from “samples/sec” in MXNet and TensorFlow in the output log files. NV-Caffe listed “M s/N iter” as output which means M seconds were taken to process N iterations, or N batches. The metric images/sec was calculated by “batch_size*N/M”. The batch size is the number of training samples in one forward/backward pass through all layers of a neural network. The images/sec number was averaged across all iterations to take into account the deviations.

The training time was obtained from “Time cost” in MXNet output logs. For NV-Caffe and TensorFlow, their output log files contained the wall-clock timestamps during the whole training. So the time difference from the start to the end of the training was calculated as the training time.

Since NV-Caffe did not support distributed training, it was not executed on multiple nodes. The MXNet framework was able to run on multiple nodes. However, the caveat was that it could only use the Ethernet interface (10 Gb/s) on the compute nodes by default, and therefore the performance was not as high as expected. To solve this issue, we have manually changed its source code so that the high-speed InfiniBand interface (EDR 100 Gb/s) was used. The training with TensorFlow on multiple nodes was able to run but with poor performance and the reason is still under investigation.

Table 2 shows the input parameters used in different deep learning frameworks. In all deep learning frameworks, the neural network training requires many epochs or iterations. Whether the term epoch or iteration is used is determined by each framework. An epoch is a complete pass through all samples in a given dataset, while one iteration processes only one batch of samples. Therefore, the relationship between iterations and epochs is: epochs = (iterations*batch_size)/training_samples. Every framework only needs either epochs or iterations so that another parameter can be easily determined by this formula. Since our goal was to measure the performance and scalability of Dell’s server and not to train an end-to-end image classification model, the training was a subset of the full model training which was large enough to reflect performance. Therefore we chose a smaller number of epochs or iterations so that they could finish in a reasonable time. Although only partial training was performed, the training speed (images/sec) remained relatively constant over this period.

The batch size is one of the hyperparameters the user needs to tune when training a neural network model with mini-batch Stochastic Gradient Descent (SGD). The batch size in the table are commonly used sizes. Whether these batch sizes are optimized for model accuracy is left in future work. For all neural networks in all frameworks, we increased the batch size proportionally with increasing number of GPUs. In the meantime, the number of iterations was adjusted so that the total number of samples was fixed no matter how many GPUs were used. Since epoch has nothing to do with batch size, its value was not changed when a different number of GPUs was used. For MXNet GoogleNet, there was runtime error if different bath sizes were used for different number of GPUs, so we used constant batch size. Learning rate is another hyperparameter that needs to be tuned. In this experiment, the default value in each framework was used.

Table 2: Input parameters used in different deep learning frameworks

	Batch size		Image shape	Iterations/Epochs
NV-Caffe GoogleNet	CPU	128	224	4000 iterations
	1 P100	128		4000 iterations
	2 P100	256		2000 iterations
	4 P100	512		1000 iterations
TensorFlow Inception-V3	1 P100	64	299	4000 iterations
	2 P100	128		2000 iterations
	4 P100	256		1000 iterations
MXNet GoogleNet	1-16 P100	144	256	1 epoch
MXNet Inception-BN	1 P100	64	224	1 epoch
	2 P100	128
	4 P100	256
	8 P100	256
	12 P100	256
	16 P100	256

Performance Evaluation

Figure 2 shows the training speed (images/sec) and training time (wall-clock time) of GoogleNet neural network in NV-Caffe using P100 GPUs. It can be seen that the training speed increased as the number of P100 GPUs increased. As a result, the training time decreased. The CPU result in Figure 2 was obtained from Intel-Caffe on two Intel Xeon CPU E5-2690 v4 (14-core Broadwell processors) within one node. We chose Intel-Caffe for the pure CPU test because it has better CPU optimizations than NV-Caffe. From Figure 1, we can see that 1 P100 GPU is ~5.3x and 4 P100 is ~19.7x faster than a Broadwell based CPU server. Since NV-Caffe has not supported distributed training so far, we only ran it on up to 4 P100 GPUs on one node.

Figure 2: The training speed and time of GoogleNet in NV-Caffe using P100 GPUs

Figure 3 and Figure 4 show the training speed and time of GoogleNet and Inception-BN neural networks in MXNet using P100 GPUs. In both figures, 8 P100 used 2 nodes, 12 P100 used 3 nodes and 16 P100 used 4 nodes. As we can see from both figures, MXNet had great scalability in training speed and training time when more P100 GPUs were used. As mentioned in Section Testing Methodology, if the Ethernet interfaces in all nodes were used, it would impact the training speed and training time significantly since the I/O operation was not fast enough to feed the GPU computations. Based on our observation, the training speed when using Ethernet was only half the speed compared to when using the InfiniBand interfaces. In both MXNet and TensorFlow, the CPU implementation was extremely slow and we believe they were not CPU optimized, therefore we did not compare their P100 performance with CPU performance.

Figure 3: The training speed and time of GoogleNet in MXNet using P100 GPUs

Figure 4: The training speed and time of Inception-BN in MXNet using P100 GPUs

Figure 5 shows the training speed and training time of Inception-V3 neural network in TensorFlow using P100 GPUs. Similar to NV-Caffe and MXNet, TensorFlow also showed good scalability in training speed when more P100 GPUs were used. The training with TensorFlow on multiple nodes was able to run but with poor performance. So that result was not shown here and the reason is still under investigation.

Figure 5: The training speed and time of Inception-V3 in TensorFlow using P100 GPUs

Figure 6 shows the speedup when using multiple P100 GPUs in different deep learning frameworks and neural networks. The purpose of this figure is to demonstrate the speedup in each framework when more number of GPUs are used. The purpose does not include the comparison among different frameworks since their input parameters were different. When using 4 P100 GPUs for NV-Caffe GoogleNet and TensorFlow Inception-V3, we observed a speedup up to 3.8x and 3.0x, respectively. For MXNet, using 16 P100 achieved 13.5x speedup in GoogleNet and 14.7x speedup in Inception-BN which are close to the ideal speedup 16x. In particular, we observed linear speedup when using 8 P100 and 12 P100 GPUs in Inception-BN neural network.

Figure 6: Speedup of multiple P100 GPUs in different DL frameworks and networks

In practice, a real user application can take days or weeks for training a model. Although our benchmarking cases run in a few minutes or a few hours, they are just small snapshots from much longer runs that would be needed to really train a network. For example, the training of a real application might take 90 epochs of 1.2M images. A Dell C4130 with P100 GPUs can turn in results in less than a day, while CPU takes >1 week – that’s the real benefits to the end users. The effect for real use case is saving weeks of time per run, not seconds.

Conclusions and Future Work

Overall, we observed great speedup and scalability in neural network training when multiple P100 GPUs were used in Dell’s PowerEdge C4130 server and multiple server nodes were used. The training speed increased and the training time decreased as the number of P100 GPUs increased. From the results shown, it is clear that Dell’s PowerEdge C4130 cluster is a powerful tool for significantly speeding up neural network training.

In the future work, we will try the P100 for NVLink-optimized servers with the same deep learning frameworks, neural networks and the dataset and see how much performance improvement can be achieved. This blog experimented the PowerEdge C4130 configuration G in which only GPU 1 and GPU 2, and GPU3 and GPU 4 have peer-to-peer accesses. In the future, we will try C4130 configuration B in which all of the four GPUs connected to one socket have peer-to-peer accesses and check the performance impact in this configuration. We will also investigate the impact of hyperparameters (e.g. batch size and learning rate) on both training performance and model accuracy. The reason of the slow training performance with TensorFlow on multiple nodes will also be examined.

↧

Understanding the Role of Dell EMC Isilon SmartConnect in Genomics Workloads

November 11, 2016, 6:28 am

≫ Next: Cryo-EM in HPC with KNL

≪ Previous: Deep Learning Performance with P100 GPUs

Coming together with EMC has opened many new opportunities for the Dell EMC HPC Team to develop high-performance computing and storage solutions for the Life Sciences. Our lab recently stood up a ‘starter' 3 node Dell EMC Isilon X410 cluster. As a loyal user of the Isilon X210 in a previous role, I couldn’t wait to start profiling genomics applications using the X410 with Dell EMC HPC System for Life Sciences.

Because our current Isilon X410 storage cluster is currently fixed at the 3 node minimum, we aren’t set up yet to evaluate the scalability of the X410 with genomics workflows. We will tackle this work once our lab receives additional X nodes and the new the Isilon All-Flash node (formerly project Nitro).

In the meantime, I wanted to understand how the Isilon storage behaves relative to other storage solutions and decided to focus on the role of Isilon SmartConnect.

Through a single host name, SmartConnect enables client connection load balancing and dynamic network file system (NFS) failover and failback of client connections across storage nodes to provide optimal utilization of the Isilon cluster resources.

Without the need to install client-side drivers, administrators can easily manage a large and growing number of clients and ensure in the event of a system failure, in-flight reads and writes will successfully finish without failing.

Traditional storage systems with two-way failover typically sustain a minimum 50 percent degradation in performance when a storage head fails, as all clients must fail over to the remaining head. With Isilon SmartConnect, clients are evenly distributed across all remaining nodes in the cluster during failover, helping to ensure minimal performance impact.

To test this concept, I ran the GATK pipeline varying the number of samples and compute nodes without and with SmartConnect enabled on the Isilon storage cluster.

The configuration of our current lab environment and whole human genome sequencing data used for this evaluation are listed below.

Table 1 System configuration, software, and data

Dell EMC HPC System for Life Sciences
Server	40 x PowerEdge C6320
Processor	2 x Intel Xeon E5-2697 v4. 18 cores per socket, 2.3 GHz
Memory	128 GB at 2400 MT/s
Interconnect	10GbE NIC and switch for accessing Isilon & Intel Omni-Path fabric
Software
Operating System	Red Hat Enterprise Linux 7.2
BWA	0.7.2-r1039
Samtools	1.2.1
Sambamba	0.6.0
GATK	3.5
Benchmark Data
ERR091571	10x Whole human genome sequencing data from Illumina HiSeq 2000, Total number of reads = 211,437,919

As noted earlier, the Isilon in our environment is currently set up in a minimum 3 node configuration. The current generation of Isilon is scalable up to 144 nodes. As you add additional Isilon nodes, the aggregate memory, throughput, and IOPS scale linearly. For a deep dive on Isilon and OneFS file system, see this technical white paper.

The Isilon storage cluster in our lab is summarized in Table 2. The Isilon storage is mounted on each compute nodes, up to 40 nodes, through NFS service (version 3) over a 10GbE network.

Table 2: Current Isilon Configuration

Dell EMC Isilon
Server	3 x X410
Processor	2 x Intel(R) Xeon(R) CPU E5-2640 v2 @ 2.00GHz (1995.20-MHz K8-class CPU), 16 cores.
Memory	256 GB at 2400 MT/s
Back End Networking	2 x IB QDR links
Front-end Networking	2 x 1GbE ports and 2 x 10GbE SFP+ ports
Storage Capacity	4TB x 30 HDDs, 120TB (usable)
Software
Operating System	OneFS 8.0
Isilon SmartConnect	Round Robin Mode

Table 3 summarizes all the tests we performed. To mimic a storage environment without proper load balancing, all the tests were performed without SmartConnect enabled except the concurrent 120 sample run.

Each pipeline (job) runs with one sample and uses 11 cores on a single compute node. A maximum three pipelines run concurrently on a single compute node. The tests were performed up to 40 nodes and 120 samples.

I included the detailed running times for each sub-step in BWA-GATK pipeline in Table 3. The running times for Aligning & Sorting and HaplotypeCaller steps are the bottlenecks in the pipeline, but the Aligning & Sorting step is more sensitive to the number of samples. In this benchmark, GenotypeGVCFs is not a bottleneck since we used an identical sequence data for all the concurrent pipelines. In real data analysis, a large number of different samples are used, and GenotypeGVCFs becomes the major bottleneck.

Table 3 Test results for BWA-GATK pipeline without and with Isilon SmartConnect enabled.

Number of samples (Data Size)	3	15	30	60	90	120	*120 with SmartConnect
Number of compute nodes	1	5	10	20	30	40	40
Aligning & Sorting	3.48	3.54	3.64	4.21	4.94	5.54	4.69
Mark/Remove Duplicates	0.46	0.49	0.79	1.27	2.52	3.07	1.84
Generate Realigning Targets	0.19	0.18	0.19	0.19	0.18	0.20	0.18
Realign around InDel	2.22	2.20	2.24	2.27	2.26	2.27	2.29
Base Recalibration	1.17	1.18	1.19	1.18	1.16	1.13	1.17
HaplotypeCaller	4.13	4.35	4.39	4.34	4.31	4.32	4.29
GenotypeGVCFs	0.02	0.02	0.02	0.02	0.02	0.02	0.02
Variant Recalibration	0.58	0.50	0.53	0.55	0.57	0.53	0.41
Apply Recalibration	0.02	0.02	0.02	0.02	0.02	0.02	0.02
Total Running Time (Hrs)	12.3	12.5	13.0	14.1	16.0	17.1	15.0
Number of Genomes per Day	6	29	53	96	129	156	185

As shown in Table 3, after 30 samples on 10 compute nodes, the total running time of BWA-GATK began to increase. When the number of compute nodes doubled, the total run time continued to climb without SmartConnect enabled. Also starting at 30 samples, we saw jobs starting to fail presumably due to unbalanced client connections and inability to failover and failback those connections.

However, when we enabled SmartConnect using the default round-robin settings, we saw a significant improvement on the total run time and daily sample throughput.

As expected, SmartConnect, maximized performance by keeping client connections balanced across all three Isilon storage nodes. In the three X410 configuration with SmartConnect enabled, the 120 samples processed with 40 compute nodes showed a 14% speed-up and 19% increased daily sample throughput.

This test also suggests a starting point identifying the number client connections per Isilon node for this genomics workflow. In our case, adding one additional X410 to the Isilon storage cluster for each 15 additional compute nodes may be a reasonable place to start.

As we add additional Isilon nodes to our cluster, we will perform additional studies to refine recommendations for the number of client connections per Isilon node for this genomics workflow. We’ll also take a deeper dive with the advanced SmartConnect load balancing options like CPU utilization, connection count, and network throughput. The Isilon SmartConnect White Paper provides a detailed summary and examples for each of these modes.

If you are using Bright Cluster Manger and need some tips to set it up with Isilon SmartConnect, read this post.

↧

Cryo-EM in HPC with KNL

November 11, 2016, 7:30 am

≫ Next: Dell China Receives AI Innovation Award

≪ Previous: Understanding the Role of Dell EMC Isilon SmartConnect in Genomics Workloads

By Garima Kochhar and Kihoon Yoon. Dell EMC HPC Innovation Lab. October 2016

This blog presents performance results for the 2D alignment and 2D classification phases of the Cryo-electron microscopy (Cryo-EM) data processing workflow using the new Intel Knights Landing architecture, and compares these results to the performance of the Intel Xeon E5-2600 v4 family. A quick description of Cryo-EM and the different phases in the process of reconstructing 3D molecular structures with electron microscopy is provided below, followed by the specific tests conducted in this study and the performance results.

Cryo-EM allows molecular samples to be studied in near-native states and down to nearly atomic resolutions. Studying the 3D structure of these biological specimens can lead to new insights into their functioning and interactions, especially with proteins and nucleic acids, and allows structural biologists to examine how alterations in their structures affect their functions. This information can be used in system biology research to understand the cell signaling network which is part of a complex communication system. This communication system controls fundamental cell activities and actions to maintain normal cell homeostasis. Errors in the cellular signaling process can lead to diseases such as cancer, autoimmune disorders, and diabetes. Studying the functioning of the proteins responsible for an illness enables a biologist to develop specific drugs that can interact with the protein effectively, thus improving the efficacy of treatment.

The workflow from the time a molecular sample is created to the creation of a 3D model of its molecular structure involves multiple steps. These steps are briefly (and simplistically!) described below.

Samples of the molecule (protein, enzyme, etc.) are purified and concentrated in a solution.
This sample is placed on an electron microscope grid and plunge-frozen. This forms a very thin layer of vitreous ice to that surrounds and immobilizes the sample in its near-native state.
The frozen sample is now placed in the microscope for imaging.
The output of the microscope consists of many large image files and across multiple fields of view (many terabytes of data).
Due to the low energy beams used in Cryo-EM (to avoid damaging the structures being studied), the images produced by the microscope have a bad signal-to-noise ratio. To improve the results, the microscope takes multiple images for each field of view. Motion-correction techniques are then applied to allow the multiple images of the same molecule to be added together into an image with less noise.
The next step is a manual process of picking “good-looking” molecule images from a few fields of view.
The frozen sample consists of many molecules that are in many different positions. The resultant Cryo-EM images therefore also consists of images or shadows of the particle from different angles. So, the next step is a 2D alignment phase to uniformly orient the images by image rotation and translation.
Next a 2D classification phase searches through these oriented images and sorts them into “classes”, grouping images that have the same view.
After alignment and classification, there should be multiple collections of images, where each collection contains images showing a view of the molecule from the same angle and showing the same shape of the molecule (a “class”). The images in a class are now combined into a composite image that provides a higher quality representation of that shape.
Finally a 3D reconstruction of the molecule is built from all the composite 2D images.
This 3D model can then be handed back to the structural biologist for further analysis, visualization, etc.

As is now clear, the Cryo-EM processing workflow must comprehend a lot of data, requires rich compute algorithms and considerable compute power for the 2D and 3D phases, and must move data efficiently across the multiple phases in the workflow. Our goal is to design a complete HPC system that can support the Cryo-EM workflow from start to finish and is optimized for performance, energy efficiency and data efficiency.

Performance Tests and Configuration

Focusing for now on the 2D phases of the workflow, this blog presents results for the steps #7 and #8 listed above - the 2D alignment and 2D classification phases. Two software packages in this domain, ROME and RELION were benchmarked on the Knights Landing (KNL, code name for the Intel Xeon Phi 7200 family) and Broadwell (BDW, code name for Intel Xeon E5-2600 v4 family) processors.

The tests were run on systems with the following configuration.

Broadwell-based systems
Server	12 * Dell PowerEdge C6320
Processor	Intel Xeon E5-2697 v4. 18 cores per socket, 2.3 GHz
Memory	128 GB at 2400 MT/s
Interconnect	Intel Omni-Path fabric
KNL-based systems
Server	12 * Dell PowerEdge C6320p
Processor	Intel Xeon Phi 7230. 64 cores, 1.3 GHz
Memory	96 GB at 2400 MT/s
Interconnect	Intel Omni-Path fabric
Software
Operating System	Red Hat Enterprise Linux 7.2
Compilers	Intel 2017, 17.0.0.098 Build 20160721
MPI	Intel MPI 5.1.3
ROME	1.0a
RELION	1.4
Benchmark Datasets
RING11_ALL	Set1. Inflammasome data: 16306 images of NLRC4/NAIP2 inflammasome with a size of 2502 pixels
DATA6	Set4. RP-a: 57001 images of proteasome regulatory particles (RP) with a size of 1602 pixels
DATA8	Set2. RP-b: 35407 images of proteasome regulatory particles (RP) with a size of 1602 pixels

ROME

ROME performs the 2D alignment (step #7 above) and the 2D classification (step #8 above) in two separate phases called the MAP phase and the SML phase respectively. For our tests we used “-k” for MAP equal to 50 (i.e. 50 initial classes) and “-k” for SML equal to 1000 (i.e. 1000 final 2D classes).

The first set of graphs below, Figure 1 and Figure 2 show the performance of the SML phase on KNL. The compute portion of the SML phase scales linearly as more KNL systems are added into the test bed, from 1 to 12 servers as shown in Figure 1. The total time to run shown in Figure 2 is slightly lower than linear, and includes an I/O component as well as the compute component. The test bed used in this study did not have a parallel file system and used just local disks on the KNL servers. Future work for this project includes evaluating the impact of adding a Lustre parallel file system to this test bed and its effect on total time for SML.

Figure 1 - ROME SML scaling on KNL, compute time

Figure 2 - ROME SML scaling on KNL, total time

The next set of graphs compare the ROME SML performance on KNL and Broadwell. Figure 3, Figure 4 and Figure 5 plot the compute time for SML on 1 to 12 servers. The black circle on the graph shows the improvement in KNL runtime when compared to BDW. For all three datasets that were benchmarked, KNL is about 3x faster than BDW. Note we’re comparing one single-socket KNL server to a dual-socket Broadwell server, so this is a server to server comparison (not socket to socket). KNL is 3x faster than BDW across different numbers of servers, showing that ROME SML scales well on Omni-Path on both KNL and BDW, but the absolute compute time on KNL is 3x faster irrespective of the number of servers in test.

Considering total time to run on KNL versus BDW, we measured KNL to be 2.4x to 3.3x faster than BDW at all node counts. Specifically, DATA6 is ~ 2.4x faster on KNL, DATA8 is 3x faster on KNL and RING11_ALL is 3.4x faster on KNL when considering total time to run. As mentioned before, the total time includes an I/O component and one of the next step in this study is to evaluate the performance improvement if adding a parallel file system to the test bed.

Figure 3 - DATA8 ROME SML on KNL and BDW

Figure 4 - DATA6 ROME SML on KNL and BDW.

Figure 5 - RING11_ALL ROME SML on KNL and BDW

RELION

RELION accomplishes the 2D alignment and classification steps mentioned above in one phase. Figure 6 shows our preliminary results on RELION on KNL across 12 servers and on two of the test datasets. The “--K” parameter for RELION was set to 300, i.e., 300 classes for 2D classification. There are several things to be still tried here – the impact of a parallel file system on RELION (as we discussed for ROME earlier) and dataset sensitivity to the parallel file system. Additionally we plan to benchmark RELION on Broadwell, across different node counts and with different input parameters.

Figure 6 - RELION 2D alignment and classification on KNL

Next Steps

The next steps in this project include adding a parallel file system to measure the impact on the workflow, tuning the test parameters for ROME MAP, SML and RELION, and testing on more datasets. We also plan to measure the power consumption of the cluster when running Cryo-EM workloads to analyze performance per watt and performance per dollar metrics for KNL.

↧

Dell China Receives AI Innovation Award

November 29, 2016, 7:42 am

≫ Next: Advancing HPC: A Closer Look at Cool New Tech

≪ Previous: Cryo-EM in HPC with KNL

Innovation Award of Artificial Intelligence Technology and Practice presented to Dell China by CCF (China Computer Federation)

Dell China has been honored with an “Innovation Award of Artificial Intelligence in Technology & Practice” in recognition of Dell’s collaboration with the Institute of Automation, Chinese Academy of Sciences (CASIA) in establishing the Artificial Intelligence and Advanced Computing Joint-Lab. The advanced computing platform was jointly unveiled by Dell China and CASIA in November 2015, and the AI award was presented by the Technical Committee of High Performance Computing (TCHPC), China Computer Federation (CCF), at the China HPC 2016 conference in Xi’an City, Shanxi Province China, on October 27, 2016. About a half dozen additional awards were presented at HPC China, an annual national conference on high performance computing organized by TCHPC. However, Dell China was the only vendor to receive an award in the emerging field of artificial intelligence in HPC.

The Artificial Intelligence and Advanced Computing Joint-Lab’s focus is on research and applications of new computing architectures in brain information processing and artificial intelligence, including cognitive function simulation, deep learning, brain computer simulation, and related new computing systems. The lab also supports innovation and development of brain science and intellect technology research, promoting Chinese innovation and breakthroughs at the forefront of science, and working to produce and industrialize these core technologies in accordance with market and industry development needs.

CASIA, a leading AI research organization in China, has huge requirements for computing and storage, and the new advanced computing platform — designed and set up by engineers and professors from Dell and CASIA — is just the tip of the iceberg with respect to CASIA’s research requirements. It features leading Dell HPC systems components designed by the Dell USA team, including servers, storage, networking and software, as well as leading global HPC partner products, including Intel CPU, NVIDIA GPU, Mellanox IB Network and Bright Computing software. The Dell China Services team implemented installation and deployment of the system, which was completed in February 2016.

The November 3, 2015, unveiling ceremony for the Artificial Intelligence and Advanced Computing Joint-Lab was held in Beijing. Marius Haas, Chief Commercial Officer and President, Enterprise Solutions of Dell; Dr. Chenhong Huang, President of Dell Greater China; and Xu Bo, Director of CASIA attended the ceremony and addressed the audience.

“As a provider of end-to-end solutions and services, Dell has been focusing on and promoting the development of frontier science and technologies, and applying the latest technologies to its solutions and services to help customers achieve business transformation and meet their ever-changing demands,” Haas said at the unveiling. “We’re glad to cooperate with CASIA in artificial intelligence, which once again shows Dell’s commitment to China’s market and will drive innovation in China’s scientific research.”

“Dell is well-positioned to provide innovative end-to-end solutions. Under the new 4.0 strategy of ‘In China, For China’, we will strengthen the cooperation with Chinese research institutes and advance the development of frontier technologies,” Huang explained. “Dell’s cooperation with CASIA represents a combination of computing and scientific research resources, which demonstrates a major trend in artificial intelligence and industrial development.”

China is a role model for emerging market development and practice sharing for other emerging countries. Partnering with CASIA and other strategic partners is Dell’s way of embracing the “Internet+” national strategy, promoting Chinese innovation and breakthroughs at the forefront of science.

“China’s strategy in innovation-driven development raises the bar for scientific research institutes. The fast development of information technologies in recent years also brings unprecedented challenges to CASIA,” added Bo. “CASIA always has intelligence technologies in mind as their main focus of strategic development. The cooperation with Dell China on the lab will further the computing advantages of the Institute of Automation, strengthen the integration between scientific research and industries, and advance artificial intelligence innovation.”

Dell China is looking forward to continued cooperation with CASIA in driving artificial intelligence across many more fields, such as meteorology, biology and medical research, transportation, and manufacturing.

↧

Advancing HPC: A Closer Look at Cool New Tech

November 29, 2016, 8:17 am

≫ Next: With Blazing Speed: Some of the Fastest Systems on the Planet are Powered by Dell EMC

≪ Previous: Dell China Receives AI Innovation Award

Dell EMC has partnered with Scientific Computing, publisher of HPC Source, and NVIDIA to produce an exclusive high performance computing supplement that takes a look at some of today’s cool new HPC technologies, as well as some of the work being done to extend HPC capabilities and opportunities.

This special publication, “New Technologies in HPC,” highlights topics such as innovative technologies in HPC and the impact they are having on the industry, HPC trends to watch, and advancing science with AI. It also looks at how organizations are extending supercomputing with cloud, machine learning technologies for the modern data center, and getting starting with deep learning.

This digital supplement can be viewed on-screen or downloaded as a PDF.

Taking our dive into new HPC technologies a bit deeper — we also brought together technology experts Paul Teich, Principal Analyst at TIRIAS Research, and Will Ramey, Senior Product Manager for GPU Computing at NVIDIA, for a live, interactive discussion with contributing editor Tim Studt: “Accelerate Your Big Data Strategy with Deep Learning.”

Paul and Will share their unique perspectives on where artificial intelligence is leading the next wave of industry transformation, helping companies go from data deluge to data-hungry. They provide insights on how organizations can accelerate their big data strategies with deep learning, the fastest growing field in AI, and discuss how, by using data-driven algorithms powered by GUP accelerators, companies can get faster insights, as well as how companies can see dynamic correlations, and achieve actionable knowledge about their business.

For those who couldn't make the live broadcast, it is available for on-demand viewing.

To learn more about HPC at Dell EMC, join the Dell EMC HPC Community at www.D ellhpc.org, or visit us online at www.Dell.com/hpc and www.HPCatDell.com.

↧

With Blazing Speed: Some of the Fastest Systems on the Planet are Powered by Dell EMC

November 29, 2016, 8:33 am

≫ Next: HPC Community Honors Dell EMC with Highly Coveted HPCwire 2016 Editor’s Choice Awards

≪ Previous: Advancing HPC: A Closer Look at Cool New Tech

MIT Lincoln Laboratory Supercomputing Center created a 1 petaflop system in less than a month to further research in autonomous systems, device physics and machine learning.

Twice each year, the TOP500 list ranks the 500 most powerful general-purpose computer systems known. In the present list, released at the SC16 conference in Salt Lake City, UT, computers in common use for high-end applications are ranked by their performance on the LINPACK Benchmark. Sixteen of these world-class systems are powered by Dell EMC. Collectively, these customers are accomplishing amazing results, continually innovating and breaking new ground to solve the biggest, most important challenges of today and tomorrow while also making major contributions to the advancement of HPC.

Here are just a few examples:

Texas Advanced Computing Center/University of Texas
Stampede - PowerEdge C8220, Xeon E5-2680 8C 2.700GHz, Infiniband FDR, Intel Xeon Phi SE10P and
Stampede-KNL - Intel S7200AP Cluster, Intel Xeon Phi 7250 68C 1.4GHz, Intel Omni-Path
The Texas Advanced Computing Center (TACC), a Dell EMC HPC Innovation Center, designs and operates some of the world's most powerful computing resources. The Center's mission is to enable discoveries that advance science and society through the application of advanced computing technologies. TACC supports the University of Texas System and National Science Foundation researchers with the newest version of their Stampede high-performance computing cluster.

MIT Lincoln Laboratory
TX-Green - S7200AP Cluster, Intel Xeon Phi 7210 64C 1.3GHz, Intel Omni-Path
MIT Lincoln Laboratory Supercomputing Center (LLSC) supports research and development aimed at solutions to problems that are critical to the Nation. The research spans diverse fields such as space observations, robotic vehicles, communications, cyber security, machine learning, sensor processing, electronic devices, bioinformatics, and air traffic control. LLSC addresses the supercomputing needs of thousands of MIT scientists and engineers by providing interactive, on-demand supercomputing and big data capabilities with zero carbon footprint.

Centre for High Performance Computing, South Africa
Lengau - PowerEdge C6320, Xeon E5-2690v3 12C 2.6GHz, Infiniband FDR
The Centre for High Performance Computing (CHPC), a Dell EMC HPC Innovation Center, is part of South Africa’s Council for Scientific and Industrial Research and hosts the fastest computer on the African continent. CHPC’s new Dell EMC-powered Lengau system will enable new opportunities and avenues in cutting-edge research, such as building the world's largest radio telescope, and will provide the computational capacity to build the private sector and non-academic user base of CHPC to help spur national economic growth.

University of Florida
HiperGator 2.0 - PowerEdge C6320, Xeon E5-2698v3 16C 2.3GHz, Infiniband
The University of Florida’s HiPerGator 2.0 system performs complex calculations and data analyses for researchers and scholars at UF and their collaborators worldwide. It is helping researchers find life-saving drugs and get them from the computer to the clinic more quickly, make more accurate, decades-long weather forecasts and improve body armor for troops.

Ohio Supercomputer Center
Owens - Dell PowerEdge C6320/R730, Xeon E5-2680v4 14C 2.4GHz, Infiniband EDR
The Ohio Supercomputer Center empowers a wide array of groundbreaking innovation and economic development activities in the fields of bioscience, advanced materials, data exploitation and other areas of state focus by providing a powerful high performance computing, research and educational cyberinfrastructure for a diverse statewide/regional constituency.

Dell EMC HPC Innovation Lab
Zenith - Dell PowerEdge C6320 & Dell PowerEdge R630, Xeon E5-2697v4 18C 2.3GHz, Intel Omni-Path
The Dell EMC HPC Innovation Lab is dedicated to HPC research, development and innovation. Its engineers are meeting real-life, workload-specific challenges through collaboration with the global HPC research community and are publishing whitepapers on their research findings. They are utilizing the lab’s world-class Infrastructure to characterize performance behavior and to test and validate upcoming technologies. The Dell EMC HPC Innovation Lab is also an OpenHPC R&D contributor.

↧

HPC Community Honors Dell EMC with Highly Coveted HPCwire 2016 Editor’s Choice Awards

November 29, 2016, 8:46 am

≫ Next: Entering a New Arena in Computing

≪ Previous: With Blazing Speed: Some of the Fastest Systems on the Planet are Powered by Dell EMC

Ed Turkel, customer Karen Green (CRC) with Tom Tabor receiving HPCwire's Editors Choice Award for Best Use of High Performance Data Analytics

Ed Turkel (HPC Sr. Strategist) with Tom Tabor (Tabor Communications CEO) receiving HPCwire's Editors Choice Award for Top Five Vendors to Watch

Just before the kick-off of the opening gala for the SC16 international supercomputing conference, HPCwire unveiled the winners of the 2016 HPCwire Editors’ Choice Awards. Each year, this awards program recognizes the best and the brightest developments that have happened in high performance computing over the past 12 months. Selected by a panel of HPCwire editors and thought leaders in HPC, these awards are highly coveted as prestigious recognition of achievements by the HPC community.

Traditionally revealed and presented each year to kick off the Supercomputing Conference (SC16), which showcases high performance computing, networking, storage, and data analysis, the awards are an annual feature of the publication and spotlight outstanding breakthroughs and achievements in HPC.

Tom Tabor, CEO of Tabor Communications, the publisher of HPCwire, announced the list of winners in Salt Lake City, UT.

“From thought leaders to end users, the HPCwire readership reaches and engages every corner of the high performance computing community,” said Tabor. “Receiving their recognition signifies community support across the entire HPC space, as well as the breadth of industries it serves.

Dell EMC was honored to be presented with two 2016 HPCwire Editors’ Choice Awards:

Best Use of High Performance Data Analytics:
The Best Use of High Performance Data Analytics award was presented to UNC-Chapel Hill Institute of Marine Sciences (IMS) and Coastal Resilience Center of Excellence (CRC), Renaissance Computing Institute (RENCI), and Dell EMC. UNC-Chapel Hill IMS and CRC work with the Dell EMC-powered RENCI Hatteras Supercomputer to predict dangerous coastal storm surges, including Hurricane Matthew, a long-lived, powerful and deadly tropical cyclone which became the first Category 5 Atlantic hurricane since 2007.

Top 5 Vendors to Watch
Dell EMC was recognized by the 2016 HPCwire Editors’ Choice Awards panel, along with Fujitsu, HPE, IBM and NVIDIA, as one of the Top 5 Vendors to Watch in high performance computing. As the only true end-to-end solutions provider in the HPC market, Dell EMC is committed to serving customer needs. And with the combination of Dell, EMC and VMware, we are a leader in the technology of today, with the world’s greatest franchises in servers, storage, virtualization, cloud software and PCs. Looking forward, we will occupy a very strong position in the most strategic areas of technology of tomorrow: digital transformation, software defined data center, hybrid cloud, converged infrastructure, mobile and security.

To learn more about HPC at Dell EMC, join the Dell EMC HPC Community at www.D ellhpc.org, or visit us online at www.Dell.com/hpc and www.HPCatDell.com.

↧

Entering a New Arena in Computing

January 4, 2017, 11:49 am

≫ Next: System benchmark results on KNL – STREAM and HPL.

≪ Previous: HPC Community Honors Dell EMC with Highly Coveted HPCwire 2016 Editor’s Choice Awards

December 2016 – HPC Innovation Lab

In order to build a balanced cluster ecosystem and eliminate bottle-necks, the need for powerful and dense server node configurations is essential to support parallel computing. The challenge is to provide maximum compute power with efficient I/O subsystem performance, including memory and networking. Some of the emerging technologies along with traditional computing that are needed for intense compute power are advanced parallel algorithms in the areas of research, life science and financial application.

Dell PowerEdge C6320p

The introduction of the Dell EMC C6320p platform, which is one of the densest and greatest maximum core capacity platform offerings in HPC solutions, provides a leap in this direction.

The PowerEdge C6320p platform is Dell EMC’s first self-bootable Intel Xeon Phi platform. The previously available versions of Intel Xeon Phi were PCIE adapters that required a host system to be plugged into. From the core perspective, it supports up to 72 processing cores, with each core supporting two vector processing units capable of AVX-512 instructions. This increases the computation of floating point operations requiring longer vector instructions unlike Intel Xeon® v4 processors that support up to AVX-2 instructions. The Intel Xeon Phi in Dell EMC C6320p also features on-package 16GB of fast MCDRAM that is stacked on the processor. The availability of MCDRAM helps out-of-order execution in applications that are sensitive to high memory bandwidth. This is in addition to the six channels of DDR4 memory hosted on the server. Being a single socket server, the C6320p provides a low power consumption compute node compared to traditional two socket nodes in HPC.

The following table shows platform differences as we compare the current Dell EMC PowerEdge C6320 and Dell EMC PowerEdge C6320p server offerings in HPC.

	C6320	C6320p
Server Form Factor	2U Chassis with four sleds	2U Chassis with four sleds
Processors
	Dual Socket	Single Socket
Processor Model	Intel® Xeon	Intel ® Xeon Phi
Max cores in a sled	Up to 44 physical cores, 88 logical cores (with two * Intel ® Xeon E5-2699 v4, 2.2 GHz, 55MB, 22 cores, 145W)	Up to 72 physical cores, 288 logical cores (with the Intel ®Xeon Phi Processor 7290 (16GB, 1.5GHz, 72 core, 245W)
Theoretical DP Flops per sled	1.26 TFlops/sec	2.9 TFlops/sec
Memory
DIMM slots	16 DDR4 DIMM slots	6 DDR4 DIMM slots + on-die 16GB MCDRAM
MCDRAM BW (Memory mode)	N/A	~ 475-490 GB/s
DDR4 BW	~ 135 GB/s	~ 90GB/s

Interconnects
Provisioning fabric	Dual port 1Gb/10GbE	Single port 1GbE
High Speed	Intel Omni-Path Fabric (100Gbps) Mellanox Infiniband (100Gbps)	Intel Omni-PathFabric (100Gbps) On-board Mellanox Infiniband (100Gbps)
Storage
	Up to 24 x 2.5” or 12 x 3.5” HD	6 x 2.5” HD per node + Internal 1.8” SSD option for boot
Management
Integrated Dell EMC Remote Access Controller	Dedicated and shared iDRAC8	Dedicated and shared iDRAC8

Table 1: Comparing the C6320 and C6320p offering in HPC

Dell EMC Supported HPC Solution:

Dell EMC offers a complete, tested, verified and validated solution offering on the C6320p servers. This is based on Bright Cluster Manger 7.3 with RHEL 7.2 that includes specific highly recommended kernel and security updates. It will also provide support for the upcoming RHEL 7.3 operating system. The solution provides automated deployment, configuration, management and monitoring of the cluster. It also integrates recommended Intel performance tweaks, as well as required software drivers and other development toolkits to support the Intel Xeon Phi programming model.

The solution provides the latest networking support for both InfiniBand and Intel Omni-Path Fabric. It also includes Dell EMC-supported System Management tools that are bundled to provide customers with the ease of cluster management on Dell EMC hardware.

*Note: As a continuation to this blog, there will be follow-on micro-level benchmarking and application study published on C6320p.

References:

1) http://www.dell.com/us/business/p/poweredge-c6320p/pd?ref=PD_OC

2) http://en.community.dell.com/techcenter/high-performance-computing/b/general_hpc/archive/2016/11/11/cryo-em-in-hpc-with-knl

↧

System benchmark results on KNL – STREAM and HPL.

January 10, 2017, 1:30 am

≫ Next: HPCG Performance study with Intel KNL

≪ Previous: Entering a New Arena in Computing

By Garima Kochhar. HPC Innovation Lab. January 2016. The Intel Xeon Phi bootable processor (architecture codenamed “Knights Landing” – KNL) is ready for prime time. The HPC Innovation Lab has had access to a few engineering test units...(read more)

↧

HPCG Performance study with Intel KNL

January 17, 2017, 1:05 am

≫ Next: Application Performance on P100-PCIe GPUs

≪ Previous: System benchmark results on KNL – STREAM and HPL.

By Ashish Kumar Singh. January 2017 (HPC Innovation Lab)

This blog presents an in-depth analysis of the High Performance Conjugate Gradient (HPCG) benchmark on the Intel Xeon Phi processor, which is based on Intel Xeon Phi architecture codenamed “Knights Landing”. The analysis has been performed on PowerEdge C6320p platform with the new Intel Xeon Phi 7230 processor.

Introduction to HPCG and Intel Xeon Phi 7230 processor

The HPCG benchmark constructs a logically global, physically distributed sparse linear system using a 27-point stencil at each grid point in 3D domain such that the equation at the point (I, j, k) depend on its values and 26 surrounding neighbors. The global domain computed by benchmark is (NRx * Nx) X (NRy*Ny) X (NRz*Nz), where Nx, Ny and Nz are dimensions of local subgrids, assigned to each MPI process and number of MPI ranks are NR = (NRx X NRy X NRz). These values can be defined in hpcg.dat file or passed in the command line arguments.

The HPCG benchmark is based on conjugate gradient solver, where the pre-conditioner is a three level hierarchical multi-grid (MG) method with Gauss-Seidel. The algorithm starts with MG and contains Symmetric Gauss-Seidel (SymGS) and Sparse Matrix-vector multiplication (SPMV) routines for each level. Both SYMGS and SPMV require data from their neighbor as data is distributed across nodes which is provided by their predecessor, the Exchange Halos routine. The residual should be lower than 1^-6which is locally computed by Dot Product (DDOT), while MPI_Allreduce follows the DDOT and completes the global operation. WAXPBY only updates a vector with sum of two scaled vectors. Scaled vector addition is a simple operation that calculates the output vector by scaling the input vectors with a constant and performing an addition on the values of the same index. So, HPCG has four computational blocks SPMV, SymGS, WAXPBY and DDOT, while two communication blocks MPI_Allreduce and Halos Exchange.

Intel Xeon Phi Processor is a new generation of processors from the Intel Xeon Phi family. Previous generations of Intel Xeon Phi were available as a coprocessor, in a PCI card form factor and required an Intel Xeon processor. The Intel Xeon Phi 7230 contains 64 cores @ 1.3GHz of core frequency along with the turbo speed of 1.5GHz and 32MB of L2 cache. It supports DDR4-2400MHz memory up to 384GB and instruction set of AVX512. Intel Xeon Phi processor also encloses 16GB of MCDRAM memory on socket with a sustained memory bandwidth of up to ~480GB/s measured by the Stream benchmark. Intel Xeon Phi 7230 delivers up to ~1.8TFLOPS of double precision HPL performance.

This blog showcases the performance of HPCG benchmark on the Intel KNL processor and compares the performance to that on the Intel Broadwell E5-2697 v4 processor. The Intel Xeon Phi cluster comprises of one head node which is PowerEdge R630 and 12 PowerEdge C6320p as compute nodes. While Intel Xeon processor cluster includes one PowerEdge R720 as head node and 12 PowerEdge R630 as compute nodes. All compute nodes are connected by Intel Omni-Path of 100GB/s. The cluster shares the storage of head node over NFS. The detailed information of the clusters are mentioned below in table1. All HPCG tests on Intel Xeon Phi has been performed with the BIOS settings of “quadrant” cluster mode and “Memory” memory mode.

Table1: Cluster Hardware and software details

Testbed configuration

HPCG Performance analysis with Intel KNL

Choosing the right problem size for HPCG should follow the following rules. The problem size should be large enough not to fit in the cache of the device. The problem size should be able to occupy the significant fraction of main memory, at least 1/4^th of total. For HPCG performance characterization, we have chosen the local domain dimension of 128^3, 160^3, and 192^3 with the execution time of t=30 seconds. The local domain dimension defines the global domain dimension by (NR*Nx) x (NR*Ny) x (NR*Nz), where Nx=Ny=Nz=160 and NR is the number of MPI processes.

Figure 1: HPCG Performance comparison with multiple local dimension grid size

As shown in figure 1, the local dimension grid size of 160^3 gives the best performance of 48.83GFLOPS. The problem size bigger than 128^3 allows for more parallelism and it fits well inside the MCDRAM while 192^3 does not. All these tests have been carried out with 4 MPI processes and 32 OpenMP threads per MPI process on a single Intel KNL server.

Figure 2: HPCG performance comparison with multiple execution time.

Figure 2 demonstrates HPCG performance with multiple execution times for grid size of 160^3 on a single Intel KNL server. As per the graph, HPCG performance doesn’t change even by changing the execution time. It means execution time does not appear to be a factor for HPCG performance. So, we may not need to spend hours or days of time to benchmark large clusters, which in result, will save both time and power. Although, the official execution time should be >=1800 seconds as reported in the output file. If you decide to submit your results to TOP 500 ranking list, execution time should be not less than 1800seconds.

Figure 3: Time consumed by HPCG computational routines.

Figure 3 shows the time consumed by each computational routine from 1 to 12 KNL nodes. Time spent by each routine is mentioned in HPCG output file as shown in the figure 4. As per the above graph, HPCG spends its most of the time in the compute intensive pre-conditioning of SYMGS function and matrix vector multiplication of sparse matrix (SPMV). The vector update phase (WAXPBY) consumes very less time in comparison to SymGS and least time by residual calculation (DDOT) out of all four computation routines. As the local grid size is same across all multi-node runs, the time spent by all four compute kernels for each multi-node run are approximately same. The output file shown in figure 4, shows performance of all four computation routines. In which, MG consists both SymGS and SPMV.

Figure 4: A slice of HPCG output file

Performance Comparison

Here is the HPCG multi-nodes performance comparison between Intel Xeon E5-2697 v4 @2.3GHz (Broadwell processor) and Intel KNL processor 7230 with Intel Omni-path interconnect.

Figure 5: HPCG performance comparison between Intel Xeon Broadwell processor and Intel Xeon Phi processor

Figure 5 shows HPCG performance comparison between dual Intel Broadwell 18 cores processors and one Intel Xeon phi 64 cores processor. Dots in figure 5 show the performance acceleration of KNL servers over Broadwell dual socket servers. For single KNL node, HPCG performs 2.23X better than Intel Broadwell node. For Intel KNL multi-nodes also HPCG show more than 100% performance increase over Broadwell processor nodes. With 12 Intel KNL nodes, HPCG performance scales out well and shows performance up to ~520 GFLOPS.

Conclusion

Overall, HPCG shows ~2Xhigher performance with Intel KNL processor on PowerEdge C6320p over Intel Broadwell processor server. HPCG performance scales out well with more number of nodes. So, PowerEdge C6320p platform will be a prominent choice for HPC applications like HPCG.

Reference:

https://software.sandia.gov/hpcg/doc/HPCG-Specification.pdf

http://www.hpcg-benchmark.org/custom/index.html?lid=158&slid=281

↧

Application Performance on P100-PCIe GPUs

March 14, 2017, 4:01 pm

≫ Next: Deep Learning Inference on P40 GPUs

≪ Previous: HPCG Performance study with Intel KNL

Authors: Rengan Xu, Frank Han and Nishanth Dandapanthu. Dell EMC HPC Innovation Lab. Feb 2017

Introduction to P100-PCIe GPU

This blog describes the performance analysis on NVIDIA® Tesla® P100™ GPUs on a cluster of Dell PowerEdge C4130 servers. There are two types of P100 GPUs: PCIe-based and SXM2-based. In PCIe-based server, GPUs are connected by PCIe buses and one P100 delivers around 4.7 and 9.3 TeraFLOPS of double and single precision performance, respectively. And in P100-SXM2, GPUs are connected by NVLink and one P100 delivers around 5.3 and 10.6 TeraFLOPS of double and single precision performance, respectively. This blog focuses on P100 for PCIe-based servers, i.e. P100-PCIe. We have already analyzed the P100 performance for several deep learning frameworks in this blog. The objective of this blog is to compare the performance of HPL, LAMMPS, NAMD, GROMACS, HOOMD-BLUE, Amber, ANSYS Mechanical and RELION. The hardware configuration of the cluster is the same as in the deep learning blog. Briefly speaking, we used a cluster of four C4130 nodes, each node has dual Intel Xeon E5-2690 v4 CPUs and four NVIDIA P100-PCIe GPUs and all nodes are connected with EDR Infiniband. Table 1 shows the detailed information about the hardware and software used in every compute node.

Table 1: Experiment Platform and Software Details

Platform	PowerEdge C4130 (configuration G)
Processor	2 x Intel Xeon CPU E5-2690 v4 @2.6GHz (Broadwell)
Memory	256GB DDR4 @ 2400MHz
Disk	9TB HDD
GPU	P100-PCIe with 16GB GPU memory
Nodes Interconnects	Mellanox ConnectX-4 VPI (EDR 100Gb/s Infiniband)
Infiniband Switch	Mellanox SB7890
Software and Firmware
Operating System	RHEL 7.2 x86_64
Linux Kernel Version	3.10.0-327.el7
BIOS	Version 2.3.3
CUDA version and driver	CUDA 8.0.44 (375.20)
OpenMPI compiler	Version 2.0.1
GCC compiler	4.8.5
Intel Compiler	Version 2017.0.098
Applications
HPL	Version hpl_cuda_8_ompi165_gcc_485_pascal_v1
LAMMPS	Version Lammps-30Sep16
NAMD	Version NAMD_2.12_Source
GROMACS	Version 2016.1
HOOMD-blue	Version 2.1.2
Amber	Version 16update7
ANSYS Mechanical	Version 17.0
RELION	Version 2.0.3

High Performance Linpack (HPL)

HPL is a multicomputer parallel application to measure how fast computers solve a dense n by n system of linear equations using LU decomposition with partial row pivoting and designed to be run at very large scale. The HPL running on the experimented cluster uses the double precision floating point operations. Figure 1 shows the HPL performance on the tested P100-PCIe cluster. It can be seen that 1 P100 is 3.6x faster than 2 x E5-2690 v4 CPUs. HPL also scales very well with more GPUs within nodes or across nodes. Recall that 4 P100 is within a server and therefore 8, 12 and 16 P100 are in 2, 3 and 4 servers. 16 P100 GPUs has the speedup of 14.9x compared to 1 P100. Note that the overall efficiency is calculated as: HPL Efficiency = rMax / (CPUs rPeak + GPUs rPeak), where rPeak is the highest theoretical FLOPS result that could be achieved with base clock, and the number reported by HPL is rMax and is the real performance that can be achieved. HPL cannot be run at the max boost clock. It is typically run at some number in between but the average is close to the base clock then to the max boost clock. That is why we used base clock for rPeak calculation. Although we also included CPU rPeak in the efficiency calculation, when running HPL on P100 we set DGEMM_SPLIT=1.0 which means CPU is not really contributing to the DGEMM, but only handling other overhead so it is not actually contributing a lot of FLOPS. Although we observed that CPUs stayed fully utilized they were just handling the overhead and data movement to keep the GPUs fed. What is the most important for P100 is that rMax is really big.

Figure 1: HPL performance on P100-PCIe

NAMD

NAMD (for NAnoscale Molecular Dynamics) is a molecular dynamics application designed for high-performance simulation of large biomolecular systems. The dataset we used is Satellite Tobacco Mosaic Virus (STMV) which is a small, icosahedral plant virus that worsens the symptoms of infection by Tobacco Mosaic Virus (TMV). This dataset has 1,066,628 atoms and it is the largest dataset on NAMD utilities website. The performance metric in the output log of this application is “days/ns” (the lower the better). But its inverted metric “ns/day” is used in our plot since that is what most molecular dynamics users focus on. The average of all occurrences of this value in the output log was used. Figure 2 shows the performance within 1 node. It can be seen that the performance of using 2 P100 is better than that of using 4 P100. This is probably because of the communications among different CPU threads. This application launches a set of workers threads that handle the computation and communication threads that handle the data communication. As more GPUs are used, more communication threads are used and more synchronization is needed. In addition, based on the profiling result from NVIDIA’s CUDA profiler called nvprof, with 1 P100 the GPU computation takes less than 50% of the whole application time. According to Amdahl’s law, the speedup with more GPUs will be limited by another 50% work that is not parallelized by GPU. Based on this observation, we further ran this application on multiple nodes with two different settings (2 GPUs/node and 4 GPUs/node) and the result is shown in Figure 3. The result shows that no matter how many nodes are used, the performance of 2 GPUs/node is always better than 4 GPUs/node. Within a node, 2 P100 GPUs is 9.5x faster than dual CPUs.

Figure 2: NAMD Performance within 1 P100-PCIe node

Figure 3: NAMD Performance across Nodes

GROMACS

GROMACS (for GROningen MAchine for Chemical Simulations) primarily does simulations for biochemical molecules (bonded interactions). But because of its efficiency in calculating non-bonded interactions (atoms not linked by covalent bonds), the user base is expanding to non-biological systems. Figure 4 shows the performance of GROMACS on CPU, K80 GPUs and P100-PCIe GPUs. Since one K80 has two internal GPUs, from now on when we mention one K80 it always refers to two internal GPUs instead of one of the two internal GPUs. When testing with K80 GPUs, the same P100-PCIe GPUs based servers were used. Therefore, the CPUs and memory were kept the same and the only difference is that P100-PCIe GPUs were replaced to K80 GPUs. In all tests, there were four GPUs per server and all GPUs were utilized. For example, the 3 node data point is with 3 servers and 12 total GPUs. The performance of P100-PCIe is 4.2x – 2.8x faster than CPU from 1 node to 4 nodes, and is 1.5x – 1.1x faster than K80 GPU from 1 node to 4 nodes.

Figure 4: GROMACS Performance on P100-PCIe

LAMMPS

LAMMPS (for Large Scale Atomic/Molecular Massively Parallel Simulator) is a classic molecular dynamics code, capable of simulations for solid-state materials (metals, semi-conductors), soft matter (biomolecules, polymers) and coarse-grained or mesoscopic systems. It can be used to model atoms or more generically as a parallel particle simulator at the atomic, meso or continuum scale. The dataset we used was LJ (Lennard-Jones liquid benchmark) which contains 512000 atoms. There are two GPU implementations in LAMMPS: GPU library version and kokkos version. In the experiment, we used kokkos version since it was much faster than the GPU library version.

Figure 5 shows LAMMPS performance on CPU and P100-PCIe GPUs. Using 16 P100 GPUs is 5.8x faster than using 1 P100. The reason that this application did not scale linearly is that the data transfer (CPU->GPU, GPU->CPU and GPU->GPU) time increases as more GPUs are used although the computation part reduces linearly. And the reason that the data transfer time increases is because this application requires the data communication among all GPUs used. However, the configuration G we used only allows Peer-to-Peer (P2P) access for two pairs of GPUs: GPU 1 - GPU 2 and GPU 3 - GPU 4. GPU 1/2 cannot communicate with GPU 3/4 directly. If the communication is needed, the data must go through CPU which slows the communication. The configuration B is able to ease this issue as it allows P2P access among all four GPUs within a node. The comparison between configuration G and configuration B is shown in Figure 6. By running LAMMPS on a configuration B server with 4 P100, the performance metric “timesteps/s” was improved to 510 compared to 505 in configuration G, resulting in 1% improvement. The reason why the improvement is not significant is because the data communication takes only less than 8% of the whole application time when running on configuration G with 4 P100. Figure 7 also compared the performance of P100-PCIe with that of CPU and K80 GPUs for this application. It is shown that within 1 node, 4 P100-PCIe is 6.6x faster than 2 E5-2690 v4 CPUs and 1.4x faster than 4 K80 GPUs.

Figure 5: LAMMPS Performance on P100-PCIe

Figure 6 : Comparison between Configuration G and Configuration B

Figure 7: LAMMPS Performance Comparison

HOOMD-blue

HOOMD-blue (for Highly Optimized Object-oriented Many-particle Dynamics - blue) is a general purpose molecular dynamic simulator. Figure 8 shows the HOOMD-blue performance. Note that the y-axis is in logarithmic scale. It is observed that 1 P100 is 13.4x faster than dual CPU. The speedup of using 2 P100 is 1.5x compared to using only 1 P100. This is a reasonable speedup. However, with 4 P100 to 16 P100, the speedup is from 2.1x to 3.9x which is not high. The reason is that similar to LAMMPS, this application also involves lots of communications among all used GPUs. Based on the analysis in LAMMPS, using configuration B should reduce this communication bottleneck significantly. To verify this, we ran the same application again on a configuration B server. With 4 P100, the performance metric “hours for 10e6 steps” was reduced to 10.2 compared to 11.73 in configuration G, resulting in 13% performance improvement and the speedup compared to 1 P100 was improved to 2.4x from 2.1x.

Figure 8: HOOMD-blue Performance on CPU and P100-PCIe

Amber

Amber is the collective name for a suite of programs that allow users to carry out molecular dynamics simulations, particularly on biomolecules. The term Amber is also used to refer to the empirical force fields that are implemented in this suite. Figure 9 shows the performance of Amber on CPU and P100-PCIe. It can be seen that 1 P100 is 6.3x faster than dual CPU. Using 2 P100 GPUs is 1.2x faster than using 1 P100. However, the performance drops significantly when 4 or more GPUs are used. The reason is that similar to LAMMPS and HOOMD-blue, this application heavily relies on P2P access but configuration G only supports that between 2 pair GPUs. We verified this by again testing this application on a configuration B node. As a result, the performance of using 4 P100 was improved to 791 ns/day compared to 315 ns/day in configuration G, resulting in 151% performance improvement and the speedup of 2.5x. But even in configuration B, the multi-GPU scaling is still not good enough. This is because when the Amber multi-GPU support was originally designed the PCI-E bus speed was gen 2 x 16 and the GPUs were C1060 or C2050s. However, the current Pascal generation GPUs are > 16x faster than the C1060s while the PCI-E bus speed has only increased by 2x (PCI Gen2 x 16 to PCI Gen3 x 16) and Infiniband interconnects by about the same amount. Amber website explicitly states that “It should be noted that while the legacy MPI and GPU-Direct methods of multi-GPU communication are still supported, and will be used by the code automatically if peer to peer communication is not available, you are very unlikely to see any speedup by using multiple GPUs for a single job if the GPUs are newer than C2050s. Multi-node runs are almost impossible to get to scale.” This is consistent with our results on multi-node. Because it is obvious to see that in Figure 9, the more nodes are used, the worse the performance is.

Figure 9: Amber Performance on CPU and P100-PCIe

ANSYS Mechanical

ANSYS® Mechanical™ software is a comprehensive finite element analysis (FEA) tool for structural analysis, including linear, nonlinear dynamic, hydrodynamic and explicit studies. It provides a complete set of element behavior, material models and equation solvers for a wide range of mechanical design problems. The finite element method is used to solve the partial differential equations which is a compute and memory intensive task. Our testing focused on the Power Supply Module (V17cg-1) benchmark. This is a medium sized job for iterative solvers and a good test for memory bandwidth. Figure 10 shows the performance of ANSYS Mechanical on CPU and P100-PCIe. It is shown that within a node, 4 P100 is 3.8x faster than dual CPUs. And with 4 nodes, 16 P100 is 2.3x faster than 8 CPUs. The figure also shows that the performance scales well with more nodes. The speedup with 4 nodes is 2.8x compared to 1 node.

Figure 10: ANSYS Mechanical Performance on CPU and P100-PCIe

RELION

RELION (for REgularised Likelihood OptimisationN) is a program that employs an empirical Bayesian approach to refinement of (multiple) 3D reconstructions or 2D class averages in electron cryo-microscopy (cryo-EM). Figure 11 shows the performance of RELION on CPU and P100-PCIe. Note that y-axis is in logarithmic scale. It demonstrates that 1 P100 is 8.8x faster than dual CPU. From the figure we also notice that it does not scale well starting from 4 P100 GPUs. Because of the long execution time, we did not perform the profiling for this application. But it is possible that the reason of the weak scaling is similar to LAMMPS, HOOMD-blue and Amber.

Figure 11: RELION Performance on CPU and P100-PCIe

Conclusions and Future Work

In this blog, we presented and analyzed the performance of different applications on Dell PowerEdge C4130 servers with P100-PCIe GPUs. In all of the tested applications, HPL, GROMACS and ANSYS Mechanical benefit from the balanced CPU-GPU configuration in configuration G, because they do not require P2P access among GPUs. However, LAMMPS, HOOMD-blue, Amber (and possibly RELION) rely on P2P accesses. Therefore, with configuration G, they scale well up to 2 P100 GPUs, then scale weakly with 4 or more P100 GPUs. But with Configuration B, they scale better than G with 4 GPUs, so configuration B is more suitable and recommended for applications implemented with P2P accesses.

In the future work, we will run these applications on P100-SXM2 and compare the performance difference between P100-PCIe and P100-SXM2.

↧

Deep Learning Inference on P40 GPUs

March 22, 2017, 8:21 am

≫ Next: Virtualized HPC Performance with VMware vSphere 6.5 on a Dell PowerEdge C6320 Cluster

≪ Previous: Application Performance on P100-PCIe GPUs

Authors: Rengan Xu, Frank Han and Nishanth Dandapanthu. Dell EMC HPC Innovation Lab. Mar. 2017

Introduction to P40 GPU and TensorRT

Deep Learning (DL) has two major phases: training and inference/testing/scoring. The training phase builds a deep neural network (DNN) model with the existing large amount of data. And the inference phase uses the trained model to make prediction from new data. The inference can be done in the data center, embedded system, auto and mobile devices, etc. Usually inference must respond to user request as quickly as possible (often in real time). To meet the low-latency requirement of inference, NVIDIA® launched Tesla® P4 and P40 GPUs. Aside from high floating point throughput and efficiency, both GPUs introduce two new optimized instructions designed specifically for inference computations. The two new instructions are 8-bit integer (INT8) 4-element vector dot product (DP4A) and 16-bit 2-element vector dot product (DP2A) instructions. Deep learning researchers have found using FP16 is able to achieve the same inference accuracy as FP32 and many applications only require INT8 or lower precision to keep an acceptable inference accuracy. Tesla P4 delivers a peak of 21.8 INT8 TIOP/s (Tera Integer Operations per Second), while P40 delivers a peak of 47.0 INT8 TIOP/s. This blog only focuses on P40 GPU.

TensorRT^TM, previously called GIE (GPU Inference Engine), is a high performance deep learning inference engine for production deployment of deep learning applications that maximizes inference throughput and efficiency. TensorRT provides users the ability to take advantage of fast reduced precision instructions provided in the Pascal GPUs. TensorRT v2 supports the INT8 reduced precision operations that are available on the P40.

Testing Methodology

This blog quantifies the performance of deep learning inference using TensorRT on Dell’s PowerEdge C4130 server which is equipped with 4 Tesla P40 GPUs. Since TensorRT is only available for Ubuntu OS, all the experiments were done on Ubuntu. Table 1 shows the hardware and software details. The inference benchmark we used was giexec in TensorRT sample codes. The synthetic images which were filled with random non-zero numbers to simulate real images were used in this sample code. Two classic neural networks were tested: AlexNet (2012 ImageNet winner) and GoogLeNet (2014 ImageNet winner) which is much deeper and complicated than AlexNet.

We measured the inference performance in images/sec which means the number of images that can be processed per second. To measure the performance improvement of the current generation GPU P40, we also compared its performance with the previous generation GPU M40. The most important goal of this testing is to measure the inference performance in INT8 mode, compared to FP32 mode. P40 uses the new Pascal architecture and supports the new INT8 instructions. The previous generation GPU M40 uses Maxwell architecture and does not support INT8 instructions. The theoretical performance of INT8, FP32 in both M40 and P40 is shown in Table 2. We measured the performance FP32 on both devices and both FP32 and INT8 on the P40.

Table 1: Hardware configuration and software details

Platform	PowerEdge C4130 (configuration G)
Processor	2 x Intel Xeon CPU E5-2690 v4 @2.6GHz (Broadwell)
Memory	256GB DDR4 @ 2400MHz
Disk	400GB SSD
GPU	4x Tesla P40 with 24GB GPU memory
Software and Firmware
Operating System	Ubuntu 14.04
BIOS	2.3.3
CUDA and driver version	8.0.44 (375.20)
TensorRT Version	2.0 EA

Table 2: Comparison between Tesla M40 and P40

	Tesla M40	Tesla P40
INT8 (TIOP/s)	N/A	47.0
FP32 (TFLOP/s)	6.8	11.8

Performance Evaluation

In this section, we will present the inference performance with TensorRT on GoogLeNet and AlexNet. We also implemented the benchmark with MPI so that it can be run on multiple P40 GPUs within a node. We will also compare the performance of P40 with M40. Lastly we will show the performance impact when using different batch sizes.

Figure 1 shows the inference performance with TensorRT library for both GoogLeNet and AlexNet. We can see that INT8 mode is ~3x faster than FP32 in both neural networks. This is expected since the theoretical speedup of INT8 is 4x compared to FP32 if only multiplications are performed and no other overhead is incurred. However, there are kernel launches, occupancy limits, data movement and math other than multiplications, so the speedup is reduced to about 3x faster.

Figure 1: Inference performance with TensorRT library

Dell’s PowerEdge C4130 supports up to 4 GPUs in a server. To make use of all GPUs, we implemented the inference benchmark using MPI so that each MPI process runs on each GPU. Figure 2 and Figure 3 show the multi-GPU inference performance on GoogLeNet and AlexNet, respectively. When using multiple GPUs, linear speedup were achieved for both neural networks. This is because each GPU processes its own images and there is no communications and synchronizations among used GPUs.

Figure 2: Multi-GPU inference performance with TensorRT GoogLeNet

Figure 3: Multi-GPU inference performance with TensorRT AlexNet

To highlight the performance advantage of P40 GPU and its native support for INT8, we compared the inference performance between P40 with the previous generation GPU M40. The result is shown in Figure 5 and Figure 6 for GoogLeNet and AlexNet, respectively. In FP32 mode, P40 is 1.7x faster than M40. And the INT8 mode in P40 is 4.4x faster than FP32 mode in M40.

Figure 4: Inference performance comparison between P40 and M40

Figure 5: Inference performance comparison between P40 and M40

Deep learning inference can be applied in different scenarios. Some scenarios require large batch size and some scenarios even requires no batching at all (i.e. batch size is 1). Therefore we also measured the performance difference when using different batch sizes and the result is shown in Figure 6. Note that the purpose here is not comparing the performance of GoogLeNet and AlexNet, instead the purpose is to check how the performance changes with different batch sizes for each neural network. It can be seen that without batch processing the inference performance is very low. This is because the GPU is not assigned enough workloads to keep it busy. The larger the batch size is, the higher the inference performance is, although the rate of the speed increasing becomes slower. When batch size is 4096, GoogLeNet stopped running because the required GPU memory for this neural network exceeds the GPU memory limit. But AlexNet was able to run because it is a less complicated neural network than GoogLeNet and therefore it requires less GPU memory. So the largest batch size is only limited by GPU memory.

Figure 6: Inference performance with different batch sizes

Conclusions and Future Work

In this blog, we presented the inference performance in deep learning with NVIDIA® TensorRT library on P40 and M40 GPUs. As a result, the INT8 support in P40 is about 3x faster than FP32 mode in P40 and 4.4x faster than FP32 mode in the previous generation GPU M40. Multiple GPUs can increase the inferencing performance linearly because of no communications and synchronizations. We also noticed that higher batch size leads to higher inference performance and the largest batch size is only limited by GPU memory size. In the future work, we will evaluate the inference performance with real world deep learning applications.

↧

Virtualized HPC Performance with VMware vSphere 6.5 on a Dell PowerEdge C6320 Cluster

April 5, 2017, 7:44 am

≫ Next: Dell EMC HPC System for Research - Keeping it fresh

≪ Previous: Deep Learning Inference on P40 GPUs

This article presents performance comparisons of several typical MPI applications — LAMMPS, WRF, OpenFOAM, and STAR-CCM+ — running on a traditional, bare-metal HPC cluster versus a virtualized cluster running VMware’s vSphere virtualization platform. The tests were performed on a 32-node, EDR-connected Dell PowerEdge C6320 cluster, located in the Dell EMC HPC Innovation Lab in Austin, Texas. In addition to performance results, virtual cluster architecture and configuration recommendations for optimal performance are described.

Why HPC virtualization

Interest in HPC virtualization and cloud have grown rapidly. While much of the interest stems from gaining the general value of cloud technologies, there are specific benefits of virtualizing HPC and supporting it in a cloud environment, such as centralized operation, cluster resource sharing, research environment reproducibility, multi-tenant data security, fault isolation and resiliency, dynamic load balancing, efficient power management, etc. Figure 1 illustrates several HPC virtualization benefits.

Despite the potential benefits of moving HPC workloads to a private, public, or hybrid cloud, performance concerns have been a barrier to adoption. We focus here on the use of on-premises, private clouds for HPC — environments in which appropriate tuning can be applied to deliver maximum application performance. HPC virtualization performance is primarily determined by two factors; hardware virtualization support and virtual infrastructure capability. With advances in both VMware vSphere as well as x86 microprocessor architecture, throughput applications can generally run at close to full speed in the VMware virtualized environment — with less than 5% performance degradation compared to native, and often just 1 – 2% [1]. MPI applications by nature are more challenging, requiring sustained and intensive communication between nodes, making them sensitive to interconnect performance. With our continued performance optimization efforts, we see decreasing overheads running these challenging HPC workloads [2] and this blog post presents some MPI results as examples.

Figure 1: Illustration of several HPC virtualization benefits

Testbed Configuration

As illustrated in Figure 2, the testbed consists of 32 Dell PowerEdge C6320 compute nodes and one management node. vCenter [3], the vSphere management component, as well as NFS and DNS are running in virtual machines (VMs) on the management node. VMware DirectPath I/O technology [4] (i.e., passthrough mode) is used to allow the guest OS (the operating system running within a VM) to directly access the EDR InfiniBand device, which shortens the message delivery path by bypassing the network virtualization layer to deliver best performance. Native tests were run using CentOS on each host, while virtual tests were run with the VMware ESXi hypervisor running on each host along with a single virtual machine running the same CentOS version.

Figure 2: Testbed Virtual Cluster Architecture

Table 1 shows all cluster hardware and software details, and Table 2 shows a summary of BIOS and vSphere settings.

Table 1: Cluster Hardware and Software Details

Hardware
Platform	Dell PowerEdge C6320
Processor	Dual 10-core Intel Xeon E5-2660 v3 processors@2.6GHz (Haswell)
Memory	128GB DDR4
Interconnect	Mellanox ConnectX-4 VPI adapter card; EDR IB (100Gb/s)
Software
VMware vSphere
ESXi hypervisor	6.5
vCenter management server	6.5
BIOS, Firmware and OS
BIOS	1.0.3
Firmware	2.23.23.21
OS Distribution (virtual and native)	CentOS 7.2
Kernel	3.10.0-327.el7.x86_64
OFED and MPI
OFED	MLNX_OFED_LINUX-3.4-1.0.0.0
Open MPI (LAMMPS, WRF and OpenFOAM)	1.10.5a1
Intel MPI (STAR-CCM+)	5.0.3.048
Benchmarks
LAMMPS	v20Jul16
WRF	v3.8.1
OpenFOAM	v1612+
STAR-CCM+	v11.04.012

Table 2: BIOS and vSphere Settings

BIOS Settings
Hardware-assisted virtualization	Enabled
Power profile	Performance Per Watt (OS)
Logical processor	Enabled
Node interleaving	Disabled (default)
vSphere Settings
ESXi power policy	Balanced (default)
DirectPath I/O	Enabled for EDR InfiniBand
VM size	20 virtual CPUs, 100GB memory
Virtual NUMA topology (vNUMA)	Auto detected (default)
Memory reservation	Fully reserved
CPU Scheduler affinity	None (default)

Results

Figures 3-6 show native versus virtual performance ratios with the settings in Table 2 applied. A value of 1.0 means that virtual performance is identical to native. Applications were benchmarked using a strong scaling methodology — problem sizes remained constant as job sizes were scaled. In the Figure legends, ‘nXnpY’ indicates a test run on X nodes using a total of Y MPI ranks. Benchmark problems were selected to achieve reasonable parallel efficiency at the largest scale tested. All MPI processes were consecutively mapped from node 1 to node 32.

As can be seen from the results, the majority of tests show degradations under 5%, though there are increasing overheads as we scale. At the highest scale tested (n32np640), performance degradation varies by applications and benchmark problems, with the largest degradation seen with LAMMPS atomic fluid (25%) and the smallest seen with STAR-CCM+ EmpHydroCyclone_30M (6%). Single-node STAR-CCM+ results are anomalous and currently under study. As we continue our performance optimization work, we expect to report better and more scalable results in the future.

Figure 3: LAMMPS native vs. virtual performance. Higher is better.

Figure 4: WRF native vs. virtual performance. Higher is better.

Figure 5: OpenFOAM native vs. virtual performance. Higher is better.

Figure 6: STAR-CCM+ native vs. virtual performance. Higher is better.

Best Practices

The following configurations are suggested to achieve optimal virtual performance for HPC. For more comprehensive vSphere performance guidance, please see [5] and [6].

BIOS:

Enable hardware-assisted virtualization features , e.g. Intel VT.
Enable logical processors. Though logical processors (hyper-threading) usually does not help HPC performance, enable it but configure the virtual CPUs (vCPUs) of a VM to each use a physical core and leave extra threads/logical cores for ESXi hypervisor helper threads to run.
It’s recommended to configure BIOS settings to allow ESXi the most flexibility in using power management features. In order to allow ESXi to control power-saving features, set the power policy to the “OS Controlled” profile.
Leave node interleaving disabled to let the ESXi hypervisor detect NUMA and apply NUMA optimizations

vSphere:

Configure EDR InfiniBand in DirectPath I/O mode for each VM
Properly size VMs:

MPI workloads are CPU-heavy and can make use of all cores, thus requiring a large VM. However, CPU or memory overcommit would greatly impact performance. In our tests, each VM is configured with 20vCPUs, using all physical cores, and 100 GB fully reserved memory, leaving some free memory to consume ESXi hypervisor memory overhead.

ESXi power management policy:

There are three ESXi power management policies: “High Performance”, “Balanced” (default), “Low Power” and “Custom”. Though “High performance” power management would slightly increase performance of latency-sensitive workloads, in situations in which a system’s load is low enough to allow Turbo to operate, it will prevent the system from going into C/C1E states, leading to lower Turbo boost benefits. The “Balanced” power policy will reduce host power consumption while having little or no impact on performance. It’s recommended to use this default.

Virtual NUMA

Virtual NUMA (vNUMA) exposes NUMA topology to the guest OS, allowing NUMA-aware OSes and applications to make efficient use of the underlying hardware. This is an out-of-the-box feature in vSphere.

Conclusion and Future Work

Virtualization holds promise for HPC, offering new capabilities and increased flexibility beyond what is available in traditional, unvirtualized environments. These values are only useful, however, if high performance can be maintained. In this short post, we have shown that performance degradations for a range of common MPI applications can be kept under 10%, with our highest scale testing showing larger slowdowns in some cases. With throughput applications running at very close to native speeds, and with the results shown here, it is clear that virtualization can be a viable and useful approach for a variety of HPC use-cases. As we continue to analyze and address remaining sources of performance overhead, the value of the approach will only continue to expand.

If you have any technical questions regarding VMware HPC virtualization, please feel free to contact us!

Acknowledgements

These results have been produced in collaboration with our Dell Technology colleagues in the Dell EMC HPC Innovation Lab who have given us access to the compute cluster used to produce these results and to continue our analysis of remaining performance overheads.

References

J. Simons, E. DeMattia, and C. Chaubal, “Virtualizing HPC and Technical Computing with VMware vSphere,” VMware Technical White Paper, http://www.vmware.com/files/pdf/techpaper/vmware-virtualizing-hpc-technical-computing-with-vsphere.pdf.
N.Zhang, J.Simons, “Performance of RDMA and HPC Applications in Virtual Machines using FDR InfiniBand on VMware vSphere,” VMware Technical White Paper, http://www.vmware.com/files/pdf/techpaper/vmware-fdr-ib-vsphere-hpc.pdf.
vCenter Server for vSphere Management, VMware Documentation, http://www.vmware.com/products/vcenter-server.html
DirectPath I/O, VMware Docuementation, http://tpub-review.eng.vmware.com:8080/vsphere-65/index.jsp#com.vmware.vsphere.networking.doc/GUID-BF2770C3-39ED-4BC5-A8EF-77D55EFE924C.html
VMware Performance Team, "Performance Best Practices for VMware vSphere 6.0," VMware Technical White Paper, https://www.vmware.com/content/***/digitalmarketing/vmware/en/pdf/techpaper/vmware-perfbest-practices-vsphere6-0-white-paper.pdf.
Bhavesh Davda, "Best Practices for Performance Tuning of Latency-Sensitive Workloads in vSphere VMs," VMware Technical White Paper, http://www.vmware.com/techpapers/2011/best-practices-for-performance-tuning-of-latency-s-10220.html.

Na Zhang is member of the technical staff working on HPC within VMware’s Office of the CTO. Her current focus is on performance and solutions of HPC virtualization. Na has Ph.D. degree in Applied Mathematics from Stony Brook University. Her research primarily focused on design and analysis of parallel algorithms for large- and multi-scale simulations running on supercomputers.

↧

Dell EMC HPC System for Research - Keeping it fresh

May 9, 2017, 6:30 am

≫ Next: Dell EMC HPC Systems - SKY is the limit

≪ Previous: Virtualized HPC Performance with VMware vSphere 6.5 on a Dell PowerEdge C6320 Cluster

Dell EMC has announced an update to the PowerEdge C6320p modular server, introducing support for the Intel® Xeon Phi x200 processor with Intel Omni-Path™ fabric integration (KNL-F). This update is a processor-only change, which means that changes to the PowerEdge C6320p motherboard were not required. New purchases of the PowerEdge C6320p server can be configured with KNL or KNL-F processors. For customers utilizing Omni-Path as a fabric, the KNL-F processor will improve cost and power efficiencies, as it eliminates the need to purchase and power discrete Omni-Path adapters. Figure 1, below, illustrates the conceptual design differences between the KNL and KNL-F solutions.

Late last year, we introduced the Dell EMC PowerEdge C6230p Server, which delivers a high performance processor node based on the Intel Xeon Phi processor (KNL). This exciting server delivers a compute node optimized for HPC workloads, supporting highly parallelized processes with up to 72 out-of-order cores in a compact half-width 1U package. High-speed fabric options include InfiniBand or Omni-Path, ideal for data intensive computational applications, such as life sciences, and weather simulations.

Figure 1: Functional design view of KNL and KNL-F Omni-Path support.

As seen in the figure, the integrated fabric option eliminates the dependency on dual x16 PCIe lanes on the motherboard and allows support for a denser configuration, with two QSFP connectors on a single carrier circuit board. For continued support of both processors, the PowerEdge C6230p server will retain the PCIe signals to the PCIe slots. Inserting the KNL-F processor will disable these signals, and expose a connector supporting two QSFP ports carried on an optional adapter using the same PCIe x16 slot for power.

Additional improvements to the PowerEdge C6320p server include support for 64GB LRDIMMs, bumping memory capacity to 384GB, and support for the LSI 2008 RAID controller via the PCIe x4 mezzanine slot.

Current HPC solution offers from Dell EMC

Dell EMC offers several HPC solutions optimized for customer usage and priorities. Domain-specific HPC compute solutions from Dell EMC include the following scalable options:

HPC System for Life Sciences – A customizable and scalable system optimized for the needs of researchers in the biological sciences.
HPC System for Manufacturing – A customizable and scalable system designed and configured specifically for engineering and manufacturing solutions including design simulation, fluid dynamics, or structural analysis.
HPC System for Research – A highly configurable and scalable platform for supporting a broad set of HPC-related workloads and research users.

For HPC storage needs, Dell EMC offers two high performance, scalable, and robust options:

Dell EMC HPC Lustre Storage - This enterprise solution handles big data and high-performance computing demands with a balanced configuration — designed for parallel input/output — and no single point of failure.
Dell EMC HPC NFS Storage Solution – Provides high data throughput, flexible, reliable, and hassle-free storage.

Summary

The Dell EMC HPC System for Research, an ideal HPC platform for IT administrators serving diverse and expanding user demands, now supports KNL-F, with its improved cost and power efficiencies, eliminating the need to purchase and power discrete Omni-Path adapters.

Dell EMC is the industry leader in HPC computing, and we are committed to delivering increased capabilities and performance in partnership with Intel and other technology leaders in the HPC community. To learn more about Dell EMC HPC solutions and services, visit us online.

http://www.dell.com/en-us/work/learn/high-performance-computing

http://en.community.dell.com/techcenter/high-performance-computing/

www.dellhpc.org/

↧

Dell EMC HPC Systems - SKY is the limit

July 19, 2017, 2:00 pm

≫ Next: Deep Learning – Revamping AI development from the infrastructure up - Part 1

≪ Previous: Dell EMC HPC System for Research - Keeping it fresh

Munira Hussain, HPC Innovation Lab, July 2017 This is an announcement about the Dell EMC HPC refresh that introduces support for 14 th Generation servers based on the new Intel® Xeon® Processor Scalable Family (micro-architecture also known...(read more)

↧