深入理解計算機系統_3e 第五章家庭作業 CS:APP3e chapter 5 homework


**5.13**

A.

B. 由浮點數加法的延遲,CPE的下界應該是3。

C. 由整數加法的延遲,CPE的下界應該是1.

D. 由A中的數據流圖,雖然浮點數乘法需要5個周期,但是它沒有“數據依賴”,也就是說,每次循環時的乘法不需要依賴上一次乘法的結果,可以各自獨立進行。但是加法是依賴於上一次的結果的(sum = sum + 乘法結果),所以該循環的“關鍵路徑”是加法這條鏈。而浮點數加法的延遲為3個周期,所以CPE為3.00。


5.14

A. 5.13中分析的,關鍵路徑是一個加法,而整數加法的延遲為1個周期,所以CPE的下界為1。
更新:題意弄錯,不是只分析6*1整數運算,跳跳熊12138指出,已更正。
下面是跳跳熊12138給的答案:

本題的代碼有n(數據規模)次加運算和乘運算。cpe最低的情況是加的功能功能單元和乘的功能單元全都處於滿流水的狀態,此時加和乘都達到吞吐量下界。對於整數運算,加法的吞吐量下界為0.5,乘法的吞吐量下界為1.0,所以cpe=max{0.5,1.0};對於浮點數運算,加法的吞吐量下界是1.0,乘法的吞吐量下界是0.5,所以cpe=max{1.0,0.5}=1.0。綜上,cpe的下界是1.0。

B. “6 * 1 loop unrolling”只減少了循環的次數(所以整數的CPE下降了,書上把這個稱為“overhead”),並沒有減少內存讀寫的次數和流水線的發生,所以浮點數運算還是不能突破“關鍵路徑”的CPE下界。


5.15

/* 6 * 6 loop unrolling */
/*省略*/
data_t sum1 = (data_t) 0;
data_t sum2 = (data_t) 0;
data_t sum3 = (data_t) 0;
data_t sum4 = (data_t) 0;
data_t sum5 = (data_t) 0;

for(i = 0; i < length; i += 6)
{
  sum1 = sum1 + udata[i] * vdata[i];	/* 相互獨立,可以流水線 */
  sum2 = sum2 + udata[i+1] * vdata[i+1];
  sum3 = sum3 + udata[i+2] * vdata[i+2];
  sum4 = sum4 + udata[i+3] * vdata[i+3];
  sum5 = sum5 + udata[i+4] * vdata[i+4];
  sum6 = sum6 + udata[i+5] * vdata[i+5];
}

for(; i < length; ++i)
{
  sum1 = sum1 + udata[i] * vdata[i];
}

*dest = sum1 + sum2 + sum3 + sum4 + sum5 + sum6;

雖然此時可以流水線,但是浮點數加法的單元的Issue time為1個周期,而Capacity也為1,所以最多每個時鍾周期完成I/C = 1個加法操作,即此時CPE的下界為1。


5.16

/* 6 * 1a loop unrolling */
/*省略*/
data_t sum1 = (data_t) 0;
data_t sum2 = (data_t) 0;
data_t sum3 = (data_t) 0;
data_t sum4 = (data_t) 0;
data_t sum5 = (data_t) 0;

for(i = 0; i < length; i += 6)
{
  sum = sum + (udata[i] * vdata[i] + udata[i+1] * vdata[i+1] + udata[i+2] * vdata[i+2] + udata[i+3] * vdata[i+3] + udata[i+4] * vdata[i+4] + udata[i+5] * vdata[i+5]);
}

for(; i < length; ++i)
{
  sum = sum + udata[i] * vdata[i];
}

*dest = sum;

5.17

#include <limits.h>
#define K sizeof(unsigned long)
void *word_memset(void *s, int c, size_t n)
{
  if (n < K)
  {
  	size_t cnt = 0;
  	unsigned char *schar = s;
  	while (cnt < n)
  	{
  		*schar++ = (unsigned char)c;
  		cnt++;
  	}
  }
  else
  {
  	  unsigned long word = 0;
      for (int i = 0; i < K; ++i)
      {
  	    word <<= K*CHAR_BIT;
  	    word += (unsigned char)c;
      }

      size_t cnt = 0;
      unsigned long *slong = s;
      while (cnt < n)
      {
      	*slong++ = word;
      	cnt += K;
      }
      
      unsigned char *schar = slong;
      while (cnt < n)
      {
      	*schar++ = (unsigned char)c;
  		cnt++;
      }
  }
  return s;
}

5.18

答案不唯一,我這里是利用10 × 10的loop unrolling改“direct evaluation”的版本。

原函數的瓶頸在於xpwr = xpwr * x這一句,乘法數據依賴,由書上給出的K >= L*C (第540面),其中L是latency,C是capacity,由於浮點數乘法分別對應5和2,所以這里的K選擇為10。

另外,K大的時候很可能會碰到寄存器不夠的情況,不得不使用棧來保存局部變量(運行的時候會加載到高速緩存),會有一些性能上的犧牲。

double faster_poly(double a[], double x, long degree)
{
	long i;
	double result1 = a[0];
	double result2 = 0;
	double result3 = 0;
	double result4 = 0;
	double result5 = 0;
	double result6 = 0;
	double result7 = 0;
	double result8 = 0;
	double result9 = 0;
	double result10 = 0;

	double xpwr1 = x;
	double xpwr2 = xpwr1 * x;
	double xpwr3 = xpwr2 * x;
	double xpwr4 = xpwr3 * x;
	double xpwr5 = xpwr4 * x;
	double xpwr6 = xpwr5 * x;
	double xpwr7 = xpwr6 * x;
	double xpwr8 = xpwr7 * x;
	double xpwr9 = xpwr8 * x;
	double xpwr10 = xpwr9 * x;
	double x10 = xpwr10;

	for (i = 1; (i+9) <= degree; i += 10)
	{
		result1 += a[i] * xpwr1;
		result2 += a[i+1] * xpwr2;
		result3 += a[i+2] * xpwr3;
		result4 += a[i+3] * xpwr4;
		result5 += a[i+4] * xpwr5;
		result6 += a[i+5] * xpwr6;
		result7 += a[i+6] * xpwr7;
		result8 += a[i+7] * xpwr8;
		result9 += a[i+8] * xpwr9;
		result10 += a[i+9] * xpwr10;

		xpwr1 *= x10;
		xpwr2 *= x10;
		xpwr3 *= x10;
		xpwr4 *= x10;
		xpwr5 *= x10;
		xpwr6 *= x10;
		xpwr7 *= x10;
		xpwr8 *= x10;
		xpwr9 *= x10;
		xpwr10 *= x10;
	}
	for (; i <= degree; ++i)
	{
		result1 += a[i] * xpwr1;
		xpwr1 *= x;
	}
	
	result1 += result2;
	result1 += result3;
	result1 += result4;
	result1 += result5;
	result1 += result6;
	result1 += result7;
	result1 += result8;
	result1 += result9;
	result1 += result10;
	return result1;
}

5.19

瓶頸在於val=val+a[i] (書上還加了last_val ,一個意思)這一句,加法數據依賴,由書上給出的K >= L*C (第540面),其中L是latency,C是capacity,由於浮點數加法分別對應3和1,所以這里選擇3*1a。

void faster_psum1a(float a[], float p[], long n)
{
	long i;
	float val = 0;
	for (i = 0; (i+2) < n; i += 3)
	{
		float tmp1 = a[i];
		float tmp2 = tmp1 + a[i+1];
		float tmp3 = tmp2 + a[i+2];
		
		p[i] = var + tmp1;
		p[i+1] = var + tmp2;
		p[i+2] = var = var + tmp3;
	}
	for (; i < n; ++i)
	{
		var += a[i];
		p[i] = var;
	}
}


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM