【讀書筆記】《Computer Organization and Design: The Hardware/Software Interface》(1)

本文轉載自查看原文 2018-12-19 10:35 1395 讀書筆記/ 《COAD: The Hardware/Software Interface》

筆記前言：

《Computer Organization and Design: The Hardware/Software Interface》，中文譯名，《計算機組成與設計：硬件/軟件接口》，是計算機組成原理的經典入門教材之一。節奏緊湊又不緊張，內容充實又不冗長，語言表述朴實易懂又不故作高深，是一本非常適合初次接觸計算機組成原理的學生閱讀的入門教材。

讀書筆記系列博客是主要是記錄我學習和閱讀中的心得和體會。既然是讀書筆記，肯定不會面面俱到，那就成了抄書筆記了。所有筆記系列博客力求言簡意賅，終極目標是拋開書，單靠閱讀筆記就能聯想起每一章對應的精華要點。

為了使未來的我能夠讀懂現在的我所寫下的文字，每篇讀書筆記將使用雙語敘述（英文+中文），並且都將分為五個模塊：知識點總結，疑難點總結，習題練習，我的思考，附錄。

知識點總結 (Knowledge Points Summary)：主要是總結每篇閱讀到的知識內容。
疑難點總結 (Difficult Points Summary)：主要是總結我在閱讀和理解時踩過的坑。
習題練習 (Assignment Solutions)：有些文章（書籍）會有習題，我盡量將所有的習題都做一遍並把自己的答案和見解放在這里。歡迎各位同僚查漏補缺。
我的思考 (Reflection)：我自己閱讀和理解中的心得和體會會放在這里，這一模塊不會拘泥於閱讀內容，什么想法都有可能。
附錄 (Appendix)：所有不在上面四個模塊的內容都會放在這里，包括但不僅限於拓展閱讀、引用來源。

在今后的學習生活中，希望自己能保持專注，保持興奮，保持謙遜，保持好奇。學無止境，任重而道遠吶。

Chapter 1: Computer Abstraction and Technology (第一章：計算機概述和技術)

======================================

English：

1.1 Introduction:

【Knowledge Points Summary】：

1). Classes of Computing Applications:

Personal Computers : individual use, 35 years-old.
Servers : larger, carrying large workloads, greater computing, storage and I/O capability.
- Supercomputers
- Cload Computing: Warehous Scale Computers (WSCs)¹, SaaS (Software as a Service)²
Embedded Systems : widest range of application, low tolerance of failure.
Personal Mobile Devices (PMD)

2). The performance of a program is influenced by :

The Algorithms used in the program:
- Determines: the number of source-level statements, the number of I/O operations executed.
- Not covered in this book. Please see my another blog of "Six Fundamental Subjects of Computer Science : Recommendation Textbooks List."
The Software Systems used to create and translate the program into machine instructions.
- Components : Programming Language, Compiler, Instruction Set Architecture.
- Determines : the number of computer instructions for each source-level statement.
- Discussed at : Chapter 2 and 3.
The effectiveness of the Computer in executing those instructions.

Components : Processor, Memory system, I/O system (hardware and operation system).
Determines : How fast instructions and I/O operations can be executed.
Discussed at : Chatper 4, 5 and 6.

1.2 Eight Great Idea in Computer Architecture:

Moore's Law:

The integrated circuit (IC) resources doubles everey 18-24 months.
Stated by Gordon Moore in 1965.

Use Abstraction to Simplify Design:

Hidden low-level design detail to high-level design.

Make the Common Case Fast:

Enhance program performance by optimizing the common case of the problem.

Performance via Parallelism:

Significant topic covered in this book

Performance via Pipeplining:

A particular pattern of parallelism.

Performance via Prediction:

Guessing and starting work rather than until knowing for sure.
Assume recovering from misprediction is not expensive and prediction is relatively accurate.

Hierarchy of Memories:

Different memory techniques have different efficience and price, needed to be balanced.
The fastest, smallest and most expensive memory per bit at the top of the hierarchy.
Through special hierarchy architect and implementation, we can simulate the main memory access as fast as cashes.

Dependability via Redundancy:

Making system dependable by including redundant components:

take over failure occurs.
help detect failure.

1.3 Below your program:

1). Hardware/Software Hierarchical Layers:

Hierarchy = (Applications Software (Systems Softwares (Hardware)))
Systems Software:

sitting between the hardware and applications software.
providing commonly useful services.
including:

Operation Systems:
- Handler basic input/output operations
- Allocating storage and memory
- providing for protected sharing of the computer among multiple applications using it simultaneously.
Compilers:
- Translate high-level language of a program into instructions that the hardware can execute.
³loaders, assemblers, linker

Program exeution pathway:

Electronic hardware is controled by electric signals.
On/Off electric signal corresponding to 1/0.
Assembler: Translate symbolic version of instruction (assembly language) into binary version (machine language).
Every basic computer instruction (like add, minus) require one line of assembly language statement.
High-level language program ==(Compiler)==> Assembly Language program ==(Assembler)==> Binary machine language program

1.4 Under Covers:

1). Computer Hardware Organization:

- Functions: Inputting data, Outputting data, Processing data, Storing data.
- Components:
- - Input: Feed data (Detail: Chapter 5, 6)
  - Output: Output data (Detail: Chapter 5, 6)
  - Memory: Storing data (Detail: Chapter 5)
  - Datapath: (Detail: Chapter 3, 4, 6 and Appendix C)
    - where data goes and is modified.
    - performs arithmetic operations.
    - " is collection of functional units such as: arithmatic logic units or multipliers, that perform data processing operations, registers, and buses." ^R1　　　　　　　　
  - Control:
    - According to the instruction of the program, sending the signals that determine the operations of the datapath, memory and output .
    - Detials: Chapter 4, 6 and appendix C)
- Datapath + Control = Processor

2). I/O devices:

- Output devices:
  - Liquid crestal displays (LCDs)
    - used on mobile devices to get a thin, low-power display.
    - Active matrix display: a LCDs using a transistor to control the transmission of light at each individual pixel.
    - Bit Map:
    - - the matrix of pixels, represented as a matrix of bits.
      - Normally ranging in size from 1024 x 768 to 2048 x 1536
    - Raster Refresh Buffer (frame buffer):
      - storing the bit map (which is used to represent the image)
      - The bit pattern per pixel is read out to the graphics display at the refresh rate.
- Input devices:
- - Touchscreen:
    - Implementation:
      - Capacitive Sensing.
      - People are electrical conductors.
      - if an insulator (glass) is covered with a transparent conductor (human finger), touching distors the electrostatic field of the screen.
      - Capacitance chages because of touching.
      - Allowing multiple touches simultaneously.

3). Inside box:

- Chips: Integrated circuit, a device combining dozens to millions of transistors.
- Central Processor Unit (CPU): datapath + controls
- Memory:
  - the storage area in which programs are kept when they are runnning and data need by the running program.
  - DRAM (dynamic random access memory):
    - built as an integrated circuit.
    - provide random access to any location.
    - RAM means that memory accesses take basically the same amount of time no matter what portin of the memory is read.
    - access times are 50 - 70 ns and cost 5 - 10 $/GiB.
    - requiring the data to be refreshed periodically in order to retain the data.
    - One transistor and one capacitor for every bit of data.
    - volatile memory.
- Cache memory:
- - Inside or along side the CPU
  - Using Different memory techonlogy: SRAM (static random access memory):
    - faster, smaller, more expensive
    - dose not need to be refreshed as the transistors inside would continue to hold the data as long as the power is not cut off
    - 6 transitors for every bit of data
    - volatile memory.
  - Cache technique is used to solve the imbalance between the high computing speed of processor and low access speed of the main memory (DRAM).
  - Cache memory acts as a buffer for the DRAM memory
- Instruction set architecture: (Details: in Chapter 2 and Appendix A)
- - abstract interface between hardware and software.
  - Application binary interface (ABI):
  - - Combination of the basic instruction set and operation system interface (I/O operation + memory allocation + low-level system functions) for application programmer.

4). Storing data:

- Distinguish by techniques used:
  - volatile memory:
    - Storage, such as DRAM, SRAM, that retains data only if it is receiving power.
    - Once power is cut off, data would be lost.
  - nonvolatile memory

- - retains data even in teh absence of a power source.
  - used to store data and programs between runs.

- Distinguish by memory type:
  - main memory (primary memory):
    - memory used to hold program and data while running.
    - volatile memory. Consists of DRAM and SRAM ⁵.
  - secondary memory:
  - - memory used to hold program and data between run.
    - nonvolatile memory:
    - - Magnetic Disk (hard disk):
        
        composed of rotating platters coated with a magnetic recording material
        
        rotating mechanical devices
        
        Access time: 5 - 20 milliseconds
        
        Cost: 0.05 - 0.10 $/GiB (in 2012)
      - Flash memory(used in PMDs):
      - Access time: 5 - 50 microsecords
        
        Cost: 0.75 - 1.00 $/GiB (in 2012)

5). Communicating with other computers:

Networks:

Functions: Communitation, resource sharing, nonlocal access.
Type:
- Ethernet: used in Local Area Network (LAN), a network of computers.
- Internet: used in Wide Area Network (WAN), a network of networks.
Cost: depends on speed of communication, length
Wireless technologies:

IEEE standard 802.11

1.5 Technologies for Building Processor and Memory:

1). Components:

- silicon ==> semiconductor ==> transisitors (on/off switch) ==> integrated circuit ==> Very large-scale integrated (VLSI) circuit. ==> chip

2). Manufacturing process:

silicon crystal ingot ==(Sliced)==> wafers ==(Diced)==> dies (chips) ==(Bonding)==> CPU package
defects: a microscopic flaw result in the failure of the die.
Yeild:

the percentage of good die from the total number of dies on the wafer.
Cost per die = Cost per wafer / Dies per wafer * yield
Dies per Wafer ~= Wafer area / Die area
Yield = 1 / (1 + (Defects per area * Die area/2))² ----- based on empirical observations of yields at IC factories.

1.6 Performance:

1). Definition:

Responce Time (Execution Time): the total time required for the computer to complete a task.
Throughput (bandwidth): the number of tasks completed per unit time.
Define: "Computer X is n times faster than computer Y" :

= Performance_x / Performance_Y = n
= Execution Time_Y / Execution Time_X= n

Time measurement:

Elapsed Time (System Performance): Total time to complete a task
CPU execution time: the actual time the cpu spends computing for a specific task

user execution time （CPU performance）: cpu time spent in the program
system execution time: cpu time spent in the operating system performing tasks on behalf of the program

2). Clock Cycle:

- is the time for one clock period, usually of the processor clock, which runs at a constant rate.
- measuring that relates to how fast the hardware can perform basic functions.
- a complete clock period , e.g , 250 picoseconds is the length of each clock cycle
- a clock rate = 4 GHz = 1 / clock period = 1 / 250 * 10^-12s

3). Performance measurement equations:

- CPU execution time for a program:

= CPU clock cycles for a program * Clock period

= CPU clock cycles for a program / clock rate

- CPU clock cycles = Instructions for a program * Average clock cycles per instruction (CPI)
- CPU execution time:

= Instruction count * CPI * Clock cycle time

= Instruction count * CPI / Clock rate

Time = Seconds / Program = Instructions/Program * Clock cycles/Instruction * Secods/Clock cycle

4). Performance factors:

Instruction count:

instructions executed for the program
can be measured by:
- using software tools that profile the execution
- or by using a simulator of the architecture.
- or use hardware counter

Clock cycle per Instruction:
- average amout of clock cycles per instruction
- can use hardware counter to measure it. (Q: any other method to measure CPI?)
- Q: What determines the amout of the clock cycle per instruction? What's different between Implementation and Architecture? (see chapter 4, 5)
Clock cycle time:

published as part of the documentation for a computer.
Q: What determines the clock cycle time?

5). Performance of a prorgram:

depends on:

Alogorithm:
- affect: Instruction Count, possible CPI
Programming Language:
- affect: Instruction Count, CPI
Compiler:
- affect: Instruction Count, CPI
Instruction et Architecture:

affect: Instruction Count, CPI, clock rate

6). some processors fetch and execute multiple instructions per clock cycle, so CPI can be < 1.0.

7).

To save energy or temporaily boost performance, today's processor can vary their clock rates.
E.g, the Intel Core i7 will temporarily increase clock rate by about 10% until the chip gets too war. Intel calls this Turbo mode.
so we would need to use the average clock rate for a program.

1.7 The Power Wall

1). Clock rate and Power consumptionn are correlated.

2). Practical power used by computer is limited by the cooling commodity microprocessor.

3). Energy is another measure of performance of the computer such as PMDs and Warehouse scale computers.

4). The dominant technology of IC is called CMOS (complementary metal oxide semiconductor互補金屬氧化物半導體).

5). Dynamics Energy:

- energy consumed when transistors swithc states from 0 to 1 (or 1 to 0).
- is the primary source of energy consumption by CMOS.
- depends on: Capacitive load * Voltage² == the energy of a puls during the logic transition of 0 -> 1->0 or 1-> 0 -> 1.
- Power required per transistor = 1/2 * C * V² * f
- - f:
  - frequency switched
  - is a function of the clock rate.
  - C:
- - - capacitive load of each transistor
    - is a function of both the number of transistors connected to an output (called fannout) and the technology (determines capacitance of wire and transistors).

1.8 The sea Change: The switch from Uniprocessors to Multiprocessors

1). Define: multiple processors per chip.

2). Parallelism: Improve performance by increasing throughput.

3). Challenge:

Scheduling
Load balance
Time for Synchronization
Overhead(開銷) for communication between the parties.

1.9 Benchmarking the Inter Core i7

Q: what is benchmark? Who publish it and how does they calculate it? How to ues it?

1.10 Fallacies and Pitfalls

1). Pitfalls: Expecting the improvement of one aspect of a computer to increase overall performance by an amount proportional to the size of the improvement.

Amdahl's Law （阿姆達爾定律）:

Execution time after improvement = Execution time affected by improvement / Amount of improvement + Execution Time unaffacted
is a rule stating that the performance enhancement possible with a given improvement is limited by the amount that the improved feature is used. It is a quantitative version of the law of diminishing returns.
use to evaluate potential enhancement.
use to argue for practical limits to the number of parallel processors. (see chapter 6 for detail)

======================================

中文

1.1 概述：

【知識點總結】：

1). 計算機的分類：

個人電腦 (Personal Computer)：私人使用，35年的發展歷史。
服務器 (Server)：更強大的計算能力(Computing Capacity)、存儲能力(Storage Capacity)和輸入/輸出負載。
- 超級計算機(Super Computer): 彪悍的計算能力，常用於科學數據處理、天氣預報等。
- 雲計算(Cload Computing)：倉儲式服務器集群(Warehouse Scale Computers)¹，軟件“即”服務(SaaS)²。
嵌入式計算機(Embedded Computer)：最廣泛的使用，低容錯率（意味着穩定(Dependability)是優先目標）。
個人移動設備(PMD)：手機，ipad。

2). 影響程序性能的因素：

算法：
- 影響：問題的解決方案、源碼級聲明的數量(這一點很模糊，不是說代碼量越少越高效)、I/O操作的數量。
- 閱讀內容：不在本書的討論范圍內，詳見我的另一篇博文：“計算機基礎六大課：教材推薦”。
程序編寫、編譯所依賴的軟件系統：
- 包括：編程所用的程序語言，編譯器，指令集(instruction set architecture)。
- 影響：對於相應的代碼，計算機所需要執行的指令的數量。
- 閱讀內容：本書第2，3章。
計算機執行指令的效率：

包括：處理器，存儲訪問系統，I/O系統 (硬件和操作系統)。
影響：處理器執行指令的速率，存儲訪問的速率，I/O操作的速率。
閱讀內容：本書第4，5，6章。

1.2: 計算機架構的八大設計思想：

摩爾定律(Moore's Law):

價格不變時，集成電路上可容納的元器件的容量每18-24個月翻一倍。
戈登·摩爾於1965年提出

通過抽象原則簡化設計模型：

對於上層隱藏低層的設計細節。把復雜的細節抽象成簡單的表達，這樣更有利於設計架構的理解和進一步開發。

優化常規事件以提升性能：
通過並行計算(Parallelism)來提升性能。
- 本書的重點。
通過流水線技術來提升性能。
- 並行計算的一種開發范式。
通過預知來提升性能：

通過猜來提前完成下一步任務，以此來提升性能。(我知道猜聽起來很不靠譜，但它確實在底層用的不少，以后會接觸到。)
預知是建立在糾錯成本可以接受和預知的准確相對高的前提下。

存儲分層架構：

由於不同的存儲技術成本不同，未了平衡存儲訪問的成本和性能，提出分層次的存儲架構。
cache這類的容量小，存儲性能高，訪問速度快的存儲模塊屬於模型的上層。
通過分層架構，程序員能夠像使用cache一樣使用內存，以此來提升程序的性能。

通過增加專項的模塊來保證系統的可靠性：

在系統設計中計入處理異常和監視異常的模塊能大幅提升系統的可靠性。

1.3 隱藏在程序背后的東東：

1). 硬件/軟件分層架構：

- 層級關系 = (應用級軟件(系統級軟件(硬件)))
- 系統級軟件：
- - 操作系統 (Operation System)：
    - 處理基本的輸入/輸出操作。
    - 分配調度存儲空間和內存使用。
    - 為多個應用進程同時使用同一個計算機的資源提供安全的共享方案。
  - 編譯器 (Compiler):
- - - 將高級的編程語言編譯轉換成計算機指令。
  - ³加載器(loader), 匯編器 (assembler)，鏈接器(linker)
  - 程序從源代碼到計算機硬件可以執行所經過的操作：
- - 硬件只能通過電信號進行控制和交流。
  - 最簡單的電信號：開/關 = 二進制信號 0/1
  - 匯編器(Assembler)：負責把匯編語言(簡單符號構成的計算機指令)轉換成二進制機器語言(0/1 構成的語言, 機器可以理解=用來轉換成電信號)
  - 每一條最原始的計算機指令(如加，減)都會需要一條匯編語言的指令聲明。
  - 高級語言(如 c, java)碼好的程序 ==(編譯器)==> 匯編語言構成的程序 ==(匯編器)==> 二進制機器語言構成的程序

1.4 計算機內部的東東：

1). 計算機硬件架構：

功能：數據輸入，數據輸出，數據處理，數據存儲
組成：

輸入(input)
輸出(output)
數據通路(datapath) ：負責所有數值運算
控制(control)：根據程序指令，發送信號給數據通路，輸入/輸出和存儲進行相應的操作。
存儲(memory)

控制 + 數據通路 = 處理器(processor) （第4章會詳細講解這個）

2). 輸入/輸出：

3）計算機內部硬件組成：

芯片：集成電路，由上百萬晶體管組成
處理器(processor): 數據通路 + 控制
內存（memory）:

程序運行時保存數據和程序的地方
動態隨機存取存儲器 (DRAM):

由晶體管構成
非永久性存儲器，斷電丟失。
隨機存取（RAM）意思是訪問內存的任何一個角落所花費的事件是一樣的。
訪問時間：50 - 70納秒，成本: 5 - 10 $/GiB (2012年數據)
每存儲1bit數據需要一個晶體管和一個電容器。
電容存儲，所以需要周期性的刷新一次，否則存儲的數據就會丟失。

緩存（cache memory）:

　靜態隨機存取存儲器（SRAM）：
- 更快，更小，更貴
- 每存儲1 bit數據需要6個晶體管
- 不需要刷新
- 非永久性存存儲器，斷電丟失
CPU的計算速度（數據處理速度）太快，以至於用DRAM技術實現的內存中的數據查找訪問速度實在是跟不上。為了解決這個問題，不浪費限制的CPU資源，SRAM技術誕生了。它位於主存和CPU之間的一級存儲器。
所有的現代CPU都配有多級緩存。詳細的說會越來越多，以后有機會單開一個緩存的博客。

指令集：

硬件和軟件之間的接口，操控硬件的指令的集合。
二進制接口（ABI）

4). 存儲數據：

非永久性存儲
永久性存儲：
- 程序等待運行時存儲數據和程序的地方。
- 代表：
  - 磁盤（硬盤, hard memory）
  - 閃存（flash memory）

5). 計算機間的通信：

計算機網絡：

功能：計算機間的通信，資源共享，異地訪問。
分類：

以太網(Ethernet)：局域網上應用的技術，多個計算機連接構成的網絡。
互聯網(Internet)：廣域網上應用的技術，多個局域網連接而成的網絡。

1.5 處理器和存儲器的制造工藝：

1). 組成成分：

硅 ==> 半導體 ==> 晶體管(on/off開關) ==> 集成電路(IC) ==> 大規模集成(VLSI)電路 ==> 芯片(chips)

2). 制造流程：

- 晶體硅錠 ==(切片)==> 晶片 ==(切粒)==> 芯片 ==(封裝)==> 處理器/存儲器
- 次品(defects)
- 晶原良率(yield)

======================================

【疑難點總結】：

1. 編譯后的程序的匯編代碼的總計算機指令數是確定的，那么運行同一個匯編程序在不同的的處理器上，CPI是一樣的嗎？

不一樣。
前提條件：同一個匯編程序能在兩個不同的處理器上執行。那么意味着這兩個處理器遵從相同的指令集，只是對於指令集的電路實現（也就是常說的處理器的微架構）不同。
那么既然電路實現不同，那么同一條指令被執行時所操作的電路通路是不同的，走的路不同了，所花費的時間就不一樣了，那么執行同樣的一條指令所需要的時鍾周期(也就是時間)也就不一樣了。所以不同架構的處理器(指令集相同，微架構不同)執行相同的匯編程序，CPI是不同的。

2. CPU所用的指令集不同，實現架構（只電路實現）肯定不同。即使它們遵從的指令集是一樣的，但實現方法(不同的電路設計實現相同的指令，比如1001)也可以不同，就是說架構也可以不同。這個不同就影響了不同的處理器的性能。當然，不同的處理器架構需要不同的指令集，也會有影響它們的性能。一句話總結：指令集決定了不同的處理器架構，指令集如果不兼容，計算機指令就不兼容。

3. 不同指令集的處理器無法運行相同的匯編程序，因為匯編語言的程序聲明和指令集對不上。

4. 是不是不同架構的CPU使用不同的編譯器？

是。編譯器基於指令集，面向編程語言。在編譯的時候需要知道是哪個cpu上跑這個程序。

5. 是不是不同架構的CPU使用的匯編語言不同？

當然，匯編語言和指令集的指令是一一對應的。不同架構所實現的指令集都不同了，匯編語言肯定不同。

5.2. 處理器架構，指令集和匯編語言的關系？

請見知乎回答： https://www.zhihu.com/question/23474438

6. 如何解決跨平台編譯？如何解決跨平台執行？

7.指令集是怎么工作的？我是說指令集是如何在硬件層面被實現的？程序被編譯成匯編語言，匯編語言被轉換成機器指令，那么cpu是靠什么讀取這些指令並把它們轉換成響應的開/關信號在晶體管中執行的？

請見知乎回答：https://www.zhihu.com/question/62173438/answer/195436918

---恢復內容結束---

筆記前言：

知識點總結 (Knowledge Points Summary)：主要是總結每篇閱讀到的知識內容。
疑難點總結 (Difficult Points Summary)：主要是總結我在閱讀和理解時踩過的坑。
習題練習 (Assignment Solutions)：有些文章（書籍）會有習題，我盡量將所有的習題都做一遍並把自己的答案和見解放在這里。歡迎各位同僚查漏補缺。
我的思考 (Reflection)：我自己閱讀和理解中的心得和體會會放在這里，這一模塊不會拘泥於閱讀內容，什么想法都有可能。
附錄 (Appendix)：所有不在上面四個模塊的內容都會放在這里，包括但不僅限於拓展閱讀、引用來源。

在今后的學習生活中，希望自己能保持專注，保持興奮，保持謙遜，保持好奇。學無止境，任重而道遠吶。

Chapter 1: Computer Abstraction and Technology (第一章：計算機概述和技術)

======================================

English：

1.1 Introduction:

【Knowledge Points Summary】：

1). Classes of Computing Applications:

Personal Computers : individual use, 35 years-old.
Servers : larger, carrying large workloads, greater computing, storage and I/O capability.
- Supercomputers
- Cload Computing: Warehous Scale Computers (WSCs)¹, SaaS (Software as a Service)²
Embedded Systems : widest range of application, low tolerance of failure.
Personal Mobile Devices (PMD)

2). The performance of a program is influenced by :

The Algorithms used in the program:
- Determines: the number of source-level statements, the number of I/O operations executed.
- Not covered in this book. Please see my another blog of "Six Fundamental Subjects of Computer Science : Recommendation Textbooks List."
The Software Systems used to create and translate the program into machine instructions.
- Components : Programming Language, Compiler, Instruction Set Architecture.
- Determines : the number of computer instructions for each source-level statement.
- Discussed at : Chapter 2 and 3.
The effectiveness of the Computer in executing those instructions.

Components : Processor, Memory system, I/O system (hardware and operation system).
Determines : How fast instructions and I/O operations can be executed.
Discussed at : Chatper 4, 5 and 6.

1.2 Eight Great Idea in Computer Architecture:

Moore's Law:

The integrated circuit (IC) resources doubles everey 18-24 months.
Stated by Gordon Moore in 1965.

Use Abstraction to Simplify Design:

Hidden low-level design detail to high-level design.

Make the Common Case Fast:

Enhance program performance by optimizing the common case of the problem.

Performance via Parallelism:

Significant topic covered in this book

Performance via Pipeplining:

A particular pattern of parallelism.

Performance via Prediction:

Guessing and starting work rather than until knowing for sure.
Assume recovering from misprediction is not expensive and prediction is relatively accurate.

Hierarchy of Memories:

Different memory techniques have different efficience and price, needed to be balanced.
The fastest, smallest and most expensive memory per bit at the top of the hierarchy.
Through special hierarchy architect and implementation, we can simulate the main memory access as fast as cashes.

Dependability via Redundancy:

Making system dependable by including redundant components:

take over failure occurs.
help detect failure.

1.3 Below your program:

1). Hardware/Software Hierarchical Layers:

Hierarchy = (Applications Software (Systems Softwares (Hardware)))
Systems Software:

sitting between the hardware and applications software.
providing commonly useful services.
including:

Operation Systems:
- Handler basic input/output operations
- Allocating storage and memory
- providing for protected sharing of the computer among multiple applications using it simultaneously.
Compilers:
- Translate high-level language of a program into instructions that the hardware can execute.
³loaders, assemblers, linker

Program exeution pathway:

Electronic hardware is controled by electric signals.
On/Off electric signal corresponding to 1/0.
Assembler: Translate symbolic version of instruction (assembly language) into binary version (machine language).
Every basic computer instruction (like add, minus) require one line of assembly language statement.
High-level language program ==(Compiler)==> Assembly Language program ==(Assembler)==> Binary machine language program

1.4 Under Covers:

1). Computer Hardware Organization:

- Functions: Inputting data, Outputting data, Processing data, Storing data.
- Components:
- - Input: Feed data (Detail: Chapter 5, 6)
  - Output: Output data (Detail: Chapter 5, 6)
  - Memory: Storing data (Detail: Chapter 5)
  - Datapath: (Detail: Chapter 3, 4, 6 and Appendix C)
    - where data goes and is modified.
    - performs arithmetic operations.
    - " is collection of functional units such as: arithmatic logic units or multipliers, that perform data processing operations, registers, and buses." ^R1　　　　　　　　
  - Control:
    - According to the instruction of the program, sending the signals that determine the operations of the datapath, memory and output .
    - Detials: Chapter 4, 6 and appendix C)
- Datapath + Control = Processor

2). I/O devices:

- Output devices:
  - Liquid crestal displays (LCDs)
    - used on mobile devices to get a thin, low-power display.
    - Active matrix display: a LCDs using a transistor to control the transmission of light at each individual pixel.
    - Bit Map:
    - - the matrix of pixels, represented as a matrix of bits.
      - Normally ranging in size from 1024 x 768 to 2048 x 1536
    - Raster Refresh Buffer (frame buffer):
      - storing the bit map (which is used to represent the image)
      - The bit pattern per pixel is read out to the graphics display at the refresh rate.
- Input devices:
- - Touchscreen:
    - Implementation:
      - Capacitive Sensing.
      - People are electrical conductors.
      - if an insulator (glass) is covered with a transparent conductor (human finger), touching distors the electrostatic field of the screen.
      - Capacitance chages because of touching.
      - Allowing multiple touches simultaneously.

3). Inside box:

- Chips: Integrated circuit, a device combining dozens to millions of transistors.
- Central Processor Unit (CPU): datapath + controls
- Memory:
  - the storage area in which programs are kept when they are runnning and data need by the running program.
  - DRAM (dynamic random access memory):
    - built as an integrated circuit.
    - provide random access to any location.
    - RAM means that memory accesses take basically the same amount of time no matter what portin of the memory is read.
    - access times are 50 - 70 ns and cost 5 - 10 $/GiB.
    - requiring the data to be refreshed periodically in order to retain the data.
    - One transistor and one capacitor for every bit of data.
    - volatile memory.
- Cache memory:
- - Inside or along side the CPU
  - Using Different memory techonlogy: SRAM (static random access memory):
    - faster, smaller, more expensive
    - dose not need to be refreshed as the transistors inside would continue to hold the data as long as the power is not cut off
    - 6 transitors for every bit of data
    - volatile memory.
  - Cache technique is used to solve the imbalance between the high computing speed of processor and low access speed of the main memory (DRAM).
  - Cache memory acts as a buffer for the DRAM memory
- Instruction set architecture: (Details: in Chapter 2 and Appendix A)
- - abstract interface between hardware and software.
  - Application binary interface (ABI):
  - - Combination of the basic instruction set and operation system interface (I/O operation + memory allocation + low-level system functions) for application programmer.

4). Storing data:

- Distinguish by techniques used:
  - volatile memory:
    - Storage, such as DRAM, SRAM, that retains data only if it is receiving power.
    - Once power is cut off, data would be lost.
  - nonvolatile memory

- - retains data even in teh absence of a power source.
  - used to store data and programs between runs.

- Distinguish by memory type:
  - main memory (primary memory):
    - memory used to hold program and data while running.
    - volatile memory. Consists of DRAM and SRAM ⁵.
  - secondary memory:
  - - memory used to hold program and data between run.
    - nonvolatile memory:
    - - Magnetic Disk (hard disk):
        
        composed of rotating platters coated with a magnetic recording material
        
        rotating mechanical devices
        
        Access time: 5 - 20 milliseconds
        
        Cost: 0.05 - 0.10 $/GiB (in 2012)
      - Flash memory(used in PMDs):
      - Access time: 5 - 50 microsecords
        
        Cost: 0.75 - 1.00 $/GiB (in 2012)

5). Communicating with other computers:

Networks:

Functions: Communitation, resource sharing, nonlocal access.
Type:
- Ethernet: used in Local Area Network (LAN), a network of computers.
- Internet: used in Wide Area Network (WAN), a network of networks.
Cost: depends on speed of communication, length
Wireless technologies:

IEEE standard 802.11

1.5 Technologies for Building Processor and Memory:

1). Components:

- silicon ==> semiconductor ==> transisitors (on/off switch) ==> integrated circuit ==> Very large-scale integrated (VLSI) circuit. ==> chip

2). Manufacturing process:

silicon crystal ingot ==(Sliced)==> wafers ==(Diced)==> dies (chips) ==(Bonding)==> CPU package
defects: a microscopic flaw result in the failure of the die.
Yeild:

the percentage of good die from the total number of dies on the wafer.
Cost per die = Cost per wafer / Dies per wafer * yield
Dies per Wafer ~= Wafer area / Die area
Yield = 1 / (1 + (Defects per area * Die area/2))² ----- based on empirical observations of yields at IC factories.

1.6 Performance:

1). Definition:

Responce Time (Execution Time): the total time required for the computer to complete a task.
Throughput (bandwidth): the number of tasks completed per unit time.
Define: "Computer X is n times faster than computer Y" :

= Performance_x / Performance_Y = n
= Execution Time_Y / Execution Time_X= n

Time measurement:

Elapsed Time (System Performance): Total time to complete a task
CPU execution time: the actual time the cpu spends computing for a specific task

user execution time （CPU performance）: cpu time spent in the program
system execution time: cpu time spent in the operating system performing tasks on behalf of the program

2). Clock Cycle:

- is the time for one clock period, usually of the processor clock, which runs at a constant rate.
- measuring that relates to how fast the hardware can perform basic functions.
- a complete clock period , e.g , 250 picoseconds is the length of each clock cycle
- a clock rate = 4 GHz = 1 / clock period = 1 / 250 * 10^-12s

3). Performance measurement equations:

- CPU execution time for a program:

= CPU clock cycles for a program * Clock period

= CPU clock cycles for a program / clock rate

- CPU clock cycles = Instructions for a program * Average clock cycles per instruction (CPI)
- CPU execution time:

= Instruction count * CPI * Clock cycle time

= Instruction count * CPI / Clock rate

Time = Seconds / Program = Instructions/Program * Clock cycles/Instruction * Secods/Clock cycle

4). Performance factors:

Instruction count:

instructions executed for the program
can be measured by:
- using software tools that profile the execution
- or by using a simulator of the architecture.
- or use hardware counter

Clock cycle per Instruction:
- average amout of clock cycles per instruction
- can use hardware counter to measure it. (Q: any other method to measure CPI?)
- Q: What determines the amout of the clock cycle per instruction? What's different between Implementation and Architecture? (see chapter 4, 5)
Clock cycle time:

published as part of the documentation for a computer.
Q: What determines the clock cycle time?

5). Performance of a prorgram:

depends on:

Alogorithm:
- affect: Instruction Count, possible CPI
Programming Language:
- affect: Instruction Count, CPI
Compiler:
- affect: Instruction Count, CPI
Instruction et Architecture:

affect: Instruction Count, CPI, clock rate

6). some processors fetch and execute multiple instructions per clock cycle, so CPI can be < 1.0.

7).

To save energy or temporaily boost performance, today's processor can vary their clock rates.
E.g, the Intel Core i7 will temporarily increase clock rate by about 10% until the chip gets too war. Intel calls this Turbo mode.
so we would need to use the average clock rate for a program.

1.7 The Power Wall

1). Clock rate and Power consumptionn are correlated.

2). Practical power used by computer is limited by the cooling commodity microprocessor.

3). Energy is another measure of performance of the computer such as PMDs and Warehouse scale computers.

4). The dominant technology of IC is called CMOS (complementary metal oxide semiconductor互補金屬氧化物半導體).

5). Dynamics Energy:

- energy consumed when transistors swithc states from 0 to 1 (or 1 to 0).
- is the primary source of energy consumption by CMOS.
- depends on: Capacitive load * Voltage² == the energy of a puls during the logic transition of 0 -> 1->0 or 1-> 0 -> 1.
- Power required per transistor = 1/2 * C * V² * f
- - f:
  - frequency switched
  - is a function of the clock rate.
  - C:
- - - capacitive load of each transistor
    - is a function of both the number of transistors connected to an output (called fannout) and the technology (determines capacitance of wire and transistors).

1.8 The sea Change: The switch from Uniprocessors to Multiprocessors

1). Define: multiple processors per chip.

2). Parallelism: Improve performance by increasing throughput.

3). Challenge:

Scheduling
Load balance
Time for Synchronization
Overhead(開銷) for communication between the parties.

1.9 Benchmarking the Inter Core i7

Q: what is benchmark? Who publish it and how does they calculate it? How to ues it?

1.10 Fallacies and Pitfalls

1). Pitfalls: Expecting the improvement of one aspect of a computer to increase overall performance by an amount proportional to the size of the improvement.

Amdahl's Law （阿姆達爾定律）:

Execution time after improvement = Execution time affected by improvement / Amount of improvement + Execution Time unaffacted
is a rule stating that the performance enhancement possible with a given improvement is limited by the amount that the improved feature is used. It is a quantitative version of the law of diminishing returns.
use to evaluate potential enhancement.
use to argue for practical limits to the number of parallel processors. (see chapter 6 for detail)

======================================

中文

1.1 概述：

【知識點總結】：

1). 計算機的分類：

個人電腦 (Personal Computer)：私人使用，35年的發展歷史。
服務器 (Server)：更強大的計算能力(Computing Capacity)、存儲能力(Storage Capacity)和輸入/輸出負載。
- 超級計算機(Super Computer): 彪悍的計算能力，常用於科學數據處理、天氣預報等。
- 雲計算(Cload Computing)：倉儲式服務器集群(Warehouse Scale Computers)¹，軟件“即”服務(SaaS)²。
嵌入式計算機(Embedded Computer)：最廣泛的使用，低容錯率（意味着穩定(Dependability)是優先目標）。
個人移動設備(PMD)：手機，ipad。

2). 影響程序性能的因素：

算法：
- 影響：問題的解決方案、源碼級聲明的數量(這一點很模糊，不是說代碼量越少越高效)、I/O操作的數量。
- 閱讀內容：不在本書的討論范圍內，詳見我的另一篇博文：“計算機基礎六大課：教材推薦”。
程序編寫、編譯所依賴的軟件系統：
- 包括：編程所用的程序語言，編譯器，指令集(instruction set architecture)。
- 影響：對於相應的代碼，計算機所需要執行的指令的數量。
- 閱讀內容：本書第2，3章。
計算機執行指令的效率：

包括：處理器，存儲訪問系統，I/O系統 (硬件和操作系統)。
影響：處理器執行指令的速率，存儲訪問的速率，I/O操作的速率。
閱讀內容：本書第4，5，6章。

1.2: 計算機架構的八大設計思想：

摩爾定律(Moore's Law):

價格不變時，集成電路上可容納的元器件的容量每18-24個月翻一倍。
戈登·摩爾於1965年提出

通過抽象原則簡化設計模型：

對於上層隱藏低層的設計細節。把復雜的細節抽象成簡單的表達，這樣更有利於設計架構的理解和進一步開發。

優化常規事件以提升性能：
通過並行計算(Parallelism)來提升性能。
- 本書的重點。
通過流水線技術來提升性能。
- 並行計算的一種開發范式。
通過預知來提升性能：

通過猜來提前完成下一步任務，以此來提升性能。(我知道猜聽起來很不靠譜，但它確實在底層用的不少，以后會接觸到。)
預知是建立在糾錯成本可以接受和預知的准確相對高的前提下。

存儲分層架構：

由於不同的存儲技術成本不同，未了平衡存儲訪問的成本和性能，提出分層次的存儲架構。
cache這類的容量小，存儲性能高，訪問速度快的存儲模塊屬於模型的上層。
通過分層架構，程序員能夠像使用cache一樣使用內存，以此來提升程序的性能。

通過增加專項的模塊來保證系統的可靠性：

在系統設計中計入處理異常和監視異常的模塊能大幅提升系統的可靠性。

1.3 隱藏在程序背后的東東：

1). 硬件/軟件分層架構：

- 層級關系 = (應用級軟件(系統級軟件(硬件)))
- 系統級軟件：
- - 操作系統 (Operation System)：
    - 處理基本的輸入/輸出操作。
    - 分配調度存儲空間和內存使用。
    - 為多個應用進程同時使用同一個計算機的資源提供安全的共享方案。
  - 編譯器 (Compiler):
- - - 將高級的編程語言編譯轉換成計算機指令。
  - ³加載器(loader), 匯編器 (assembler)，鏈接器(linker)
  - 程序從源代碼到計算機硬件可以執行所經過的操作：
- - 硬件只能通過電信號進行控制和交流。
  - 最簡單的電信號：開/關 = 二進制信號 0/1
  - 匯編器(Assembler)：負責把匯編語言(簡單符號構成的計算機指令)轉換成二進制機器語言(0/1 構成的語言, 機器可以理解=用來轉換成電信號)
  - 每一條最原始的計算機指令(如加，減)都會需要一條匯編語言的指令聲明。
  - 高級語言(如 c, java)碼好的程序 ==(編譯器)==> 匯編語言構成的程序 ==(匯編器)==> 二進制機器語言構成的程序

1.4 計算機內部的東東：

1). 計算機硬件架構：

功能：數據輸入，數據輸出，數據處理，數據存儲
組成：

輸入(input)
輸出(output)
數據通路(datapath) ：負責所有數值運算
控制(control)：根據程序指令，發送信號給數據通路，輸入/輸出和存儲進行相應的操作。
存儲(memory)

控制 + 數據通路 = 處理器(processor) （第4章會詳細講解這個）

2). 輸入/輸出：

3）計算機內部硬件組成：

芯片：集成電路，由上百萬晶體管組成
處理器(processor): 數據通路 + 控制
內存（memory）:

程序運行時保存數據和程序的地方
動態隨機存取存儲器 (DRAM):

由晶體管構成
非永久性存儲器，斷電丟失。
隨機存取（RAM）意思是訪問內存的任何一個角落所花費的事件是一樣的。
訪問時間：50 - 70納秒，成本: 5 - 10 $/GiB (2012年數據)
每存儲1bit數據需要一個晶體管和一個電容器。
電容存儲，所以需要周期性的刷新一次，否則存儲的數據就會丟失。

緩存（cache memory）:

　靜態隨機存取存儲器（SRAM）：
- 更快，更小，更貴
- 每存儲1 bit數據需要6個晶體管
- 不需要刷新
- 非永久性存存儲器，斷電丟失
CPU的計算速度（數據處理速度）太快，以至於用DRAM技術實現的內存中的數據查找訪問速度實在是跟不上。為了解決這個問題，不浪費限制的CPU資源，SRAM技術誕生了。它位於主存和CPU之間的一級存儲器。
所有的現代CPU都配有多級緩存。詳細的說會越來越多，以后有機會單開一個緩存的博客。

指令集：

硬件和軟件之間的接口，操控硬件的指令的集合。
二進制接口（ABI）

4). 存儲數據：

非永久性存儲
永久性存儲：
- 程序等待運行時存儲數據和程序的地方。
- 代表：
  - 磁盤（硬盤, hard memory）
  - 閃存（flash memory）

5). 計算機間的通信：

計算機網絡：

功能：計算機間的通信，資源共享，異地訪問。
分類：

以太網(Ethernet)：局域網上應用的技術，多個計算機連接構成的網絡。
互聯網(Internet)：廣域網上應用的技術，多個局域網連接而成的網絡。

1.5 處理器和存儲器的制造工藝：

1). 組成成分：

硅 ==> 半導體 ==> 晶體管(on/off開關) ==> 集成電路(IC) ==> 大規模集成(VLSI)電路 ==> 芯片(chips)

2). 制造流程：

- 晶體硅錠 ==(切片)==> 晶片 ==(切粒)==> 芯片 ==(封裝)==> 處理器/存儲器
- 次品(defects)
- 晶原良率(yield)

======================================

【疑難點總結】：

1. 編譯后的程序的匯編代碼的總計算機指令數是確定的，那么運行同一個匯編程序在不同的的處理器上，CPI是一樣的嗎？

不一樣。
前提條件：同一個匯編程序能在兩個不同的處理器上執行。那么意味着這兩個處理器遵從相同的指令集，只是對於指令集的電路實現（也就是常說的處理器的微架構）不同。
那么既然電路實現不同，那么同一條指令被執行時所操作的電路通路是不同的，走的路不同了，所花費的時間就不一樣了，那么執行同樣的一條指令所需要的時鍾周期(也就是時間)也就不一樣了。所以不同架構的處理器(指令集相同，微架構不同)執行相同的匯編程序，CPI是不同的。

3. 不同指令集的處理器無法運行相同的匯編程序，因為匯編語言的程序聲明和指令集對不上。

4. 是不是不同架構的CPU使用不同的編譯器？

是。編譯器基於指令集，面向編程語言。在編譯的時候需要知道是哪個cpu上跑這個程序。

5. 是不是不同架構的CPU使用的匯編語言不同？

當然，匯編語言和指令集的指令是一一對應的。不同架構所實現的指令集都不同了，匯編語言肯定不同。

5.2. 處理器架構，指令集和匯編語言的關系？

請見知乎回答： https://www.zhihu.com/question/23474438

6. 如何解決跨平台編譯？如何解決跨平台執行？

請見知乎回答：https://www.zhihu.com/question/62173438/answer/195436918

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 SystemVerilog Assertions Design Tricks and SVA Bind Files讀書筆記《IDEO，設計改變一切》（Change By Design）- 讀書筆記【讀書筆記】計算機圖形學基礎（虎書 Fundementals of Computer Graphics）讀書筆記項目介紹《過得剛好》---讀書筆記《失控》讀書筆記《騰訊傳》讀書筆記《這樣讀書就夠了》讀書筆記《干法》讀書筆記《活法》讀書筆記《大空頭》讀書筆記