關鍵詞:blktrace、blk tracer、blkparse、block traceevents、BIO。
本章只做一個記錄,關於優化Block層IO性能方法工具。
對Block層沒有詳細分析,對工作的使用和結果分析也沒有展開。
如果有合適的機會補充。
1. blktrace介紹
如下圖可知整個Block I/O框架可以分為三層:VFS、Block和I/O設備驅動。
Linux內核中提供了跟蹤Block層操作的手段,可以通過blk跟蹤器、或者使用blktrace/blkparse/btt工具抓取分析、或者使用block相關trace events記錄。
2. blk跟蹤器分析
blk跟蹤器作為Linux ftrace的一個跟蹤器,主要跟蹤Linux Block層相關操作。
2.1 使用blk跟蹤器
執行腳本sh blk_tracer.sh 4k 512 fsync,抓取相關case。
#!/bin/bash
echo $1 $2 $3
#Prepare for the test
echo 1 > /sys/block/mmcblk0/trace/enable
echo 0 > /sys/kernel/debug/tracing/tracing_on
echo 0 > /sys/kernel/debug/tracing/events/enableecho 10000 > /sys/kernel/debug/tracing/buffer_size_kb
echo blk > /sys/kernel/debug/tracing/current_tracerecho 1 > /sys/kernel/debug/tracing/tracing_on
#Run the test case
if [ $3 = 'fsync' ]; then
dd bs=$1 count=$2 if=/dev/zero of=out_empty conv=fsync
else
dd bs=$1 count=$2 if=/dev/zero of=out_empty
fi#Capture the test log
echo 0 > /sys/kernel/debug/tracing/tracing_on
echo 0 > /sys/block/mmcblk0/trace/enable
cat /sys/kernel/debug/tracing/trace > test_$1_$2_$3.txt
echo nop > /sys/kernel/debug/tracing/current_tracer
echo > /sys/kernel/debug/tracing/trace
2.2 注冊blk跟蹤器
首先register_tracer向ftrace注冊blk跟蹤器,然后打印核心函數是blk_tracer_print_line。
static int __init init_blk_tracer(void) { ... if (register_tracer(&blk_tracer) != 0) { pr_warning("Warning: could not register the block tracer\n"); unregister_ftrace_event(&trace_blk_event); return 1; } return 0; } static struct tracer blk_tracer __read_mostly = { .name = "blk", .init = blk_tracer_init, .reset = blk_tracer_reset, .start = blk_tracer_start, .stop = blk_tracer_stop, .print_header = blk_tracer_print_header, .print_line = blk_tracer_print_line, .flags = &blk_tracer_flags, .set_flag = blk_tracer_set_flag, }; static enum print_line_t blk_tracer_print_line(struct trace_iterator *iter) { if (!(blk_tracer_flags.val & TRACE_BLK_OPT_CLASSIC)) return TRACE_TYPE_UNHANDLED; return print_one_line(iter, true); }
print_one_line是打印的核心,這里最主要是根據當前struct blk_io_trace->action選擇合適的打印函數。
static enum print_line_t print_one_line(struct trace_iterator *iter, bool classic) { struct trace_seq *s = &iter->seq; const struct blk_io_trace *t; u16 what; int ret; bool long_act; blk_log_action_t *log_action; t = te_blk_io_trace(iter->ent); what = t->action & ((1 << BLK_TC_SHIFT) - 1);-------------------------------what表示是什么操作類型? long_act = !!(trace_flags & TRACE_ITER_VERBOSE); log_action = classic ? &blk_log_action_classic : &blk_log_action; if (t->action == BLK_TN_MESSAGE) { ret = log_action(iter, long_act ? "message" : "m"); if (ret) ret = blk_log_msg(s, iter->ent); goto out; } if (unlikely(what == 0 || what >= ARRAY_SIZE(what2act)))---------------------------what為0表示未知操作類型 ret = trace_seq_printf(s, "Unknown action %x\n", what); else { ret = log_action(iter, what2act[what].act[long_act]);--------------------------首先調用blk_long_action_classic或者blk_long_action打印 if (ret) ret = what2act[what].print(s, iter->ent);----------------------------------然后在調用具體action對應的但因函數打印細節。 } out: return ret ? TRACE_TYPE_HANDLED : TRACE_TYPE_PARTIAL_LINE; } static const struct { const char *act[2]; int (*print)(struct trace_seq *s, const struct trace_entry *ent); } what2act[] = { [__BLK_TA_QUEUE] = {{ "Q", "queue" }, blk_log_generic }, [__BLK_TA_BACKMERGE] = {{ "M", "backmerge" }, blk_log_generic }, [__BLK_TA_FRONTMERGE] = {{ "F", "frontmerge" }, blk_log_generic }, [__BLK_TA_GETRQ] = {{ "G", "getrq" }, blk_log_generic }, [__BLK_TA_SLEEPRQ] = {{ "S", "sleeprq" }, blk_log_generic }, [__BLK_TA_REQUEUE] = {{ "R", "requeue" }, blk_log_with_error }, [__BLK_TA_ISSUE] = {{ "D", "issue" }, blk_log_generic }, [__BLK_TA_COMPLETE] = {{ "C", "complete" }, blk_log_with_error }, [__BLK_TA_PLUG] = {{ "P", "plug" }, blk_log_plug }, [__BLK_TA_UNPLUG_IO] = {{ "U", "unplug_io" }, blk_log_unplug }, [__BLK_TA_UNPLUG_TIMER] = {{ "UT", "unplug_timer" }, blk_log_unplug }, [__BLK_TA_INSERT] = {{ "I", "insert" }, blk_log_generic }, [__BLK_TA_SPLIT] = {{ "X", "split" }, blk_log_split }, [__BLK_TA_BOUNCE] = {{ "B", "bounce" }, blk_log_generic }, [__BLK_TA_REMAP] = {{ "A", "remap" }, blk_log_remap }, }
從fill_rwbs()可知,RWBS之類的字符表示讀寫類型,W-BLK_TC_WRITE、S-BLK_TC_SYNC、R-BLK_TC_READ、N-BLK_TN_MESSAGE等等。
結合what2act可知當前當前打印的確切含義,結合打印函數,可知其大概含義。
dd-844 [000] .... 1540.751312: 179,0 Q WS 32800 + 1 [dd]---------------Q-queue操作,WS-寫同步,blk_log_generic打印-磁盤偏移量+寫sector數目 [進程名] dd-844 [000] .... 1540.751312: 179,0 G WS 32800 + 1 [dd]---------------G-getrq操作 dd-844 [000] .... 1540.751312: 179,0 I WS 32800 + 1 [dd] dd-844 [000] .... 1540.753937: 179,0 P N [dd]-------------------------P-plug操作,N-BLK_TN_MESSAGE dd-844 [000] .... 1540.753937: 179,0 A WS 18079 + 1 <- (179,1) 18047 dd-844 [000] .... 1540.753937: 179,0 Q WS 18079 + 1 [dd] dd-844 [000] .... 1540.753967: 179,0 G WS 18079 + 1 [dd] dd-844 [000] .... 1540.753967: 179,0 I WS 2863 + 1 [dd] dd-844 [000] .... 1540.753967: 179,0 I WS 18079 + 1 [dd] dd-844 [000] .... 1540.753967: 179,0 U N [dd] 2------------------------U-unplug_io操作, mmcqd/0-673 [000] .... 1540.753967: 179,0 D WS 2863 + 1 [mmcqd/0] mmcqd/0-673 [000] .... 1540.753998: 179,0 D WS 18079 + 1 [mmcqd/0] mmcqd/0-673 [000] .... 1540.757690: 179,0 C WS 2863 + 1 [0] mmcqd/0-673 [000] .... 1540.760681: 179,0 C WS 18079 + 1 [0] dd-844 [000] .... 1540.760742: 179,0 A R 2864 + 1 <- (179,1) 2832------A-remap操作,讀取偏移量+讀取sect數 <-源設備主從設備號 源設備偏移量
3.blktrace使用及原理
blktrace是針對Linux內核中Block I/O的跟蹤工具,屬於內核Block Layer。
通過這個工具可以獲得I/O請求隊列的詳細情況,包括讀寫進程名、進程號、執行時間、讀寫物理塊號、塊大小等等。
PS:代碼是LInux 3.4.110,但是試驗在Ubuntu Kernel 4.4.0-31進行。
3.1 blktrace的使用
安裝blktrace
sudo apt-get install blktrace
blktrace用於抓取Block I/O相關的log,blkparse用於分析blktrace抓取的二進制log。通過btt也可以用blktrace抓取的二進制log。
sudo blktrace -d /dev/sda6 -o - | blkparse -i -
或者分開使用
sudo blktrace -d /dev/sda6 -o sda6 #生成sda6.blktrace.x文件
blkparse -i sda6.blktrace.* -o sda6.txt
btt -i sda6.blktrace.*
3.2 blktrace原理
1. blktrace在運行的時候會在/sys/kernel/debug/block下面設備名稱對應的目錄,並生成四個文件droppeed/msg/trace0/trace2。
2. blktrace通過ioctl對內核進行設置,從而抓取log。
3. blktrace針對系統每個CPU綁定一個線程來收集相應數據。
3.2.1 Block設備blktrace相關節點
使用strace跟中blktrace執行流程可以看出:打開/dev/sda6設備,然后通過ioctl發送BLKTRACESETUP和BLKTRACESTART。
open("/dev/sda6", O_RDONLY|O_NONBLOCK) = 3 statfs("/sys/kernel/debug", {f_type=0x64626720, f_bsize=4096, f_blocks=0, f_bfree=0, f_bavail=0, f_files=0, f_ffree=0, f_fsid={0, 0}, f_namelen=255, f_frsize=4096}) = 0... ioctl(3, BLKTRACESETUP, {act_mask=65535, buf_size=524288, buf_nr=4, start_lba=0, end_lba=0, pid=0}, {name="sda6"}) = 0... ioctl(3, BLKTRACESTART, 0x608c20) = 0
下面通過走查代碼簡單分析流程。
add_disk() ->blk_register_queue() ->blk_trace_init_sysfs() ->blk_trace_attr_group ->blk_trace_attrs
如果定義了CONFIG_BLK_DEV_IO_TRACE,會在/sys/block/sda/sda6下創建一個trace目錄。執行blktrace可以打開enable。
3.2.2 Block設備ioctl相關命令
ioctl通過/dev/sda6設置進去,可以看一下ioctl代碼流程。
def_blk_fops ->unlocked_ioctl = block_ioctl ->blkdev_ioctl ->BLKTRACESTART/BLKTRACESTOP/BLKTRACESETUPBLKTRACETEARDOWN ->blk_trace_ioctl
blk_trace_ioctl處理BLKTRACESETUP/BLKTRACESTART/BLKTRACESTOP/BLKTRACETEARDOWN四種情況,分別表示配置、開始、停止、釋放資源。
int blk_trace_ioctl(struct block_device *bdev, unsigned cmd, char __user *arg) { struct request_queue *q; int ret, start = 0; char b[BDEVNAME_SIZE]; q = bdev_get_queue(bdev); if (!q) return -ENXIO; mutex_lock(&bdev->bd_mutex); switch (cmd) { case BLKTRACESETUP: bdevname(bdev, b); ret = blk_trace_setup(q, b, bdev->bd_dev, bdev, arg); break; #if defined(CONFIG_COMPAT) && defined(CONFIG_X86_64) case BLKTRACESETUP32: bdevname(bdev, b); ret = compat_blk_trace_setup(q, b, bdev->bd_dev, bdev, arg); break; #endif case BLKTRACESTART: start = 1; case BLKTRACESTOP: ret = blk_trace_startstop(q, start); break; case BLKTRACETEARDOWN: ret = blk_trace_remove(q); break; default: ret = -ENOTTY; break; } mutex_unlock(&bdev->bd_mutex); return ret; }
int blk_trace_setup(struct request_queue *q, char *name, dev_t dev, struct block_device *bdev, char __user *arg) { struct blk_user_trace_setup buts; int ret; ret = copy_from_user(&buts, arg, sizeof(buts)); if (ret) return -EFAULT; ret = do_blk_trace_setup(q, name, dev, bdev, &buts); if (ret) return ret; if (copy_to_user(arg, &buts, sizeof(buts))) { blk_trace_remove(q); return -EFAULT; } return 0; } int do_blk_trace_setup(struct request_queue *q, char *name, dev_t dev, struct block_device *bdev, struct blk_user_trace_setup *buts) { struct blk_trace *old_bt, *bt = NULL; struct dentry *dir = NULL; int ret, i; ... bt = kzalloc(sizeof(*bt), GFP_KERNEL); if (!bt) return -ENOMEM; ret = -ENOMEM; bt->sequence = alloc_percpu(unsigned long); if (!bt->sequence) goto err; bt->msg_data = __alloc_percpu(BLK_TN_MAX_MSG, __alignof__(char)); if (!bt->msg_data) goto err; ret = -ENOENT; mutex_lock(&blk_tree_mutex); if (!blk_tree_root) { blk_tree_root = debugfs_create_dir("block", NULL);-------------------創建/sys/kernel/debug/block目錄 if (!blk_tree_root) { mutex_unlock(&blk_tree_mutex); goto err; } } mutex_unlock(&blk_tree_mutex); dir = debugfs_create_dir(buts->name, blk_tree_root);---------------------在/sys/kernel/debug/block目錄下創建設備名稱對應目錄 if (!dir) goto err; bt->dir = dir; bt->dev = dev; atomic_set(&bt->dropped, 0); ret = -EIO; bt->dropped_file = debugfs_create_file("dropped", 0444, dir, bt, &blk_dropped_fops);--------------------------------在/sys/kernel/debug/block/sda6下創建dropped節點 if (!bt->dropped_file) goto err; bt->msg_file = debugfs_create_file("msg", 0222, dir, bt, &blk_msg_fops); if (!bt->msg_file) goto err; bt->rchan = relay_open("trace", dir, buts->buf_size, buts->buf_nr, &blk_relay_callbacks, bt);----------------------為每個CPU創建/sys/kernel/debug/block/sda6/tracex對應額relay channel ... if (atomic_inc_return(&blk_probes_ref) == 1) blk_register_tracepoints();-------------------------------------------注冊和/sys/kernel/debug/tracing/events/block下同樣的trace events。 return 0; err: blk_trace_free(bt); return ret; }
blk_trace_startstop執行blktrace的開關操作,停止過后將per cpu的relay chanel強制flush出來。
int blk_trace_startstop(struct request_queue *q, int start) { int ret; struct blk_trace *bt = q->blk_trace; ... ret = -EINVAL; if (start) { if (bt->trace_state == Blktrace_setup || bt->trace_state == Blktrace_stopped) { blktrace_seq++; smp_mb(); bt->trace_state = Blktrace_running; trace_note_time(bt); ret = 0; } } else { if (bt->trace_state == Blktrace_running) { bt->trace_state = Blktrace_stopped; relay_flush(bt->rchan); ret = 0; } } return ret; }
釋放blktrace設置創建的buffer、刪除相關文件節點,並去注冊trace events。
int blk_trace_remove(struct request_queue *q) { struct blk_trace *bt; bt = xchg(&q->blk_trace, NULL); if (!bt) return -EINVAL; if (bt->trace_state != Blktrace_running) blk_trace_cleanup(bt); return 0; } static void blk_trace_cleanup(struct blk_trace *bt) { blk_trace_free(bt); if (atomic_dec_and_test(&blk_probes_ref)) blk_unregister_tracepoints(); }
這里的print_one_line和blk跟蹤器的打印稍有不同,此處調用blk_log_action打印公共部分信息。
static int __init init_blk_tracer(void) { if (!register_ftrace_event(&trace_blk_event)) { pr_warning("Warning: could not register block events\n"); return 1; } ... } static enum print_line_t blk_trace_event_print(struct trace_iterator *iter, int flags, struct trace_event *event) { return print_one_line(iter, false); }
blk跟蹤器使用blk_log_action_classic()打印,兩者區別在於blk跟蹤器
static int blk_log_action_classic(struct trace_iterator *iter, const char *act) { char rwbs[RWBS_LEN]; unsigned long long ts = iter->ts; unsigned long nsec_rem = do_div(ts, NSEC_PER_SEC); unsigned secs = (unsigned long)ts; const struct blk_io_trace *t = te_blk_io_trace(iter->ent); fill_rwbs(rwbs, t); return trace_seq_printf(&iter->seq, "%3d,%-3d %2d %5d.%09lu %5u %2s %3s ", MAJOR(t->device), MINOR(t->device), iter->cpu, secs, nsec_rem, iter->ent->pid, act, rwbs); } static int blk_log_action(struct trace_iterator *iter, const char *act) { char rwbs[RWBS_LEN]; const struct blk_io_trace *t = te_blk_io_trace(iter->ent); fill_rwbs(rwbs, t); return trace_seq_printf(&iter->seq, "%3d,%-3d %2s %3s ", MAJOR(t->device), MINOR(t->device), act, rwbs); }
3.3 blkparse結果分析
使用blkparse分析blktrace后,得出的結果和blk跟蹤器大概類似。
可見實驗結果多了第2列CPU序號,和第4列執行的時間戳。
8,6 1 1 0.000000000 173 P N [jbd2/sda1-8] 8,6 1 2 0.000011134 173 U N [jbd2/sda1-8] 1 8,0 1 3 4.000017080 570 A WS 528813240 + 8 <- (8,6) 440921272 8,6 1 4 4.000018315 570 Q WS 528813240 + 8 [jbd2/sda6-8] 8,6 1 5 4.000024476 570 G WS 528813240 + 8 [jbd2/sda6-8] 8,6 1 6 4.000025282 570 P N [jbd2/sda6-8] 8,0 1 7 4.000026610 570 A WS 528813248 + 8 <- (8,6) 440921280 8,6 1 8 4.000026935 570 Q WS 528813248 + 8 [jbd2/sda6-8] 8,6 1 9 4.000028253 570 M WS 528813248 + 8 [jbd2/sda6-8] 8,0 1 10 4.000028895 570 A WS 528813256 + 8 <- (8,6) 440921288 8,6 1 11 4.000029180 570 Q WS 528813256 + 8 [jbd2/sda6-8] 8,6 1 12 4.000029585 570 M WS 528813256 + 8 [jbd2/sda6-8] 8,0 1 13 4.000030182 570 A WS 528813264 + 8 <- (8,6) 440921296 8,6 1 14 4.000030463 570 Q WS 528813264 + 8 [jbd2/sda6-8] 8,6 1 15 4.000030845 570 M WS 528813264 + 8 [jbd2/sda6-8] 8,0 1 16 4.000031406 570 A WS 528813272 + 8 <- (8,6) 440921304 8,6 1 17 4.000031687 570 Q WS 528813272 + 8 [jbd2/sda6-8] 8,6 1 18 4.000032066 570 M WS 528813272 + 8 [jbd2/sda6-8] ... CPU0 (sda6): Reads Queued: 2375, 16816KiB Writes Queued: 909, 4648KiB Read Dispatches: 2329, 16632KiB Write Dispatches: 310, 3876KiB Reads Requeued: 0 Writes Requeued: 15 Reads Completed: 23, 132KiB Writes Completed: 0, 0KiB Read Merges: 27, 108KiB Write Merges: 489, 1960KiB Read depth: 32 Write depth: 31 IO unplugs: 379 Timer unplugs: 0 CPU1 (sda6): Reads Queued: 2737, 12724KiB Writes Queued: 1152, 5080KiB Read Dispatches: 2755, 12908KiB Write Dispatches: 759, 5852KiB Reads Requeued: 0 Writes Requeued: 249 Reads Completed: 5061, 29408KiB Writes Completed: 945, 9728KiB Read Merges: 0, 0KiB Write Merges: 759, 3044KiB Read depth: 32 Write depth: 31 IO unplugs: 212 Timer unplugs: 1 Total (sda6): Reads Queued: 5112, 29540KiB Writes Queued: 2061, 9728KiB Read Dispatches: 5084, 29540KiB Write Dispatches: 1069, 9728KiB Reads Requeued: 0 Writes Requeued: 264 Reads Completed: 5084, 29540KiB Writes Completed: 945, 9728KiB Read Merges: 27, 108KiB Write Merges: 1248, 5004KiB IO unplugs: 591 Timer unplugs: 1 Throughput (R/W): 2KiB/s / 0KiB/s Events (sda6): 41303 entries Skips: 0 forward (0 - 0.0%)
4. block的Traceevents
關於block Traceevents和blk跟蹤器的關聯,可以通過blk_register_tracepoints()和include/trace/events/block.h中定義的時間關聯起來,兩者是一一對應的。
block事件打印的信息多了一個字符串,可讀性更強一點。比blk跟蹤器和blktrace更容易了解含義。
在/sys/kernel/debug/tracing/events/block下,可以分別打開相關事件:
block_bio_backmerge
block_bio_bounce
block_bio_complete
block_bio_frontmerge
block_bio_queue
block_bio_remap
block_getrq
block_plug
block_rq_abort
block_rq_complete
block_rq_insert
block_rq_issue
block_rq_remap
block_rq_requeue
block_sleeprq
block_split
block_unplug
同時根據這些字符串,也容易找出內核中代碼位置。
也只有明白了block層流程,以及這些關鍵事件的log,才能知道問題點在哪里?然后再去進行優化。
# tracer: nop # # entries-in-buffer/entries-written: 14856/14856 #P:1 # # _-----=> irqs-off # / _----=> need-resched # | / _---=> hardirq/softirq # || / _--=> preempt-depth # ||| / delay # TASK-PID CPU# |||| TIMESTAMP FUNCTION # | | | |||| | | dd-955 [000] .... 3287.902618: block_bio_remap: 179,0 W 472280 + 16 <- (179,1) 472248 dd-955 [000] .... 3287.902618: block_bio_queue: 179,0 W 472280 + 16 [dd] dd-955 [000] .... 3287.902649: block_getrq: 179,0 W 472280 + 16 [dd] dd-955 [000] .... 3287.902649: block_plug: [dd] dd-955 [000] d... 3287.902649: block_rq_insert: 179,0 W 0 () 472280 + 16 [dd] dd-955 [000] d... 3287.902679: block_unplug: [dd] 1 mmcqd/0-673 [000] d... 3287.902679: block_rq_issue: 179,0 W 0 () 472280 + 16 [mmcqd/0] mmcqd/0-673 [000] d... 3287.907440: block_rq_complete: 179,0 W () 472280 + 16 [0] dd-955 [000] .... 3287.907501: block_bio_remap: 179,0 WS 32800 + 1 <- (179,1) 32768 dd-955 [000] .... 3287.907501: block_bio_queue: 179,0 WS 32800 + 1 [dd] dd-955 [000] .... 3287.907501: block_getrq: 179,0 WS 32800 + 1 [dd] dd-955 [000] d... 3287.907501: block_rq_insert: 179,0 WS 0 () 32800 + 1 [dd] mmcqd/0-673 [000] d... 3287.907532: block_rq_issue: 179,0 WS 0 () 32800 + 1 [mmcqd/0] mmcqd/0-673 [000] d... 3287.921631: block_rq_complete: 179,0 WS () 32800 + 1 [0] dd-955 [000] .... 3287.921631: block_bio_remap: 179,0 WS 2794 + 1 <- (179,1) 2762 dd-955 [000] .... 3287.921631: block_bio_queue: 179,0 WS 2794 + 1 [dd] dd-955 [000] .... 3287.921631: block_getrq: 179,0 WS 2794 + 1 [dd] dd-955 [000] .... 3287.921661: block_plug: [dd] dd-955 [000] .... 3287.921661: block_bio_remap: 179,0 WS 18010 + 1 <- (179,1) 17978 dd-955 [000] .... 3287.921661: block_bio_queue: 179,0 WS 18010 + 1 [dd] dd-955 [000] .... 3287.921661: block_getrq: 179,0 WS 18010 + 1 [dd] dd-955 [000] .... 3287.921661: block_bio_remap: 179,0 WS 2792 + 1 <- (179,1) 2760 dd-955 [000] .... 3287.921661: block_bio_queue: 179,0 WS 2792 + 1 [dd] dd-955 [000] .... 3287.921661: block_getrq: 179,0 WS 2792 + 1 [dd] dd-955 [000] .... 3287.921661: block_bio_remap: 179,0 WS 18008 + 1 <- (179,1) 17976 dd-955 [000] .... 3287.921661: block_bio_queue: 179,0 WS 18008 + 1 [dd] dd-955 [000] .... 3287.921692: block_getrq: 179,0 WS 18008 + 1 [dd] dd-955 [000] .... 3287.921692: block_bio_remap: 179,0 WS 2797 + 1 <- (179,1) 2765 dd-955 [000] .... 3287.921692: block_bio_queue: 179,0 WS 2797 + 1 [dd] dd-955 [000] .... 3287.921692: block_getrq: 179,0 WS 2797 + 1 [dd] dd-955 [000] .... 3287.921692: block_bio_remap: 179,0 WS 18013 + 1 <- (179,1) 17981 dd-955 [000] .... 3287.921692: block_bio_queue: 179,0 WS 18013 + 1 [dd]
5. 優化嘗試
對文件系統性能的調優,主要通過兩個目錄下節點:/proc/sys/vm和/sys/block/sda/queue。
5.1 更改IO調度算法
echo cfq > /sys/block/sda/queue/scheduler
5.2 修改磁盤相關內核參數
/sys/block/sda/queue/scheduler
/sys/block/sda/queue/nr_requests 磁盤隊列長度。默認只有 128 個隊列,可以提高到 512 個.會更加占用內存,但能更加多的合並讀寫操作,速度變慢,但能讀寫更加多的量
/sys/block/sda/queue/iosched/antic_expire 等待時間 。讀取附近產生的新請時等待多長時間
這個參數控制文件系統的文件系統寫緩沖區的大小,單位是百分比,表示系統內存的百分比,表示當寫緩沖使用到系統內存多少的時候,開始向磁盤寫出數 據.增大之會使用更多系統內存用於磁盤寫緩沖,也可以極大提高系統的寫性能.但是,當你需要持續、恆定的寫入場合時,應該降低其數值,一般啟動上缺省是 10.下面是增大的方法: echo ’40’>
/proc/sys/vm/dirty_background_ratio
這個參數控制文件系統的pdflush進程,在何時刷新磁盤.單位是百分比,表示系統內存的百分比,意思是當寫緩沖使用到系統內存多少的時候, pdflush開始向磁盤寫出數據.增大之會使用更多系統內存用於磁盤寫緩沖,也可以極大提高系統的寫性能.但是,當你需要持續、恆定的寫入場合時,應該降低其數值,一般啟動上缺省是 5.下面是增大的方法: echo ’20’ >
/proc/sys/vm/dirty_writeback_centisecs
這個參數控制內核的臟數據刷新進程pdflush的運行間隔.單位是 1/100 秒.缺省數值是500,也就是 5 秒.如果你的系統是持續地寫入動作,那么實際上還是降低這個數值比較好,這樣可以把尖峰的寫操作削平成多次寫操作.設置方法如下: echo ‘200’ > /proc/sys/vm/dirty_writeback_centisecs 如果你的系統是短期地尖峰式的寫操作,並且寫入數據不大(幾十M/次)且內存有比較多富裕,那么應該增大此數值: echo ‘1000’ > /proc/sys/vm/dirty_writeback_centisecs
/proc/sys/vm/dirty_expire_centisecs
這個參數聲明Linux內核寫緩沖區里面的數據多“舊”了之后,pdflush進程就開始考慮寫到磁盤中去.單位是 1/100秒.缺省是 30000,也就是 30 秒的數據就算舊了,將會刷新磁盤.對於特別重載的寫操作來說,這個值適當縮小也是好的,但也不能縮小太多,因為縮小太多也會導致IO提高太快.建議設置為 1500,也就是15秒算舊. echo ‘1500’ > /proc/sys/vm/dirty_expire_centisecs 當然,如果你的系統內存比較大,並且寫入模式是間歇式的,並且每次寫入的數據不大(比如幾十M),那么這個值還是大些的好.
參考文檔