Coursera系列-R Programming (John Hopkins University)-Programming Assignment 3


經過斷斷續續一個月的學習,R語言這門課也快接近尾聲了。進入Week 4,作業對於我這個初學者來說感到越發困難起來。還好經過幾天不斷地摸索和試錯,最終完整地解決了問題。

本周的作業Assignment 3是處理一個來自美國Department of Health and Human Services的一個文件,叫“outcome-of-care-measures.csv”。里面儲存了美國50個州4000多家醫院的幾個常見疾病的死亡率。具體說來是30-day mortality and readmission rates for heart attacks, heart failure, and pneumonia。然后我們的任務是能對里面州內或全國的醫院按不同疾病的死亡率進行排序,從而鎖定最佳醫院,最差醫院和排名為第N名的醫院。

Task 1

Finding the best hospital in a state

Write a function called best that take two arguments: the 2-character abbreviated name of a state and an outcome name. The function reads the outcome-of-care-measures.csv _le and returns a character vector with the name of the hospital that has the best (i.e. lowest) 30-day mortality for the speci_ed outcome in that state. The hospital name is the name provided in the Hospital.Name variable. The outcomes can be one of \heart attack", \heart failure", or \pneumonia". Hospitals that do not have data on a particular outcome should be excluded from the set of hospitals when deciding the rankings.

Handling ties. If there is a tie for the best hospital for a given outcome, then the hospital names should be sorted in alphabetical order and the _rst hospital in that set should be chosen (i.e. if hospitals \b", \c", and \f" are tied for best, then hospital \b" should be returned).

The function should use the following template.

best <- function(state, outcome) {

## Read outcome data

## Check that state and outcome are valid

## Return hospital name in that state with lowest 30-day death

## rate

}

The function should check the validity of its arguments. If an invalid state value is passed to best, the

function should throw an error via the stop function with the exact message \invalid state". If an invalid

outcome value is passed to best, the function should throw an error via the stop function with the exact

message \invalid outcome".

Here is some sample output from the function.

> source("best.R")

> best("TX", "heart attack")

[1] "CYPRESS FAIRBANKS MEDICAL CENTER"

> best("TX", "heart failure")

[1] "FORT DUNCAN MEDICAL CENTER"

> best("MD", "heart attack")

[1] "JOHNS HOPKINS HOSPITAL, THE"

> best("MD", "pneumonia")

[1] "GREATER BALTIMORE MEDICAL CENTER"

> best("BB", "heart attack")

Error in best("BB", "heart attack") : invalid state

> best("NY", "hert attack")

Error in best("NY", "hert attack") : invalid outcome

第一個函數任務叫best,任務就是當輸入“州”和“疾病”時,該函數能夠返回該州治療該疾病最好的醫院名。所謂“最好”,作業里已經有所介紹,就是30天期間該病死亡率最低。如果某病最佳醫院的死亡率相同,則按照字母順序對醫院進行排名,字母靠前的醫院優先排在前面。最終是第一名的醫院被返回。

best <- function (state, outcome){ data <- read.csv("outcome-of-care-measures.csv") A <- data$State == state  ## Test if the variable: state is in the list of data$State. If not, the sum of A will be 0. if (!sum(A)) { stop ("invalid state") } disease_list <- c("heart attack", "heart failure", "pneumonia") ## Test if the variable: outcome is in the list of disease if (!outcome %in% disease_list){ stop ("invalid outcome") } ## Create the sub-data.frame for the specific state.  StateData <- subset(data, State == state) ## Extract the hospital and rate colume from the data. if (outcome == "heart attack") { StateData <- StateData[c("Hospital.Name", "Hospital.30.Day.Death..Mortality..Rates.from.Heart.Attack")] } else if (outcome == "heart failure"){ StateData <- StateData[c("Hospital.Name", "Hospital.30.Day.Death..Mortality..Rates.from.Heart.Failure")] } else { StateData <- StateData[c("Hospital.Name", "Hospital.30.Day.Death..Mortality..Rates.from.Pneumonia")] } ## Assign a common colume name for the StateData whatever the disease is.  colnames(StateData) <- c("Hospital.Name", "Disease.Rate") ## Transform the disease.rate colume from Factor to numeric for the purpose of ordering.  StateData[, "Disease.Rate"] <- as.numeric(as.character(StateData[, "Disease.Rate"])) StateData[, "Hospital.Name"] <- as.character(StateData[, "Hospital.Name"])  ## Order the data.frame by disease rate and hospital names.  StateData <- StateData[order(StateData$Disease.Rate, StateData$Hospital.Name), ]
best <- StateData[1, "Hospital.Name"] best
}

首先,讀取數據文件到data中。然后就要做判斷state 和outcome 是否在文件中存在。這里我用 A <- data$State == state 構建一個邏輯數組,如果state不是美國任何一個州,則 A = FALSE, FALSE, … FALSE,進而其求和 sum (A) = 0;如果state是其中一個州那么data$State中必包含一個或多個TRUE,則其總和將不等於0. 這一招也是看課件學到的,感覺以后也會很有用。

對於outcome,本來也可以依葫蘆畫瓢,但是感謝博客園廣大博主,又學會了一招。就是R語言有個最簡單的判斷元素a是否在數組或者list A中的辦法,即a %in% A, 返回值為邏輯類。

之后對於上面兩種辦法產生的結果進行if判斷,決定是否拋出stop函數。

接下來用subset 將某一州state的數據提取出來。再依據outcome的結果,找到相應疾病的colume,並用StateData <- StateDate[c(“Hospital.Name”, “Hospital.30.Day…”) ]只把醫院名,疾病死亡率這兩列提取出來。

order()函數

最核心的就是排序了。排序的話有sort, rank 和order等函數可以選用。sort(x)是對向量x進行排序,返回值排序后的數值向量。rank()是求秩的函數,它的返回值是這個向量中對應元素的“排名”。而order()的返回值是對應“排名”的元素所在向量中的位置。
下面以一小段R來感受一下:

> w <- c(97, 93, 85, 85, 32, NA, 99)

> w

[1] 97 93 85 85 32 NA 99

> order(w)

[1] 5 3 4 2 1 7 6

> w <- c(97, 93, 85, 85, 32, NA, 99)

看來NA通常被認為是最大的。

既然order ()可以返回排名后元素所在位置,那么用 A[order(A$a), ] 的模式就可以對A的第a列進行排序。而且order()還可以用A[order(A$a, A$b,…), ]對A中的多列同時排序,先排第一個出現的a。案例如下所示:

> x <- data.frame(foo = 1:8, State = c('TX','TX','TX','NY','NY','NY','CA','CA'), Country = c('a','a','b','e','e','f','m','n'), Site = c(1,6,1,1,3,1,8,5))

> x

  foo State Country Site

1   1    TX       a    1

2   2    TX       a    6

3   3    TX       b    1

4   4    NY       e    1

5   5    NY       e    3

6   6    NY       f    1

7   7    CA       m    8

8   8    CA       n    5

> x[order(x$Site),]

  foo State Country Site

1   1    TX       a    1

3   3    TX       b    1

4   4    NY       e    1

6   6    NY       f    1

5   5    NY       e    3

8   8    CA       n    5

2   2    TX       a    6

7   7    CA       m    8

> x[order(x$Site, x$Country),]

  foo State Country Site

1   1    TX       a    1

3   3    TX       b    1

4   4    NY       e    1

6   6    NY       f    1

5   5    NY       e    3

8   8    CA       n    5

2   2    TX       a    6

7   7    CA       m    8

那么,想對Disease.Rate列排序,同時讓Hospital按字母排列,一個辦法就是用如下的order函數

StateData <- StateData[order(StateData$Disease.Rate, StateData$Hospital.Name), ]

但是問題又來了,如果直接這樣排,發現排出來的是錯的。比如我輸入WI這個州,pneumonia這個病。按代碼運行到StateData並排完序結果如下。

StateData <- StateData[order(StateData$Disease.Rate, StateData$Hospital.Name), ] 
> StateData
                                 Hospital.Name Disease.Rate
1897                 CALVERT MEMORIAL HOSPITAL         10.1
1902            HOWARD COUNTY GENERAL HOSPITAL         10.1
1875               JOHNS HOPKINS HOSPITAL, THE         10.2
1906            LAUREL REGIONAL MEDICAL CENTER         10.6
1895               MEMORIAL HOSPITAL AT EASTON         10.6
1883         PENINSULA REGIONAL MEDICAL CENTER         10.6
1889      JOHNS HOPKINS BAYVIEW MEDICAL CENTER         10.7
1910                 ATLANTIC GENERAL HOSPITAL         10.8
1896                MARYLAND GENERAL  HOSPITAL         10.8
1904              DOCTORS'  COMMUNITY HOSPITAL         11.0
1909                  FORT WASHINGTON HOSPITAL         11.0
1880             WASHINGTON ADVENTIST HOSPITAL         11.0
1876                      SAINT AGNES HOSPITAL         11.1
1890             CHESTER RIVER HOSPITAL CENTER         11.2
1874                  MERCY MEDICAL CENTER INC         11.2
1886           MEDSTAR UNION MEMORIAL HOSPITAL         11.3
1872                 HARFORD MEMORIAL HOSPITAL         11.5
1908            SHADY GROVE ADVENTIST HOSPITAL         11.7
1905         SOUTHERN MARYLAND HOSPITAL CENTER         11.7
1885               ANNE ARUNDEL MEDICAL CENTER         12.0
1867                    MERITUS MEDICAL CENTER         12.5
1898                 NORTHWEST HOSPITAL CENTER         12.6
1911 VA MARYLAND HEALTHCARE SYSTEM - BALTIMORE         12.6
1887  WESTERN MARYLAND REGIONAL MEDICAL CENTER         12.6
1899      BALTIMORE WASHINGTON  MEDICAL CENTER         12.7
1868     UNIVERSITY OF MARYLAND MEDICAL CENTER         12.7
1901         EDWARD MCCREADY MEMORIAL HOSPITAL         12.9
1903           UPPER CHESAPEAKE MEDICAL CENTER         12.9
1869            PRINCE GEORGES HOSPITAL CENTER         13.0
1888             MEDSTAR SAINT MARY'S HOSPITAL         13.1
1881          GARRETT COUNTY MEMORIAL HOSPITAL         13.5
1894                    CIVISTA MEDICAL CENTER         14.2
1900          GREATER BALTIMORE MEDICAL CENTER          7.4
1907           MEDSTAR GOOD SAMARITAN HOSPITAL          8.4
1893                   MEDSTAR HARBOR HOSPITAL          9.2
1879    MEDSTAR FRANKLIN SQUARE MEDICAL CENTER          9.3
1882         MEDSTAR MONTGOMERY MEDICAL CENTER          9.3
1873               SAINT JOSEPH MEDICAL CENTER          9.5
1878                      BON SECOURS HOSPITAL          9.6
1870                       HOLY CROSS HOSPITAL          9.6
1892                   CARROLL HOSPITAL CENTER          9.7
1877               SINAI HOSPITAL OF BALTIMORE          9.7
1871               FREDERICK MEMORIAL HOSPITAL          9.8
1884                         SUBURBAN HOSPITAL          9.9
1891            UNION HOSPITAL OF CECIL COUNTY          9.9

有意思的是本來1900行的GREATER BALTIMORE MEDICAL CENTER的死亡率最低為7.4,應該排第一。但是order函數卻把大於10的先排了,小於10的再在后面單獨排列,導致7.4被甩到后面去了。問題排查出來是StateData$Disease.Rate得到是Factor類,而非由numeric類構成的數組。所以這可能是排序無法進行的原因。同樣地,StateData$Hospital.Name也不是character類型,而是Factor類。所以解決方案就是對Factor進行轉碼。StateData$Disease.Rate由Factor轉成numeric類, StateData$Hospital.Name則由Facter轉成character類型。

下面是個小例子。

> x

  foo State Country Site

1   1    TX       a    1

2   2    TX       a    6

3   3    TX       b    1

4   4    NY       e    1

5   5    NY       e    3

6   6    NY       f    1

7   7    CA       m    8

8   8    CA       n    5

> str(x$State)

 Factor w/ 3 levels "CA","NY","TX": 3 3 3 2 2 2 1 1

> str(as.character(x$State))

 chr [1:8] "TX" "TX" "TX" "NY" "NY" "NY" "CA" "CA"

這里的x數據框為了方便學習我自己原創的。果然 x$State 顯示的結果為Factor類型,需要轉化為character類型。

其中要注意由Factor轉化為numeric要先轉成character然后再由character轉為numeric.

所以在應用order這個函數前,一定要注意數列是否是你想排列的那個類型。

也因此我在前面加上了兩句。

StateData[, "Disease.Rate"] <- as.numeric(as.character(StateData[, "Disease.Rate"]))

StateData[, "Hospital.Name"] <- as.character(StateData[, "Hospital.Name"])

最后只消返回第一個Hospital就是我們想要的最佳醫院了。

Task 2

Ranking hospitals by outcome in a state Write a function called rankhospital that takes three arguments: the 2-character abbreviated name of a
state (state), an outcome (outcome), and the ranking of a hospital in that state for that outcome (num).
The function reads the outcome-of-care-measures.csv le and returns a character vector with the name
of the hospital that has the ranking speci ed by the num argument. For example, the call
rankhospital("MD", "heart failure", 5)
would return a character vector containing the name of the hospital with the 5th lowest 30-day death rate
for heart failure.


Here is some sample output from the function.

> source("rankhospital.R")
> rankhospital("TX", "heart failure", 4)
[1] "DETAR HOSPITAL NAVARRO"
> rankhospital("MD", "heart attack", "worst")
[1] "HARFORD MEMORIAL HOSPITAL"
> rankhospital("MN", "heart attack", 5000)
[1] NA

第二個函數叫rankhospital, 相比第一個函數提出了更多的要求,就是輸入input除了州名,疾病名,還有排名num。函數要能給出該州該病排名第num的醫院名。如果排名超出醫院總數,返回NA,如果有醫院某病的死亡率相同,按照字母先后順序對醫院進行排名,字母靠前的醫院優先返回。

rankhospital <- function (state, outcome, num = "best"){ data <- read.csv("outcome-of-care-measures.csv") A <- data$State == state if (!sum(A)) { stop ("invalid state") } disease_list <- c("heart attack", "heart failure", "pneumonia") if (!outcome %in% disease_list){ stop ("invalid outcome") }  StateData <- subset(data, State == state)  
if (outcome == "heart attack") { StateData <- StateData[c("Hospital.Name", "Hospital.30.Day.Death..Mortality..Rates.from.Heart.Attack")] } else if (outcome == "heart failure"){ StateData <- StateData[c("Hospital.Name", "Hospital.30.Day.Death..Mortality..Rates.from.Heart.Failure")] } else { StateData <- StateData[c("Hospital.Name", "Hospital.30.Day.Death..Mortality..Rates.from.Pneumonia")] } colnames(StateData) <- c("Hospital.Name", "Disease.Rate") StateData[, "Disease.Rate"] <- as.numeric(as.character(StateData[, "Disease.Rate"])) StateData[, "Hospital.Name"] <- as.character(StateData[, "Hospital.Name"]) StateData <- StateData[order(StateData$Disease.Rate, StateData$Hospital.Name), ] ## Specify the exact value for num input. N <- sum(!is.na(StateData$Disease.Rate)) if (num == "best"){ num <- 1 } else if (num == "worst"){ num <- N } else{} Hospital <- StateData[num, "Hospital.Name"] Hospital }

和best 函數相比,這里多出來的部分就是對num進行判斷。Num可以是1,最后一名,可以是某一名次。值得注意的是,worst並非對應某州的所有醫院的最后一位,因為有相當多的醫院沒有提供死亡率,也就是NA。看作業的示例結果,貌似NA是不參與評比的,因而worst只對應有數據的最后一位。這樣就需要我們計算一下Disease.Rate中一共有多少個NA數據。

這里我用N <- sum(!is.na(StateData$Disease.Rate))來計算,直接了當。

至於num超過醫院總數的情況可以不必理會,因為讀取時R找不到對應行時會自動返回NA

Task 3

Ranking hospitals in all states

Write a function called rankall that takes two arguments: an outcome name (outcome) and a hospital rank-
ing (num). The function reads the outcome-of-care-measures.csv le and returns a 2-column data frame
containing the hospital in each state that has the ranking speci ed in num. For example the function call
rankall("heart attack", "best") would return a data frame containing the names of the hospitals that
are the best in their respective states for 30-day heart attack death rates. The function should return a value
for every state (some may be NA). The rst column in the data frame is named hospital, which contains
the hospital name, and the second column is named state, which contains the 2-character abbreviation for
the state name. Hospitals that do not have data on a particular outcome should be excluded from the set of
hospitals when deciding the rankings.

第三個任務是rankall函數,要求是不關心是哪個州,只要指定疾病和排名,就要返回一個數據框,里面存儲着所有州該疾病該排名的醫院名。

 1 rankall <- function (outcome, num = "best"){  2   data <- read.csv("outcome-of-care-measures.csv")  3 
 4   disease_list <- c("heart attack", "heart failure", "pneumonia")  5   
 6   if (!outcome %in% disease_list){  7     stop ("invalid outcome")  8  }  9 
10   ## Extract the hospital and rate colume from the data. 11   if (outcome == "heart attack") { 12     data <- data[c("Hospital.Name", "State", "Hospital.30.Day.Death..Mortality..Rates.from.Heart.Attack")] 13  } 14   else if (outcome == "heart failure"){ 15     data <- data[c("Hospital.Name", "State", "Hospital.30.Day.Death..Mortality..Rates.from.Heart.Failure")] 16  } 17   else { 18     data <- data[c("Hospital.Name", "State", "Hospital.30.Day.Death..Mortality..Rates.from.Pneumonia")] 19  } 20   ## Assign a common colume name for the StateData whatever the disease is. 21   colnames(data)[3] <- "Disease.Rate"
22   
23   ## Transform the disease.rate colume from Factor to numeric for the purpose of ordering.So does the hospital colume. 24   data[, "Disease.Rate"] <- as.numeric(as.character(data[, "Disease.Rate"])) 25   data[, "Hospital.Name"] <- as.character(data[, "Hospital.Name"]) 26   
27   ## Create a list to store all of the state names in US, and order it alphabetically. 28   Statelist <- as.character(unique(data$State)) 29   Statelist <- Statelist[order(Statelist)] 30   
31   Final <- data.frame() 32   
33   for (i in seq_len(length(Statelist))){ 34     ## Create the sub-data.frame for the specific state. 35     StateData <- subset(data, State == Statelist[i]) 36     
37  ## Order the data.frame by disease rate and hospital names. 38     StateData <- StateData[order(StateData$Disease.Rate, StateData$Hospital.Name), ] 39     
40     ## Specify the exact value for num input. 41     N <- sum(!is.na(StateData$Disease.Rate)) 42     if (num == "best"){ 43       num <- 1
44  } 45     else if (num == "worst"){ 46       num <- N 47  } 48     else{} 49     
50     Hospital <- StateData[num, "Hospital.Name"] 51     tmp <- data.frame(Hospital, Statelist[i])  # Create each row for the final data.frame. 52     colnames(tmp) <- c("hospital", "state") 53     Final <- rbind(Final, tmp) 54  } 55   
56  Final 57 }

這個函數要求返回一個data.frame,而不是簡單醫院的名稱。那么就需要分別讀取每個州的數據,然后進行上一個函數的操作,最后再把州和相應的醫院名整合到這個data.frame中。在讀取每個州的操作和前面的都如出一轍。只是因為要歷遍每一個州,需要用到循環。這里我構建一個list保存唯一的每個州的州名,用的是Statelist <- as.character(unique(data$State)) 這個操作。其中unique表示把重復元素剔除,只保留不重復的唯一元素,但要注意格式,剔除以后可能也需要轉碼。這樣就可以通過讀取Statelist里的每一個元素實現循環。

這樣三個函數寫完,大部分題目都測試正確。但是在判斷“worst”時候,有時會出現問題。做細節排查,發現R在我寫的程序里判斷 else if (num == "worst") 時有時會報錯:

> if(num == “best”){num<-1} else if(num == "worst"){num <- N} else {}

Error: unexpected input in "if(num == ?

希望日后能再搞清楚。

最后得吐槽一下Coursera改版以后的界面用戶體驗真實差到shit一樣。

總之,從開始的試試看到發現R真的很好玩。同時這又是我在博客園上的第一篇博文。希望以后也像其他大牛們一樣堅持寫博客,並會繼續學習R下去的~~

I am learning

2016年4月26日於美國休斯頓


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM