outlier異常值檢驗原理和處理方法

本文轉載自查看原文 2019-04-11 10:29 3251 異常值檢驗/ 異常值/ outlier/ 異常值處理abnormal data/ 異常值處理

https://study.163.com/provider/400000000398149/index.htm?share=2&shareId=400000000398149（歡迎關注博主主頁，學習python視頻資源，還有大量免費python經典文章）

Before we tackle how to handle them, let’s quickly define what an outlier is. An outlier is any data point that is distinctly different from the rest of your data points. When you’re looking at a variable that is relatively normally distributed, you can think of outliers as anything that falls 3 or more standard deviations from its mean. While this will suffice as a working definition, keep in mind that there’s no golden rule for defining what an outlier is.

在討論如何處理它們之前，讓我們快速定義什么是異常值。離群值是與其余數據點明顯不同的任何數據點。當您查看一個相對正態分布的變量時，您可以將離群值視為與其均值相差3個或更多標准差的任何東西。盡管這可以作為一個可行的定義，但請記住，定義異常值是沒有黃金法則的。

In general, outliers belong to one of two categories: a mistake in the data or a true outlier. The first type, a mistake in the data, could be as simple as typing 10000 rather than 100.00 – resulting in a big shift as we’re analyzing the data later on. The second type, a true outlier, would be something like finding Bill Gates in your dataset. His profile probably looks so different from the other people in your list that including him might skew your results. It’s important to distinguish these types because we’ll handle them differently in an analysis. It’s subjective. It’s up to you as the analyst to determine which data points are outliers in any given dataset.

通常，離群值屬於以下兩類之一：數據錯誤或真實的離群值。第一種是數據中的錯誤，可能很簡單，就像輸入10000而不是100.00一樣-導致我們以后分析數據時發生了很大的變化。第二種是真正的異常值，類似於在數據集中找到比爾·蓋茨。他的個人資料看起來可能與您列表中的其他人有很大不同，以至於包括他在內的個人資料可能會使您的結果產生偏差。區分這些類型非常重要，因為我們在分析中將以不同的方式處理它們。這是主觀的。由您決定作為分析師來確定哪些數據點在任何給定的數據集中是異常值。

Now, how do we deal with outliers? Here are four approaches:

現在，我們如何處理異常值？這是四種方法：

1. Drop the outlier records.

In the case of Bill Gates, or another true outlier, sometimes it’s best to completely remove that record from your dataset to keep that person or event from skewing your analysis.

2. Cap your outliers data.

Another way to handle true outliers is to cap them. For example, if you’re using income, you might find that people above a certain income level behave in the same way as those with a lower income. In this case, you can cap the income value at a level that keeps that intact.

3. Assign a new value.

If an outlier seems to be due to a mistake in your data, you try imputing a value. Common imputation methods include using the mean of a variable or utilizing a regression model to predict the missing value.

4. Try a transformation.

A different approach to true outliers could be to try creating a transformation of the data rather than using the data itself. For example, try creating a percentile version of your original field and working with that new field instead.

Just how much an outlier affects your analysis depends, not surprisingly, on a few factors. One factor is dataset size. In a large dataset, each individual point carries less weight, so an outlier is less worrisome than the same data point would be in a smaller dataset. Another consideration is “how much” of an outlier a point might be – just how far out of line with the rest of your dataset a single point is. A point that is ten times as large as your upper boundary will do more damage than a point that is twice as large.

These are a few ways that we’ve found to help with outliers, but there are certainly others. I’d love to know – what has your experience been with outliers? Do you use any of the above methods?

-Caitlin Garrett, Statistical Analyst

1.刪除異常值記錄。

對於Bill Gates或其他真正的異常值，有時最好將其從數據集中完全刪除，以免該人或事件歪曲您的分析。

2.限制您的異常數據。

處理真實異常值的另一種方法是對它們進行上限。例如，如果您使用收入，則可能會發現某個收入水平以上的人的行為與收入較低的人的行為相同。在這種情況下，您可以將收入值的上限保持不變。

3.分配一個新值。

如果異常值似乎是由於數據錯誤引起的，請嘗試估算一個值。常見的插補方法包括使用變量的平均值或使用回歸模型來預測缺失值。

4.嘗試轉換。

解決真實異常值的另一種方法是嘗試創建數據的轉換，而不是使用數據本身。例如，嘗試創建原始字段的百分位數版本，然后使用該新字段。

異常值對您的分析的影響取決於幾個因素，這並不奇怪。一個因素是數據集大小。在大型數據集中，每個單獨的點都具有較小的權重，因此與較小數據集中的相同數據點相比，離群值更不會令人擔憂。另一個考慮因素是一個點可能有“多少” –一個點與數據集的其余部分不一致。比上邊界大十倍的點將造成更大的損害。

這些是我們發現的可以幫助解決異常值的方法，但是當然還有其他方法。我很想知道-您在離群值方面有什么經驗？您是否使用上述任何方法？

-Caitlin Garrett，統計分析師

https://conversionxl.com/blog/outliers/

One thing many people forget when dealing with data: outliers.

Even in a controlled online a/b test experiment, your dataset may be skewed by extremities. How do you deal with them? Trim them out, or is there some other way?

How do you even detect the presence of outliers and how extreme they are?

Especially if you’re optimizing your site for revenue, you should care about outliers. This post will dive into the nature of outliers in general, how to detect them, and then some popular methods for dealing with them.

許多人在處理數據時忘記的一件事：離群值。

即使在受控的在線a / b測試實驗中，您的數據集也可能會被四肢扭曲。您如何處理他們？修剪掉它們，或者還有其他方法嗎？

您甚至如何檢測異常值的存在及其極端程度？

特別是如果您要優化網站的收入，則應注意離群值。這篇文章將深入探討異常值的性質，如何檢測異常值，然后介紹一些處理異常值的流行方法。

What Are Outliers?

First, what exactly are outliers?

An outlier is an observation that lies an abnormal distance from other values in a random sample from a population.

首先，離群值到底是什么？

離群值是與總體中隨機樣本中的其他值存在異常距離的觀察值。

Image Source

There is, of course, a degree of ambiguity here. Qualifying a data point as an anomaly leaves it up to the analyst or model to determine what is actually abnormal and what to do with such data points.

There are also different degrees of outliers:

Mild outliers lie beyond an inner fence on either side
Extreme outliers are beyond an outer fence

Why do outliers occur? According to Tom Bodenberg, chief economist and data consultant at market research firm Unity Marketing, “It can be the result of measurement or recording errors, or the unintended and truthful outcome resulting from the set’s definition.”

Outliers could contain valuable information, they could be meaningless aberrations caused by measurement and recording errors, they could cause problems with repeatable A/B test results. So it’s important to question and analyze outliers in any case to see what their actual meaning is.

當然，這里存在一定程度的歧義。將數據點定為異常將其留給分析人員或模型來確定實際異常的內容以及如何處理此類數據點。

也有不同程度的離群值：

溫和的異常值位於兩側的內部圍欄之外
極端離群值超出了外部范圍
為什么發生異常值？市場研究公司Unity Marketing的首席經濟學家和數據顧問Tom Bodenberg表示：“這可能是測量或記錄錯誤的結果，也可能是集合定義帶來的意外和真實的結果。”

離群值可能包含有價值的信息，它們可能是由測量和記錄錯誤引起的無意義的像差，它們可能導致可重復的A / B測試結果出現問題。因此，無論如何都要質疑和分析離群值以了解其實際含義，這一點很重要。

Image Source

Why are they occurring, where, and what might the meaning be? The answer could be different business to business, but it’s important to have the conversation rather than to ignore the data regardless of the significance.

The real question, though, is how do outliers affect your testing efforts?

它們為什么發生，在哪里發生，含義是什么？答案可能因企業而異，但是重要的是進行對話而不是忽略數據，無論其重要性如何。

但是，真正的問題是，異常值如何影響您的測試工作？

How Outliers Affect A/B Testing

Though outliers will show up in many analysis situations, for the sake of conversion optimization, you should mostly be concern about tests where you’re optimizing for revenue metrics like Average Order Value or Revenue Per Visitor.

You can easily imagine anecdotally how outliers could affect a single A/B test result. If not, here’s Taylor Wilson, Senior Optimization Analyst at Brooks Bell, explaining a few scenarios in which that could happen:

離群值如何影響A / B測試
盡管在許多分析情況下都會出現異常值，但是出於轉化優化的考慮，您應該主要關注要針對收入指標（例如平均訂單價值或每位訪客收入）進行優化的測試。

您可以輕松地想象離群值如何影響單個A / B測試結果。如果沒有，這里是Brooks Bell的高級優化分析師Taylor Wilson，解釋了可能發生的幾種情況：

Taylor Wilson:

“In this particular situation, resellers were the culprit—customers who buy in bulk with the intention of reselling items later. Far from your typical customer, they place unusually large orders, paying little attention to the experience they’re in.

It’s not just resellers who won’t be truly affected by your tests. Depending on your industry, it could be very loyal customers, in-store employees who order off the site, or another group that exhibits out-of-the-ordinary behavior.”

Especially in data sets with low sample sizes, outliers can mess up your whole day.

“在這種特殊情況下，轉售商是罪魁禍首。顧客大量購買商品是為了以后轉售商品。他們遠離您的典型客戶，他們下了非常大的訂單，很少關注他們所經歷的事情。

不僅僅是經銷商不會真正受到您的測試的影響。根據您所在行業的不同，可能是非常忠實的客戶，在現場下訂單的店內員工或其他表現出異常行為的團隊。”

尤其是在樣本量較小的數據集中，離群值可能會浪費您一整天的時間。

Image Source

As Dr. Julia Engelmann, Head of Data Analytics at Konversionkraft, mentioned in a CXL blog post, “Almost every online shop has them and usually they cause problems for the valid evaluation of a test: the bulk orderers.”

So this isn’t a rare, fringe problem. She shared this specific example of how including and excluding outliers can affect the results of a test, and ultimately the decision you make:

正如Konversionkraft的數據分析主管Julia Engelmann博士在CXL博客文章中提到的那樣：“幾乎每個在線商店都擁有它們，通常它們會給測試的有效評估帶來問題：批量訂購者。”

因此，這不是一個罕見的邊緣問題。她分享了這個具體示例，其中包括和排除異常值如何影響測試結果以及最終影響您做出的決定：

A problem outliers can cause in A/B tests, HiConversion noted, is that outliers tend not to be affected by the smaller UI changes that may affect a more fickle and mainstream population. Bulk orderers will push through your smaller usability changes like your average visitor may not.

Their article outlined a case where outliers skewed the results of a test. Upon further analysis, the outlier segment was 75% return visitors and much more engaged than the average visitor.

HiConversion指出，離群值可能會在A / B測試中引起問題，那就是離群值通常不會受到較小的UI更改的影響，而UI更改可能會影響更多善變的主流人群。批量訂購者將推動您進行較小的可用性更改，就像您的普通訪問者可能不會這樣做一樣。

他們的文章概述了異常值使測試結果偏斜的情況。經過進一步分析，離群部分的回訪者為75％，並且比普通訪問者的參與度高得多。

Image Source

Think your data is immune to outliers? Maybe it is, but probably not – and in any case, it’s best to know for sure. So how do you diagnosis that on your own? That is to say, how do you detect outliers in your data?

認為您的數據不受異常值的影響？也許是，但可能不是，無論如何，最好是一定要知道。那么您如何自己診斷呢？也就是說，如何檢測數據中的異常值？

How to Detect Outliers in Data

Data visualization is a core discipline for analysts and optimizers, not just to better communicate results with executives, but to explore the data more fully.

As such, outliers are often detected through graphical means, though you can also do so by a variety of statistical methods using your favorite tool (Excel and R will be referenced heavily here, though SAS, Python, etc. all work).

Two of the most common graphical ways of detecting outliers are the boxplot and the scatterplot. A boxplot is my favorite way.

You can see here that the blue circles are outliers, with the open circles representing mild outliers and closed circles representing extreme outliers:

數據可視化是分析人員和優化人員的核心學科，不僅是為了與執行人員更好地交流結果，而且是更充分地探索數據。

因此，離群值通常是通過圖形方式檢測的，盡管您也可以使用自己喜歡的工具通過多種統計方法進行檢測（Excel和R在這里會被大量引用，盡管SAS，Python等都可以使用）。

檢測異常值的兩種最常見的圖形方式是箱線圖和散點圖。箱線圖是我最喜歡的方式。

您可以在此處看到藍色圓圈是離群值，空心圓圈代表溫和離群值，實心圓圈代表極端離群值：

Image Source

It’s really easy to analyze boxplots in R. Just use boxplot(x, horizontal = TRUE) where x is your data set to make something that looks like this:

Even better, you can use boxplot.stats(x) function, where x is your data set, to get summary stats that includes the list of outliers ($out):

You can also see these in a scatterplot, though I believe it’s a bit harder to tell with clarity what extreme and mild outliers are:

Image Source

A histogram can work as well:

Image Source

You can also see outliers fairly easily in run charts, lag plots (a type of scatterplot), and line charts, depending on what type of data you’re working with.

Conversion expert Andrew Anderson also backs the value of graphs to determine the effect of outliers on data:

Andrew Anderson:

“The graph is your friend. One of the reasons that I look for 7 days of consistent data is that it allows for normalization against non normal actions, be it size or external influence.

The other thing is that if there are obvious non-normal action values, it is ok to normalize them to the average as long as it is done unilaterally and is done to not bias results. This is only done if it is obviously out of normal line and usually I will still run the test another 2-3 extra days just to make sure.”

As to the latter point on non-normal distributions, we’ll go into that a bit later in the article.

But is there a statistical way of detecting outliers, apart from just eyeballing it on a chart? Indeed, there are many ways to do so (outlined here), the main two being a standard deviation approach or Tukey’s method.

Image Source

In the latter, extreme outliers tend to lie more than 3.0 times the interquartile range below the first quartile or above the third quartile, and mild outliers lie between 1.5 times and 3.0 times the interquartile range below the first quartile or above the third quartile.

It’s pretty easy to highlight outliers in Excel. While there’s no built-in function for outlier detection, you can find the quartile values and go from there. Here’s a quick guide if you’re interested in doing that.

Strategies for Dealing with Outliers in Data

Should an outlier be removed from analysis? The answer, though seemingly straightforward, isn’t so simple.

There are many strategies for dealing with outliers in data, and depending on the situation and data set, any could be the right or the wrong way. In addition, most major testing testing tools have strategies for dealing with outliers, but they usually differ in how exactly they do so.

Because of that, it’s still important to do your own custom analysis with regards to outliers, even if your testing tool has its own default parameters. Not only can you trust your testing data better, but sometimes analysis of outliers produces its own insights that will help with optimization.

So let’s go over some common strategies:

Set Up a Filter in Your Testing Tool

Even though this has a little cost, filtering out outliers is worth doing it because you can often discover significant effects that are simply “hidden” by outliers.

According to Himanshu Sharma at OptimizeSmart, if you are tracking revenue as a goal in your A/B testing tool, you should set up a code which filters out abnormally large orders from your test results.

He says that you should look at your past analytics data to secure an average web order, and to set up filters with that in mind. In his example, imagine you have that your website average order value in the last 3 months has been $150 – then any order which is above $200 can be considered as an outlier.

Then it’s all about writing a bit of code to stop the tool from passing that value. Here are some brief instructions on how to do that in Optimizely. The tl;dr is that you exclude values above a certain amount with code that looks something like this (for orders above $200):

if(priceInCents <20000){

window.optimizely = window.optimizely || [];

window.optimizely.push([‘trackEvent’,

‘orderComplete’, {‘revenue’: priceInCents}]);

Remove or Change Outliers During Post-Test Analysis

Kevin Hillstrom, President of Mine That Data, explained why he will sometimes adjust outliers in tests…

Kevin Hillstrom:

“On average, what a customer spends is not normally distributed.

If you have an average order value of $100, most of your customers are spending $70, $80, $90, or $100, and you have a small number of customers spending $200, $300, $800, $1600, and one customer spending $29,000. If you have 29,000 people in the test panel, and one person spends $29,000, that’s $1 per person in the test.

That’s how much that one order skews things.”

One way to account for this is simply to remove outliers, or trim your data set as to exclude as many as you’d like.

This is really easy to do in Excel – a simple TRIMMEAN function will do the trick. The first argument is the array you’d like to manipulate (column A here), and the second argument is by how much you’d like to trim the upper and lower extremities:

Trimming values in R is super easy, too. It exists within the mean(function). So, say you have a mean that differs quite a bit from the median, it probably means you have some very large or small values skewing it. In that case, you can trim off a certain percentage of the data on both the large and small side. In R, it’s just mean(x, trim = .05), where x is your data set and .05 can be any number of your choosing:

This process of using Trimmed Estimators is usually done to obtain a more robust statistic. By the way, the median is the most trimmed statistic, at 50% on both sides (which you can also do with the mean function in R – mean(x, trim = .5)).

Most of the time in optimization, your outliers will be on the higher end because of bulk orderers. Given your knowledge of historical data, if you’d like to do a post-hoc trimming of values above a certain parameter, that’s very easy to do in R. If the name of my data set is “rivers” I can do this, given the knowledge that my data usually falls under 1210: rivers.low <- rivers[rivers<1210].

That creates a new variable only consisting of what I deem to be non-outlier values, and from there I can boxplot it, getting something like this:

Clearly there are fewer outlier values, though there are still a few. This will virtually always happen, no matter how many values you trim from the extremes.

You can also do this by removing values that are beyond three standard deviations from the mean. To do that, first you need to extract the raw data from your testing tool. Optimizely reserves this ability for their enterprise customers unless you ask support to help you.

Instead of taking real client data, just to demonstrate how to do this, I generated two random sequence of numbers with normal distributions (using =NORMINV(RAND(),C1,D1) where C1 is mean and D1 is SD, for reference). In “variation 1,” though, I added a few very high outliers, making variation 1 a “statistically significant” winner:

Then you can use conditional formatting to highlight those that are above 3 standard deviations. Chop those off:

And you have a different statistically significant winner:

My example is incredibly clean cut and probably simpler than you’ll deal with, but at least you can see how just a few very high values can throw things off (and one possible solution to that). If you want to play around with outliers using this fake data, click here to download the spreadsheet.

Change the Value of Outliers

Much of the debate on how to deal with outliers in data comes down to the following questions: should you keep outliers, remove them, or change them to another variable?

Essentially, instead of simply removing the outliers from the data, in this case you take your set of outliers and change their values to something more representative of your data set. It’s a small distinction, but important: when you trim data, the extreme values are discarded. When you use winsorized estimators (changing the values), extreme values are instead replaced by certain percentiles (the trimmed minimum and maximum).

Kevin Hillstrom mentioned in his podcast that he trims the top 5% or top 1% (depending on the business) of orders and changes the value (e.g. $29,000 to $800). As he says, “You are allowed to adjust outliers.”

Here’s how to do that in R.

Consider the Underlying Distribution

Traditional methods to calculate confidence intervals assume that the data follows a normal distribution, but as we discussed above, with certain metrics like average revenue per visitor, that usually isn’t the way reality works.

In another section of Dr. Julia Engelmann’s wonderful article for our blog, she shared a graphic depicting this difference. The left graphic shows a perfect (theoretical) normal distribution. The number of orders fluctuates around a positive average value. In the example, most customers order five times. More or fewer orders arise less often.

The graphic to the right shows the bitter reality. Assuming an average conversion rate of 5%, 95% are customers who don’t buy. Most buyers have probably placed one or two orders, and there are a few customers who order an extreme quantity.

The distribution on the right side is known as a “right-skewed” distribution.

Image Source

Essentially, the problem comes in when we assume that a distribution is normal but we’re actually working with something like a right skewed distribution. Confidence intervals can no longer be reliably calculated.

With your average ecommerce site, let’s say at least 90% of customers will not buy anything. Therefore, the proportion of “zeros” in the data is extreme and the deviations in general are enormous, including extremities because of bulk orders.

In this case, it’s worth taking a look at the data using other methods than the t-test. (The Shapiro-Wilk test lets you test your data for normal distribution, by the way). All of these were suggested in this article:

1. Mann-Whitney U-Test

The Mann-Whitney U-Test is an alternative to the t-test when the data deviates greatly from the normal distribution.

2. Robust statistics

Methods from robust statistics are used when the data is not normally distributed or distorted by outliers. Here, average values and variances are calculated such that they are not influenced by unusually high or low values – which I sort of went into with windsorization above.

3. Bootstrapping

This so-called non-parametric procedure works independently of any distribution assumption and provides reliable estimates for confidence levels and intervals. At its core, it belongs to the resampling methods. They provide reliable estimates of the distribution of variables on the basis of the observed data through random sampling procedures.

Consider the Value of Mild Outliers

As mentioned, with Revenue Per Visitor the underlying distribution is often non-normal. It’s common for few big buyers to skew the data set towards the extremes. When this is the case, outlier detection falls prey to predictable inaccuracies – it detects outliers far more often.

So there’s a chance that in your data analysis, you shouldn’t throw away outliers. Rather, you should segment them and analyze them deeper. What demographic, behavioral, firmographic traits correlate with their purchasing behavior and how can you run an experiment to tease out some causality there?

This is a question that runs deeper than simple A/B testing and is core to your customer acquisition, targeting, and segmentation efforts. I don’t want to go too deep here, but I want to say that for various marketing reasons, analyzing your highest value cohorts can bring profound insights as well.

No Matter What, Do Something

In any case, it helps to have a plan in place. As Dan Begley-Groth wrote on the Richrelevance blog:

Dan Begley-Groth:

“In order for a test to be statistically valid, all rules of the testing game should be determined before the test begins. Otherwise, we potentially expose ourselves to a whirlpool of subjectivity mid-test.

Should a $500 order only count if it was directly driven by attributable recommendations? Should all $500+ orders count if there are an equal number on both sides? What if a side is still losing after including its $500+ orders? Can they be included then?

By defining outlier thresholds prior to the test (for RichRelevance tests, three standard deviations from the mean) and establishing a methodology that removes them, both the random noise and subjectivity of A/B test interpretation is significantly reduced. This is key to minimizing headaches while managing A/B tests”

Whether you believe outliers don’t have a strong effect on your data and choose to leave them as is, or whether you want to trim the top and bottom 25% of your data, the important thing is that you’ve thought it through and have an active strategy. Being data-driven means considering anomalies like this, and to ignore them means you could be making decisions on faulty data.

Conclusion

Outliers are something not discussed often in testing, but depending on your business and what metric you’re optimizing, they could certainly be affecting your results.

As we’ve seen, one or two high values in a smaller sample size can totally skew a test, leading you to make a decision off of faulty data. No bueno.

For the most part, if your data is affected by these extreme cases, you can bound the input to a historical representative of your data that excludes outliers. So that could be a number of items (>3) or a lower or upper bounds on your order value.

Another way, perhaps better in the long run, would be to export your post-test data and visualize it by various means. Determine on a case-by-case basis what the effect of the outliers was. And from there, decide whether you want to remove, change, or keep the outlier values.

Really, though, there are lots of ways to deal with outliers in data. It’s not a simple quick fix that works across the board, and that’s why the demand for good analysts continues to grow.

python機器學習-乳腺癌細胞挖掘（博主親自錄制視頻）

https://study.163.com/course/introduction.htm?courseId=1005269003&utm_campaign=commission&utm_source=cp-400000000398149&utm_medium=share

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 異常值（outlier） R語言︱處理缺失數據&&異常值檢驗、離群點分析、異常值處理 R語言︱異常值檢驗、離群點分析、異常值處理異常值處理 Pandas異常值處理 pandas - 異常值處理二、檢測與處理異常值【轉】異常值處理數據預處理 | 通過 Z-Score 方法判斷異常值 python異常值(outlier)檢測實戰:KMeans + PCA + IsolationForest + SVM + EllipticEnvelope