How to detect and exclude outliers from linear regression?

[Just to follow-up on this question]:

I'm doing linear regression tests with SQL Anywhere's builtin OLAP functions. Say, for a very simplified example, I would assume a linear correlation between the columns x and y in a table MyTable.

So I would generate a linear function with

select REGR_COUNT(y, x) as cnt,
   round(REGR_SLOPE(y, x), 4) as slope,
   round(REGR_INTERCEPT(y, x), 4) as yIntercept,
   round(REGR_R2(my, x), 4) as fitness
from MyTable
where x > 0 and y > 0;

This works well generally. However, what would be a senseful method to exclude outliers?

A simple test for maximum/minimum values (or a ranking) seems inadequate as outliers would be defined as based on their value pairs, not just on the y value.

Currently, Im trying to use the above query as common table expression and then to check for those pairs that have a bigger deviation compared to the generated linear function:

with CTE_LR as
   (select REGR_COUNT(y, x) as cnt,
      round(REGR_SLOPE(y, x), 4) as slope,
      round(REGR_INTERCEPT(y, x), 4) as yIntercept,
      round(REGR_R2(my, x), 4) as fitness
   from MyTable
   where x > 0 and y > 0)
select x, y, round(slope * x + yIntercept, 4) as yCalc,
    abs(yCalc - y) as absDiff, abs(yCalc - y) / y as relDiff
from MyTable M, CTE_LR
where x > 0 and y > 0
order by relDiff desc, x, y;

However, this helps to detect outliers post-mortem, but obviously they have already influenced the linear regression. I could then build another regression without these outliers (say, those with a certain relative deviation) but that again might exclude the "wrong outliers" based on them being part of the previous regression.

Therefore I would like a way to exclude them beforehand. Is there a (not too complicated) way to do so?

olap linear-regression sa-12

asked 08 Jun '11, 12:26

Volker Barth
40.2k●361●550●822
accept rate: 34%

edited 08 Jun '11, 12:29

flat view

One Answer:

active answers oldest answers newest answers popular answers

The word percentile comes to my mind, exclude the <5% and the >95% percentile of the data range. See the PERCENT_RANK function and select the set for your regression based on the PERCENT_RANK.

permanent link

answered 08 Jun '11, 12:55

Martin
9.0k●130●169●257
accept rate: 14%

That seems helpful (and I haven't been aware of that variation of RANK()), but...

...the problem is I'm not looking for absolute values but maximum/minimum y values w.r.t. the according x values.

So basically it's a question how to do a "partition" (or a group by) over a continuous range of double values - and unfortunately not a range which is evenly filled with x values.

(09 Jun '11, 04:20) Volker Barth

you order by y and then use the Percent_Rank to skip the areas which you define as outliers

(09 Jun '11, 06:14) Martin

flat view

Your answer

toggle preview

community wiki:

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here

By RSS:

Answers

Answers and Comments

Markdown Basics

*italic* or _italic_
**bold** or __bold__
link:[text](http://url.com/ "title")
image?![alt text](/path/img.jpg "title")
numbered list: 1. Foo 2. Bar
to add a line break simply add two spaces to where you would like the new line to be.
basic HTML tags are also supported

learn more about Markdown

Question tags:

sa-12 ×438
olap ×19
linear-regression ×1

question asked: 08 Jun '11, 12:26

question was seen: 3,860 times

last updated: 09 Jun '11, 06:14

SAP SQL Anywhere

How to detect and exclude outliers from linear regression?

Follow this question

Related questions