## #研究分享#【数据科学家用得最多的十种数据挖掘算法】

#研究分享#【数据科学家用得最多的十种数据挖掘算法】和2011年相比，占据榜首的依然是回归、聚类、决策树和可视化。在2016年调查中新上榜的有：K-近邻法、主成分分析（PCA）、随机森林、优化、神经网络-深度学习、奇异值分解。下跌幅度最大的包括：关联规则、因子分析、存活分析等。监督学习算法受到广泛的使用。http://www.looooker.com/archives/34455

Latest KDnuggets Poll asked

Which methods/algorithms you used in the past 12 months for an actual Data Science-related application? .

Here are the results, based on 844 voters.

The top 10 algorithms and their share of voters are:

**Fig. 1: Top 10 algorithms used by Data Scientists.**

See full table of all algorithms at the end of the post.

The average respondent used 8.1 algorithms, a big increase vs a similar poll in 2011.

Comparing with 2011 Poll Algorithms for data analysis / data mining we note that the top methods are still Regression, Clustering, Decision Trees/Rules, and Visualization. The biggest relative increases, measured by (pct2016 /pct2011 - 1) are for

**Boosting**, up 40% to 32.8% share in 2016 from 23.5% share in 2011**Text Mining**, up 30% to 35.9% from 27.7%**Visualization**, up 27% to 48.7% from 38.3%**Time series/Sequence analysis**, up 25% to 37.0% from 29.6%**Anomaly/Deviation detection**, up 19% to 19.5% from 16.4%**Ensemble methods**, up 19% to 33.6% from 28.3%**SVM**, up 18% to 33.6% from 28.6%**Regression**, up 16% to 67.1% from 57.9%

Most popular among new options added in 2016 are

- K-nearest neighbors, 46% share
- PCA, 43%
- Random Forests, 38%
- Optimization, 24%
- Neural networks - Deep Learning, 19%
- Singular Value Decomposition, 16%

The biggest declines are for

- Association rules, down 47% to 15.3% from 28.6%
- Uplift modeling, down 36% to 3.1% from 4.8% (that is a surprise, given strong results published)
- Factor Analysis, down 24% to 14.2% from 18.6%
- Survival Analysis, down 15% to 7.9% from 9.3%

The following table shows usage of different algorithms types: Supervised, Unsupervised, Meta, and other by Employment type. We excluded NA (4.5%) and Other (3%) employment types.

**Table 1: Algorithm usage by Employment Type**

Employment Type | % Voters | Avg Num Algorithms Used | % Used Super- vised |
% Used Unsuper- vised |
% Used Meta | % Used Other Methods |
---|---|---|---|---|---|---|

Industry | 59% | 8.4 | 94% | 81% | 55% | 83% |

Government/Non-profit | 4.1% | 9.5 | 91% | 89% | 49% | 89% |

Student | 16% | 8.1 | 94% | 76% | 47% | 77% |

Academia | 12% | 7.2 | 95% | 81% | 44% | 77% |

All | 8.3 | 94% | 82% | 48% | 81% |

We note that almost **everyone uses supervised learning algorithms**.

Government and Industry Data Scientists used **more different types of algorithms** than students or academic researchers,

and **Industry Data Scientists were more likely to use Meta-algorithms**.

Next, we analyzed the usage of top 10 algorithms + Deep Learning by employment type.

**Table 2: Top 10 Algorithms + Deep Learning usage by Employment Type**

Algorithm | Industry | Government/Non-profit | Academia | Student | All |
---|---|---|---|---|---|

Regression | 71% | 63% | 51% | 64% | 67% |

Clustering | 58% | 63% | 51% | 58% | 57% |

Decision | 59% | 63% | 38% | 57% | 55% |

Visualization | 55% | 71% | 28% | 47% | 49% |

K-NN | 46% | 54% | 48% | 47% | 46% |

PCA | 43% | 57% | 48% | 40% | 43% |

Statistics | 47% | 49% | 37% | 36% | 43% |

Random Forests | 40% | 40% | 29% | 36% | 38% |

Time series | 42% | 54% | 26% | 24% | 37% |

Text Mining | 36% | 40% | 33% | 38% | 36% |

Deep Learning | 18% | 9% | 24% | 19% | 19% |

To make the differences easier to see, we compute the algorithm bias for a particular employment type relative to average algorithm usage as Bias(Alg,Type)=Usage(Alg,Type)/Usage(Alg,All) - 1.

**Fig. 2: Algorithm usage bias by Employment.**

We note that Industry Data Scientists are more likely to use Regression, Visualization, Statistics, Random Forests, and Time Series. Government/non-profit are more likely to use Visualization, PCA, and Time Series. Academic researchers are more likely to use PCA and Deep Learning. Students generally use fewer algorithms, but do more text mining and Deep Learning.

Next, we look at regional participation which was representative of overall KDnuggets visitors.

Regional distribution of poll participants.

- US/Canada, 40%
- Europe, 32%
- Asia, 18%
- Latin America, 5.0%
- Africa/Middle East, 3.4%
- Australia/NZ, 2.2%

As in 2011 poll, we combined Industry/Government in one group and Academic researchers/Students into a second group, and computed the "affinity" of the algorithm to Industry/Gov as

N(Alg,Ind_Gov) / N(Alg,Aca_Stu)

------------------------------- - 1

N(Ind_Gov) / N(Aca_Stu)

Thus algorithm with affinity 0 is used equally in Industry/Government and by Academic Researchers or students. The higher IG affinity the more "industrial" is the algorithms, and the lower it is the more "academic" is the algorithm.

The most "Industrial Algorithms" were:

- Uplift modeling, 2.01
- Anomaly Detection, 1.61
- Survival Analysis, 1.39
- Factor Analysis, 0.83
- Time series/Sequences, 0.69
- Association Rules, 0.5

While the uplift modeling was again the most "industrial algorithm", the surprising finding is that it is used by so few - only 3.1% - the lowest of any algorithm in this poll.

The most academic algorithms were

- Neural networks - regular, -0.35
- Naive Bayes, -0.35
- SVM, -0.24
- Deep Learning, -0.19
- EM, -0.17

Next figure shows all the algorithms and their Industry/Academic affinity.

**Fig. 3. KDnuggets Poll: Top Algorithms used by Data Scientists: Industry vs Academia**

Next table has the details on the algorithms, % respondents who used them in 2016 and 2011 Poll, change (%2016 / %2011 - 1), and Industry affinity as explained above.

**Table 3: KDnuggets 2016 Poll: Algorithms Used by Data Scientists**

Next table has the details on the algorithms, with columns

- N: Rank according to share of usage
- Algorithm: algorithm name,
- Type: S - Supervised, U - Unsupervised, M - Meta, Z - Other,
- % respondents who used this algorithm in 2016 Poll
- % respondents who used this algorithm in 2011 Poll
- change (%2016 / %2011 - 1), and
- Industry affinity as explained above.

**Table 4: KDnuggets 2016 Poll: Algorithms Used by Data Scientists**

N | Algorithm | Type | 2016 % used | 2011 % used | % Change | Industry Affinity |
---|---|---|---|---|---|---|

1 | Regression | S | 67% | 58% | 16% | 0.21 |

2 | Clustering | U | 57% | 52% | 8.7% | 0.05 |

3 | Decision Trees/Rules | S | 55% | 60% | -7.3% | 0.21 |

4 | Visualization | Z | 49% | 38% | 27% | 0.44 |

5 | K-nearest neighbors | S | 46% | 0.32 | ||

6 | PCA | U | 43% | 0.02 | ||

7 | Statistics | Z | 43% | 48% | -11.0% | 1.39 |

8 | Random Forests | S | 38% | 0.22 | ||

9 | Time series/Sequence analysis | Z | 37% | 30% | 25.0% | 0.69 |

10 | Text Mining | Z | 36% | 28% | 29.8% | 0.01 |

11 | Ensemble methods | M | 34% | 28% | 18.9% | -0.17 |

12 | SVM | S | 34% | 29% | 17.6% | -0.24 |

13 | Boosting | M | 33% | 23% | 40% | 0.24 |

14 | Neural networks - regular | S | 24% | 27% | -10.5% | -0.35 |

15 | Optimization | Z | 24% | 0.07 | ||

16 | Naive Bayes | S | 24% | 22% | 8.9% | -0.02 |

17 | Bagging | M | 22% | 20% | 8.8% | 0.02 |

18 | Anomaly/Deviation detection | Z | 20% | 16% | 19% | 1.61 |

19 | Neural networks - Deep Learning | S | 19% | -0.35 | ||

20 | Singular Value Decomposition | U | 16% | 0.29 | ||

21 | Association rules | Z | 15% | 29% | -47% | 0.50 |

22 | Graph / Link / Social Network Analysis | Z | 15% | 14% | 8.0% | -0.08 |

23 | Factor Analysis | U | 14% | 19% | -23.8% | 0.14 |

24 | Bayesian networks | S | 13% | -0.10 | ||

25 | Genetic algorithms | Z | 8.8% | 9.3% | -6.0% | 0.83 |

26 | Survival Analysis | Z | 7.9% | 9.3% | -14.9% | -0.15 |

27 | EM | U | 6.6% | -0.19 | ||

28 | Other methods | Z | 4.6% | -0.06 | ||

29 | Uplift modeling | S | 3.1% | 4.8% | -36.1% | 2.01 |

标题：Top Algorithms Used by Data Scientists

作者：Gregory Piatetsky

来源：kdnuggets.com

链接：http://www.kdnuggets.com/2016/09/poll-algorithms-used-data-scientists.html