Dimension Reduction in Categorical Data with missing values

Posted by user227290 on Stack Overflow See other posts from Stack Overflow or by user227290
Published on 2010-05-14T21:50:21Z Indexed on 2010/05/14 21:54 UTC
Read the original article Hit count: 300

Filed under:

I have a regression model in which the dependent variable is continuous but ninety percent of the independent variables are categorical(both ordered and unordered) and around thirty percent of the records have missing values(to make matters worse they are missing randomly without any pattern, that is, more that forty five percent of the data hava at least one missing value). There is no a priori theory to choose the specification of the model so one of the key tasks is dimension reduction before running the regression. While I am aware of several methods for dimension reduction for continuous variables I am not aware of a similar statical literature for categorical data (except, perhaps, as a part of correspondence analysis which is basically a variation of principal component analysis on frequency table). Let me also add that the dataset is of moderate size 500000 observations with 200 variables. I have two questions.

Is there a good statistical reference out there for dimension reduction for categorical data along with robust imputation (I think the first issue is imputation and then dimension reduction)?
This is linked to implementation of above problem. I have used R extensively earlier and tend to use transcan and impute function heavily for continuous variables and use a variation of tree method to impute categorical values. I have a working knowledge of Python so if something is nice out there for this purpose then I will use it. Any implementation pointers in python or R will be of great help. Thank you.

Related posts about statistics

SQL SERVER – When are Statistics Updated – What triggers Statistics to Update

as seen on SQL Authority - Search for 'SQL Authority'
If you are an SQL Server Consultant/Trainer involved with Performance Tuning and Query Optimization, I am sure you have faced the following questions many times. When is statistics updated? What is the interval of Statistics update? What is the algorithm behind update statistics? These are the puzzling… >>> More
SQL SERVER When are Statistics Updated What triggers Statistics to Update

as seen on Dot net Slackers - Search for 'Dot net Slackers'
If you are an SQL Server Consultant/Trainer involved with Performance Tuning and Query Optimization, I am sure you have faced the following questions many times.When is statistics updated? What is the interval of Statistics update? What is the algorithm behind update statistics? These are the puzzling… >>> More
New Whitepaper: Best Practices for Gathering EBS Database Statistics

as seen on Oracle Blogs - Search for 'Oracle Blogs'
Most Oracle Applications DBAs and E-Business Suite users understand the importance of accurate database statistics. Missing, stale or skewed statistics can adversely affect performance. Oracle E-Business Suite statistics should only be gathered using FND_STATS or the Gather Statistics… >>> More
Incremental Statistics Maintenance – what statistics will be gathered after DML occurs on the table?

as seen on Oracle Blogs - Search for 'Oracle Blogs'
Incremental statistics maintenance was introduced in Oracle Database 11g to improve the performance of gathering statistics on large partitioned table. When incremental statistics maintenance is enabled for a partitioned table, oracle accurately generated global level statistics by aggregating… >>> More
Lies, damned lies, and statistics Part 2

as seen on Oracle Blogs - Search for 'Oracle Blogs'
There was huge interest in our OOW session last year on Managing Optimizer Statistics. It seems statistics and the maintenance of them continues to baffle people. In order to help dispel the mysteries surround statistics management we have created a two part white paper series on Optimizer statistics… >>> More

Developer IT

Dimension Reduction in Categorical Data with missing values - Developer IT

Dimension Reduction in Categorical Data with missing values

statistics

r

python

Related posts about statistics

SQL SERVER – When are Statistics Updated – What triggers Statistics to Update

SQL SERVER When are Statistics Updated What triggers Statistics to Update

New Whitepaper: Best Practices for Gathering EBS Database Statistics

Incremental Statistics Maintenance – what statistics will be gathered after DML occurs on the table?

Lies, damned lies, and statistics Part 2

Related posts about r

Categories cloud