SQL Server to PostgreSQL - Migration and design concerns
- by youwhut
Currently migrating from SQL Server to PostgreSQL and attempting to improve a couple of key areas on the way:
I have an Articles table:
CREATE TABLE [dbo].[Articles](
    [server_ref] [int] NOT NULL,
    [article_ref] [int] NOT NULL,
    [article_title] [varchar](400) NOT NULL,
    [category_ref] [int] NOT NULL,
    [size] [bigint] NOT NULL
)
Data (comma delimited text files) is dumped on the import server by ~500 (out of ~1000) servers on a daily basis.
Importing:
Indexes are disabled on the Articles table.
For each dumped text file
Data is BULK copied to a temporary table.
Temporary table is updated.
Old data for the server is dropped from the Articles table.
Temporary table data is copied to Articles table.
Temporary table dropped.
Once this process is complete for all servers the indexes are built and the new database is copied to a web server.
I am reasonably happy with this process but there is always room for improvement as I strive for a real-time (haha!) system. Is what I am doing correct? The Articles table contains ~500 million records and is expected to grow. Searching across this table is okay but could be better. i.e. SELECT * FROM Articles WHERE server_ref=33 AND article_title LIKE '%criteria%' has been satisfactory but I want to improve the speed of searching. Obviously the "LIKE" is my problem here. Suggestions? SELECT * FROM Articles WHERE article_title LIKE '%criteria%' is horrendous.
Partitioning is a feature of SQL Server Enterprise but $$$ which is one of the many exciting prospects of PostgreSQL. What performance hit will be incurred for the import process (drop data, insert data) and building indexes? Will the database grow by a huge amount?
The database currently stands at 200 GB and will grow. Copying this across the network is not ideal but it works. I am putting thought into changing the hardware structure of the system. The thought process of having an import server and a web server is so that the import server can do the dirty work (WITHOUT indexes) while the web server (WITH indexes) can present reports. Maybe reducing the system down to one server would work to skip the copying across the network stage. This one server would have two versions of the database: one with the indexes for delivering reports and the other without for importing new data. The databases would swap daily. Thoughts?
This is a fantastic system, and believe it or not there is some method to my madness by giving it a big shake up.
UPDATE: I am not looking for help with relational databases, but hoping to bounce ideas around with data warehouse experts.