string categorization strategies

Posted by Andrew Heath on Stack Overflow See other posts from Stack Overflow or by Andrew Heath
Published on 2010-05-25T07:27:17Z Indexed on 2010/05/25 7:31 UTC
Read the original article Hit count: 310

Filed under:
|
|
|
|

I'm the one-man dev team on a fledgling military history website. One aspect of the site is a catalog of ~1,200 individual battles, including the nations & formations (regiments, divisions, etc) which took part.

The formation information (as well as the other battle info) was manually imported from a series of books by a 10-man volunteer team. The formations were listed in groups with varying formatting and abbreviation patterns. At the time I set up the data collection forms I couldn't think of a good way to process that data... and elected to store it all as strings in the MySQL database and sort it out later.

Well, "later" - as it tends to happen - has arrived. :-)

Each battle has 2+ records in the database - one for each nation that participated. Each record has a formations text string listing the formations present as the volunteer chose to add them.

Some real examples:

  • 39th Grenadier Rgmt, 26th Volksgrenadier Division
  • 2nd Luftwaffe Field Division, 246th Infantry Division
  • 247th Rifle Division, 255th Tank Brigade
  • 2nd Luftwaffe Field Division, SS Cavalry Division
  • 28th Tank Brigade, 158th Rifle Division, 135th Rifle Division, 81st Tank Brigade, 242nd Tank Brigade
  • 78th Infantry Division
  • 3rd Kure Special Naval Landing Force, Tulagi Seaplane Base personnel
  • 1st Battalion 505th Infantry Regiment

The ultimate goal is for each individual force to have an ID, so that its participation can be traced throughout the battle database. Formation hierarchy, such as the final item above 1st Battalion (of the) 505th Infantry Regiment also needs to be preserved. In that case, 1st Battalion and 505th Infantry Regiment would be split, but 1st Battalion would be flagged as belonging to the 505th.

In database terms, I think I want to pull the formation field out of the current battle info table and create three new tables:

FORMATION
[id] [name]

FORMATION_HIERARCHY
[id] [parent] [child]

FORMATION_BATTLE
[f_id] [battle_id]

It's simple to explain, but complicated to enact.

What I'm looking for from the SO community is just some tips on how best to tackle this problem. Ideally there's some sort of method to solving this that I'm not aware of. However, as a last resort, I could always code a classification framework and call my volunteers back to sort through 2,500+ records...

© Stack Overflow or respective owner

Related posts about php

Related posts about mysql