Matching shows and the files on the HPR server they refer to needs work
The plan is to upload all files relating to a show to the Internet Archive (IA) as part of the HPR workflow. This makes them self-contained on the IA.
The script which collates everything and prepares metadata for upload has been enhanced to do this, but unfortunately this is not always simple.
Knowledge about which files belong to a show is held in the notes (and sometimes in these additional files themselves). For example, if a contributor sends in a show with notes and the notes refer to other files, including HTML, which in turn refers to other files such as images, tracing all of the components requires work. A recursive parsing approach seems to be the only solution, and that is what the metadata generating upload script does. It will need help with older shows though.
In order to keep a tally of all of the files relating to shows it is suggested that a script be written to perform a scan across the entire HPR database and that the result be stored for future reference. It would make sense to incorporate this data in the PostgreSQL database currently being designed.
To that end a temporary SQLite database is being populated with sufficient data to assist with keeping records of the process of uploading all HPR shows to the IA, with their files, audio formats and so forth. A script to manage this database is being produced at the time of writing.
To assist with this process it's important to ensure that:
- Files related to a new show are named or placed in an appropriately named directory so as to be easy to identify as belonging to that show
- We rationalise all cases where older shows did not conform to this criterion
- We check that there are no "orphaned" files in the directory where all the shows are held. Tidy up in other words!