SURAgrid in-person meeting notes
September 7 & 8, 2005
With updates from discussions in the SURAgrid call that followed
on September 12
September 7 - Install Fest
Many sites came ready to install; some just to learn and observe. Of those installing, some were using the time to get started, with the bonus of the availability of peer support. Others had already made progress in installation but run into questions/problems at particular points. TACC and UVA were available to assist with node/portal presence and cross-certification, respectively. TACC also wanted feedback on features needed for the portal in the future. Note: UNCC, LSU and OleMiss had intended to come to the Install Fest as well as the next-day planning meeting but could not attend due to travel-related fall-out from Hurricane Katrina.
Learning/observing (not installing today):
- GMU – Has some general ?’s on Globus and related scheduler
- ODU – Interested in observing the steps involved in installation and cross-certification
- Vanderbilt – New to SURAgrid and desiring to learn more before proceeding
Here to work on both getting in portal and cross-certifying:
- UArk – Intending to cross-certifiy and get in portal today; also has some questions on lower level issues, e.g. configuring GRAM
- GSU – Intending to finish cross-certification and getting into portal; also would like to run an application
- UMich – Intending to make part of an MGRID cluster available but has some political issues with cross-certification since MGRID machines are using DOE certificates rather than from a local UMich CA
- USC – Intending to bring shared cluster & condor flock into portal, with installation being handled by those not at the meeting though. Also need to clarify state of earlier cross-certification
- TTU – Intending to cross-certify and get resources into the portal but needs information related to what ports to open for various SURAgrid nodes to get through TTU firewall
- UKY – Intending to get resources into the portal and cross-certify, with specific goal to run UNC SCOOP application in time for SURAgrid presentation at I2
- ULL – Intending to begin with a small cluster in the portal and cross-certify; also has some ?’s on Globus (e.g., some problems w/GRAM)
- TAMU – Intending to get shared cluster into portal, finish cross-certification, also interested in additional info on problems with GRAM and variations on MPI across SURAgrid machines
UAB – Intending to update cross-certification and get resources into portal
TACC gave a tour of the portal to start us off, and Jim J. presented basics on cross-certification.
Update from September 12 SURAgrid call:
See slides from Jim for September 8. He had intended to send these for a sanity check before sending to the list but MFY jumped the gun ;-). Please send feedback to Jim if you have comments/changes for this presentation from the meeting.
Feedback on portal:
- Would be good to have some “You’ve made it" application when adding resources to the portal - some job that users can log in through the portal and run to verify their resources are in and ready.
- It would also be good to have some applications – simple O.K. but better than bin/date! – to verify ability to login to the portal and use others’ resources. -> "Little library of applications"
- Would like to be able to see some amount of history in the portal – what has run vs. what is running right now, particularly since traffic is likely to be intermittent as we are still getting applications involved.
- Whole machine shows in portal and (e.g., when figuring load, for instance) and in a static way. Amount available is actually determined by the allocation though. Would be good to find a way to show what’s really available to SURAgrid users on shared machines,
- What is a SURAgrid account anyway? Should provide some clarification of this as part of portal (and other SURAgrid) documentation. One for future discussion on project planning on Sept. 8…
Feedback on Cross-certification:
- Document how to generate the certificate request (OpenSSL command)
Those that had not worked through Bridge CA or portal documentation began by doing that. Those with specific questions worked immediately with Jim, Warren or Ashok (and others that wanted to help).
Results: All but one site that came ready to install are now in the portal, some with 2 different resources: GSU (2), TACC (2), ULL, UKY, UVA, UAB, UArk, TAMU. Also see gridportal.sura.org.
Update from September 12 SURAgrid call:
Sites that are cross-certified are listed at https://www.pki.virginia.edu/nmi-bridge/certs. Currently includes UVA, UAB, TACC, LSU, USC, GSU, with UArk, UMich, TAMU in progress.
Next steps: (MFY thoughts after the meeting)
- TACC will be providing ongoing assistance to sites as needed to bring nodes into the resource monitor and successfully use them through the portal. Based on info in the current Table of Elements and SURAgrid calls to date, the next wave of installations to be pursued are: USC [this week or next, or rolls to early October] LSU, OleMiss, UNCC, GMU, UMich [on the way in]
- Would like to discuss the document version of SURAgrid node info (currently Table of SURAgrid Elements) – Do we need to maintain this now that nodes are coming online or is all info needed available through the portal, the install tracking that TACC will be doing and the applications tracking that Art is doing? If we decide to maintain it, need to post an updated version to the SURAgrid Web, ideally before the I2 meeting next week.
Update from September 12 SURAgrid call:
Several elements of the table are recorded in other areas/ways now – system specs in the portal, cross-cert status on Jim’s Bridge CA pages, apps info into application description template as they become “real” for running on SURAgried. However, table still provides a useful aggregate view and also a view of things that are in progress. MFY will examine the table for any potential changes, update for references to the portal where applicable and circulate for another update to be posted to the SURAgrid Web site.
Note since the call – this is now done. Copy on SURAgrid Web is the latest one.
September 8 – Project Planning
Grid Building - Facilitated by MFY, input from all
- Further specification of components to make SURAgrid "real" within the next few months
- Need to provide information on SURAgrid environment variables to applications that will be running on SURAgrid nodes. This should be done sooner than later, in order to support applications that are in progress as well as communicate to otherr applications about the potential to run on SURAgrid. Related points:
- Some of the variables are static, some will be dynamic. GPIR could publish this information if variables were retrievable from some known location for each SURAgrid node.
- What information needs to be known? OSG has found a total of five environment variables that are needed: 2 flavors of temp (staging, working), home directory, application area (OSG mandates that nodes provide a specific size space per VO), data area. Could begin with this model rather than reinventing the wheel and adapt to differences as they come up. For example, SURAgrid might think about VO areas more generally, such as by application, or general user types, like. We should also look at a more exhaustive list of variables at some point in case we want to add, e.g. file systems in use, what compilers are available. Should include a review of other notable models such as Teragrid, and also consider scaling issues for the future (static allocations are easier; dynamic allocations could make better use of resources, particularly if resources become more limited.
- Relevant URLS from Shawn & Ashok: OSG environment variables, http://osg.ivdgl.org/twiki/bin/view/Documentation/OsgCEInstallGuide#Configuration_and_Setup_of_OSG_C; Teragrid environment variables, http://www.teragrid.org/userinfo/guide_environment.html
Did not define next steps for this but we need to start work. Would like to discuss immediate next steps in the 9/12 call.
Update from September 12 SURAgrid call:
Formed a working group to discuss the environment variable question (identification as well as potential minimum requirements) and provide a recommendation to the list in the October 10 SURAgrid call. Team will work on its own til then, with MFY providing phone bridge if needed. Group includes: Warren Smith, Ashok Adiga, John-Paul Robinson, Shawn McKee, Victor Bolet, Judith Utley, Jerry Perez.
- Pre-requisites – We should further define pre-requisites for SURAgrid node installation and document these from the Web site.
- Incorporate Ashok’s notes to the list before the meeting.
- Question on why limit to just Linux. Latest Rocks, for example, is nearly op sys agnostic.
- On pre-Web Globus - when to move to GT4? O.K. now if pre-Web versions of necessary components are made available during installation.
- Other things we might specify: Should we mandate that there be some compiler? If so, which one(s)? (Had moved compiler earlier on to be specified as an application requirement). How about NTP or other universally synchronized time source – maybe not mandate which one but should we make a recommendation? Anything clarifications/requirements about MPI or is this also an application-specific requirement?
- Overall, we decided that we should create a list of pre-requisites and document at various levels – step-by-step or “do this specifically” info for those who just want to be told what to do (maybe it’s latest Rocks – compare our list when done against theirs to see) and more functional (not implementation specific) requirements for those who want more flexibility and to tinker at a deeper level.
Decided this would be the subject of a SURAgrid call over the next few weeks, with MFY integrating discussion-to-date into a strawman in advance to kick off discussion.
Update from September 12 SURAgrid call:
We had some additional discussion regarding this and agreed that MFY will draft a strawman for discussion in a (not to far in the) future SURAgrid call.
Update since the call – currently targeting the October 24 call for this discussion
- Local workload management – We decided this doesn’t matter to SURAgrid but we could provide a list of what’s been known to work. So far, this includes Condor, PBS (Open and Pro), LSF, and SGE. This also ties in with the need to know environment variables at the resource level. Application needs to know the variables, communicates to the scheduler, app needs to know whether there are different schedulers that. Define grid job w/RSL – if building before submiktting, GK must be able to translate that into something that the local worklad manager understands. Part of the RSL is those enviroment variables – application dfined, script defined first & picks up env. variables. Depending on what you publish through portal or MDS - environment variables need to be publishable. Jobs need env variables at a minimum.
- Metascheduler – Decided that this is nice to have but not necessary to provide at the overall grid level. It can be (and is often being implemented today as) a VO or user option rather than provided or recommended from the overall grid perspective, particularly since there aren’t any available yet that everyone agrees on (good research topic or sub-project!). Meta-schedulers on the horizon: Nimrod, emerging TACC product, MARS from UMich, whole bunch of others. The best way for SURAgrid to stay aware of the state of development in this area at this time is through the direct involvement of many qualified SURAgrid participants :-). However, the info we are discussing to publish for environment variables is a vital prerequisite for running a metascheduler and could/should eventually lead to one. To further inform the environment variable discussion, we might review a list of meta-schedulers available today and the information that they rely on. Decided this would be the subject of a SURAgrid call over the next few weeks, with MFY integrating discussion-to-date into a strawman in advance to kick off discussion.
Update from September 12 SURAgrid call:
Targeting late October or later for this discussion, once initial environment variable work is done.
- When to move to Web services version of Globus? We didn’t make it around to this discussion and need to discuss it in a future call. MFY input after the meeting: See UK Engineering Task Force evaluation of GT4 for one input to this discussion: http://www.nesc.ac.uk/technical_papers/UKeS-2005-03.pdf.
Update from September 12 SURAgrid call:
We re-confirmed that this is still not a high priority and that we should stick with what we have specified to date (pre-Web Globus any version) until the fundamental next steps that have been identified (e.g., env variables, application readiness) are addressed. Will revisit this after the beginning of the next calendar year.
- Development of a time line for deployment over the coming year (including any anticipated changes in recommended components).
- Plans to evolve areas/components as necessary towards the longer term objective of scalable, sustainable, general-purpose infrastructure.
We addressed each of the above together but briefly since we needed to move on to the next topic area. In addition to the technical topics already mentioned, we decided that progress was needed in the following areas over the next few months:
- Deployment of resources will be ongoing, with tracking and assistance from TACC as well as MFY.
- Further definition of what SURAgrid is - benefits, requirements, levels of participation. There is preliminary documentation of this on the SURAgrid Web, with potential contributions listed as expertise, resources or applications but we need to document in more depth and more formally. We reconfirmed that the majority of those in attendance didn’t want to make active contribution of resources in particular a requirement for participating in SURAgrid (basically being included on meetings, calls and decision-making) although some felt that setting such a requirement could be motivational. Preferred timeline: No later than mid-October. Decided this would be the subject of a SURAgrid call over the next few weeks, with MFY integrating discussion-to-date into a strawman in advance to kick off discussion.
- More detailed documentation of SURAgrid resource requirements (environment variables, pre-reqs, etc. as noted in previous sections) – Preferred timeline: By end of October. Decided this would be the subject of a SURAgrid call over the next few weeks, with MFY integrating discussion-to-date into a strawman in advance to kick off discussion.
- Explore the scheduling of another in-person meeting – End of this year or early next. MFY to begin thinking on this…
Update from September 12 SURAgrid call:
MFY to work in development of strawman documents and schedule SURAgrid call discussions as noted above. Probably will slip into November vs. October, given other items already identified (prereqs, env. variables). Will begin thinking on timing of in-person meeting right away, possibly in conjunction with one of the SURA Cyberinfrastructure workshops coming up (December 2005, January 2006).
Next Steps in authN/authZ - led by Jim Jokl
Jim worked through the attached presentation, adding notes and action items throughout. Some related details and additional action items are noted below.
Update from September 12 SURAgrid call:
See slides from Jim for September 8. He had intended to send these for a sanity check before sending to the list but MFY jumped the gun ;-). Please send feedback to Jim if you have comments/changes for this presentation from the meeting.
- Policy development and documentation needed to give working "credibility" to existing components
State of SURAgrid policy elements:
Pieces needed: |
SURAgrid practice |
Best practice |
Local CA Certificate Policy |
Trust each site in terms of content |
Document with content/format something like PKI-Lite |
Bridge CA Certificate Policy |
Basic practices documented as part of mechanics on Jim’s Web site |
Document with content/format based on something like PKI-Lite, or less if needed since Bridge CA is interim |
Local CA Certificate Practice |
Currently integrated with policy |
Covered in PKI-Lite format |
Bridge CA Certificate Practice |
Currently integrated with policy |
Covered in PKI-Lite format |
Certificate Profile |
|
Recommended profiles at HEPKI |
- Acceptable Use Policy – Updated the draft on the SURAgrid Web site based on SURAgrid call on 8/29 and ongoing input is welcome. Still need links to participant site AUPs though!
USC, http://www.usc.edu/hpcc/systems/account.php
Update from September 12 SURAgrid call:
MFY will add these to the AUP page as received.
- Benefits/obligations as they will be further described in the item c of the Grid Building discussion.
- Governance/Charter (and related sorts of things…) – Do we need to make more progress on this? If so, how? Long term, a SURAgrid legal entity/organization could be managed under SURA’s 501(c)3, as we do now for SoX and are discussing for ViDe. For immediate next steps, would like to make progress towards more formality but not at the expense of grid-building or application efforts. OSG has made progress in this area and has several documents we can learn from and possibly model. OSG doc URL from Shawn: http://www.opensciencegrid.org/index.php?option=com_content&task=view&id=58&elMenu=Documentation. Decided this would be the subject of a SURAgrid call over the next few weeks, with MFY integrating discussion-to-date into a strawman in advance to kick off discussion.
Update from September 12 SURAgrid call:
MFY to work in development of strawman documents and schedule SURAgrid call discussions on governance & charter. Probably will slip to the end of the year given other items identified to address earlier on.
- Identification and time line for "automation" of account management on dedicated and shared systems.
- See Jim’s slides for detail on this. Also, clarified that “dedicated” means a specific resource – full system or static designated nodes – available all of the time to SURAgrid users; “shared” means a portion or percentage of a resource will be made available to SURAgrid users as an allocation – could be “used up” in a particular time frame, for instance. Those currently planning to work with Jim on procedures for dedicated systems include: UAB, TTU, GSU, UVA, TAMU. Those that can provide input /scenarios for shared systems include: Vanderbilt, TACC, TAMU, USC, ULL, UKY
- Currently, Jim is completing code that UAB is scheduled to pilot within one month. Once they see how this goes, will expand the trial.
- UVA is currently using a UMich interface for accounting (SourceForge & NMI: PBS XML Accounting) to show who is doing what on their campus grid and if sites are getting their expected benefit. The UMich product currently works with PBS (OpenPBS and PBSPro) (now also Condor) and Jim may be able to incorporate this for SURAgrid accounting as well. He might also be able to map information coming out of a SURAgrid account management system to what other products need. We also discussed that each site is now able to see the usage of their own systems and, if this info could be aggregated in a database, we’d be able to query the other way around as well – for a site to see what other sites’ systems they had made use of. URL from Ashok on all of the accounting packages. Possible Accounting pilot project here: who’s in, who’s in charge, when does it happen? MFY to add this for discussion in a future SURAgrid call.
Update from September 12 SURAgrid call:
Will keep this on the list for future discussion but no action yet. Could probably use a simple format for sharing this information a - possibly just aggregating information into a database that can be queried - and it will be important to use GGF Accounting XML group schema for whatever we do.
- Taking some of the steps towards b on-site if that makes sense and there's time – Wasn’t any time :-(.
12:30 – Lunch w/lunch speaker!!!
John-Paul Robinson spoke about one of his latest projects related to Shibboleth and used in conjunction with (among other things) UABgrid authN/authZ: OpenIDP, http://www.openidp.org. More from John-Paul if he wants to send it…
Catalyzing Applications - led by Art Vandenberg
- application identification and development of time lines and next steps for all suggested applications, including those for I2 demo
- brainstorming re: potential SUN proposal (Gary Crane)
- gathering means/methods we collectively know for grid-enabling applications and how we can assist with that
- ways to promote SURAgrid at participating institutions and beyond, including prep for pres/demo at the September I2 meeting if there's time.
Art – I really lost track of taking notes here since we were skipping around a bit. Please send important points you want “on the record” and I will incorporate them into the notes.
Below are some last minute points we pondered as to why institutions might get/stay involved in SURAgrid, from resource contribution and also application perspectives. Will incorporate these into the Participation documentation mentioned in earlier section on grid-building:
What would help most in getting your resources online?
- Prove that institutions benefit by getting resources back (or other benefits) [over time, should get out what you put in; short term, may be able to get access to more or special resources]
- More things like the Install Fest (always include with in-person)
- Remember: It is the mission of the university to advance science, research and education.
What would help most in getting your applications online?
- Define what SURAgrid benefits are – why run on SURAgrid rather than locally
- Assistance in grid-enabling applications
- Remember: It is the mission of the university to advance science, research and education :-).