tagging_workshop_2006_sep_paul

<< back

Outlined notes from Paul McKenney


Thu Sep 14, 2006

  • Diane Peters -- introduction.
  • Kees Cook, OSDL.
    • "Claims language"
    • Differing vocabularies
    • Gail -- one should read the application as well as the claims. Examiners don't rely strictly on claims language.
    • Map tagging to USPTO classes (but quite coarse grained).
    • Perfection the enemy of any real system.  ;-)
    • "Tagging wizard".
  • Ross Turk, OSTG (SourceForge.net)
    • Largest number of open-source projects.
    • Want users to find the projects, hence tagging.
    • Want community to collaborate on tagging/categorization.
    • Current approach:
      • Software map (browsing, filtering, categorization)
      • Keyword search
    • Challenges:
      • Search results difficult
      • Only get 255 characters to describe a project
      • SF.net controls hierarchy, admins control content. Need to empower community! Hierarchy not related to USPTO categorization: different audience.
      • Not part of a larger community.
    • 125K projects, 239K searches/day, 24M unique visitors/mo, 1.3M registered users.
    • Tagging must be "organic", cannot force a "flag day". Will take time.
    • Automation?
    • Bogus tags? Cannot assume that all members of the community are benevolent! Can patent examiners evaluate tags or produce tags? USPTO is looking into examiner-produced tags, but might not be able to publicize them.
    • Each SF project can use up to 5 tags out of a universe of about 500.
    • Syntactic or semantic analysis for automated tagging? Analyze Makefile and/or source code? Leverage libraries used? (But be careful of automatically generated makefiles/code.)
  • Ken Krugler - Files, files everywhere and not tag to see.
    • Vertical search engine for programmers. 500+ repositories, 20M files. 50K+ metadata repositories. 40M pages from key domains.
    • Add "code wikis" to talk about code.
    • Cannot reasonably -build- all of this code! Different languages, gcc versions, ...
    • "DOAP" -- XML-based description of OSS projects -- but not everyone uses this!
    • Significant UI issues -- user might specify a version, but work only with latest version. Need to propagate tags as code is harvested by different projects, possibly with modifications.
    • Tags in different languages -- but tags are often too short to automatically recognize the language.
    • Willing to make APIs available gratis to USPTO. (Google makes APIs available under Creative Commons license. Choose to throttle, put up with people "harvesting".)
    • Expose tag values and usage to users? Folksonomy quality still a matter for research.
    • Tags on "what it does" vs. "how it does it".
    • Fuzzy matching to relate good documentation to related code? Structure matching requires heuristics, which differ from language to language. Need things to be automated and scalable to 100K projects.
  • Ward Cunningham - Experience extracting tags from Eclipse commit comments
    • "200 lines of perl is my contribution" "wikiwikiweb is what will be on my gravestone" [reply from audience: will there be an "edit" button?]
    • Eclipse wanted tags to lead busy developers through a complex code base.
    • Have experimented wiht tags as a supplement to project visibility tools.
    • "Commits explorer" -- OLAP view into CVS repository. dash.eclipse.org
    • "Tag Cloud" by project over time from checkin comments "tags.cgi".
    • Show tags with font size based on number of uses. Considering differential tagging -- month's frequency compared to long-term usage frequency. Generally, the mid-range of usage is the most interesting. The highest usage tends to be things like "fixed".
    • Very small perl script.  ;-)
    • In future, perhaps extract words from comments as well. Perhaps also from code diffs.
    • Want to use comment-like entities that reference multiple pieces of code -- calling out relationships.
    • Might be applicable to other types of repositories, but could always convert to CVS...
  • Eric Hestenes -- Peer to Patent Project
    • Incorporating public input to examination process.
    • Collecting use cases -- collaborating on use cases.
    • Community deliberates and ranks prior-art references. Initial thought is to send in top 10. Might also send "the rest" separately categorized.
    • Also include examiner usage in ranking? As well, include list of other applications that community has felt that the reference was relevant to.
    • No need to be perfect. If five of ten are right on, the extra five "false positives" aren't a real problem.
    • Retain old commentary -- keep track of "tree" that points backwards in time to progressively older prior art.
    • Why are applicants doing this to themselves? (1) expedited review (2) greater presumption of validity.
  • Best Practices for Manual Software Tagging
    • Incenting people to tag software.

  • "street cred" from wannabe developers.
  • rating tags for relevance and usefulness.
  • [Gail] Only a small fraction of patents describe things at the code level, so tagging documentation may be more important than code. Examiners have to hit on the "right" tag, which might not be sufficient. General description of project would be more helpful in most cases. Would like standard terminology... Or contextual thesaurus. (Can wikipedia be leveraged for this?) Portions of this might exist on a per-examiner basis. Most patent applications are "big animal" inventions -- very few at the low level.
  • Tag with URL to definition.
  • Ward Cunningham: "Archival Quality Software"
    • Hyperlinked CDC6600 assembly with patterns indicated.
  • Thesaurus:
    • Definition and terms, with context.
    • Terms in IDE -- as pulldown.
  • Timestamping
    • Forensic -- Received headers. Should be OK for past work, but belt and suspenders would be good. (Generally, case law applies only going forward.)
    • Hashing -- see Jan's email message. Publish hashes. In Germany, digital signatures must be re-signed periodically (6 years or so?).
    • timestamp.com.
    • Jan Kechel's timestamp service.
    • Perhaps archive.com could save plaintext.
    • Library of Congress keeping old source code? (IP.com does this as a final step.)
    • Google hashes during crawl, but does not guarantee to keep actual file indefinitely (or for any particular time period, for that matter).
    • Note that one must prove integrity as well as date!!!
  • Potential Friday Topics:
    • "gaming" of tags.
    • USPTO categories as tags.
    • Tagging of public non-open-source software -- corporate SDKs.
    • Incenting people to tag software.
    • Overview of examination process.
  • Friday topics:
    • USPTO viewpoint and needs (Gail and Tariq)
    • Google synonym feature (Karl)
    • Sourceforge/Freshmeat Trove (Ross)
    • Repository Federation/Unification/"why can't we all be friends" Standard Panel (Kees, Dan, Karl, Ross, Chris)

  • Fri Sep 15, 2006

    • Gail Hayes, Tariq Hafiz: USPTO

    • Patent filings increasing faster than expected. Hiring 1,200 patent examiners per year, total of 4,500. 3,000 of these are "electrical", including computer software/hardware. Exceeding space -- senior examiners often work from home, which makes training more challenging. New patent examiners might or might not have knowledge of older technologies.
    • More explicit training required, given the very large number of new hires.
    • Senior examiners often have only five years of experience.
    • Average examiner paid ~50-60K out of college, senior examiners might get $130K.
    • Training/travel set-asides. Paid consultants, field trips.
    • Examiner process --
      • Most time spent on detailed description.
      • Searching common. Some examiners use categories, others do not.
      • Background provides some education.
      • Incentives for rejection? No publicity... [Could scan patent applications and making noise about good rejections...]
      • "Quota" for examiners include rejections as well as allowances.
      • RMS's article.  ;-)
    • Wishlist
      • Call out to industry experts [policy issue!]
      • Training.
      • Associating patents with projects. Pledged patents!
      • Tagging should include synopsis of problem and how it was solved.
  • Karl Fogel (Google)
    • google.com/coop
    • Customize search engine -- Google knows about you, and adjusts searches -- special results presented in shaded box.
      • Gail: Policy -- no assisted search. But, given prototype, policy might change.
  • Ross Turk: SourceForge/Freshmeat "Trove"
    • Multi-dimensional map -- tree-structured representation.
    • Categories: topic, user interface, translations, programming language, license, development status, database environment, intended audience, operating systems
    • Browsing starts with topic.
    • Searchable text books. O'Reilly Safari.
    • Mapping between USPTO categories and other locations.
    • Different types of tags -- name as description vs. separate searchable description.
    • Finding prior art similar to finding useful code.
  • Jan Kechel: Mapping between OSS tags and prior art.
    • On OSAPA wiki.
    • USPTO could use something like this internally -- external participation would require a policy change.
  • Tariq Hafiz: Recategorization
    • Initiated when given subclass has more than a certain number patents (e.g., 5,000).
    • Also when technology changes.
  • Karl Fogel, Chris Conrad, Ross Turk, Kees Cook
    • One-stop shopping for research searching. Hosted by both producers and consumers.
    • Jeffrey Kruelen nomination for name "DOAP on SOAP".
    • Overview
      • Goals:
        • Assist searches across entire open-source software space.
        • Information known to be up-to-date data source.
        • Extensible by third parties.
        • Both major repositories and bit players represented.
        • Consumers able to select publishers.
        • Common editing interface (includes tagging wizard).
        • RSS feeds to consumers.
      • Single shared project-ID namespace
        • Naming conflicts handled by disambiguation redirection like Wikipedia.
      • Common Interchange Format (XML)
        • DOAP (Description of a Project) Includes "who is authorative source".
          • Authorative vs. non-authorative tags.
          • XML audit trail for tagging.
            • Potentially determine authorativity from audit trail.
          • Extend tags to include problem and solution (or maybe description).
        • "Dublin Core" RDF (research description format)
      • OSUOSL hosting -- central master database
      • Change publication (e.g., RSS feed)
    • --- Roles for existing players: Authority...
  • << back

    Groups: