Wednesday, May 30, 2007

US probes how TB traveler crossed border (AP)

US probes how TB traveler crossed border (AP): The government is investigating how a globe-trotting tuberculosis patient drove back into the country after his name was put on the no-fly list — and given to U.S. border guards — a major gap in the nation's system to keep the direst of diseases from crossing borders.

We have heard a lot about false positives in the no-fly process, but why would we be surprised about false negatives, except that false negatives are by their very nature not recognized until something goes wrong with them?

DNA-based watermarks using the DNA-Crypt algorithm

DNA-based watermarks using the DNA-Crypt algorithm: The aim of this paper is to demonstrate the application of watermarks based on DNA sequences to identify the unauthorized use of genetically modified organisms (GMOs) protected by patents. Predicted mutations in the genome can be corrected by the DNA-Crypt program leaving the encrypted information intact.

I don't know if this will work in practice, but if it does it could be very problematic. In any case, something like it is likely to work well enough in the lab that some company would get the idea of using it. But the watermarks could leak out in several ways, including pollination and bacteria-mediated exchanges of genetic material. An innocent farmer could be sued because some of its produce is "watermarked" involuntarily through one of those routes. They might not afford to defend themselves from the charge that they are infringing someone's patent on a genetically-engineered plant, and end up having to pay protection money (shades of RIAA).

Making Network Neutrality Sustainable

Making Network Neutrality Sustainable: SMART Letter #100: Executive Summary: Network Neutrality as currently conceived requires changes in carrier behavior that are contrary to their corporate culture and business model, so we can expect their active opposition even after Network Neutrality becomes law. If carrier resistance prevails, the Internet stands to lose its key success factor. The Network Neutrality movement can learn from history; the demise of Unbundled Network Elements (UNEs) and the ensuing collapse of telephone and Internet competition provides an parallel. The solution is strategy that is more ambitious and more patient, that addresses industry structure rather than carrier behavior.

Read it all. Unless you have lived inside old telecom like David (and I, to a lesser extent) you might think that his views are alarmist, but they are not. The AT&T monopoly is now almost reconstructed, but without the regulatory constraints of the old Bell System.

Incidentally, Apple's exclusive deal with AT&T/Cingular for iPhone service is a bad mistake. Would they have made an exclusive deal with an ISP for Mac Internet service?

Sunday, May 27, 2007

U.S. colleges retool programming classes (AP)

U.S. colleges retool programming classes (AP): They had spent the last few months teaching their personal "Scribbler" robots to draw shapes and chirp on command. Now they were being asked to navigate a daunting obstacle course of Girl Scout cookie boxes scattered over a grid.

I've been reading about the Georgia Tech program, which has interesting aspects. But I can't swallow the gimmickry. Robotics is a great context, but these are still toy problems. My argument is that even beginners can contribute to a large, complex project. I've had freshmen and sophomores make significant contributions to research projects. Granted, those are some of the best students at Penn, but I'm convinced that with a careful structure, it is possible to engage many more into meaningful projects.

I've been in many discussions, panels on the drop in computer science enrollments. There are lots of theories about it, and preciously little empirical evidence. Some changes in freshman classes help. But most of the factors in the drop are likely to be beyond our control.

Instead of trying this and that in the hopes that it might make a difference, we should instead think about what an excellent computer science education should be like. The current models, even in their best manifestations, wring out much the main joy of the field, which is to muster deep thought and computational processes to solve real problems.

New Music

Until digital tracks are higher-quality and DRM-free, I'll keep buying CDs. Fortunately, AKA Music is a 10-minute walk away. Here are some recent purchases that I am enjoying:

Sclavis gets ever more interesting at each hearing. Listening to his new record begs us to go back to compare and contrast with Napoli's Walls and Les Violences de Rameau.

Friday, May 25, 2007

Learning by Doing Part 3

Great comments on Learning by Doing Part 2, thanks! Since writing the earlier posts, I realized that the term "topic" can be confusing, so I'll use instead "subject", which also allows me to discuss more easily the discussion between these ideas and traditional programs. Here are some tentative answers to earlier comments and to questions I came up with since:
  • Sample project: It's easiest to use one of my own research areas, biomedical information access. We have been working on systems for integrating biomedical text with databases that describe biomedical text. What we have built so far only scratches the surface of what would be possible with enough effort over a few years. We have had successful experience with undergraduates working on the project for independent studies and summer research experience. A project like this has many opportunities for learning standard subjects: object-oriented programming, data structures and algorithms, software engineering, databases, networking, machine learning, distributed systems, user interface design, basic statistics and experimental methods, as well as basic mathematical subjects like formal language theory, linear algebra, probability, and optimization. By carefully factoring projects and subjects, it would be possible to use the same subject module in different projects, and ensure that all students have to cover a broad set of subjects in their work. Think of the subjects as the books that you would have to consult when working successfully on your projects.
  • Projects in current Penn programs: I totally agree that we should extend senior project down into the junior and sophomore years. Faculty have been discussing this, but we haven't worked out the full model yet. This discussion may help. There might be an incremental path to learning by doing by increasing project-oriented work and short courses, and reducing standard lecture courses. Another possibility is to start small with a special program with limited enrollment to allow developing and debugging the proposal. Incidentally, I think it would be totally possible to create an ABET-acredited computer science learning-by-doing major, because what ABET cares about is about subject and skill coverage, not particular course structure, as we found out when going through the accreditation process last year.
  • Too much work; Having designed and redesigned a major freshman class over the last five years, I can't imagine that anything could be more work than that... There would be more interactive, event-driven work, but we would give fewer lectures, and we would not have to create and grade assignments, which is a major part of traditional course management. Teaching-by-doing is already what we do with graduate students. We would be bringing teaching and research much closer together at all levels, which would make teaching more creative and enjoyable, and improve research output. The NSF would love it for those reasons. I'm also pretty aware of the complexities of academic finances, course units, and faculty workload. I made some rough calculations while exercising at the gym (you need to use that time well...) I concluded that there isn't a huge order-of-magnitude mismatch between current faculty resources and the demands of a learning-by-doing program. Of course, it would require extra work to set up, but so does any significant change in the curriculum.
  • Is it too hard to do with freshmen: Maybe, but Michael Kearns has had great success with in-class activities and projects that have led to major publications in his Networked Life class. In our freshman introduction to programming, we have successfully used open-ended, realistic assignments that are not so far from project tasks. We use unit testing to check assignment solutions, as we would do for project contributions.
  • Grades and transcripts: We would need to organize projects and subjects carefully to make sure that we can certify a level of accomplishment in different subjects. In fact, this wouldn't need to be that different from what we do in a traditional program. As students progress, they move to positions of greater responsibility in their chosen projects and their roles demand more advanced subjects. A freshman might do basic program maintenance tasks that require her to read specifications and bug reports, read and understand varied code, and modify it to pass existing unit tests and new ones she would craft. With faculty guidance she would take on a number of different such tasks to learn about different programming concepts and paradigms. Later, she would be assigned to select and implement (or get from libraries) the best data structures for various tasks, producing also documentation that explains rigorously the complexity arguments in favor of her choices. All of her contributions in different subjects would be evaluated and packaged into subject bundles that would lead to transcript entries.
  • Do it right away: I keep bringing this up, that when I'm asked to do something new, I reply asking what you want me to stop doing. A school interested in going this way would have to allocate start-up resources to release a coherent faculty team to put a pilot program in place. Given the challenges of current academic finances, with decreasing real-dollars research funding, shrinking enrollments in areas like computer science, and downward pressure on tuition increases while expenses like health care explode, it won't be easy to convince an administration to come up with the money. But it might be possible to build something incrementally by converting more advanced applied subjects first (operating systems, databases, AI, ...) and spreading into the more basic parts of the curriculum as the demonstration succeeds and is able to attract resources from grants, for example.

Wednesday, May 23, 2007

Learning by Doing Part 2

Is it not hypocrisy to slam current education practices while being a professor who practices them?

In my defense, it took me a while as teacher to understand what is wrong with our education. In other words, I had to learn by doing. I can't say much with confidence about education before college, although I have opinions as a parent of two kids who went through worse and better schools.

However, I think that it would be worth trying a radical experiment in college-level computer science education. This experiment would start with a small group of students, but be conducted with serious attention to scalability. A four-year program would be based on a small set of big projects addressing problems we do not know how to solve. If the projects are successful, they could have major impact. Those projects would be defined by a rolling committee of faculty and students with external input, and would be hard but not unrealistic for making substantial progress in four years; would involve several areas, skills, and talents; and would be manageable by faculty and students. The projects would be organized to cover the core knowledge, skills, and open questions that we believe computer scientists need to command. Let's call those the topics. Incoming students would go through an initial period in which they would select a subset of projects with a balanced coverage of topics, with the help of a faculty advisor and more senior students. Like when selecting a major, students would sample different projects before settling on a portfolio with a balanced coverage. Instead of signing up for classes, students would sign up for tasks in projects. Their evaluation would depend only on how they perform on their selected tasks. Specific tasks would need specific knowledge and skills, which students would be free to acquire in whichever way works best for them. Faculty (and possibly) others would offer short courses on important topics that serve multiple projects, but the main role of faculty would be to guide projects and their participants. There would be no difference between research and teaching. Faculty would ensure that projects and project assignments are demanding so that students have to get deep knowledge and skills across a range of topics. Faculty would work with students to help them find the best way to succeed. But each student would have an individualized curriculum tied to their projects and their topic coverage requirements.

In other words, undergraduate education would be run like a (large) set of graduate research or cutting-edge industrial R&D projects. Lectures, reading, assignments would be provided to satisfy problem-solving demand, instead of the current model in which they are imposed on students independently of their stage in their education.

Projects would have concrete outcomes, for example papers, online services, and open-source software. Individual faculty, or small groups of faculty, would be project directors, with students playing appropriate roles in different project functions.


This is a very rough sketch. I'm sure there are many objections. But something like it has to emerge. Fun and profit demand it.

Newsweek Ranks Schools; Monkey High Still Tops

Newsweek Ranks Schools; Monkey High Still Tops: Newsweek has once again issued its list of America’s Best High Schools. They’re using the same goofy formula as before: the number of students from a school who show up for AP or IB exams, divided by the number who graduate. Just showing up for an exam raises your school’s rating; graduating lowers your school’s rating. As before, my hypothetical Monkey High is still the best high school in the universe. Monkey High has a strict admissions policy, allowing only monkeys to enroll. The monkeys are required to attend AP and IB exams; but they learn nothing and thus fail to graduate. Monkey High has an infinite rating on Newsweek’s scale.

Ed Felten quotes a summary of research that confirms the idiocy of this ranking formula:

Studies by U.S. Department of Education senior researcher Clifford Adelman in 1999 and 2005 showed that the best predictors of college graduation were not good high-school grades or test scores, but whether or not a student had an intense academic experience in high school. Such experiences were produced by taking higher-level math and English courses and struggling with the demands of college-level courses like AP or IB. Two recent studies looked at more than 150,000 students in California and Texas and found if they had passing scores on AP exams they were more likely to do well academically in college.

What is both amazing and depressing is that we wouldn't think of evaluating athletic performance with single tests. What matters is sustained performance over many events, which involves a wide range of physical and psychological processes. I became a decent skier (and a decent programmer) not by focusing on one test, but by working on harder and harder problems, and overcoming many failures, over a long period. Staying with it in the face of failure was in some ways much more important than scoring a perfect run (or writing a bug-free function first time) now and then. We all know that is what it takes. So why is our education so disconnected from it?

Maybe part of the problem is that education has become totally disconnected from doing. Most educational systems serve their own internal objectives, not the achievement of something of independent value.

All of the computer science I have learned — my college degree was in pure math — I learned because I was trying to some problem or understand someone else's solution. I would not have it any other way.

Even much of the math I learned in class only became alive when I had to remember it in solving some computational problem. I did pretty well in real and functional analysis, measure theory, differential geometry, abstract algebra, and mathematical logic in college, but all of that became much more real to me as I used it here and there in understanding problems in programming language design, formal linguistics, machine learning, and speech recognition.

The latest New Yorker has an interesting article on Gordon Bell. What struck me most there (leaving aside the questions of how to search a life's record) was the story of the young Gordon Bell learning by doing in his father's electrical shop. The formalization and bureaucratization of work has taken away most opportunities for effective apprenticeship. I doubt that we can fix our educational system before we create effective ways of learning by doing.

Learning by doing makes much more sense in one's internal economy. In standard education, you invest a lot of effort upfront in the hopes of some uncertain return in the distant future. In learning by doing, the return on investment is immediate: if I learn this piece of math or physics, I may be able to do this task and be rewarded (materially, psychologically, or socially) for my contribution to the project's success.

Learning by doing is also more efficient in individual resources, in that what we learn has an immediate use and is tailored to our interests, abilities, and needs. One-size-fits-all academic education is hugely wasteful in that it is based on fixed user-independent bundles of knowledge and skill. Once again, our processes and systems fail the long tail.

Tagging fantasies

Interview: David Weinberger:This seems to have been my “interview very smart people” month. A couple of weeks ago I was lucky enough to spend an hour talking with David Weinberger about his fascinating new book, Everything Is Miscellaneous. The full interview with Weinberger is now up at Salon.

Rosenberg includes some interview answers that were deemed too technical:

You mean WinFS, the file-system revamp that was supposed to be a part of Vista and then got chopped?
No, a skunkworks Microsoft project — called Tesla. It was XP-based, an application, a layer on top, and instead of having folders, you had tags, and you could organize by that. If people wanted that, if there were demand for it, somebody wrote it well, we'd have it. We're very used to the Windows foldering system, which emulates a second order system. It has a third order capability that is unused, or not implemented well enough for people to actually use. There's also a cognitive step to be taken to get used to it. Not much of a cognitive step. We'd be doing this for photographs within Windows if Windows gave us good enough tools.

It's not a cognitive step, it's an economic step. To tag an item is to invest effort upfront for uncertain return later. The tags I assign today may not be the ones that connect in my mind to that kind of item in three months. As a result, it is not clear that tagging is worth it for any individual. Tagging is indexing in disguise, and good indexing is very hard. Tagging of images may be successful because there is no other good way of retrieving images. But the co-ocurrence statistics of file contents and of our interactions with them are there without us making any extra effort. Manual tagging may be a good solution for the unusually organized, but for the rest of us, search that exploits content and context better is where the sweet spot is. I never tag (let alone folder) most of my huge email archive, but I search it all the time. I might be persuaded by a good automated method for suggesting tags, though.

I like Weinberger's book a lot in his diagnosis of the situation, but I am not so convinced by his favored solutions.

Monday, May 21, 2007

Unbundling review, highlight, distribution, discussion

The net has made it possible to unbundle the various roles of traditional science venues:

  • Review: Maintain good scientific standards. A paper that passes review should be sound (within the limits of human ability to check), understandable, and reproducible.
  • Highlight: We all have opinions about which sound papers are most significant. We also tend to trust certain people or venues to select or highlight the most significant papers. This function is separate from basic reviewing. For example, PLoS ONE will publish any paper that passes review, leaving aside subjective importance criteria. This is possible because there is no page limit on an net journal.
  • Distribution: Open access distribution is easy, cheap, and does not need to be tied to physical venues (dead trees or meetings).
  • Discussion: Online discussion, as the PLoS journals offer, and meetings offer complementary channels for communication and discussion of ideas. Highlighting can draw attention to particular papers for more intense discussion; meetings with separate tracks or talk/poster distinctions do this implicitly.

My proposal for connecting an open access journal with an annual meeting is just one way of re-bundling these functions in a different way from what is currently available, which is designed to promote better reviewing, easy distribution, effective highlighting through the tracks of a meeting, and increased discussion over what a journal alone offers.

Sunday, May 20, 2007

Open Access Computational Linguistics

Hal's opening shot has many interesting responses in the comments, and my blog reply also yielded interesting comments. Instead of commenting on the comments individually, here are some additional points:

  • I understand that Computational Linguistics and the other ACL publications are distinct, but the ACL owns Computational Linguistics and could change its operation if it so wished.
  • Open access Computational Linguistics seems a good idea, but...
  • Open access by itself does not solve the critical problem with our field, which is the poor quality of reviewing for conferences, and the slow reviewing for journals.
  • These two problems are linked. Reviewing resources tied up in conference reviewing are not available for journal reviewing. Journals get even slower, authors give up on journals, submit more to conferences, and conference reviewing gets even worse as a result of the increased load.
  • Conference reviewing can never be as good as journal reviewing because it is done under time pressure, because it does not allow for careful editorial consideration, and most of all because it generates a surge of demand over a short period that cannot be met by competent reviewers.
  • Journals spread their load evenly all year round, and revisions or resubmissions with detailed answers to previous reviews are possible (and desirable).
  • Journals have an institutional memory, including a memory of reviewer quality and timeliness that conferences do not have with their annual change in program chairs.

An open access Computational Linguistics would also to be a fast turn-around journal with quick and effective editorial control. After getting a paper through PLoS Computational Biology in eight months from initial submission to publication of a revised version, with two rounds of extraordinarily helpful reviews and editorial comments, I cannot accept anything less for our field. However, I don't see that this can happen if current reviewers move their reviewing efforts from conferences to journals. It is evident that we are scrapping the bottom of the reviewing barrel at the moment, from the many experiences we have all had with uninformed, sloppy, and cursory conference reviews. High-quality journal reviews take time, easily several times longer than typical conference paper reviews. Something has to give.

However, if we had an efficient open access journal, we could easily create an excellent conference without extra work. All of the papers accepted for publication in a given year would be eligible for presentation at the conference. Of those, the editorial board would select a subset of nuggets for plenary talk presentation. The rest could be presented as short talks or posters. We would have all of the social advantages of a conference with the higher quality of a well-edited journal.

I hear that some faculty oppose ideas like this because it removes the conference deadline pressure from their graduate students. I say to those faculty that if deadlines is all you have to motivate your students, you are in deep trouble. Conference deadlines are terrible for quality. All of my worst published mistakes are all in hastily written conference publications, and they were not caught by the deadline-driven reviewers.

Tuesday, May 15, 2007

Whence JCLR?

Whence JCLR?: Hal advocates an open-access "Journal of Computational Linguistics Research" (JCLR) on the model of JMLR. I'm with him all the way, until his final sentence:

The biggest thing that would be required would be a bunch of people with white hair who are willing to commit body-and-soul to such a move.

Many of those of us with white (or missing) hair are already up to our necks (or worse) in administration. When I'm asked to do something new, I always ask back "What do you want me to stop doing?" But the current ACL leadership is already spending a lot of time managing publication in conferences and one journal. Except that this effort is mostly wasted in obsolete publication methods instead of leading the move to open access and fast turn-around, quality reviewing. When I had some executive power at ACL, many years ago, before there was a web, I was one of the movers with Stuart Shieber in getting rid a the restrictive copyright assignment for conference and journal publications. What is are current ACL executives doing?

Saturday, May 12, 2007

Zellig Harris, natural language processing, and search

Two quotes from Zellig Harris's Language and Information, which I keep coming back to when I am trying to figure out the confusions of natural language processing (NLP) and search. Discussing language in general:

But natural language has no external metalanguage. We cannot describe the structure of natural language in some other kind of system, for any system in which we could identify the elements and meanings of a given language would have to have already the same essential structure of words and sentences as the language to be described.

Discussing science sublanguages:

Though the sentences of a sublanguage are a subset of the sentences of, say, English, the grammar of the sublanguage is not a subgrammar of English. The sublanguage has important constraints which are not in the language: the particular word subclasses, and the particular sentence types made by these. And the language has important constraints which are not followed in the sublanguage. Of course, since the sentences of the sublanguage are also sentences of the language, they cannot violate the constraints of the language, but they can avoid the conditions that require those constraints. Such are the likelihood differences among arguments in respect to operators; those likelihoods may be largely or totally disregarded in sublanguages. Such also is the internal structure of phrases, which is irrelevant to their membership in a particular word class of a sublanguage (my emphasis).

Recently, we found clear empirical evidence for this last point, and indirect evidence for the more general point in the failure of several teams to achieve significant domain adaptation from newswire parsing to biochemical abstract parsing.

In general, discussions of natural language processing in search fail to distinguish between search in general text material and search in narrow technical domains. Both rule-based and statistical methods perform very differently in the two kinds of search, and the reason is implicit in Harris's analysis of the differences between general language and technical sublanguages: the very different distributional properties of general language and sublanguages.

Some of the most successful work on biomedical text mining relies on a parser that descends in a direct line from Harris's ideas on the grammar of science sublanguages.

Harris observed the very different distributions in general language and technical sublanguages. Although he didn't put it this way, the distributions in sublanguages are very sharp, light-tailed. In general language, they are heavy-tailed (Zipf). Both manual lexicon and rule construction methods and most of the machine learning methods applied to text fail to capture the long tail in general text. The paradoxical effect is that "deeper" analysis leads to more errors, because analysis systems are overconfident in their analysis and resulting classifications or rankings.

In contrast, in technical sublanguages there is a hope that both rule-based and machine learning methods can achieve very high coverage. Additional resources, such as reference book tables of contents, thesauri, and other hierarchical classifications provide relatively stable side information to help the automation. Recently, I had the opportunity to spend some time with Peter Jackson and his colleagues at Thomson and see some of the impressive results they have achieved in large-scale automatic classification of legal documents and in document recommendation. The law is very interesting in that it has a very technical core but it connects to just about any area of human activity and thus to a wide range of language. However, Harris's distributional observations still apply to the technical core, and can be exploited by skilled language engineers to achieve much better accuracy than would be possible with the same methods on general text.

More speculatively, the long tail in general language may have a lot to do with the statistical properties of the graph of relationships among words. Harris again:

At what point do words get meaning? One should first note something that may not be immediately obvious, and that is that meanings do not suffice to identify words, They can give a property to words that are already identified, but they don't identify words. Another way of saying this is that, as everybody who has used Roget's Thesaurus knows, there is no usable classification and structure of meanings per se, such that we could assign the words of a given language to an a priori organization of meanings. Meanings over the whole scope of language cannot arranged independently of the stock of words and their sentential relations. They can be set up independently only for kinship relations, for numbers, and for some other strictly organized parts of the perceived world.

Rule-based and parametric machine learning methods in NLP are based on the assumption that language can be "carved at the joints" and reduced to the free combination of a relatively small (to the number of distinct tokens) number of factors. Although David Weinberger in Everything is Miscellaneous does not write about NLP, his arguments are directly applicable here. Going further, to the extent to which general search works, it is because it is non-parametric: the ranking of documents in response to a query is mostly determined by the particular terms in the query and documents and their distributions, not by some parametric abstract model of ranking. If and when we can do machine learning and NLP this way accurately and efficiently, we may have a real hope of changing general search significantly. In the meanwhile, our parametric methods have a good chance in sublanguages that matter, like the law or biomedical. The work I mentioned already demonstrates this.

Matt Hurst on The Future of Search

The Future of Search: I spent Thursday and Friday last week in the Bay area. On Friday I participated in Berkeley's Future of Search (FoS) event [...] (Peter Norvig) acknowledged that there is still much to be done (for Google) in terms of understanding document structure, in particular the interpretation of tables. I've never seen any real evidence of good document analysis in main stream search which is actually very surprising given the constraints that document structure provides which can only help with relevance and other issues.

This is not at all surprisingy. Document structure doesn't have a stable semantics. A given configuration may express many different relations depending on context. Or no relationship at all, but instead serve stylistic goals. The problem with document structure analysis, as the problem with natural language analysis, is that the analyzer is often wrong, but it has no way of knowing when.

As with NLP, document structure analysis is a great research topic. In both cases, there are good opportunities for researchers to work with search experts who understand the diversity of queries and documents to find out what if anything of the research may be effectively applicable to search. But I don't believe we can just will our current simplistic analysis methods into search success.